A EC funded project aimed at improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory

Interpreting Sign Language Recognition using Transformers and MediaPipe Landmarks


Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model.

In this work, we modified the SPOTER architecture including a learnable array of parameters that performs an element-wise multiplication of the inputs to add interpretability. This operation comes at a linear cost and does not increase significantly the model size, but it reports different advantages:

  • Dataset-level understanding of the predictions, hence helping in the early detection of hazardous biases.
  • Interpretation of the most relevant features learned by the model.

We evaluated our approach on two public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). WLASL100 includes 100 glosses from 2,038 videos signed by 97 people and IPNHand contains more than 4,000 gesture instances performed by 50 subjects including 13 gestures with one hand.

The learned array highlighted the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions.

Regarding WLASL100, the system highlighted the right-hand landmarks as the most informative ones, including the fingers motion. Moreover, the arm landmarks were more important than the rest of pose landmarks and the specific fingers from the left hand. Concerning IPNHand, the system highlighted that landmarks from the wrist and fingertips of thumb, index and middle fingers were more informative in this dataset. To corroborate the system was correctly identifying what the most informative landmarks were in each task, we compared the weight assigned to each input features at a dataset level and contrasted them with our expert knowledge on the subject (the predominance and the variance of the landmarks).

We believe that the insights gained in this work could be exploited for the development of more efficient SLR pipelines and applied to other application domains.

Read the full paper here:

Leave a Reply

Required fields are marked *.