A EC funded project aimed at improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory

29 May 2024
by fjmirandav

Testing theory of mind in Large Language models and humans

In a recent groundbreaking study published in the renowned journal Nature, a team of researchers from the ASTOUND project consortium, explored the theory of mind capabilities in humans and large language models (LLMs) such as GPT-4 and LLaMA2. This study, central to the ASTOUND project (GA 101071191) dives into how well these AI models can track and interpret human mental states, an ability central to social interactions and communication.

Our team, alongside other prominent researchers, embarked on a comprehensive examination of theory of mind in both humans and AI. The study involved a series of tests designed to measure various aspects of theory of mind, including understanding false beliefs, interpreting indirect requests, and recognizing irony and faux pas.

We tested two families of LLMs (GPT-4 and LLaMA2) against a battery of measurements, comparing their performance with a sample of 1,907 human participants. This rigorous approach ensured a fair and systematic comparison between human and artificial intelligences.

The findings highlight that while AI models can mimic human-like reasoning in several theory of mind tasks, they also reveal distinct limitations and biases. For instance, GPT models often adopt a hyperconservative approach, hesitating to commit to conclusions without sufficient evidence, which contrasts with human tendencies to make more definitive judgments.

This study was a collaborative effort involving experts from various institutions, including our own team. Our involvement was crucial in designing and conducting the experiments, analyzing the data, and interpreting the results.

The insights gained from this research are invaluable for future developments in AI. Understanding the nuances of how AI models process social information can guide the creation of more sophisticated and human-like AI systems. It also opens avenues for further research into mitigating biases and improving the robustness of AI’s social reasoning abilities.

Read the full article here:

15 December 2023
by fjmirandav

Sign Language Motion Generation from Sign Characteristics

Motion generation is an innovative research field which consists of producing movements, gestures, or animations by computer systems. This process involves the use of mathematical algorithms to create dynamic, natural, and fluid motions that simulate human movements, enhancing human–computer interactions. Motion generation has many different applications, such as movement of virtual characters in video games or animated films, robotics, and sign language representation in communication systems for deaf people. In sign language communication systems, motion generation allows for an increase in the naturality of interactive avatars or virtual assistants that respond with a sign language output. Increasing naturality allows for the development of friendly communication systems between deaf and hearing people. These systems empower and enhance the communication capabilities of deaf individuals, encouraging inclusivity and facilitating their integration into various social and professional scenarios.

Most state-of-the-art sign language motion generation systems are based on expert rules or prerecorded movements. This work proposes to train a module to automatically generate sign language motion from sign phonemes, represented using HamNoSys [1]. The proposed generation system is based on deep learning using a transformer-based approach [2]. HamNoSys is a phonetic transcription system for sign language, which includes sign characteristics or phonemes such as hand location, shape, and movement.

To the best of the authors’ knowledge, the proposed system is the first motion generation system for sign language based on transformers. The main contributions of this paper are the following:

  • Proposal and evaluation of a deep learning architecture based on transformers for generating the sequence of landmarks to represent a sign.
  • This proposed approach also includes a stop module for deciding the end of the generation process. This stop module is also evaluated in different scenarios.
  • Additional analyses for improving the system accuracy, considering different padding strategies, interpolation approaches, and data augmentation techniques.

Read the full paper here:


  1. Hanke, T. HamNoSys—Representing sign language data in language resources and language processing contexts. LREC 20045, 1–6. [Google Scholar]
  2. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]

13 December 2023
by fjmirandav

Sign Language Dataset for Automatic Motion Generation

Sign language datasets play a crucial role in developing systems that enable effective communication for individuals with hearing impairments. While several sign language datasets exist, they are focused on sign language recognition and translation, not including information at the phoneme level. Many existing datasets rely only on specific sources such as speech or text descriptions attached to videos, which fall short of capturing the intricate details inherent in sign languages. The absence of phonemes does not allow for the development of motion generation systems for sign language.

In other fields, different to sign language processing, there exist several motion datasets describing different human activities. For example, the Human3.6M dataset [1] contains 3.6 million accurate 3D human poses under four different viewpoints and their corresponding images. This dataset contains typical human activities such as taking photos, posing, eating, or talking on the phone performed by 11 professional actors. Some examples of the annotations in the dataset are “a person waves with left hand” and “the person is walking in a circular shape”.

Other datasets combine natural language annotations and gesture representations to train systems able to generate avatar motion. For instance, the KIT Motion-Language dataset [2] contains 3911 gestures, with a total duration of 11.23 h, and 6278 annotations in natural language that contain 52,903 words. The authors converted the marker-based motion capture data to the Master Motor Map framework representation (avatars). To obtain motion annotations in natural language, they applied a crowd-sourcing approach and a web-based tool called Motion Annotation.

he HumanML3D dataset [3] consists of 14,616 3D human motion clips and 44,970 text descriptions, covering a vocabulary of 5371 distinct words and a total duration of 28.59 h. This dataset covers a wide range of body movements and postures. Some examples of the text descriptions are “a person sits down and crosses their legs, before getting up” or “a person stretches arms out and makes arm circles”.

NTU RGB+D 120 [4] is a large-scale dataset for RGB+D human action recognition, collected from 106 distinct subjects, that contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes, including daily, mutual, and health-related activities. These action classes cover labels such as “move heavy objects”, “pushing other person”, “arm circles”, and “squat down”.

The BABEL dataset [5] provides action labels per frame for natural and continuous human movement data and contains 43.5 h of recording from 252 action categories. These action categories cover labels such as “stand”, “run”, “support body with right hand”, and “jump over obstacle”.

Regarding datasets related to sign language, Word-Level American Sign Language (WLASL) is the largest video dataset for ASL recognition [6], including 2000 common different words performed over 100 signers. This dataset has been exploited to recognize signs but also to generate 2D human pose representations using OpenPose [7]. Another dataset, called How2Sign [8], included speech and transcriptions of videos. This dataset contained a 16k English words vocabulary and became a rich set of annotations including gloss, category labels, as well automatically extracted 2D keypoints for more than 6M frames. The LSE-Sign database [9] includes Spanish Sign Language information, including 2400 individual signs as well as grammatical, phonological, and articulatory information. Other studies combine different types of sensors for sign language recognition [10]. However, these datasets do not include both sign phonemes and sign motion landmarks, preventing the training of an automatic system with the sufficient level of detail to generate sign language motion from sign characteristics. These datasets have been traditionally used for sign language recognition [11].

In this paper, we introduce a new sign language dataset that addresses this limitation by incorporating phoneme representations for each sign. By providing these phonemes for each sign, we bridge this gap and unlock new possibilities for sign language motion generation with enough precision. The main contributions of this paper are as follows:

  • The first sign language dataset for automatic motion generation, including sign videos and the corresponding phonemes in HamNoSys. HamNoSys is a transcription system for any sign language and was developed at the University of Hamburg, Hamburg (Germany).
  • A detailed description of the methodology for generating the dataset: phonemes and motion information.
  • A strategy for landmarks extraction from sign language videos. This strategy includes the use of MediaPipe for combining pose and hand landmarks. A solution is provided for dealing with coherence problems during the landmark extraction process along the frame sequence of a sign.
  • Finally, the paper presents preliminary experiments for automatic motion generation from sign language phonemes using state-of-the-art deep learning algorithms based on transformers.

The motivation behind this research arises from the need to fill a gap in sign language dataset generation. By introducing the first dataset for automatic motion generation, encompassing phonemes and motion information, this study aims to contribute to the advancement of sign language research. Furthermore, the exploration of preliminary experiments using state-of-the-art transformers for generating motion from these sign language phonemes serves as a driving force to expand the frontiers of automatic motion generation in sign language applications.

Read the full paper here:


  1. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 201436, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
  2. Plappert, M.; Mandery, C.; Asfour, T. The KIT Motion-Language Dataset. Big Data 20164, 236–252. [Google Scholar] [CrossRef] [PubMed]
  3. Guo, C.; Zou, S.H.; Zuo, X.X.; Wang, S.; Ji, W.; Li, X.Y.; Cheng, L. Generating Diverse and Natural 3D Human Motions from Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5142–5151. [Google Scholar]
  4. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 202042, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
  5. Punnakkal, A.R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; Black, M.J. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 722–731. [Google Scholar]
  6. Li, D.X.; Opazo, C.R.; Yu, X.; Li, H.D.; Soc, I.C. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar]
  7. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 202143, 172–186. [Google Scholar] [CrossRef] [PubMed]
  8. Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i-Nieto, X. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 2734–2743. [Google Scholar]
  9. Gutierrez-Sigut, E.; Costello, B.; Baus, C.; Carreiras, M. LSE-Sign: A lexical database for Spanish Sign Language. Behav. Res. Methods 201648, 123–137. [Google Scholar] [CrossRef] [PubMed]
  10. Amin, M.S.; Rizvi, S.T.H.; Hossain, M.M. A Comparative Review on Applications of Different Sensors for Sign Language Recognition. J. Imaging 20228, 98. [Google Scholar] [CrossRef] [PubMed]
  11. Dhulipala, S.; Adedoyin, F.F.; Bruno, A. Sign and Human Action Detection Using Deep Learning. J. Imaging 20228, 192. [Google Scholar] [CrossRef] [PubMed]

29 November 2023
by fjmirandav

Improving Hand Pose Recognition using Localization and Zoom Normalizations over MediaPipe Landmarks

Hand Pose Recognition presents significant challenges that need to be addressed, such as varying lighting conditions or complex backgrounds, which can hinder accurate and robust hand pose estimation. This can be mitigated by employing MediaPipe to facilitate the efficient extraction of representative landmarks from static images combined with the use of Convolutional Neural Networks.

Extracting these landmarks from the hands mitigates the impact of lighting variability or the presence of complex backgrounds. However, the variability of the location and size of the hands is still not addressed by this process. Therefore, the use of processing modules to normalize these points independent of the location of the wrist and the zoom of the hands can significantly mitigate the effects of these variabilities. In all the experiments performed in this work based on American Sign Language alphabet datasets of 870, 27,000, and 87,000 images, the application of the proposed normalizations results in significant improvements in the model performance in a limited resource scenario. Particularly, under conditions of high variability, applying both normalizations resulted in a performance increment of 45.08%, increasing the accuracy from 43.94 ± 0.64% to 89.02 ± 0.40%.

Read the full paper here:

18 October 2023
by fjmirandav

Interpreting Sign Language Recognition using Transformers and MediaPipe Landmarks

Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model.

In this work, we modified the SPOTER architecture including a learnable array of parameters that performs an element-wise multiplication of the inputs to add interpretability. This operation comes at a linear cost and does not increase significantly the model size, but it reports different advantages:

  • Dataset-level understanding of the predictions, hence helping in the early detection of hazardous biases.
  • Interpretation of the most relevant features learned by the model.

We evaluated our approach on two public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). WLASL100 includes 100 glosses from 2,038 videos signed by 97 people and IPNHand contains more than 4,000 gesture instances performed by 50 subjects including 13 gestures with one hand.

The learned array highlighted the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions.

Regarding WLASL100, the system highlighted the right-hand landmarks as the most informative ones, including the fingers motion. Moreover, the arm landmarks were more important than the rest of pose landmarks and the specific fingers from the left hand. Concerning IPNHand, the system highlighted that landmarks from the wrist and fingertips of thumb, index and middle fingers were more informative in this dataset. To corroborate the system was correctly identifying what the most informative landmarks were in each task, we compared the weight assigned to each input features at a dataset level and contrasted them with our expert knowledge on the subject (the predominance and the variance of the landmarks).

We believe that the insights gained in this work could be exploited for the development of more efficient SLR pipelines and applied to other application domains.

Read the full paper here: