A EC funded project aimed at improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory

13 December 2023
by fjmirandav

Sign Language Dataset for Automatic Motion Generation

Sign language datasets play a crucial role in developing systems that enable effective communication for individuals with hearing impairments. While several sign language datasets exist, they are focused on sign language recognition and translation, not including information at the phoneme level. Many existing datasets rely only on specific sources such as speech or text descriptions attached to videos, which fall short of capturing the intricate details inherent in sign languages. The absence of phonemes does not allow for the development of motion generation systems for sign language.

In other fields, different to sign language processing, there exist several motion datasets describing different human activities. For example, the Human3.6M dataset [1] contains 3.6 million accurate 3D human poses under four different viewpoints and their corresponding images. This dataset contains typical human activities such as taking photos, posing, eating, or talking on the phone performed by 11 professional actors. Some examples of the annotations in the dataset are “a person waves with left hand” and “the person is walking in a circular shape”.

Other datasets combine natural language annotations and gesture representations to train systems able to generate avatar motion. For instance, the KIT Motion-Language dataset [2] contains 3911 gestures, with a total duration of 11.23 h, and 6278 annotations in natural language that contain 52,903 words. The authors converted the marker-based motion capture data to the Master Motor Map framework representation (avatars). To obtain motion annotations in natural language, they applied a crowd-sourcing approach and a web-based tool called Motion Annotation.

he HumanML3D dataset [3] consists of 14,616 3D human motion clips and 44,970 text descriptions, covering a vocabulary of 5371 distinct words and a total duration of 28.59 h. This dataset covers a wide range of body movements and postures. Some examples of the text descriptions are “a person sits down and crosses their legs, before getting up” or “a person stretches arms out and makes arm circles”.

NTU RGB+D 120 [4] is a large-scale dataset for RGB+D human action recognition, collected from 106 distinct subjects, that contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes, including daily, mutual, and health-related activities. These action classes cover labels such as “move heavy objects”, “pushing other person”, “arm circles”, and “squat down”.

The BABEL dataset [5] provides action labels per frame for natural and continuous human movement data and contains 43.5 h of recording from 252 action categories. These action categories cover labels such as “stand”, “run”, “support body with right hand”, and “jump over obstacle”.

Regarding datasets related to sign language, Word-Level American Sign Language (WLASL) is the largest video dataset for ASL recognition [6], including 2000 common different words performed over 100 signers. This dataset has been exploited to recognize signs but also to generate 2D human pose representations using OpenPose [7]. Another dataset, called How2Sign [8], included speech and transcriptions of videos. This dataset contained a 16k English words vocabulary and became a rich set of annotations including gloss, category labels, as well automatically extracted 2D keypoints for more than 6M frames. The LSE-Sign database [9] includes Spanish Sign Language information, including 2400 individual signs as well as grammatical, phonological, and articulatory information. Other studies combine different types of sensors for sign language recognition [10]. However, these datasets do not include both sign phonemes and sign motion landmarks, preventing the training of an automatic system with the sufficient level of detail to generate sign language motion from sign characteristics. These datasets have been traditionally used for sign language recognition [11].

In this paper, we introduce a new sign language dataset that addresses this limitation by incorporating phoneme representations for each sign. By providing these phonemes for each sign, we bridge this gap and unlock new possibilities for sign language motion generation with enough precision. The main contributions of this paper are as follows:

  • The first sign language dataset for automatic motion generation, including sign videos and the corresponding phonemes in HamNoSys. HamNoSys is a transcription system for any sign language and was developed at the University of Hamburg, Hamburg (Germany).
  • A detailed description of the methodology for generating the dataset: phonemes and motion information.
  • A strategy for landmarks extraction from sign language videos. This strategy includes the use of MediaPipe for combining pose and hand landmarks. A solution is provided for dealing with coherence problems during the landmark extraction process along the frame sequence of a sign.
  • Finally, the paper presents preliminary experiments for automatic motion generation from sign language phonemes using state-of-the-art deep learning algorithms based on transformers.

The motivation behind this research arises from the need to fill a gap in sign language dataset generation. By introducing the first dataset for automatic motion generation, encompassing phonemes and motion information, this study aims to contribute to the advancement of sign language research. Furthermore, the exploration of preliminary experiments using state-of-the-art transformers for generating motion from these sign language phonemes serves as a driving force to expand the frontiers of automatic motion generation in sign language applications.

Read the full paper here:


  1. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 201436, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
  2. Plappert, M.; Mandery, C.; Asfour, T. The KIT Motion-Language Dataset. Big Data 20164, 236–252. [Google Scholar] [CrossRef] [PubMed]
  3. Guo, C.; Zou, S.H.; Zuo, X.X.; Wang, S.; Ji, W.; Li, X.Y.; Cheng, L. Generating Diverse and Natural 3D Human Motions from Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5142–5151. [Google Scholar]
  4. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 202042, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
  5. Punnakkal, A.R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; Black, M.J. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 722–731. [Google Scholar]
  6. Li, D.X.; Opazo, C.R.; Yu, X.; Li, H.D.; Soc, I.C. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar]
  7. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 202143, 172–186. [Google Scholar] [CrossRef] [PubMed]
  8. Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i-Nieto, X. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 2734–2743. [Google Scholar]
  9. Gutierrez-Sigut, E.; Costello, B.; Baus, C.; Carreiras, M. LSE-Sign: A lexical database for Spanish Sign Language. Behav. Res. Methods 201648, 123–137. [Google Scholar] [CrossRef] [PubMed]
  10. Amin, M.S.; Rizvi, S.T.H.; Hossain, M.M. A Comparative Review on Applications of Different Sensors for Sign Language Recognition. J. Imaging 20228, 98. [Google Scholar] [CrossRef] [PubMed]
  11. Dhulipala, S.; Adedoyin, F.F.; Bruno, A. Sign and Human Action Detection Using Deep Learning. J. Imaging 20228, 192. [Google Scholar] [CrossRef] [PubMed]

29 November 2023
by fjmirandav

Improving Hand Pose Recognition using Localization and Zoom Normalizations over MediaPipe Landmarks

Hand Pose Recognition presents significant challenges that need to be addressed, such as varying lighting conditions or complex backgrounds, which can hinder accurate and robust hand pose estimation. This can be mitigated by employing MediaPipe to facilitate the efficient extraction of representative landmarks from static images combined with the use of Convolutional Neural Networks.

Extracting these landmarks from the hands mitigates the impact of lighting variability or the presence of complex backgrounds. However, the variability of the location and size of the hands is still not addressed by this process. Therefore, the use of processing modules to normalize these points independent of the location of the wrist and the zoom of the hands can significantly mitigate the effects of these variabilities. In all the experiments performed in this work based on American Sign Language alphabet datasets of 870, 27,000, and 87,000 images, the application of the proposed normalizations results in significant improvements in the model performance in a limited resource scenario. Particularly, under conditions of high variability, applying both normalizations resulted in a performance increment of 45.08%, increasing the accuracy from 43.94 ± 0.64% to 89.02 ± 0.40%.

Read the full paper here:

18 October 2023
by fjmirandav

Interpreting Sign Language Recognition using Transformers and MediaPipe Landmarks

Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model.

In this work, we modified the SPOTER architecture including a learnable array of parameters that performs an element-wise multiplication of the inputs to add interpretability. This operation comes at a linear cost and does not increase significantly the model size, but it reports different advantages:

  • Dataset-level understanding of the predictions, hence helping in the early detection of hazardous biases.
  • Interpretation of the most relevant features learned by the model.

We evaluated our approach on two public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). WLASL100 includes 100 glosses from 2,038 videos signed by 97 people and IPNHand contains more than 4,000 gesture instances performed by 50 subjects including 13 gestures with one hand.

The learned array highlighted the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions.

Regarding WLASL100, the system highlighted the right-hand landmarks as the most informative ones, including the fingers motion. Moreover, the arm landmarks were more important than the rest of pose landmarks and the specific fingers from the left hand. Concerning IPNHand, the system highlighted that landmarks from the wrist and fingertips of thumb, index and middle fingers were more informative in this dataset. To corroborate the system was correctly identifying what the most informative landmarks were in each task, we compared the weight assigned to each input features at a dataset level and contrasted them with our expert knowledge on the subject (the predominance and the variance of the landmarks).

We believe that the insights gained in this work could be exploited for the development of more efficient SLR pipelines and applied to other application domains.

Read the full paper here:

3 October 2023
by fjmirandav

Video Memorability Prediction From Jointly-learnt Semantic and Visual Features

What if we could predict whether we will remember a video after a second viewing? What aspects of a social post, advertisement or educational content make it more or less memorable? In a world with an abundance of multimedia content, this question is crucial for improving how we communicate through videos. With this aim, we present in this paper a system that can predict the memorability score, i.e. its likelihood of being remembered over time, of a short video based on its frames or textual descriptions.

Recent findings in psychology and neuroscience challenge the idea that memory is purely a matter of personal judgment. Instead, they suggest that certain visual aspects tend to stick in our memory more often (Isola et al., 2011; Lin et al., 2021; Xie et al., 2020). Additionally, the memorability of visual content isn’t just about personal opinions but is also influenced by the semantics or main topic of the scene that is presented (Bylinskii et al., 2022). This leads to approaches that not just look at pictures but start considering descriptions of those scenes in text (Kleinlein et al., 2021).

Our approach combines text and images to enhance predictions of what people will remember. We merge images and text descriptions to jointly train an image and text encoder, creating a representation that extracts semantic information from both sources. We achieve this using Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021), which creates similar vector representations for image-text pairs depicting the same concepts.

Expanding on this approach, we adapt the representations generated by these jointly-learned encoders by fine-tuning them using image-text pairs extracted from videos in a dataset specifically curated for memorability analysis, namely Memento10K (Newman et al., 2020). Hence, we refine the CLIP models into F-CLIP models enriched with semantic knowledge from both visual and textual elements.

In essence, we further pre-train these encoders to indirectly generate useful vector representations for predicting video memorability, capturing semantics from both frames and text descriptions.

We can then use these models as feature extractors by decoupling them and freezing their weights. The generated features are subsequently channeled into a Bayesian Ridge Regressor (BRR), which plays the pivotal role in our predictive pipeline by generating memorability scores. This model is a simple variation of the Ordinary Least Squares Linear Regressor that provides akin results whilst being more robust to ill posed problems. By employing this straightforward yet resilient model, our primary focus centers on comparing the various features we extract and their impact on memorability prediction.

Our results indicate that, as a general pattern, our fine-tuning approach improves memorability prediction, as reflected in the Spearman Rank Correlation Coefficient (SRCC). Specifically, our method achieved an SRCC of 0.575 when using text as input, while the default pre-trained text encoder scored 0.538. However, we also noticed that this enhancement is less pronounced when dealing with visual data. One explanation for this difference is that text tends to offer richer and more accessible information than visual data, especially in the context of Memento10K’s often lower-quality videos.

In conclusion, our study demonstrates that blending text and images, as seen in the F-CLIP model, improves video memorability prediction, offering promise for various applications. However, there’s still more to explore in this field. Future work may involve leveraging temporal information within videos, incorporating pixel-level descriptors, and exploring the advanced capabilities of Large Language Models (LLMs). Additionally, integrating other types of signals that correlate with our way of creating memories, such as Electro-Encephalogram (EEG), could open up new avenues for understanding and predicting video memorability in more comprehensive ways.

In a nutshell, our research highlights the potential of cross-modal learning, paving the way for better video content creation and communication strategies in our information-rich world.


Isola, P., Parikh, D., Torralba, A., & Oliva, A. (2011). Understanding the intrinsic memorability of images. *Advances in neural information processing systems*, *24*.

Lin, Q., Yousif, S. R., Chun, M. M., & Scholl, B. J. (2021). Visual memorability in the absence of semantic content. *Cognition*, *212*, 104714.

Xie, W., Bainbridge, W. A., Inati, S. K., Baker, C. I., & Zaghloul, K. A. (2020). Memorability of words in arbitrary verbal associations modulates memory retrieval in the anterior temporal lobe. *Nature human behaviour*, *4*(9), 937-948.

Bylinskii, Z., Goetschalckx, L., Newman, A., & Oliva, A. (2022). Memorability: An image-computable measure of information utility. *Human Perception of Visual Information: Psychological and Computational Perspectives*, 207-239.

Kleinlein, R., Luna-Jiménez, C., & Fernández-Martínez, F. (2021). THAU-UPM at MediaEval 2021: From Video Semantics To Memorability Using Pretrained Transformers. In *MediaEval 2021 workshop*.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In *International conference on machine learning* (pp. 8748-8763). PMLR.

Newman, A., Fosco, C., Casser, V., Lee, A., McNamara, B., & Oliva, A. (2020). Multimodal memorability: Modeling effects of semantics and decay on video memorability. In *Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16* (pp. 223-240). Springer International Publishing.

28 September 2023
by luisfernandodharo

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots

Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions.

In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient.

To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. The experimental results demonstrate the efficacy of the topic detection algorithm, achieving an F1 weighted score of 0.67 when detecting 13 distinct topics and an F1 weighted score of 0.45 when detecting 18 distinct topics. In terms of the subtopic level, a weighted F1 score of 0.67 was achieved. Besides, we show how our proposed approach outperforms a larger model trained on specific dialogue data. An advantage of our approach is that it is scalable allowing the incorporation of new categories and subcategories (fine-grained) that the larger model is not able to recognize.

For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Moreover, the algorithm exhibits precise estimation capabilities in determining the number of diverse responses, as evidenced by an MSE of 3.4 calculated over a set of 109 handcrafted responses, 15 sets of original questions plus their paraphrases, passed to 4 small model chatbots. In the case of the 120 questions created with GPT-4, 15 question sets each consisting of 1 original and its respective 7 paraphrases, and fed into 4 State-of-the-Art chatbots, the overall resulting MSE was 3.2. These results show that even LLMs produce inconsistent answers and our approach is a good proxy to detect such cases.

As future work, we will primarily focus on two main aspects: expanding the range of high-level topics and subsequently evaluating the algorithm’s performance in identifying subtopics. Additionally, we included this topic and subtopic classifier in the dialogue management for the chatbot that we used during our participation in the Alexa Socialbot Grand Challenge (SGC5) [1]. Regarding the detection of inconsistent responses, our efforts will be directed towards the development of controllable algorithms and architectures, such as TransferTransfo [2] or CTRL [3], leveraging persona profiles within these frameworks with the idea of generating more consistent responses. Furthermore, we seek to explore mechanisms to incorporate these identified inconsistencies into an automated evaluation of dialogue systems [4, 5], according to the recommendations made in [6].


[1]. Estecha-Garitagoitia, Marcos, Mario Rodríguez-Cantelar, Alfredo Garrachón Ruiz, Claudia Garoé Fernández García, Sergio Esteban Romero, Cristina Conforto, Alberto Saiz Fernández, Luis Fernando Fernández Salvador, and Luis Fernando D’Haro. “THAURUS: An Innovative Multimodal Chatbot Based on the Next Generation of Conversational AI.” Alexa Prize SocialBot Grand Challenge 5.

[2]. Wolf, T.; Sanh, V.; Chaumond, J.; Delangue, C. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv 2019, arXiv:cs.CL/1901.08149. 

[3]. Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. CTRL: A Conditional Transformer Language Model for Controllable GenerationarXiv 2019, arXiv:cs.CL/1909.05858.

[4]. Zhang, C.; Sedoc, J.; D’Haro, L.F.; Banchs, R.; Rudnicky, A. Automatic Evaluation and Moderation of Open-domain Dialogue SystemsarXiv 2021, arXiv:cs.CL/2111.02110.

[5]. Zhang, C.; D’Haro, L.F.; Friedrichs, T.; Li, H. MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue EvaluationProc. AAAI Conf. Artif. Intell. 202236, 11657–11666.