Grupo de Tecnología del Habla y Aprendizaje Automático

Speech Technology and Machine Learning Group

Revolutionizing Video Understanding: QLoRA Fine-Tuning for State-of-the-Art Memorability Prediction

Have you ever wondered what makes some videos so unforgettable? What if we could create AI that understands and predicts a video's memorability? Our brand new paper "Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction" by Martín-Fernández et al. (2025) introduces a novel approach to predicting video memorability using the QLoRA technique to fine-tune the Qwen-VL model. This method achieves a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744 on the Memento10k dataset, demonstrating the effectiveness of leveraging generalist multimodal pre-training in specialized domains.

Click here to read the paper

The doctoral thesis of Dr. Manuel Gil Martín honored by the Madrid City Council in the Margarita Salas Awards

Manuel's thesis, awarded an honorable mention, is titled "Contributions to Human Motion Modeling and Recognition using Non-intrusive Wearable Sensors." His advisor is Rubén San Segundo Hernández, a full professor in the Department of Electronic Engineering and also a member of THAU. The work focuses on the characterization of human movement through inertial and physiological signals obtained from wearable devices, analyzed using signal processing techniques and deep learning algorithms. The Margarita Salas Awards promote research talent and innovation, with 575 entries this year. The ceremony took place on October 9th at the Palacio de Cibeles. Link to the thesis: https://oa.upm.es/70493/

Link to the awards

Welcome to THAU

Welcome to the Speech Technology and Machine Learning group website

2nd place in the EmoSPeech Challenge at IberLEF 2024

We're thrilled to announce our team's achievement at IberLEF 2024, where we secured second place in the EmoSPeech task, focused on Multimodal Speech-text Emotion Recognition in Spanish. We developed two strategies: using the Qwen-Audio-Chat model with Low-Rank Adaptation (LoRA) and the novel Whisper-Gemma model, combining the Whisper-large-v3 audio encoder with the Gemma Large Language Model (LLM). The Qwen-Audio-Chat model achieved an f1-macro score of 0.8248, and the Whisper-Gemma model scored 0.7904. These results show the effectiveness of parameter-efficient fine-tuning and combining robust audio encoders with LLMs in improving Speech Emotion Recognition (SER) systems. Our work will be presented at IberLEF 2024 on September 24th, 2024, in Valladolid, Spain, as part of SEPLN 2024.

Information about the EmoSpeech challenge

Our Team Makes Strides in Emotion Recognition at Odyssey 2024!

This prestigious workshop brings together researchers worldwide to push the boundaries of speaker and language recognition technologies, including the complex field of emotion detection in speech. The challenge focused on analyzing speech recordings from the MSP-Podcast corpus to identify specific emotions. We're proud to announce that our team successfully qualified 7 out of 69 participants, demonstrating a strong showing in this competitive environment. Stay tuned for further updates as we delve deeper into emotion recognition research and explore its potential applications!

More info