Sign language datasets play a crucial role in developing systems that enable effective communication for individuals with hearing impairments. While several sign language datasets exist, they are focused on sign language recognition and translation, not including information at the phoneme level. Many existing datasets rely only on specific sources such as speech or text descriptions attached to videos, which fall short of capturing the intricate details inherent in sign languages. The absence of phonemes does not allow for the development of motion generation systems for sign language.
In other fields, different to sign language processing, there exist several motion datasets describing different human activities. For example, the Human3.6M dataset [1] contains 3.6 million accurate 3D human poses under four different viewpoints and their corresponding images. This dataset contains typical human activities such as taking photos, posing, eating, or talking on the phone performed by 11 professional actors. Some examples of the annotations in the dataset are “a person waves with left hand” and “the person is walking in a circular shape”.
Other datasets combine natural language annotations and gesture representations to train systems able to generate avatar motion. For instance, the KIT Motion-Language dataset [2] contains 3911 gestures, with a total duration of 11.23 h, and 6278 annotations in natural language that contain 52,903 words. The authors converted the marker-based motion capture data to the Master Motor Map framework representation (avatars). To obtain motion annotations in natural language, they applied a crowd-sourcing approach and a web-based tool called Motion Annotation.
he HumanML3D dataset [3] consists of 14,616 3D human motion clips and 44,970 text descriptions, covering a vocabulary of 5371 distinct words and a total duration of 28.59 h. This dataset covers a wide range of body movements and postures. Some examples of the text descriptions are “a person sits down and crosses their legs, before getting up” or “a person stretches arms out and makes arm circles”.
NTU RGB+D 120 [4] is a large-scale dataset for RGB+D human action recognition, collected from 106 distinct subjects, that contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes, including daily, mutual, and health-related activities. These action classes cover labels such as “move heavy objects”, “pushing other person”, “arm circles”, and “squat down”.
The BABEL dataset [5] provides action labels per frame for natural and continuous human movement data and contains 43.5 h of recording from 252 action categories. These action categories cover labels such as “stand”, “run”, “support body with right hand”, and “jump over obstacle”.
Regarding datasets related to sign language, Word-Level American Sign Language (WLASL) is the largest video dataset for ASL recognition [6], including 2000 common different words performed over 100 signers. This dataset has been exploited to recognize signs but also to generate 2D human pose representations using OpenPose [7]. Another dataset, called How2Sign [8], included speech and transcriptions of videos. This dataset contained a 16k English words vocabulary and became a rich set of annotations including gloss, category labels, as well automatically extracted 2D keypoints for more than 6M frames. The LSE-Sign database [9] includes Spanish Sign Language information, including 2400 individual signs as well as grammatical, phonological, and articulatory information. Other studies combine different types of sensors for sign language recognition [10]. However, these datasets do not include both sign phonemes and sign motion landmarks, preventing the training of an automatic system with the sufficient level of detail to generate sign language motion from sign characteristics. These datasets have been traditionally used for sign language recognition [11].
In this paper, we introduce a new sign language dataset that addresses this limitation by incorporating phoneme representations for each sign. By providing these phonemes for each sign, we bridge this gap and unlock new possibilities for sign language motion generation with enough precision. The main contributions of this paper are as follows:
- The first sign language dataset for automatic motion generation, including sign videos and the corresponding phonemes in HamNoSys. HamNoSys is a transcription system for any sign language and was developed at the University of Hamburg, Hamburg (Germany).
- A detailed description of the methodology for generating the dataset: phonemes and motion information.
- A strategy for landmarks extraction from sign language videos. This strategy includes the use of MediaPipe for combining pose and hand landmarks. A solution is provided for dealing with coherence problems during the landmark extraction process along the frame sequence of a sign.
- Finally, the paper presents preliminary experiments for automatic motion generation from sign language phonemes using state-of-the-art deep learning algorithms based on transformers.
The motivation behind this research arises from the need to fill a gap in sign language dataset generation. By introducing the first dataset for automatic motion generation, encompassing phonemes and motion information, this study aims to contribute to the advancement of sign language research. Furthermore, the exploration of preliminary experiments using state-of-the-art transformers for generating motion from these sign language phonemes serves as a driving force to expand the frontiers of automatic motion generation in sign language applications.
Read the full paper here: https://www.mdpi.com/2313-433X/9/12/262
References:
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Plappert, M.; Mandery, C.; Asfour, T. The KIT Motion-Language Dataset. Big Data 2016, 4, 236–252. [Google Scholar] [CrossRef] [PubMed]
- Guo, C.; Zou, S.H.; Zuo, X.X.; Wang, S.; Ji, W.; Li, X.Y.; Cheng, L. Generating Diverse and Natural 3D Human Motions from Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5142–5151. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Punnakkal, A.R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; Black, M.J. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 722–731. [Google Scholar]
- Li, D.X.; Opazo, C.R.; Yu, X.; Li, H.D.; Soc, I.C. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i-Nieto, X. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 2734–2743. [Google Scholar]
- Gutierrez-Sigut, E.; Costello, B.; Baus, C.; Carreiras, M. LSE-Sign: A lexical database for Spanish Sign Language. Behav. Res. Methods 2016, 48, 123–137. [Google Scholar] [CrossRef] [PubMed]
- Amin, M.S.; Rizvi, S.T.H.; Hossain, M.M. A Comparative Review on Applications of Different Sensors for Sign Language Recognition. J. Imaging 2022, 8, 98. [Google Scholar] [CrossRef] [PubMed]
- Dhulipala, S.; Adedoyin, F.F.; Bruno, A. Sign and Human Action Detection Using Deep Learning. J. Imaging 2022, 8, 192. [Google Scholar] [CrossRef] [PubMed]