Towards large vocabulary continuous sign language recognition: from artificial to real-life tasks

  • Auf dem Weg zur kontinuierlichen Gebärdenspracherkennung mit großem Vokabular: Von künstlichen zu lebensechten Daten

Koller, Oscar Tobias Anatol; Ney, Hermann (Thesis advisor); Bowden, Richard (Thesis advisor)

Aachen (2020)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


Deaf people represent a minority that faces strong accessibility challenges due to a world focused on oral-auditory communication. This thesis deals with large vocabulary continuous sign language recognition, which has the potential to overcome accessibility issues and also communication barriers between Deaf and hearing people. The full communication pipeline is bidirectional and composed of recognition, translation and generation sub-tasks going from sign to spoken and from spoken to sign language. Sign language recognition targets one complex sub-problem in the communication direction from sign to spoken language, recognising the sequence of signs ina signed video utterance. In the scope of this thesis, signs are represented by semantic gloss descriptors which are used to transcribe a signed utterance. It is assumed that sign language video and gloss transcriptions share the same temporal order. The translation problem, which is not addressed in this work, focuses on reordering and translating the recognition output into spoken language, which could then be written or spoken out by the generation part.Automatic sign language recognition is a multi-disciplinary task that covers techniques from its numerous neighbouring fields, such as speech recognition, computer vision and linguistics. Historically, research on sign language recognition has been relatively scattered and often researchers independently captured their own small-scale data sets for experimentation. This has several disadvantages. Mostly, the data sets do not cover sufficient complexity that sign language encompass. Moreover, most previous work does not tackle continuous sign language but only isolated single signs. Besides containing only a small and very limited vocabulary (less than 100 different signs), no work has ever targeted real-life sign language. Until now, the employed data sets only comprised artificial and staged sign language footage, which was planned and recorded with the aim of enabling automatic recognition. The kind of signs to be encountered, the structure of sentences, the signing speed, the choice of expression and dialects were usually controlled and determined beforehand. This work aims at moving sign language recognition to more realistic scenarios. For this purpose we create the first real-life large vocabulary continuous sign language corpora, which are based on recordings of the broadcast channel featuring natural sign language of professional interpreters. This kind of data provides unprecedented complexity for recognition. In the scope of this thesis, we made it publicly available free of charge. A conventional GMM-HMM statistical sign language recognition system with distinct and manually engineered features is created and evaluated on the challenging task. We then leverage the recent advances in deep learning and propose modern hybrid CNN-LSTM-HMM models which are shown to halve the recognition error. We analyse theimpact of various architectural design decisions with the aim of giving guidance to researchers in the field. Finally, we develop a weakly supervised learning scheme based on hybrid multi-stream CNN-LSTM-HMMs that allows the efficient spotting of subunits such as articulated handshapes and mouth patterns in sign language footage.