About the Project: In the realm of Sign Language Translation (SLT), groundbreaking advancements have emerged, creating a bridge between sign language and spoken language. This rapidly developing field spans Natural Language Processing, Machine Translation, Visual Processing, Multi-modal Data Representation, and SL Linguistics. Our project aims to harness the collective expertise of these domains to achieve a common goal.

Why SLT Matters: Sign Language Translation stands at the forefront of inclusivity, breaking barriers in communication accessibility for the Deaf and hard-of-hearing communities. It’s not merely a technological advancement; it’s a pivotal tool that bridges the gap between sign language—a rich, expressive mode of communication—and the spoken word. By enabling seamless translation between these languages, SLT empowers individuals to interact, learn, and engage universally, fostering social integration and equal opportunities. This field holds the key to creating a world where communication knows no bounds.
Leveraging Large Language Models (LLMs): We’re at the forefront of leveraging the power of Large Language Models (LLMs) in SLT. By tapping into pre-trained LLMs, we can decode translation models efficiently, utilizing their inherent linguistic knowledge. These models also play a pivotal role in guiding image/pose encoders toward robust linguistic representations of input sequences.

Why You Should Join: This workshop offers a unique opportunity for young minds to delve into a pioneering field that not only merges technology and language but also bridges communication gaps for the hearing-impaired. As a participant, you’ll collaborate with experts, engage in hands-on experimentation, and contribute to shaping the future of communication technology.

Who Should Apply: Passionate students with a keen interest in Natural Language Processing, Machine Translation, Visual Processing, Multi-modal Data Representation, SL Linguistics, or related fields. No prior experience in SLT is required—just a thirst for knowledge and a drive to innovate.
Join us in unraveling the mysteries of Sign Language Translation. Together, we’ll pioneer advancements that redefine communication paradigms and make a lasting impact on society.

Motivation

In recent years, there has been notable progress in Sign Language Translation (SLT) [5], i.e. translation of a sign language video directly to its spoken-form counterpart (Figure 1). SLT encompasses several areas of research; Natural Language Processing, Machine Translation, Visual Processing, Multi-modal data representation and fusion, and SL linguistics. As such, it is a solid candidate for bringing together experts from these fields to work on a common goal. There are two major input representations of SL. (1) Appearance-based – a sequence of images depicting the signing person; (2) Pose-based – a sequence of estimated skeletal poses of the signer. Pose-based approaches have a greater application potential over appearance-based methods due to their much lower computational cost and ability to operate on low-end mobile devices. Furthermore, it is easier to augment and generate the pose representations [1]. On the other hand, the appearance-based methods achieve higher recognition rates. In this project, we want to leverage the potential of Large Language Models (LLMs) in the task of SLT. A pre-trained LLM already encodes the knowledge about a language and thus can be used as an efficient decoder of the translation model. Also, it should be able to guide an image/pose encoder to a strong linguistic representation of the input sequences.

Why SLT Matters

SLT is one component of automatic sign language understanding. Historically, the problem was tackled in a bottom-up fashion. At first, the methods were only translating individual signs into glosses by means of classification [9]. A recent survey [8] shows that the most commonly used approaches involve fine-tuning pre-trained visual models. It is observed that visual appearance-based models outperform the pose-based methods [1] in the means of accuracy until a self-supervised pose-based pre-training is applied [3].

Later, methods of predicting a sequence of glosses were developed, which are then translated into the textual representation of the spoken form of the language [2]. Since the development of robust foundation models in NLP, Vision, and multi-modal processing the focus of researchers shifted to the end-to-end SLT, so-called Gloss-Free SLT [13]. In this work, the authors utilize a visual-language pre-training. Cross-language unsupervised learning on different sign languages is a new approach and there are only a few attempts [3, 12].

The state-of-the-art methods indicate that; (1) Appearance-based approaches (i.e. employment of a visual encoder) outperform pose-based approaches. (2) Self-supervised visual- and pose-based pre training is beneficial. (3) Cross-sign language learning improves results in individual languages. The methods are tested mainly on two benchmark datasets; Phoenix Weather 2014 [2] and CSL daily [14]. Within our team, we have developed a (at that time) sota pose-based isolated sign language recognition system [1] and a sing-pose generative model [11] working in an end-to-end fashion (i.e. from text to a sequence of sing poses).

Reseach Proposal

Our main goal is to develop an SLT system based on an LLM. LLMs are now becoming multi-modal – which generally means (image+text). There are no LLMs that can do “long-ish” video understanding. There are some generative models that can consume short clips (e.g. Blip-2 [4], Vid2Seq [10]). The main problem is data scarcity for any somewhat reasonable task. In this project, we want to build the first multi-modal LLM that consumes “long-form” video in the form of SL. The task of SLT is reasonably well defined (language to language) and there is enough data available so that we are able to handle this project with a relatively small team in a short time frame with reasonable computational power available. Recently, it has been demonstrated by the authors of LLaVa model [6 (see Figure 2) that it is possible to tune the image encoder to produce language tokens as input for an LLM decoder to solve various problems including image captioning, question answering, image reasoning, and so on. Authors of the AnyMAL model [7] have furthermore demonstrated that the principle of tuning works with many other modalities such as sound, IMU motion sensor, video, and so on.

During the workshop, we would make use of the principles of visual and video instruction tuning to tackle the problem of SLT. Instead of a text to summarize, we give the model a projected representation that is extracted from an image sequence using a pre-trained vision encoder. The model then produces a language response (i.e. translation), after undergoing visual instruction tuning using a reasonably-sized training process. We want to experiment with different architectures of the image/pose/video encoder. We will begin with the masked autoencoder pre-trained on masked images from a sign language sequence to handle the appearance-based input and a bert-like style of pre-training for the pose-based input. Furthermore, we want to experiment with whether a pose conditioned masking of the appearance input will pre-train a more robust model.

We want to explore different options for processing the temporal nature of the input image sequence. On one end, one can project each image from the sequence into a textual token, on the other end, the whole sequence can be projected to one token. In between there are the options to process micro-sequences of given or adaptive size. The main research question we want to answer is: ”Can an LLM guide a Pose/Image encoder in learning a strong textual representation for sign language translation?” For this we will need a strong grounding – learn at word-level scope, gradually learn from longer sequences, and a robust pre-training of the encoders in a self-supervised manner.

Data

There are several datasets available that we can use during the project. We can use both isolated and continuous SL datasets. For the pre-training, we can use multi-lingual data. The word-level datasets include LSA64 (Argentinian), WLASL, MS-ASL (American), AUTSL, BosphorusSign22k (Turkish), CSL-Daily (Chinese), and GSL (Greek). The sentence-level datasets include RWTH-PHOENIX-2014T, Public DGS Corpus (German), BOBSL (British), SWISSTXT (Swiss), CSL-Daily (Chinese), KETI (Korean), How2Sign, OpenASL, YouTube-ASL (American), SP-10, AfriSign (various). Together there are a few thousand hours of video sequences. We will look out for any newly emerged datasets. Finally, we’ve started a cooperation with Czech National TV on downloading interpreted videos with subtitles and/or audio that could serve as training, validation, and test data together with the already existing ones.

Timeline

Before the workshop, we would prepare a majority of the training data. Poses from the sign language videos will be extracted using existing frameworks like MMPose or MediaPipe. The new SL data will need to be parsed into sentence-like parts suitable for learning. We will pre-train the appearance and pose-based encoders using self-supervision. We will prepare training pipelines for the SLT with the LLaVa model and test it out on a subset of the data.

During the workshop, we want to continue with the training after preliminary observations and discussions. We want to evaluate the SLT system both with a quantitative score such as BLEU as well as with human-level qualitative studies. We want to prepare several scenarios of communication between a sign language user and a person without the knowledge of sign language. The SLT system will have to lead the users to common goals such as shopping, finding a place, etc. The quality of the dialog will be evaluated in several aspects: has the goal been achieved, how many queries were needed, how much time did it take, and so on. After the workshop, we would like to test the model for more general tasks such as gesture or action recognition.

Team

The project requires experts from several fields.
• Vision – Experience with vision pre-training, pose estimation, video analysis, and general image
processing.
• NLP – Experience with pre-training of LLMs or similar mask-based self-supervised transformer
models. Auto-regressive generative models. Evaluation of NLP tasks.
• Speech – Experience with speech pre-training and representation. Time series processing and
automatic speech recognition.
• Multi-modality – Experience with modality instruction tuning, and modality fusion.

• SL tech – Experience with SL processing – recognition, translation, generation, basic understand-
ing of SL linguistics – manual/non-manual components, facial expression, hand shape.

• SL linguist – Understands at least one SL with underlying principles of communication and
generation.

Several experts have expressed interest in cooperating on this project during the workshop either as consultants or hands-on team members: Vision: Marek Hruz (University of West Bohemia, team leader, SL recognition), David Jacobs (Meta, video processing, Technical Program/Project Manager), Greg Shakhnatovich (TTI-Chicago, image understanding), Lale Akarun (SL processing). NLP: Jan Svec (University of West Bohemia, pre-training, fine-tuning of LLMs, speech processing). Speech: Murat Sara ̧clar(SL understanding, pre-training, fine-tuning). Multi-modality: Florian Metze (Meta) SL linguist: Annemarie Kocab (Johns Hopkins), Lenka Okrouhlikova (Charles University) General Consulting: Shankar Kumar, Katrin Kirchhoff (Amazon), Yan Huang (Microsoft)

References

[1] M. Boh ́aˇcek and M. Hr ́uz. Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 182–191, January 2022.

[2] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7784–7793,
2018.

[3] H. Hu, W. Zhao, W. Zhou, and H. Li. Signbert+: Hand-model-aware self-supervised pre-training
for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2023.

[4] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.

[5] Z. Liang, H. Li, and J. Chai. Sign language translation: A survey of approaches and techniques.
Electronics, 12(12):2678, 2023.

[6] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
2023.

[7] S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain, C.-F. Yeh, P. Murugesan, P. Heidari, Y. Liu, et al. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.

[8] N. Sarhan and S. Frintrop. Unraveling a decade: A comprehensive survey on isolated sign lan-
guage recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3210–3219, October 2023.

[9] J. Trmal, M. Hr ́uz, J. Zelinka, P. Campr, and L. M ̈uller. Feature space transforms for czech sign-language recognition. In Ninth Annual Conference of the International Speech Communication Association, 2008.

[10] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid.
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714 10726, 2023.

[11] J. Zelinka and J. Kanis. Neural sign language synthesis: Words are our glosses. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.

[12] W. Zhao, H. Hu, W. Zhou, J. Shi, and H. Li. Best: Bert pre-training for sign language recognition with coupling tokenization. arXiv preprint arXiv:2302.05075, 2023.

[13] B. Zhou, Z. Chen, A. Clap ́es, J. Wan, Y. Liang, S. Escalera, Z. Lei, and D. Zhang. Gloss-
free sign language translation: Improving from visual-language pretraining. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 20871–20881, 2023.

[14] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li. Improving sign language translation with monolingual
data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1325, 2021.

Team Leader

Marek Hruz

Senior Members

Murat Saraclar
Ivan Gruber
Miroslav Hlavac
Kevin Duh

Graduate Students

Jakub Straka
Tomas Zelezny
Shester Gueuwau
Jiri Mayer
Dominik Machacek
Xuan Zhang

Masters Student

Karahan Sahin

Opening Day Team Presentation (Video)(PDF)
Closing Presentation (Video)

Center for Language and Speech Processing