Visual Language Embeddings

2020 Seventh Frederick Jelinek Memorial Summer Workshop

Team Leader

Murat Saraçlar, Bogazici U., Turkey

Senior Members

Ebru Arısoy, MEF U., Turkey

Ivan Gruber, U. of West Bohemia, Czechia

Miroslav Hlavac, U. of West Bohemia, Czechia

Marek Hruz, U. of West Bohemia, Czechia

Herman Kamper, Stellenbosch U., S. Africa

Jakub Kanis, U. of West Bohemia, Czechia

Jan Trmal, Johns Hopkins University, USA

Graduate Students

Amanda Duarte, UPC, Spain (PhD Student – remote)

Alp Kındıroğlu, Bogazici U., Turkey (PhD Student)

Oğulcan Özdemir, Bogazici U., Turkey (PhD Student)

Bowen Shi, TTIC, USA (PhD Student)

Benjamin van Niekerk, Stellenbosch U., S. Africa (PhD Student)

Senior Affiliates

Lale Akarun, Bogazici U., Turkey (part-time)

Kadir Gökgöz, Bogazici U., Turkey (remote)

Karen Livescu, TTIC, USA (part-time)

Greg Shakhnarovich, TTIC, USA (part-time)

Sign languages, the native languages of the Deaf communities, are visual languages which convey meaning through hand, face and body gestures. Each culture has its own sign language, and some of these are not well documented languages. The advances in spoken language recognition, and language to language translation are not yet paralleled in Sign Language. This may be due to the difficulty of the visual tasks, the scarcity of data, and the smaller size of the community that uses this language.

Most sign language resources are either recorded in the lab (controlled environment, isolated signs, often one-handed; sometimes finger-spelled) or are domain specific (weather reports). Sign languages are often annotated using glosses, which are semantic labels. Some datasets have been annotated using special annotation systems such as HamNoSys and SignWriting. Some sign resources, on the other hand, are in the form of sign annotations from speech and sounds in television videos. These resources often contain subtitles, corresponding to spoken content, which are loosely aligned with the video. If subtitles are not directly present, an automatic speech recognition system can be used to obtain them from speech. Our older work used news programs annotated with signed speech (Santemiz 2009), in which the speaker translated each word to a sign simultaneously with speech. More recently, films and other TV programs that have been annotated by a separate sign language channel displayed as a layer on the original video have become abundant. The availability of such data has motivated endto-end neural sign translation attempts (Camgöz 2018). However, unit discovery and tokenization improves the success of translation substantially. Some initial unit discovery attempts have yielded promising results on sign language resources that have been recorded with RGB-D cameras (Tournay 2019).

Sign languages and gestures share common articulators and have similar units. Therefore, modeling both of these visual means of communication together makes sense. The interest in gesture-based control of devices has intensified with the availability of RGB-D sensors and libraries that enable real-time articulated pose capture of the human body (Keskin 2012). Gesture-based control is popular especially in healthcare, where touchless inputinput is required for preserving the sterility of the hands and noisy environments where voice input is not feasible.

The availability of human pose data with ground truth annotations has enabled the training of CNNs that can estimate the human pose from 2D images (Cao 2018). Recent advances in hand skeleton estimation (Moon 2018, Xiong 2019) have enabled hand-skeleton based approaches. These approaches work well on depth data and are able to predict the 3D location of finger joints with a precision of around 8 to 15 mm depending on the complexity of the data. It has been observed that an average error around 20 mm approaches the limit of human accuracy (Supancic 2015). Even though the estimation of joint locations lacks the precision when estimated from purely 2D data, the advancement in this field is very prominent


In this project, we aim to apply the recent developments in deep learning to obtain visual embeddings first at the frame-level and ultimately for sequences. Our goal is to discover units of visual languages – gestures and signs – in an unsupervised fashion using these embeddings. Since visual languages make use of information about the position, orientation, and pose of each hand as well as face expressions together with the movements of both hands and even the face, separate embeddings for all these different aspects will be combined together. Since motion blur and occlusions are very common in this setting, we will also incorporate confidence scores into the estimation process.

We will work on multiple sign languages in order to obtain multilingual embeddings. Multilingual modeling has been shown to improve performance for automatic speech recognition and machine translation, especially for lowresource languages. The embeddings will make use of both gesture and sign data. Previous work indicates that gesture information helps fingerspelling (Shi and Livescu, 2017).


The visual language embeddings lie at the core of the project. These embeddings will follow the usual idea: we want similar hand poses, face expressions, and ultimately signs or gestures to be close to each other in the embedded space. The key questions are: what does similar mean? and what does close mean? For our purposes, the similarity is linked to the semantic interpretation of the signs and gestures, and distances in the embedding space will be tied to the proxy tasks.

On the frame level:

Hand shape (pose)

Hand appearance in the image is influenced by three factors: Hand morphology, which may change from person to person; the articulated pose of the hand skeleton, in terms of the angles of the hand joints, and the camera pose, which depends upon the hand orientation with respect to the camera. In sign language recognition research the term “hand shape” refers to a joint feature extracted from the combined appearance. However, the meaning is conveyed through the pose, and should not be influenced by the hand morphology. Hand pose is the configuration of fingers stripped of the identity of the person performing the gesture. The same hand pose performed by different people will have similar relative spatial distribution of the finger joints. The spatial distribution can be described eg. by relative angles between and/or positions of the joints. One of the open questions is whether to encode the hand orientation directly into the embedding, or whether to keep it as a separate information. Another interesting topic will be the 2D vs 3D positions of joints. We will focus on monitoring the methods for estimation of joint locations from purely 2D data in order to implement a method before the workshop. Alternatively, we may use adversarial training to eliminate the effects of different hand morphologies.

Our early work has shown that using different data sources and different tasks defined in each results in better embeddings that improve the performance of sign to spoken language translation.

Face embeddings

Usually face embeddings (Schroff 2015, Deng 2019) are used to identify faces, meaning we want the face of a person to be independent of the expression in the embedded space. In this work we aim to do the exact opposite. We want to preserve the expression and get rid of the identity. The expression of the face can be described by the distribution of the facial landmarks or features – shape of eyes, eyebrows, mouth, etc. We need to design the description so that it is relative to the neutral facial expression of the analysed person – eg. relative change in landmark position. To detect the landmarks we can use either statistical approach (Cootes 2001) or machine learning (Asthana 2014, Kazemi 2014).

Another side channel is lipreading. Signers often use “mouthings” to articulate corresponding spoken words as they sign. We may develop simple mouth embeddings and use them as a side channel.


The dynamics of the gestures can be described two-fold. First there is the movement of limbs and other body parts, second there is the apparent motion in the form of optical flow, that describes the changes of textures. The trajectories of body parts have to be normalized according to the observed lengths of the skeletal parts. The optical flow can be expressed as a relative histogram of flow’s directions.

Embeddings will be realized through deep neural networks. Modern architectures will be used and the training approach will follow either the work of (Schroff 2015) “triplet loss” or (Deng 2019) “arc loss”.

To establish a confidence measurement on the embeddings of frame level we will compute standard image degrading vision factors such as – noise, motion blur, resolution, etc. For the hand pose, we will compare the detected pose to a sign language hand pose prior.

In order to obtain sequence embeddings, we will follow recent work on acoustic word embeddings based on encoder-decoder architectures including variational and correspondence autoencoders (Kamper 2019).


  • BosphorusSign / HospiSign (Turkish Sign Language)
  • RWTH-PHOENIX-Weather
  • Continuous German Sign Language Dataset (DGS-Korpus)
  • Czech Weather reports
  • ChicagoFSWild (ASL)
  • SeBeDer (movies with Turkish Sign Language descriptions)
  • TRT Broadcast News for the hearing impaired (Turkish)


In order to evaluate the quality of the embeddings, we will make use of various proxy tasks with increasing complexity. Our first task will be the same-different task where the embeddings are used to decide whether two signs (or gestures) are the same or different. Our main task will be the unsupervised discovery of signs (or gestures). For this task, the matching and clustering performance measures similar to those for the Zero Speech Challenges will be used. In addition, we will use the embeddings in hand action recognition and sign language recognition tasks. For sign language recognition, we will use the neural sign language translation framework (Camgoz, 2018).

Other, alternative tasks include: hand blurry/not blurry task; hand up/down task; detection of certain signs that appear very frequently (e.g., no sign; question sign).


Center for Language and Speech Processing