AI Research Internships for Undergraduates

The Johns Hopkins University Center for Language and Speech Processing is hosting the Eighth Frederick Jelinek Memorial Summer Workshop this summer, and seeking outstanding members of the current junior class enrolled in US-universities to join this residential research workshop on human language technologies (HLT) from June 13 to August 5, 2022.

The internship includes a comprehensive 2-week summer school on HLT, followed by intensive research projects on select topics for 6 weeks.

The 8-week workshop provides an intense, dynamic intellectual environment.  Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problem of current interest.

Teams and Topics

The teams and topics for 2022 are:

We hope that this highly selective experience and stimulating will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.

The summer workshop provides:

  • An opportunity to explore an exciting new area of research
  • A two-week tutorial on current speech and language technology
  • Mentoring by experienced researchers
  • Participation in project planning activities
  • Use of cloud computing services
  • A $6,000 stipend and $2,800 towards meals and incidental expenses
  • Private furnished accommodation for the duration of the workshop
  • Travel expenses to and from the workshop venue

Applications should be received by Friday, March 11, 2022. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Tuesday March 15, 2022.

Questions can be directed to the JSALT 2022 organizing committee at [email protected] 

Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation.  No limitation is placed on the undergraduate major.  Women and minorities are encouraged to apply.


The Application Process

The application process has three stages.

  1. Completion and submission of the application form by March 11, 2022.
  2. Submitting applicant’s CV to [email protected] by March 11, 2022.
  3. Applicant’s Faculty Nominator, whose contact was provided in stage 1, will be asked to provide a recommendation letter in support of applicant’s admission to the program.  The letter is to be submitted electronically, to [email protected] by March 15, 2022.

Please note that the application will not be considered complete until it includes both the CV and the letter.

Feel free to contact the JSALT 2022 committee at [email protected] with any questions or concerns you may have.

Team Descriptions:

Speech Translation for Under-Resourced Languages

Seamless communication between people speaking different languages is a long-term dream of humanity. Artificial intelligence aims at reaching this goal. Despite recent huge improvements in Machine Translation, Speech Recognition and Speech Translation, Speech to Speech Translation (SST) remains a central problem in natural language processing, especially for under-resourced languages. During this workshop, our team will design a system that gathers and shares information across speech and text modalities in different languages, in order to build a joint, cross-lingual, speech-and-text representation learning framework. Utterances in their spoken and written forms in several different languages will be clustered together in a common embedding space, allowing data augmentation and knowledge transfer to low-resource languages. The creation of a common multi-lingual and multi-modal space requires merging speech and text representation spaces across languages, while taking into account the important differences that exists between those modalities. The adaptation of this space for under-resourced languages with incomplete data will then be explored based on recent technics such as transfer learning, adapter networks or prompting, while avoiding catastrophic forgetting. The final goal of our team is to use the common embedding space to generate written and spoken output from representations in this common embedding space, allowing to many-to-many speech/text to speech/text translation.

Multilingual and Code-Switching Speech Recognition

The growing adoption of personal voice assistants and smartphones and the prevalence of code-switching in spoken communications of many societies are forcing automatic speech recognition (ASR) systems to handle mixed-language input. Designing such systems is challenging, mainly due to the data scarcity, grammatical complexity and an unbalanced language distribution. We will develop key speech and language processing technology enabling users to speak in more than one language, aiding in essential downstream tasks and conversational AI technology. We will address the state of the art large vocabulary speech recognition in monolingual, multilingual and code-switching between multiple language pairs in several ways: (i) understand where/why code-switching happens in speech and analyze human code-switching points, addressing complex social factors, dominant language of education/literacy, and topic/domain; (ii) design multilingual ASR systems where possible by leveraging pretrained, self-supervised models; (iii) explore methods to handle low-resourced languages/dialects by generating synthetic code-switched speech and text while upholding language dependent construction and triggers; (iv) explore evaluation measures for mixed-script outputs, especially for intra-word code-mixing.

Leveraging Pre-training Models for Speech Processing

Deep learning has revolutionized speech processing, delivering impressive gains for scenarios rich in labeled data. Heavy reliance on labeled data hinders the development of new speech applications and limits progress in new languages or domains. There are about 7000 languages globally, and it is impossible to label a large amount of speech data for all languages. To overcome the dire need for labeled data, the pre-training paradigm emerged. Models are first pre-trained to develop generic knowledge. Inspired by the fact that human babies learn their native

languages by listening and interacting with their families and surroundings with little or no labeled data, scientists have developed pre-training techniques that require only unlabeled data, which is ubiquitous and can be easily collected from the Internet. These pretrained models are then used to target downstream applications. Because the general-purpose knowledge has been developed during pre-training, each downstream application only requires minimal labeled data for fine-tuning. Pre-training has proven to be crucial in advancing the state of speech. This project will extend the pre-trained models to more speech applications, find the most efficient ways to utilize them, and make them more robust and environmentally friendly.

Center for Language and Speech Processing