AI Research Internships for Undergraduates

***Applications should be received by Monday, March 10th, 2025. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Friday, March 14th, 2025. Expand the “see more” section below to view the application process and link to apply.***

The Johns Hopkins University Center for Language and Speech Processing is hosting the Eleventh Frederick Jelinek Memorial Summer Workshop this summer, and seeking outstanding members of the current junior class enrolled in US universities to join this residential research workshop on human language technologies (HLT) from June 9th to August 1st, 2025.

The internship includes a comprehensive 2-week summer school on HLT, followed by intensive research projects on select topics for 6 weeks.

The 8-week workshop provides an intense, dynamic intellectual environment. Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problems of current interest.

See more

Teams and Topics:

The teams and topics being considered for 2025 are:

We hope this stimulating and highly selective experience will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.

The summer workshop provides:

  • An opportunity to explore an exciting new area of research
  • A two-week tutorial on current speech and language technology
  • Mentoring by experienced researchers
  • Participation in project planning activities
  • Use of cloud computing services
  • A $6,000 stipend and some additional funds for meals and incidental expenses
  • Private furnished accommodation for the duration of the workshop
  • Travel expenses to and from the workshop venue

Questions can be directed to the JSALT 2025 organizing committee at: [email protected] 

Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation. No limitation is placed on the undergraduate major. Women and minorities are encouraged to apply.

APPLY HERE

The Application Process

The application process has three stages.

  1. Completion and submission of the application form by March 10, 2025.
  2. Submitting applicant’s CV to [email protected] by March 10, 2025.
  3. Applicant’s Faculty Nominator, whose contact was provided in stage 1, will provide a recommendation letter in support of applicant’s admission to the program.  The letter is to be submitted electronically, to [email protected] by March 14, 2025.

Please note that the application will not be considered complete until it includes both the CV and the letter.

Team Descriptions:

Play your Part: Towards LLM role-playing agents that stick to their role

Large Language Models (LLMs)—neural networks trained as auto-regressive generative models on web-scale text datasets—can be prompted to perform various tasks, including dialogue, enabling natural, human-like interaction. This has led to their widespread use in chatbots like ChatGPT. These systems prompt an LLM to role-play an agent by describing its persona and following a dialogue template.

To facilitate interaction with LLMs and prevent harmful behavior, complex prompts are crafted to shape the persona of the simulated character. Additionally, most LLMs undergo Human Preference Alignment (HPA), where they are fine-tuned to increase helpful, harmless outputs and reduce harmful or non-helpful content, as defined by human evaluators. However, due to their inherent nature, LLMs are difficult to control, which reduces trust in their use since they can unpredictably deviate from the intended “script”. Such deviations may occur due to hallucinations or shifts in their behavior caused by altered instructions. This issue is accentuated in long-form interaction, with empirical evidence and theoretical arguments showing that long contexts result in reduced controllability through initial instructions.

This project aims to address the issue of consistency and controllability in LLM agents within the challenging context of long-form interactions. We propose a dual-pronged approach. Firstly, we will explore metrics to identify and quantify deviations from desired behavior, along with the necessary evaluation sets to measure these metrics effectively. Secondly, we will delve into mitigating such deviations through the development of improved control techniques. Our methods will be based on gaining a deeper understanding of the mechanisms underlying role-playing and jailbreaking through modern mechanistic interpretability techniques, and the analysis of interaction dynamics using a model-based approach. Two applications involving long-form interaction and of significant practical relevance—multi-turn task-oriented dialogues and the simulation of doctor-patient interactions with diverse personas—will inform the design of our methods and serve as testbeds for their evaluation.

Advancing Expert-Level Reasoning and Understanding in Large Audio Language Models

To exhibit intelligence in the physical world, both AI agents and humans must comprehend and then reason about sound (including speech, non-speech sounds, and music). However, research in complex reasoning with audio has lagged behind modalities such as language and vision. This discrepancy is due to several challenges, the capabilities of algorithms for audio understanding, scarcity of large-scale training datasets, architectures, and, the lack of comprehensive benchmarks for assessing advanced audio processing capabilities. The recent open-source MMAU benchmark has revealed that even state-of-the-art LALMs, including proprietary ones, achieve only 53% accuracy on complex audio reasoning tasks. This deficiency represents a crucial bottleneck in the development of multimodal AI systems and the progression toward AGI. 

We are embarking on an intensive project to address critical limitations in Foundational Large Audio Language Models (LALMs). Our workshop is focused on advancing expert-level understanding and complex reasoning in audio-language models. The team, drawn from several universities and industry in the US, Europe and Asia, and with students and senior professionals from various disciplines, will allow us to achieve these goals. 

End to End multi channel multi talker ASR, EMMA

Our aim is to advance robust speech processing for everyday conversational scenarios, addressing some limitations in current state-of-the-art approaches.

In fact, current speech foundation models such as Whisper are incapable of natively handling multi-talker, multi-channel conversational speech. These need to be integrated into a complex pipeline which combines independently trained subsystems for diarization, source separation, and automatic speech recognition (ASR), suffering from error propagation.

This project pursues two complementary directions: 1) developing a modular multi-channel ASR system by using a streamlined pipeline of existing pre-trained components, including diarization and target-speaker ASR and fine-tune the whole pipeline together to avoid error propagation; and 2) building a novel, more computationally efficient “Whisper-style” foundation model for joint diarization and ASR with extended context handling.

Key research questions include the feasibility of fully end-to-end meeting transcription, how to effectively handle multi-channel data with single-channel pre-trained models, and differentiable integration of components, particularly diarization and ASR.

TTS4ALL: TTS in low resource scenarios: data management, methodology, models, evaluation

The project aims to effectively train and evaluate TTS systems in a situation of scarce training data and complex linguistic contexts. 

We aim to set up an effective data collection, preparation and evaluation protocols that are adapted to the situation above-mentioned. 

We will also explore effective strategies for training TTS models for spoken languages without written form or dialects without standardized writing systems. Besides that, we will also address the use of Self-Supervised Learning (SSL) for building TTS and investigate SSL layers in order to find where linguistic content and emotions are encoded. 

Furthermore, we will benefit from our multidisciplinary and highly-skilled team to build TTS for additional applications that include speech pseudonymization and streaming TTS. Speech pseudonymization is an area lacking existing resources and previous studies. It involves altering the linguistic content of recorded natural speech to protect the speaker’s identity while maintaining the intelligibility of the utterance. This could be particularly useful in scenarios where privacy is a concern, such as in legal or child protection contexts. Streaming TTS is also an emerging topic, which allows for speech generation as symbolic inputs (text or discrete tokens) are provided. This could be particularly useful for integrating TTS with the output of a textual Large Language Model (LLM) or for simultaneous speech translation. Streaming TTS could enable real-time applications where immediate feedback is required, such as in conversational agents or live broadcasting.

Center for Language and Speech Processing