AI Research Internships for Undergraduates

The Johns Hopkins University Center for Language and Speech Processing is organizing the Ninth Frederick Jelinek Memorial Summer Workshop from June 12 to August 5, 2023, this year hosted at the University of Le Mans, France, and seeking outstanding members of the current junior class in US-universities to join this residential research experience in human language technologies.  Please complete this application no later than April 13, 2023.

The internship includes a comprehensive 2-week summer school on human language technology (HLT), followed by intensive research projects on select topics for 6 weeks.

The 8-week workshop provides an intense, dynamic intellectual environment.  Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problems of current interest.

Teams and Topics

The teams and topics for 2023 are:

  1. Better Together: Text + Context 
  2. Finite State Methods with Modern Neural Architectures for Speech Applications
  3. Automatic Design of Conversational Models from Human-to-human Conversation
  4. Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions

We hope that this highly selective and stimulating experience will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.

The summer workshop provides:

  • An opportunity to explore an exciting new area of research
  • A two-week tutorial on current speech and language technology
  • Mentoring by experienced researchers
  • Participation in project planning activities
  • A $6,000 stipend and $2,800 towards meals and incidental expenses
  • Private furnished accommodation for the duration of the workshop
  • Travel expenses to and from the workshop venue

Applications should be received by Thursday, April 13, 2023. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Tuesday, Apr 18, 2023.

Questions may be directed to [email protected] 

Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation.  No limitation is placed on the undergraduate major.  Women and underrepresented minorities are encouraged to apply.

 APPLY HERE

The Application Process

The application process has three stages.

  1. Completion and submission of the application form by Apr 13, 2023.
  2. Submitting applicant’s CV to [email protected] by Apr 13, 2023.
  3. The applicant’s Faculty Nominator, whose contact was provided in stage 1, will be asked to provide a recommendation letter in support of the applicant’s admission to the program.  The letter is to be submitted electronically,  by April 18, 2023 to [email protected]. Please note that the application will not be considered complete until it includes both the CV and the letter.

Feel free to contact the JSALT 2023 committee at [email protected] with any questions or concerns you may have.

Team Descriptions:

Better Together: Text + Context

It is standard practice to represent documents, (a), as embeddings, (d). We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets), (c), capture citation graphs, (b). Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote the similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.

As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.

Finite State Methods with Modern Neural Architectures for Speech Applications

Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, …) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:

  • End-to-end: a single system (usually a deep neural network) is built with speech signal as input and target signal as final output (for example spoken English as input and french text as output). While this approach greatly simplifies the overall design of the system, it comes with two significant drawbacks:
    • lack of modularity: no sub-components can be modified or used in another system
    • large data requirements: the necessity to find hard-to-collect supervised task-specific data (input-output pairs)
  • Cascade: a separately built ASR system is used to convert the speech signal into text and the output text is then passed to another back-end system. This approach greatly improves the modularity of the individual components of the pipeline and drastically reduces the need of task-specific data. The main disadvantages are:
    • ASR output is noisy: the downstream network is usually fed with the 1-best hypothesis of the ASR system which is prone to error (no account for uncertainty)
    • Separate optimization: each module is separately optimized and the joint-training of the whole pipeline is almost impossible as we cannot differentiate through the ASR best path

In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.

Automatic Design of Conversational Models from Human-to-Human Conversation

Currently used conversation models (or dialog models) are mostly hand designed by data analysts as a conversation graph consisting of the system’s prompts and the user’s answers. The advanced conversation models [1, 2] are based on large language models fine-tuned on the dialog task, and still require significant amounts of training data. These models produce surprisingly fluent outputs but are not trustable because of hallucination (which can produce unexpected and wrong answers), and their adoption in commerce is limited.

Our goal is to explore ways to design conversation models in the form of finite state graphs[1] semi-automatically or fully automatically from an unlabeled set of audio or textual training dialogs. Words, phrases, or user turns can be converted to embeddings using (large) language models trained specifically on conversational data [3, 4]. These embeddings represent points in a vector space and carry semantic information. The conversations are trajectories in the vector space. By merging, pruning, and modeling the trajectories, we can get dialog model skeleton models. These models could be used for fast data content exploration, content visualization, topic detection, and topic-based clustering, speech analysis, and mainly for much faster and cheaper design of fully trustable conversation models for commercial dialog agents. The models can also target some specific dialog strategies – the fastest way to reach a conversation goal (to provide useful information or sell a good or entertain users for the longest time). One promising approach to building a conversational model from data is presented in [4]. Variational Recurrent Neural Networks are trained to get discrete embeddings with a categorical distribution. The categories are conversation states. Then a transition probability matrix among states is calculated, and low probabilities are pruned out to get a graph.

Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions

Speaker diarization aims at answering the question of “who speaks when” in a recording. It is a key task for many speech technologies such as automatic speech recognition (ASR), speaker identification and dialog monitoring in different multi-speaker scenarios, including TV/radio, meet- ings, and medical conversations. In many domains, such as health or human-machine interactions, the prediction of speaker segments is not enough and it is necessary to include additional para-linguistic information (age, gender, emotional state, speech pathology, etc.). However, most existing real-world applications are based on mono-modal modules trained separately, thus resulting in sub-optimal solutions. In addition, the current trend for explainable AI is a vital process for transparency of decision-making with machine learning: the user (a doctor, a judge, or a human scientist) has to justify the choice made on the basis of the system output.

This project aims at converting these outputs into interpretable clues (mispronounced phonemes, low speech rate, etc.) which explains the automatic diarization. While the question of simultaneously performed speech recognition and speaker diarization has been addressed under JSALT 2020, this proposal intends to develop a multi-task diarization system based on a joint latent representation of speaker and para-linguistic information. The latent representation embeds multiple modalities such as acoustic and linguistic or vision. This joint embedding space will be projected into a sparse and non-negative space in which all dimensions are interpretable by design. In the end, the diarization output will be a rich segmentation where speech segments are characterized with multiple labels, and interpretable attributes derived from the latent space.

Center for Language and Speech Processing