The Johns Hopkins University Center for Language and Speech Processing is organizing the Ninth Frederick Jelinek Memorial Summer Workshop from June 12 to August 5, 2023, this year hosted at the University of Le Mans, France, and seeking outstanding members of the current junior class in US-universities to join this residential research experience in human language technologies. Please complete this application no later than April 13, 2023.
The internship includes a comprehensive 2-week summer school on human language technology (HLT), followed by intensive research projects on select topics for 6 weeks.
The 8-week workshop provides an intense, dynamic intellectual environment. Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problems of current interest.
Teams and Topics
The teams and topics for 2023 are:
We hope that this highly selective and stimulating experience will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.
The summer workshop provides:
Applications should be received by Thursday, April 13, 2023. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Tuesday, Apr 18, 2023.
Questions may be directed to [email protected]
Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation. No limitation is placed on the undergraduate major. Women and underrepresented minorities are encouraged to apply.
The Application Process
The application process has three stages.
Feel free to contact the JSALT 2023 committee at [email protected] with any questions or concerns you may have.
Team Descriptions:
Better Together: Text + Context
It is standard practice to represent documents, (a), as embeddings, (d). We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets), (c), capture citation graphs, (b). Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote the similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.
As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.
Finite State Methods with Modern Neural Architectures for Speech Applications
Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, …) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:
In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.
Automatic Design of Conversational Models from Human-to-Human Conversation
Currently used conversation models (or dialog models) are mostly hand designed by data analysts as a conversation graph consisting of the system’s prompts and the user’s answers. The advanced conversation models [1, 2] are based on large language models fine-tuned on the dialog task, and still require significant amounts of training data. These models produce surprisingly fluent outputs but are not trustable because of hallucination (which can produce unexpected and wrong answers), and their adoption in commerce is limited.
Our goal is to explore ways to design conversation models in the form of finite state graphs[1] semi-automatically or fully automatically from an unlabeled set of audio or textual training dialogs. Words, phrases, or user turns can be converted to embeddings using (large) language models trained specifically on conversational data [3, 4]. These embeddings represent points in a vector space and carry semantic information. The conversations are trajectories in the vector space. By merging, pruning, and modeling the trajectories, we can get dialog model skeleton models. These models could be used for fast data content exploration, content visualization, topic detection, and topic-based clustering, speech analysis, and mainly for much faster and cheaper design of fully trustable conversation models for commercial dialog agents. The models can also target some specific dialog strategies – the fastest way to reach a conversation goal (to provide useful information or sell a good or entertain users for the longest time). One promising approach to building a conversational model from data is presented in [4]. Variational Recurrent Neural Networks are trained to get discrete embeddings with a categorical distribution. The categories are conversation states. Then a transition probability matrix among states is calculated, and low probabilities are pruned out to get a graph.
Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions
Speaker diarization aims at answering the question of “who speaks when” in a recording. It is a key task for many speech technologies such as automatic speech recognition (ASR), speaker identification and dialog monitoring in different multi-speaker scenarios, including TV/radio, meet- ings, and medical conversations. In many domains, such as health or human-machine interactions, the prediction of speaker segments is not enough and it is necessary to include additional para-linguistic information (age, gender, emotional state, speech pathology, etc.). However, most existing real-world applications are based on mono-modal modules trained separately, thus resulting in sub-optimal solutions. In addition, the current trend for explainable AI is a vital process for transparency of decision-making with machine learning: the user (a doctor, a judge, or a human scientist) has to justify the choice made on the basis of the system output.
This project aims at converting these outputs into interpretable clues (mispronounced phonemes, low speech rate, etc.) which explains the automatic diarization. While the question of simultaneously performed speech recognition and speaker diarization has been addressed under JSALT 2020, this proposal intends to develop a multi-task diarization system based on a joint latent representation of speaker and para-linguistic information. The latent representation embeds multiple modalities such as acoustic and linguistic or vision. This joint embedding space will be projected into a sparse and non-negative space in which all dimensions are interpretable by design. In the end, the diarization output will be a rich segmentation where speech segments are characterized with multiple labels, and interpretable attributes derived from the latent space.