BEGIN:VCALENDAR VERSION:2.0 PRODID:-//128.220.36.25//NONSGML kigkonsult.se iCalcreator 2.26.9// CALSCALE:GREGORIAN METHOD:PUBLISH X-FROM-URL:https://www.clsp.jhu.edu X-WR-TIMEZONE:America/New_York BEGIN:VTIMEZONE TZID:America/New_York X-LIC-LOCATION:America/New_York BEGIN:STANDARD DTSTART:20231105T020000 TZOFFSETFROM:-0400 TZOFFSETTO:-0500 RDATE:20241103T020000 TZNAME:EST END:STANDARD BEGIN:DAYLIGHT DTSTART:20240310T020000 TZOFFSETFROM:-0500 TZOFFSETTO:-0400 RDATE:20250309T020000 TZNAME:EDT END:DAYLIGHT END:VTIMEZONE BEGIN:VEVENT UID:ai1ec-21031@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nMost people take for granted that when they speak\, t hey will be heard and understood. But for the millions who live with speec h impairments caused by physical or neurological conditions\, trying to co mmunicate with others can be difficult and lead to frustration. While ther e have been a great number of recent advances in Automatic Speech Recognit ion (ASR) technologies\, these interfaces can be inaccessible for those wi th speech impairments.\nIn this talk\, we will present Parrotron\, an end- to-end-trained speech-to-speech conversion model that maps an input spectr ogram directly to another spectrogram\, without utilizing any intermediate discrete representation. The system is also trained to emit words in addi tion to a spectrogram\, in parallel. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent\, pro sody\, and background noise\, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We fu rther show that this normalization model can be adapted to normalize highl y atypical speech from speakers with a variety of speech impairments (due to\, ALS\, Cerebral-Palsy\, Deafness\, Stroke\, Brain Injury\, etc.) \, r esulting in significant improvements in intelligibility and naturalness\, measured via a speech recognizer and listening tests. Finally\, demonstrat ing the utility of this model on other speech tasks\, we show that the sam e model architecture can be trained to perform a speech separation task.\n Dimitri will give a brief description of some key moments in development o f speech recognition algorithms that he was involved in and their applicat ions to YouTube closed captions\, Live Transcribe and wearable subtitles. \nFadi will then speak about the development of Parrotron.\nBiographies\nD imitri Kanevsky started his career at Google working on speech recognition algorithms. Prior to joining Google\, Dimitri was a Research staff member in the Speech Algorithms Department at IBM. Prior to IBM\, he worked at a number of centers for higher mathematics\, including Max Planck Institu te in Germany and the Institute for Advanced Studies in Princeton. He curr ently holds 295 US patents and was Master Inventor at IBM. MIT Technology Review recognized Dimitri conversational biometrics based security patent as one of five most influential patents for 2003. In 2012 Dimitri was hono red at the White House as a Champion of Change for his efforts to advance access to science\, technology\, engineering\, and math.\nFadi Biadsy is a senior staff research scientist at Google NY for the past ten years. He h as been exploring and leading multiple projects at Google\, including spee ch recognition\, speech conversion\, language modeling\, and semantic unde rstanding. He received his PhD from Columbia University in 2011. At Colum bia\, he researched a variety of speech and language processing projects i ncluding\, dialect and accent recognition\, speech recognition\, charismat ic speech and question answering. He holds a BSc and MSc in mathematics a nd computer science. He worked on handwriting recognition during his maste rs degree and he worked as a senior software developer for five years at D alet digital media systems building multimedia broadcasting systems. DTSTART;TZID=America/New_York:20211105T120000 DTEND;TZID=America/New_York:20211105T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Fadi Biadsy and Dimitri Kanevsky (Google) “Speech Recognition: From Speaker Dependent to Speaker Independent to Full Personalization” “Parrot ron: A Unified E2E Speech-to Speech Conversion and ASR Model for Atypical Speech” URL:https://www.clsp.jhu.edu/events/fadi-biadsy-and-dimitri-kanevsky-google / X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n
\\nAbstr act
\nMost people take for granted that when they speak\, they will be heard and understood. But for the millions who live with speech impairments caused by physical or neurological condi tions\, trying to communicate with others can be difficult and lead to fru stration. While there have been a great number of recent advances in Autom atic Speech Recognition (ASR) technologies\, these interfaces can be inacc essible for those with speech impairments.
\nIn this talk\, we will present Parrotron\, an end-to-end-trained speech-to-sp eech conversion model that maps an input spectrogram directly to another s pectrogram\, without utilizing any intermediate discrete representation. T he system is also trained to emit words in addition to a spectrogram\, in parallel. We demonstrate that this model can be trained to normalize spe ech from any speaker regardless of accent\, prosody\, and background noise \, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normal ization model can be adapted to normalize highly atypical speech from spea kers with a variety of speech impairments (due to\, ALS\, Cerebral-Palsy\, Deafness\, Stroke\, Brain Injury\, etc.) \, resulting in significant imp rovements in intelligibility and naturalness\, measured via a speech recog nizer and listening tests. Finally\, demonstrating the utility of this mod el on other speech tasks\, we show that the same model architecture can be trained to perform a speech separation task.
\nDimitri will give a brief description of some key moments in development o f speech recognition algorithms that he was involved in and their applicat ions to YouTube closed captions\, Live Transcribe and wearable subtitles.
\nFadi will then speak about the development of Parrotron.
\nBiographies
\nDimitri K anevsky started his career at Google working on speech recognitio n algorithms. Prior to joining Google\, Dimitri was a Research staff membe r in the Speech Algorithms Department at IBM. Prior to IBM\, he worked a t a number of centers for higher mathematics\, including Max Planck Instit ute in Germany and the Institute for Advanced Studies in Princeton. He cur rently holds 295 US patents and was Master Inventor at IBM. MIT Technology Review recognized Dimitri conversational biometrics based security patent as one of five most influential patents for 2003. In 2012 Dimitri was hon ored at the White House as a Champion of Change for his efforts to advance access to science\, technology\, engineering\, and math.
\nFadi Biadsy is a senior staff research scientist at Google NY for the past ten years. He has been exploring and leading multiple projects a t Google\, including speech recognition\, speech conversion\, language mod eling\, and semantic understanding. He received his PhD from Columbia Uni versity in 2011. At Columbia\, he researched a variety of speech and langu age processing projects including\, dialect and accent recognition\, speec h recognition\, charismatic speech and question answering. He holds a BSc and MSc in mathematics and computer science. He worked on handwriting rec ognition during his masters degree and he worked as a senior software deve loper for five years at Dalet digital media systems building multimedia br oadcasting systems.
\n X-TAGS;LANGUAGE=en-US:2021\,Biadsy and Kanevsky\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-21041@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nNarration is a universal human practice that serves a s a key site of education\, collective memory\, fostering social belief sy stems\, and furthering human creativity. Recent studies in economics (Shil ler\, 2020)\, climate science (Bushell et al.\, 2017)\, political polariza tion (Kubin et al.\, 2021)\, and mental health (Adler et al.\, 2016) sugge st an emerging interdisciplinary consensus that narrative is a central con cept for understanding human behavior and beliefs. For close to half a cen tury\, the field of narratology has developed a rich set of theoretical fr ameworks for understanding narrative. And yet these theories have largely gone untested on large\, heterogenous collections of texts. Scholars conti nue to generate schemas by extrapolating from small numbers of manually ob served documents. In this talk\, I will discuss how we can use machine lea rning to develop data-driven theories of narration to better understand wh at Labov and Waletzky called “the simplest and most fundamental narrative structures.” How can machine learning help us approach what we might call a minimal theory of narrativity?\nBiography\nAndrew Piper is Professor and William Dawson Scholar in the Department of Languages\, Literatures\, and Cultures at McGill University. He is the director of _.txtlab \n_\,\n a l aboratory for cultural analytics\, and editor of the /Journal of Cultural Analytics/\, an open-access journal dedicated to the computational study o f culture. He is the author of numerous books and articles on the relation ship of technology and reading\, including /Book Was There: Reading in Ele ctronic Times/(Chicago 2012)\, /Enumerations: Data and Literary Study/(Chi cago 2018)\, and most recently\, /Can We Be Wrong? The Problem of Textual Evidence in a Time of Data/(Cambridge 2020). DTSTART;TZID=America/New_York:20211112T120000 DTEND;TZID=America/New_York:20211112T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Andrew Piper (McGill University) ” How can we use machine learning to understand narration?” URL:https://www.clsp.jhu.edu/events/andrew-piper-mcgill-university-how-can- we-use-machine-learning-to-understand-narration/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nNarration is a universal human practice that serves a s a key site of education\, collective memory\, fostering social belief sy stems\, and furthering human creativity. Recent studies in economics (Shil ler\, 2020)\, climate science (Bushell et al.\, 2017)\, political polariza tion (Kubin et al.\, 2021)\, and mental health (Adler et al.\, 2016) sugge st an emerging interdisciplinary consensus that narrative is a central con cept for understanding human behavior and beliefs. For close to half a cen tury\, the field of narratology has developed a rich set of theoretical fr ameworks for understanding narrative. And yet these theories have largely gone untested on large\, heterogenous collections of texts. Scholars conti nue to generate schemas by extrapolating from small numbers of manually ob served documents. In this talk\, I will discuss how we can use machine lea rning to develop data-driven theories of narration to better understand wh at Labov and Waletzky called “the simplest and most fundamental narrative structures.” How can machine learning help us approach what we might call a minimal theory of narrativity?
\nBiography
\n< p>Andrew Piper is Professor and William D awson Scholar in the Department of Languages\, Literatures\, and Cultures at McGill University. He is the director of _.txtlab \n\na laboratory for cultural ana lytics\, and editor of the /Journal of Cultural Analytics/\, an open-acces s journal dedicated to the computational study of culture. He is the autho r of numerous books and articles on the relationship of technology and rea ding\, including /Book Was There: Reading in Electronic Times/(Chicago 201 2)\, /Enumerations: Data and Literary Study/(Chicago 2018)\, and most rece ntly\, /Can We Be Wrong? The Problem of Textual Evidence in a Time of Data /(Cambridge 2020).
\n X-TAGS;LANGUAGE=en-US:2021\,November\,Piper END:VEVENT BEGIN:VEVENT UID:ai1ec-21057@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nThis talk will outline the major challenging in porti ng mainstream speech technology to the domain of clinical applications\; i n particular\, the need for personalised systems\, the challenge of workin g in an inherently sparse data domain and developing meaningful collaborat ions with all stakeholders. The talk will give an overview of recent state -of-the-art research from current projects including in the areas of recog nition of disordered speech\, automatic processing of conversations and th e automatic detection and tracking of paralinguistic information at the Un iversity of Sheffield (UK)’s Speech and Hearing (SPandH) & Healthcare lab. \nBiography\nHeidi is a Senior Lecturer (associate professor) in Computer Science at the University of Sheffield\, United Kingdom. Her research inte rests are on the application of AI-based voice technologies to healthcare. In particular\, the detection and monitoring of people’s physical and men tal health including verbal and non-verbal traits for expressions of emoti on\, anxiety\, depression and neurodegenerative conditions in e.g.\, thera peutic or diagnostic settings. DTSTART;TZID=America/New_York:20211119T120000 DTEND;TZID=America/New_York:20211119T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Heidi Christensen (University of Sheffield\, UK) Virtual Seminar “A utomated Processing of Pathological Speech: Recent Work and Ongoing Challe nges” URL:https://www.clsp.jhu.edu/events/heidi-christensen-university-of-sheffie ld-uk-virtual-seminar-automated-processing-of-pathological-speech-recent-w ork-and-ongoing-challenges/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nThis talk will outline the major challenging in porti ng mainstream speech technology to the domain of clinical applications\; i n particular\, the need for personalised systems\, the challenge of workin g in an inherently sparse data domain and developing meaningful collaborat ions with all stakeholders. The talk will give an overview of recent state -of-the-art research from current projects including in the areas of recog nition of disordered speech\, automatic processing of conversations and th e automatic detection and tracking of paralinguistic information at the Un iversity of Sheffield (UK)’s Speech and Hearing (SPandH) & Healthcare lab.
\nBiography
\nHeidi is a Senior Lecturer (as sociate professor) in Computer Science at the University of Sheffield\, Un ited Kingdom. Her research interests are on the application of AI-based vo ice technologies to healthcare. In particular\, the detection and monitori ng of people’s physical and mental health including verbal and non-verbal traits for expressions of emotion\, anxiety\, depression and neurodegenera tive conditions in e.g.\, therapeutic or diagnostic settings.
\n X-TAGS;LANGUAGE=en-US:2021\,Christensen\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-21259@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nNatural language processing has been revolutionized b y neural networks\, which perform impressively well in applications such a s machine translation and question answering. Despite their success\, neur al networks still have some substantial shortcomings: Their internal worki ngs are poorly understood\, and they are notoriously brittle\, failing on example types that are rare in their training data. In this talk\, I will use the unifying thread of hierarchical syntactic structure to discuss app roaches for addressing these shortcomings. First\, I will argue for a new evaluation paradigm based on targeted\, hypothesis-driven tests that bette r illuminate what models have learned\; using this paradigm\, I will show that even state-of-the-art models sometimes fail to recognize the hierarch ical structure of language (e.g.\, to conclude that “The book on the table is blue” implies “The table is blue.”) Second\, I will show how these beh avioral failings can be explained through analysis of models’ inductive bi ases and internal representations\, focusing on the puzzle of how neural n etworks represent discrete symbolic structure in continuous vector space. I will close by showing how insights from these analyses can be used to ma ke models more robust through approaches based on meta-learning\, structur ed architectures\, and data augmentation.\nBiography\nTom McCoy is a PhD c andidate in the Department of Cognitive Science at Johns Hopkins Universit y. As an undergraduate\, he studied computational linguistics at Yale. His research combines natural language processing\, cognitive science\, and m achine learning to study how we can achieve robust generalization in model s of language\, as this remains one of the main areas where current AI sys tems fall short. In particular\, he focuses on inductive biases and repres entations of linguistic structure\, since these are two of the major compo nents that determine how learners generalize to novel types of input. DTSTART;TZID=America/New_York:20220131T120000 DTEND;TZID=America/New_York:20220131T131500 LOCATION:Ames Hall 234 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Tom McCoy (Johns Hopkins University) “Opening the Black Box of Deep Learning: Representations\, Inductive Biases\, and Robustness” URL:https://www.clsp.jhu.edu/events/tom-mccoy-johns-hopkins-university-open ing-the-black-box-of-deep-learning-representations-inductive-biases-and-ro bustness/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nNatural language processing has been revolutionized b y neural networks\, which perform impressively well in applications such a s machine translation and question answering. Despite their success\, neur al networks still have some substantial shortcomings: Their internal worki ngs are poorly understood\, and they are notoriously brittle\, failing on example types that are rare in their training data. In this talk\, I will use the unifying thread of hierarchical syntactic structure to discuss app roaches for addressing these shortcomings. First\, I will argue for a new evaluation paradigm based on targeted\, hypothesis-driven tests that bette r illuminate what models have learned\; using this paradigm\, I will show that even state-of-the-art models sometimes fail to recognize the hierarch ical structure of language (e.g.\, to conclude that “The book on the table is blue” implies “The table is blue.”) Second\, I will show how these beh avioral failings can be explained through analysis of models’ inductive bi ases and internal representations\, focusing on the puzzle of how neural n etworks represent discrete symbolic structure in continuous vector space. I will close by showing how insights from these analyses can be used to ma ke models more robust through approaches based on meta-learning\, structur ed architectures\, and data augmentation.
\nBiography
\nTom McCoy is a PhD candidate in the Department of Cognitive Sci ence at Johns Hopkins University. As an undergraduate\, he studied computa tional linguistics at Yale. His research combines natural language process ing\, cognitive science\, and machine learning to study how we can achieve robust generalization in models of language\, as this remains one of the main areas where current AI systems fall short. In particular\, he focuses on inductive biases and representations of linguistic structure\, since t hese are two of the major components that determine how learners generaliz e to novel types of input.
\n X-TAGS;LANGUAGE=en-US:2022\,January\,McCoy END:VEVENT BEGIN:VEVENT UID:ai1ec-22403@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nVoice conversion (VC) is a significant aspect of arti ficial intelligence. It is the study of how to convert one’s voice to soun d like that of another without changing the linguistic content. Voice conv ersion belongs to a general technical field of speech synthesis\, which co nverts text to speech or changes the properties of speech\, for example\, voice identity\, emotion\, and accents. Voice conversion involves multiple speech processing techniques\, such as speech analysis\, spectral convers ion\, prosody conversion\, speaker characterization\, and vocoding. With t he recent advances in theory and practice\, we are now able to produce hum an-like voice quality with high speaker similarity. In this talk\, Dr. Sis man will present the recent advances in voice conversion and discuss their promise and limitations. Dr. Sisman will also provide a summary of the av ailable resources for expressive voice conversion research.\nBiography\nDr . Berrak Sisman (Member\, IEEE) received the Ph.D. degree in electrical an d computer engineering from National University of Singapore in 2020\, ful ly funded by A*STAR Graduate Academy under Singapore International Graduat e Award (SINGA). She is currently working as a tenure-track Assistant Prof essor at the Erik Jonsson School Department of Electrical and Computer Eng ineering at University of Texas at Dallas\, United States. Prior to joinin g UT Dallas\, she was a faculty member at Singapore University of Technolo gy and Design (2020-2022). She was a Postdoctoral Research Fellow at the N ational University of Singapore (2019-2020). She was an exchange doctoral student at the University of Edinburgh and a visiting scholar at The Centr e for Speech Technology Research (CSTR)\, University of Edinburgh (2019). She was a visiting researcher at RIKEN Advanced Intelligence Project in Ja pan (2018). Her research is focused on machine learning\, signal processin g\, emotion\, speech synthesis and voice conversion.\nDr. Sisman has serve d as the Area Chair at INTERSPEECH 2021\, INTERSPEECH 2022\, IEEE SLT 2022 and as the Publication Chair at ICASSP 2022. She has been elected as a me mber of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from January 2022 to Decembe r 2024. She plays leadership roles in conference organizations and active in technical committees. She has served as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Ass ociation (ISCA). DTSTART;TZID=America/New_York:20221104T120000 DTEND;TZID=America/New_York:20221104T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Berrak Sisman (University of Texas at Dallas) “Speech Synthesis and Voice Conversion: Machine Learning can Mimic Anyone’s Voice” URL:https://www.clsp.jhu.edu/events/berrak-sisman-university-of-texas-at-da llas/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nVoice conversion (VC) is a significant aspect of arti ficial intelligence. It is the study of how to convert one’s voice to soun d like that of another without changing the linguistic content. Voice conv ersion belongs to a general technical field of speech synthesis\, which co nverts text to speech or changes the properties of speech\, for example\, voice identity\, emotion\, and accents. Voice conversion involves multiple speech processing techniques\, such as speech analysis\, spectral convers ion\, prosody conversion\, speaker characterization\, and vocoding. With t he recent advances in theory and practice\, we are now able to produce hum an-like voice quality with high speaker similarity. In this talk\, Dr. Sis man will present the recent advances in voice conversion and discuss their promise and limitations. Dr. Sisman will also provide a summary of the av ailable resources for expressive voice conversion research.
\nDr. Berrak Sisman (Member\, IEEE) received th e Ph.D. degree in electrical and computer engineering from National Univer sity of Singapore in 2020\, fully funded by A*STAR Graduate Academy under Singapore International Graduate Award (SINGA). She is currently working a s a tenure-track Assistant Professor at the Erik Jonsson School Department of Electrical and Computer Engineering at University of Texas at Dallas\, United States. Prior to joining UT Dallas\, she was a faculty member at S ingapore University of Technology and Design (2020-2022). She was a Postdo ctoral Research Fellow at the National University of Singapore (2019-2020) . She was an exchange doctoral student at the University of Edinburgh and a visiting scholar at The Centre for Speech Technology Research (CSTR)\, U niversity of Edinburgh (2019). She was a visiting researcher at RIKEN Adva nced Intelligence Project in Japan (2018). Her research is focused on mach ine learning\, signal processing\, emotion\, speech synthesis and voice co nversion.
\nDr. Sisman has served as the Area Chair at INTERSPEECH 2 021\, INTERSPEECH 2022\, IEEE SLT 2022 and as the Publication Chair at ICA SSP 2022. She has been elected as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from January 2022 to December 2024. She plays leadership roles i n conference organizations and active in technical committees. She has ser ved as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Association (ISCA).
\n X-TAGS;LANGUAGE=en-US:2022\,November\,Sisman END:VEVENT BEGIN:VEVENT UID:ai1ec-22408@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nAI-powered applications increasingly adopt Deep Neura l Networks (DNNs) for solving many prediction tasks\, leading to more than one DNNs running on resource-constrained devices. Supporting many models simultaneously on a device is challenging due to the linearly increased co mputation\, energy\, and storage costs. An effective approach to address t he problem is multi-task learning (MTL) where a set of tasks are learned j ointly to allow some parameter sharing among tasks. MTL creates multi-task models based on common DNN architectures and has shown significantly redu ced inference costs and improved generalization performance in many machin e learning applications. In this talk\, we will introduce our recent effor ts on leveraging MTL to improve accuracy and efficiency for edge computing . The talk will introduce multi-task architecture design systems that can automatically identify resource-efficient multi-task models with low infer ence costs and high task accuracy.\n\nBiography\n\n\nHui Guan is an Assist ant Professor in the College of Information and Computer Sciences (CICS) a t the University of Massachusetts Amherst\, the flagship campus of the UMa ss system. She received her Ph.D. in Electrical Engineering from North Car olina State University in 2020. Her research lies in the intersection betw een machine learning and systems\, with an emphasis on improving the speed \, scalability\, and reliability of machine learning through innovations i n algorithms and programming systems. Her current research focuses on both algorithm and system optimizations of deep multi-task learning and graph machine learning. DTSTART;TZID=America/New_York:20221111T120000 DTEND;TZID=America/New_York:20221111T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Hui Guan (University of Massachusetts Amherst) “Towards Accurate an d Efficient Edge Computing Via Multi-Task Learning” URL:https://www.clsp.jhu.edu/events/hui-guan-university-of-massachusetts-am herst/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nAbstr act
\nDriven by the goal of eradicating language barriers o n a global scale\, machine translation has solidified itself as a key focu s of artificial intelligence research today. However\, such efforts have c oalesced around a small subset of languages\, leaving behind the vast majo rity of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe\, high-quality results\, all while ke eping ethical considerations in mind? In this talk\, I introduce No Langua ge Left Behind\, an initiative to break language barriers for low-resource languages. In No Language Left Behind\, we took on the low-resource langu age translation challenge by first contextualizing the need for translatio n support through exploratory interviews with native speakers. Then\, we c reated datasets and models aimed at narrowing the performance gap between low and high-resource languages. We proposed multiple architectural and tr aining improvements to counteract overfitting while training on thousands of tasks. Critically\, we evaluated the performance of over 40\,000 differ ent translation directions using a human-translated benchmark\, Flores-200 \, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achiev es an improvement of 44% BLEU relative to the previous state-of-the-art\, laying important groundwork towards realizing a universal translation syst em in an open-source manner.
\nBiography
\nAngela is a research scientist at Meta AI Research in Ne w York\, focusing on supporting efforts in speech and language research. R ecent projects include No Language Left Behind (https://ai.facebook.com/research/no-language-left-be hind/) and Universal Speech Translation for Unwritten Languages (https://ai.facebook.com/blog/ai-translation -hokkien/). Before translation\, Angela previously focused on research in on-device models for NLP and computer vision and text generation.
\n\n X-TAGS;LANGUAGE=en-US:2022\,Fan\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-23302@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION: DTSTART;TZID=America/New_York:20230130T120000 DTEND;TZID=America/New_York:20230130T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Daniel Fried (CMU) URL:https://www.clsp.jhu.edu/events/daniel-fried-cmu/ X-COST-TYPE:free X-TAGS;LANGUAGE=en-US:2023\,Fried\,January END:VEVENT BEGIN:VEVENT UID:ai1ec-23910@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nEffective communication lies at the heart of social h armony and individual well-being. However\, key areas of our society face profound challenges in how we talk about things\, or to each other. In thi s talk\, I will show how these challenges manifest: from the manner in whi ch TV reporters discuss current events to online health discussions in ban ned Reddit communities\, and interactions between law enforcement and comm unities of color during routine car stops. My research applies theories fr om linguistics and psychology to analyze patterns in such dialogue using l arge language models (LLMs)\, statistics\, and experimental design. In thi s presentation\, I will introduce three research studies that highlight ho w specific patterns in our language choices are predictive of real-world o utcomes. First\, I will illustrate how partisan divides in the language of America’s two major broadcasting news stations over the past decade direc tly correlate with semantic polarity trends on Twitter\, empirically linki ng for the first time how online discussions are influenced by televised m edia. Second\, I will show how “gists” or causal statements in social medi a discussions about pandemic health practices unveil underlying beliefs an d attitudes\, which in turn\, can forecast broader health trends across th e U.S. Finally\, by examining the linguistic interactions captured from th ousands of footages from police body-worn cameras\, I demonstrate how the first 45 words spoken by a police officer during a car stop with a Black d river can be quite telling about how the stop will conclude. Persistent ch allenges in dialogue marked by tensions and biases can have wide-ranging i mplications for both individuals and society. These studies call for a bro ader awareness on the influence of our language choices across institution al\, media\, and online contexts.\n\nBio\n\n\nEugenia Rho is an Assistant Professor of Computer Science at Virginia Tech\, where she leads the SAIL (Society + AI & Language) Lab. Her research lies at the intersection of Natural Language Processing (NLP) and Human-Computer Interaction (HCI). He r work aims to advance Computational Social Science (CSS) by using computa tional linguistics to better understand how AI-mediated systems impact int eractions across people and machines. DTSTART;TZID=America/New_York:20231103T120000 DTEND;TZID=America/New_York:20231103T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Eugenia Rho (Virginia Tech) “Words Matter: How Language Choices Pre dict Societal Trends and Outcomes in Media\, Health and Policing” URL:https://www.clsp.jhu.edu/events/eugenia-rho-virginia-tech/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n
\\nAbstr act
\nAbstr act
\nMultilingual machine translation has proven immensely useful for both parameter efficiency and overall perf ormance for many language pairs via complete parameter sharing. However\, some language pairs in multilingual models can see worse performance than in bilingual models\, especially in the one-to-many translation setting. M otivated by their empirical differences\, we examine the geometric differe nces in representations from bilingual models versus those from one-to-man y multilingual models. Specifically\, we measure the isotropy of these rep resentations using intrinsic dimensionality and IsoScore\, in order to mea sure how these representations utilize the dimensions in their underlying vector space. We find that for a given language pair\, its multilingual mo del decoder representations are consistently less isotropic than comparabl e bilingual model decoder representations. Additionally\, we show that muc h of this anisotropy in multilingual decoder representations can be attrib uted to modeling language-specific information\, therefore limiting remain ing representational capacity.
\n X-TAGS;LANGUAGE=en-US:2023\,November\,Verma END:VEVENT BEGIN:VEVENT UID:ai1ec-24157@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nIn this talk\, I will present a simple extension of i mage-based Masked Autoencoders (MAE) to self-supervised representation lea rning from audio spectrograms. Following the Transformer encoder-decoder d esign in MAE\, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio\, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens\, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder\, as au dio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target dataset s. Empirically\, Audio-MAE sets new state-of-the-art performance on six au dio and speech classification tasks\, outperforming other recent models th at use external supervised pre-training.\nBio\nFlorian Metze is a Research Scientist Manager at Meta AI in New York\, supporting a team of researche rs and engineers working on multi-modal (image\, video\, audio\, text) con tent understanding for Meta’s Family of Apps (Instagram\, Threads\, Facebo ok\, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University\, in the School of Computer Science’s Language Technolog ies Institute\, where he still is an Adjunct Professor. He is also a co-fo under of Abridge\, a company working on extracting information from doctor patient conversations. His work covers many areas of speech recognition a nd multi-media analysis with a focus on end-to-end deep learning. Currentl y\, he focuses on multi-modal processing of videos\, and using that inform ation to recommend unconnected content. In the past\, he has worked on low resource and multi-lingual speech processing\, speech recognition with ar ticulatory features\, large-scale multi-media retrieval and summarization\ , information extraction from medical interviews\, and recognition of pers onality or similar meta-data from speech.\nFor more information\, please s ee http://www.cs.cmu.edu/directory/fmetze\n DTSTART;TZID=America/New_York:20231110T120000 DTEND;TZID=America/New_York:20231110T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Florian Metze (CMU) “Masked Autoencoders that Listen” URL:https://www.clsp.jhu.edu/events/florian-metze-cmu/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nIn this talk\, I will present a simple extension of i mage-based Masked Autoencoders (MAE) to self-supervised representation lea rning from audio spectrograms. Following the Transformer encoder-decoder d esign in MAE\, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio\, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens\, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder\, as au dio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target dataset s. Empirically\, Audio-MAE sets new state-of-the-art performance on six au dio and speech classification tasks\, outperforming other recent models th at use external supervised pre-training.
\nBio
\nFlorian Metze is a Research Scientist Manager at Meta AI in New York\ , supporting a team of researchers and engineers working on multi-modal (i mage\, video\, audio\, text) content understanding for Meta’s Family of Ap ps (Instagram\, Threads\, Facebook\, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University\, in the School of Compu ter Science’s Language Technologies Institute\, where he still is an Adjun ct Professor. He is also a co-founder of Abridge\, a company working on ex tracting information from doctor patient conversations. His work covers ma ny areas of speech recognition and multi-media analysis with a focus on en d-to-end deep learning. Currently\, he focuses on multi-modal processing o f videos\, and using that information to recommend unconnected content. In the past\, he has worked on low resource and multi-lingual speech process ing\, speech recognition with articulatory features\, large-scale multi-me dia retrieval and summarization\, information extraction from medical inte rviews\, and recognition of personality or similar meta-data from speech.< /p>\n
For more information\, please see http://www.cs.cmu.edu/directory/fmetze
\n\n X-TAGS;LANGUAGE=en-US:2023\,Metze\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-24159@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Student Seminars CONTACT: DESCRIPTION: DTSTART;TZID=America/New_York:20231113T120000 DTEND;TZID=America/New_York:20231113T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Student Seminar – Kate Sanders URL:https://www.clsp.jhu.edu/events/student-seminar-kate-sanders/ X-COST-TYPE:free X-TAGS;LANGUAGE=en-US:2023\,November\,Sanders END:VEVENT BEGIN:VEVENT UID:ai1ec-24163@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nThe almost unlimited multimedia content available on video-sharing websites has opened new challenges and opportunities for bui lding robust multimodal solutions. This seminar will describe our novel mu ltimodal architectures that (1) are robust to missing modalities\, (2) can identify noisy or less discriminative features\, and (3) can leverage unl abeled data. First\, we present a strategy that effectively combines auxil iary networks\, a transformer architecture\, and an optimized training mec hanism for handling missing features. This problem is relevant since it is expected that during inference the multimodal system will face cases with missing features due to noise or occlusion. We implement this approach fo r audiovisual emotion recognition achieving state-of-the-art performance. Second\, we present a multimodal framework for dealing with scenarios char acterized by noisy or less discriminative features. This situation is comm only observed in audiovisual automatic speech recognition (AV-ASR) with cl ean speech\, where the performance often drops compared to a speech-only s olution due to the variability of visual features. The proposed approach i s a deep learning solution with a gating layer that diminishes the effect of noisy or uninformative visual features\, keeping only useful informatio n. The approach improves\, or at least\, maintains performance when visual features are used. Third\, we discuss alternative strategies to leverage unlabeled multimodal data. A promising approach is to use multimodal prete xt tasks that are carefully designed to learn better representations for p redicting a given task\, leveraging the relationship between acoustic and facial features. Another approach is using multimodal ladder networks wher e intermediate representations are predicted across modalities using later al connections. These models offer principled solutions to increase the ge neralization and robustness of common speech-processing tasks when using m ultimodal architectures. \nBio\nCarlos Busso is a Professor at the Univers ity of Texas at Dallas’s Electrical and Computer Engineering Department\, where he is also the director of the Multimodal Signal Processing (MSP) La boratory. His research interest is in human-centered multimodal machine in telligence and application\, with a focus on the broad areas of affective computing\, multimodal human-machine interfaces\, in-vehicle active safety systems\, and machine learning methods for multimodal processing. He has worked on audio-visual emotion recognition\, analysis of emotional modulat ion in gestures and speech\, designing realistic human-like virtual charac ters\, and detection of driver distractions. He is a recipient of an NSF C AREER Award. In 2014\, he received the ICMI Ten-Year Technical Impact Awar d. In 2015\, his student received the third prize IEEE ITSS Best Dissertat ion Award (N. Li). He also received the Hewlett Packard Best Paper Award a t the IEEE ICME 2011 (with J. Jain)\, and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Trans actions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). He received the ACM ICMI Community Servi ce Award in 2023. In 2023\, he received the Distinguished Alumni Award in the Mid-Career/Academia category by the Signal and Image Processing Instit ute (SIPI) at the University of Southern California. He is currently servi ng as an associate editor of the IEEE Transactions on Affective Computing. He is an IEEE Fellow. He is a member of the ISCA\, and AAAC and a senior member of ACM. DTSTART;TZID=America/New_York:20231117T120000 DTEND;TZID=America/New_York:20231117T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Carlos Busso (University of Texas at Dallas) “Multimodal Machine Le arning for Human-Centric Tasks” URL:https://www.clsp.jhu.edu/events/carl-busso-university-of-texas-at-dalla s-multimodal-machine-learning-for-human-centric-tasks/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\n
Abstr act
\nThe almost unlimited multimedia content available on video-sharing websites has opened new challenges and opportun ities for building robust multimodal solutions. This seminar will describe our novel multimodal architectures that (1) are robust to missing modalit ies\, (2) can identify noisy or less discriminative features\, and (3) can leverage unlabeled data. First\, we present a strategy that effectively c ombines auxiliary networks\, a transformer architecture\, and an optimized training mechanism for handling missing features. This problem is relevan t since it is expected that during inference the multimodal system will fa ce cases with missing features due to noise or occlusion. We implement thi s approach for audiovisual emotion recognition achieving state-of-the-art performance. Second\, we present a multimodal framework for dealing with s cenarios characterized by noisy or less discriminative features. This situ ation is commonly observed in audiovisual automatic speech recognition (AV -ASR) with clean speech\, where the performance often drops compared to a speech-only solution due to the variability of visual features. The propos ed approach is a deep learning solution with a gating layer that diminishe s the effect of noisy or uninformative visual features\, keeping only usef ul information. The approach improves\, or at least\, maintains performanc e when visual features are used. Third\, we discuss alternative strategies to leverage unlabeled multimodal data. A promising approach is to use mul timodal pretext tasks that are carefully designed to learn better represen tations for predicting a given task\, leveraging the relationship between acoustic and facial features. Another approach is using multimodal ladder networks where intermediate representations are predicted across modalitie s using lateral connections. These models offer principled solutions to in crease the generalization and robustness of common speech-processing tasks when using multimodal architectures.
\nBio
\nCarlos Busso is a Professor at the University of Tex as at Dallas’s Electrical and Computer Engineering Department\, where he i s also the director of the Multimodal Signal Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and application\, with a focus on the broad areas of affective computing\ , multimodal human-machine interfaces\, in-vehicle active safety systems\, and machine learning methods for multimodal processing. He has worked on audio-visual emotion recognition\, analysis of emotional modulation in ges tures and speech\, designing realistic human-like virtual characters\, and detection of driver distractions. He is a recipient of an NSF CAREER Awar d. In 2014\, he received the ICMI Ten-Year Technical Impact Award. In 2015 \, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain)\, and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Be st Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). He received the ACM ICMI Community Service Award i n 2023. In 2023\, he received the Distinguished Alumni Award in the Mid-Ca reer/Academia category by the Signal and Image Processing Institute (SIPI) at the University of Southern California. He is currently serving as an a ssociate editor of the IEEE Transactions on Affective Computing. He is an IEEE Fellow. He is a member of the ISCA\, and AAAC and a senior member of ACM.
\n X-TAGS;LANGUAGE=en-US:2023\,Busso\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-24165@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Student Seminars CONTACT: DESCRIPTION: DTSTART;TZID=America/New_York:20231127T120000 DTEND;TZID=America/New_York:20231127T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Student Seminar – Aleem Khan URL:https://www.clsp.jhu.edu/events/student-seminar-aleem-khan/ X-COST-TYPE:free X-TAGS;LANGUAGE=en-US:2023\,Khan\,November END:VEVENT BEGIN:VEVENT UID:ai1ec-24239@www.clsp.jhu.edu DTSTAMP:20240329T113252Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nNon-invasive neural interfaces have the potential to transform human-computer interaction by providing users with low friction\ , information rich\, always available inputs. Reality Labs at Meta is deve loping such an interface for the control of augmented reality devices base d on electromyographic (EMG) signals captured at the wrist. Speech and aud io technologies turn out to be especially well suited to unlocking the ful l potential of these signals and interactions and this talk will present s everal specific problems and the speech and audio approaches that have adv anced us towards this ultimate goal of effortless and joyful interfaces. W e will provide the necessary neuroscientific background to understand thes e signals\, describe automatic speech recognition-inspired interfaces gene rating text and beamforming-inspired interfaces for identifying individual neurons\, and then explain how they connect with egocentric machine intel ligence tasks that might reside on these devices.\nBiography\nMichael I Ma ndel is a Research Scientist in Reality Labs at Meta. Previously\, he was an Associate Professor of Computer and Information Science at Brooklyn Col lege and the CUNY Graduate Center working at the intersection of machine l earning\, signal processing\, and psychoacoustics. He earned his BSc in Co mputer Science from the Massachusetts Institute of Technology and his MS a nd PhD with distinction in Electrical Engineering from Columbia University as a Fu Foundation Presidential Scholar. He was an FQRNT Postdoctoral Res earch Fellow in the Machine Learning laboratory (LISA/MILA) at the Univers ité de Montréal\, an Algorithm Developer at Audience Inc\, and a Research Scientist in Computer Science and Engineering at the Ohio State University . His work has been supported by the National Science Foundation\, includi ng via a CAREER award\, the Alfred P. Sloan Foundation\, and Google\, Inc. DTSTART;TZID=America/New_York:20240129T120000 DTEND;TZID=America/New_York:20240129T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Michael I Mandel (Meta) “Speech and Audio Processing in Non-Invasiv e Brain-Computer Interfaces at Meta” URL:https://www.clsp.jhu.edu/events/michael-i-mandel-cuny/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nNon-invasive neural interfaces ha ve the potential to transform human-computer interaction by providing user s with low friction\, information rich\, always available inputs. Reality Labs at Meta is developing such an interface for the control of augmented reality devices based on electromyographic (EMG) signals captured at the w rist. Speech and audio technologies turn out to be especially well suited to unlocking the full potential of these signals and interactions and this talk will present several specific problems and the speech and audio appr oaches that have advanced us towards this ultimate goal of effortless and joyful interfaces. We will provide the necessary neuroscientific backgroun d to understand these signals\, describe automatic speech recognition-insp ired interfaces generating text and beamforming-inspired interfaces for id entifying individual neurons\, and then explain how they connect with egoc entric machine intelligence tasks that might reside on these devices.
\nBiography
\nMichael I Mandel is a Research Sci entist in Reality Labs at Meta. Previously\, he was an Associate Professor of Computer and Information Science at Brooklyn College and the CUNY Grad uate Center working at the intersection of machine learning\, signal proce ssing\, and psychoacoustics. He earned his BSc in Computer Science from th e Massachusetts Institute of Technology and his MS and PhD with distinctio n in Electrical Engineering from Columbia University as a Fu Foundation Pr esidential Scholar. He was an FQRNT Postdoctoral Research Fellow in the Ma chine Learning laboratory (LISA/MILA) at the Université de Montréal\, an A lgorithm Developer at Audience Inc\, and a Research Scientist in Computer Science and Engineering at the Ohio State University. His work has been su pported by the National Science Foundation\, including via a CAREER award\ , the Alfred P. Sloan Foundation\, and Google\, Inc.
\n X-TAGS;LANGUAGE=en-US:2024\,January\,Mandel END:VEVENT END:VCALENDAR