BEGIN:VCALENDAR VERSION:2.0 PRODID:-//128.220.36.25//NONSGML kigkonsult.se iCalcreator 2.26.9// CALSCALE:GREGORIAN METHOD:PUBLISH X-FROM-URL:https://www.clsp.jhu.edu X-WR-TIMEZONE:America/New_York BEGIN:VTIMEZONE TZID:America/New_York X-LIC-LOCATION:America/New_York BEGIN:STANDARD DTSTART:20231105T020000 TZOFFSETFROM:-0400 TZOFFSETTO:-0500 RDATE:20241103T020000 TZNAME:EST END:STANDARD BEGIN:DAYLIGHT DTSTART:20240310T020000 TZOFFSETFROM:-0500 TZOFFSETTO:-0400 RDATE:20250309T020000 TZNAME:EDT END:DAYLIGHT END:VTIMEZONE BEGIN:VEVENT UID:ai1ec-21267@www.clsp.jhu.edu DTSTAMP:20240328T190614Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nIn this talk\, I present a multipronged strategy for zero-shot cross-lingual Information Extraction\, that is the construction of an IE model for some target language\, given existing annotations exclu sively in some other language. This work is part of the JHU team’s effort under the IARPA BETTER program. I explore data augmentation techniques inc luding data projection and self-training\, and how different pretrained en coders impact them. We find through extensive experiments and extension of techniques that a combination of approaches\, both new and old\, leads to better performance than any one cross-lingual strategy in particular.\nBi ography\nMahsa Yarmohammadi is an assistant research scientist in CLSP\, J HU\, who leads state-of-the-art research in cross-lingual language and spe ech applications and algorithms. A primary focus of Yarmohammadi’s researc h is using deep learning techniques to transfer existing resources into ot her languages and to learn representations of language from multilingual d ata. She also works in automatic speech recognition and speech translation . Yarmohammadi received her PhD in computer science and engineering from O regon Health & Science University (2016). She joined CLSP as a post-doctor al fellow in 2017. DTSTART;TZID=America/New_York:20220204T120000 DTEND;TZID=America/New_York:20220204T131500 LOCATION:Ames 234 Presented Virtually via Zoom https://wse.zoom.us/j/967351 83473 SEQUENCE:0 SUMMARY:Mahsa Yarmohammadi (Johns Hopkins University) “Data Augmentation fo r Zero-shot Cross-Lingual Information Extraction” URL:https://www.clsp.jhu.edu/events/mahsa-yarmohammadi-johns-hopkins-univer sity-data-augmentation-for-zero-shot-cross-lingual-information-extraction/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n
\\nAbstr act
\nIn this talk\, I present a multipronged strategy for zero-shot cross-lingual Information Extraction\, that is the construction of an IE model for some target language\, given existing annotations exclu sively in some other language. This work is part of the JHU team’s effort under the IARPA BETTER program. I explore data augmentation techniques inc luding data projection and self-training\, and how different pretrained en coders impact them. We find through extensive experiments and extension of techniques that a combination of approaches\, both new and old\, leads to better performance than any one cross-lingual strategy in particular.
\nBiography
\nAbstr act
\nAs humans\, our understanding of language is grounded
in a rich mental model about “how the world works” – that we learn throug
h perception and interaction. We use this understanding to reason beyond w
hat we literally observe or read\, imagining how situations might unfold i
n the world. Machines today struggle at this kind of reasoning\, which lim
its how they can communicate with humans.
In my talk\, I will discuss three lines of work to bridge
this gap between machines and humans. I will first discuss how we might m
easure grounded understanding. I will introduce a suite of approaches for
constructing benchmarks\, using machines in the loop to filter out spuriou
s biases. Next\, I will introduce PIGLeT: a model that learns physical com
monsense understanding by interacting with the world through simulation\,
using this knowledge to ground language. From an English-language descript
ion of an event\, PIGLeT can anticipate how the world state might change –
outperforming text-only models that are orders of magnitude larger. Final
ly\, I will introduce MERLOT\, which learns about situations in the world
by watching millions of YouTube videos with transcribed speech. Through tr
aining objectives inspired by the developmental psychology idea of multimo
dal reentry\, MERLOT learns to fuse language\, vision\, and sound together
into powerful representations.
Together\, these directions suggest a path forward for building mac
hines that learn language rooted in the world.
Biography strong>
\nRowan Zellers is a final year PhD candidate at the Univers ity of Washington in Computer Science & Engineering\, advised by Yejin Cho i and Ali Farhadi. His research focuses on enabling machines to understand language\, vision\, sound\, and the world beyond these modalities. He has been recognized through an NSF Graduate Fellowship and a NeurIPS 2021 out standing paper award. His work has appeared in several media outlets\, inc luding Wired\, the Washington Post\, and the New York Times. In the past\, he graduated from Harvey Mudd College with a B.S. in Computer Science & M athematics\, and has interned at the Allen Institute for AI.
\n< /HTML> X-TAGS;LANGUAGE=en-US:2022\,February\,Zellers END:VEVENT BEGIN:VEVENT UID:ai1ec-22403@www.clsp.jhu.edu DTSTAMP:20240328T190614Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nVoice conversion (VC) is a significant aspect of arti ficial intelligence. It is the study of how to convert one’s voice to soun d like that of another without changing the linguistic content. Voice conv ersion belongs to a general technical field of speech synthesis\, which co nverts text to speech or changes the properties of speech\, for example\, voice identity\, emotion\, and accents. Voice conversion involves multiple speech processing techniques\, such as speech analysis\, spectral convers ion\, prosody conversion\, speaker characterization\, and vocoding. With t he recent advances in theory and practice\, we are now able to produce hum an-like voice quality with high speaker similarity. In this talk\, Dr. Sis man will present the recent advances in voice conversion and discuss their promise and limitations. Dr. Sisman will also provide a summary of the av ailable resources for expressive voice conversion research.\nBiography\nDr . Berrak Sisman (Member\, IEEE) received the Ph.D. degree in electrical an d computer engineering from National University of Singapore in 2020\, ful ly funded by A*STAR Graduate Academy under Singapore International Graduat e Award (SINGA). She is currently working as a tenure-track Assistant Prof essor at the Erik Jonsson School Department of Electrical and Computer Eng ineering at University of Texas at Dallas\, United States. Prior to joinin g UT Dallas\, she was a faculty member at Singapore University of Technolo gy and Design (2020-2022). She was a Postdoctoral Research Fellow at the N ational University of Singapore (2019-2020). She was an exchange doctoral student at the University of Edinburgh and a visiting scholar at The Centr e for Speech Technology Research (CSTR)\, University of Edinburgh (2019). She was a visiting researcher at RIKEN Advanced Intelligence Project in Ja pan (2018). Her research is focused on machine learning\, signal processin g\, emotion\, speech synthesis and voice conversion.\nDr. Sisman has serve d as the Area Chair at INTERSPEECH 2021\, INTERSPEECH 2022\, IEEE SLT 2022 and as the Publication Chair at ICASSP 2022. She has been elected as a me mber of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from January 2022 to Decembe r 2024. She plays leadership roles in conference organizations and active in technical committees. She has served as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Ass ociation (ISCA). DTSTART;TZID=America/New_York:20221104T120000 DTEND;TZID=America/New_York:20221104T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Berrak Sisman (University of Texas at Dallas) “Speech Synthesis and Voice Conversion: Machine Learning can Mimic Anyone’s Voice” URL:https://www.clsp.jhu.edu/events/berrak-sisman-university-of-texas-at-da llas/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nVoice conversion (VC) is a significant aspect of arti ficial intelligence. It is the study of how to convert one’s voice to soun d like that of another without changing the linguistic content. Voice conv ersion belongs to a general technical field of speech synthesis\, which co nverts text to speech or changes the properties of speech\, for example\, voice identity\, emotion\, and accents. Voice conversion involves multiple speech processing techniques\, such as speech analysis\, spectral convers ion\, prosody conversion\, speaker characterization\, and vocoding. With t he recent advances in theory and practice\, we are now able to produce hum an-like voice quality with high speaker similarity. In this talk\, Dr. Sis man will present the recent advances in voice conversion and discuss their promise and limitations. Dr. Sisman will also provide a summary of the av ailable resources for expressive voice conversion research.
\nDr. Berrak Sisman (Member\, IEEE) received th e Ph.D. degree in electrical and computer engineering from National Univer sity of Singapore in 2020\, fully funded by A*STAR Graduate Academy under Singapore International Graduate Award (SINGA). She is currently working a s a tenure-track Assistant Professor at the Erik Jonsson School Department of Electrical and Computer Engineering at University of Texas at Dallas\, United States. Prior to joining UT Dallas\, she was a faculty member at S ingapore University of Technology and Design (2020-2022). She was a Postdo ctoral Research Fellow at the National University of Singapore (2019-2020) . She was an exchange doctoral student at the University of Edinburgh and a visiting scholar at The Centre for Speech Technology Research (CSTR)\, U niversity of Edinburgh (2019). She was a visiting researcher at RIKEN Adva nced Intelligence Project in Japan (2018). Her research is focused on mach ine learning\, signal processing\, emotion\, speech synthesis and voice co nversion.
\nDr. Sisman has served as the Area Chair at INTERSPEECH 2 021\, INTERSPEECH 2022\, IEEE SLT 2022 and as the Publication Chair at ICA SSP 2022. She has been elected as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from January 2022 to December 2024. She plays leadership roles i n conference organizations and active in technical committees. She has ser ved as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Association (ISCA).
\n X-TAGS;LANGUAGE=en-US:2022\,November\,Sisman END:VEVENT BEGIN:VEVENT UID:ai1ec-24157@www.clsp.jhu.edu DTSTAMP:20240328T190614Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nIn this talk\, I will present a simple extension of i mage-based Masked Autoencoders (MAE) to self-supervised representation lea rning from audio spectrograms. Following the Transformer encoder-decoder d esign in MAE\, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio\, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens\, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder\, as au dio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target dataset s. Empirically\, Audio-MAE sets new state-of-the-art performance on six au dio and speech classification tasks\, outperforming other recent models th at use external supervised pre-training.\nBio\nFlorian Metze is a Research Scientist Manager at Meta AI in New York\, supporting a team of researche rs and engineers working on multi-modal (image\, video\, audio\, text) con tent understanding for Meta’s Family of Apps (Instagram\, Threads\, Facebo ok\, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University\, in the School of Computer Science’s Language Technolog ies Institute\, where he still is an Adjunct Professor. He is also a co-fo under of Abridge\, a company working on extracting information from doctor patient conversations. His work covers many areas of speech recognition a nd multi-media analysis with a focus on end-to-end deep learning. Currentl y\, he focuses on multi-modal processing of videos\, and using that inform ation to recommend unconnected content. In the past\, he has worked on low resource and multi-lingual speech processing\, speech recognition with ar ticulatory features\, large-scale multi-media retrieval and summarization\ , information extraction from medical interviews\, and recognition of pers onality or similar meta-data from speech.\nFor more information\, please s ee http://www.cs.cmu.edu/directory/fmetze\n DTSTART;TZID=America/New_York:20231110T120000 DTEND;TZID=America/New_York:20231110T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Florian Metze (CMU) “Masked Autoencoders that Listen” URL:https://www.clsp.jhu.edu/events/florian-metze-cmu/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nIn this talk\, I will present a simple extension of i mage-based Masked Autoencoders (MAE) to self-supervised representation lea rning from audio spectrograms. Following the Transformer encoder-decoder d esign in MAE\, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio\, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens\, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder\, as au dio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target dataset s. Empirically\, Audio-MAE sets new state-of-the-art performance on six au dio and speech classification tasks\, outperforming other recent models th at use external supervised pre-training.
\nBio
\nFlorian Metze is a Research Scientist Manager at Meta AI in New York\ , supporting a team of researchers and engineers working on multi-modal (i mage\, video\, audio\, text) content understanding for Meta’s Family of Ap ps (Instagram\, Threads\, Facebook\, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University\, in the School of Compu ter Science’s Language Technologies Institute\, where he still is an Adjun ct Professor. He is also a co-founder of Abridge\, a company working on ex tracting information from doctor patient conversations. His work covers ma ny areas of speech recognition and multi-media analysis with a focus on en d-to-end deep learning. Currently\, he focuses on multi-modal processing o f videos\, and using that information to recommend unconnected content. In the past\, he has worked on low resource and multi-lingual speech process ing\, speech recognition with articulatory features\, large-scale multi-me dia retrieval and summarization\, information extraction from medical inte rviews\, and recognition of personality or similar meta-data from speech.< /p>\n
For more information\, please see http://www.cs.cmu.edu/directory/fmetze
\n\n X-TAGS;LANGUAGE=en-US:2023\,Metze\,November END:VEVENT END:VCALENDAR