BEGIN:VCALENDAR VERSION:2.0 PRODID:-//128.220.36.25//NONSGML kigkonsult.se iCalcreator 2.26.9// CALSCALE:GREGORIAN METHOD:PUBLISH X-FROM-URL:https://www.clsp.jhu.edu X-WR-TIMEZONE:America/New_York BEGIN:VTIMEZONE TZID:America/New_York X-LIC-LOCATION:America/New_York BEGIN:STANDARD DTSTART:20231105T020000 TZOFFSETFROM:-0400 TZOFFSETTO:-0500 RDATE:20241103T020000 TZNAME:EST END:STANDARD BEGIN:DAYLIGHT DTSTART:20240310T020000 TZOFFSETFROM:-0500 TZOFFSETTO:-0400 RDATE:20250309T020000 TZNAME:EDT END:DAYLIGHT END:VTIMEZONE BEGIN:VEVENT UID:ai1ec-21023@www.clsp.jhu.edu DTSTAMP:20240328T231209Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:Abstract\nSpeech data is notoriously difficult to work with due to a variety of codecs\, lengths of recordings\, and meta-data formats. W e present Lhotse\, a speech data representation library that draws upon le ssons learned from Kaldi speech recognition toolkit and brings its concept s into the modern deep learning ecosystem. Lhotse provides a common JSON d escription format with corresponding Python classes and data preparation r ecipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings\, long recordings\, local and cloud storage\, la zy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts\, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally\, we show how Lhotse leverages PyTorch data API abstractions and adopts them to han dle speech data for deep learning.\nBiography\nPiotr Zelasko is an assista nt research scientist in the Center for Language and Speech Processing (CL SP) who specializes in automatic speech recognition (ASR) and spoken langu age understanding (SLU). His current research focuses on applying multilin gual and crosslingual speech recognition systems to categorize the phoneti c inventory of a previously unknown language and on improving defenses aga inst adversarial attacks on both speaker identification and automatic spee ch recognition systems. He is also addressing the question of how to struc ture a spontaneous conversation into high-level semantic units such as dia log acts or topics. Finally\, he is working on Lhotse + K2\, the next-gene ration speech processing research software ecosystem. Before joining Johns Hopkins\, Zelasko worked as a machine learning consultant for Avaya (2017 -2019)\, and as a machine learning engineer for Techmo (2015-2017). Zelask o received his PhD (2019) in electronics engineering\, as well as his mast er’s (2014) and undergraduate degrees (2013) in acoustic engineering from AGH University of Science and Technology in Kraków\, Poland. DTSTART;TZID=America/New_York:20211029T120000 DTEND;TZID=America/New_York:20211029T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore MD 21218 SEQUENCE:0 SUMMARY:Piotr Zelasko (CLSP at JHU) “Lhotse: a speech data representation l ibrary for the modern deep learning ecosystem” URL:https://www.clsp.jhu.edu/events/piotr-zelasko-clsp-at-jhu-lhotse-a-spee ch-data-representation-library-for-the-modern-deep-learning-ecosystem/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n
\\nAbstr act
\nSpeech data is notoriously difficult t o work with due to a variety of codecs\, lengths of recordings\, and meta- data formats. We present Lhotse\, a speech data representation library tha t draws upon lessons learned from Kaldi speech recognition toolkit and bri ngs its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and dat a preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings\, long recordings\, local and clo ud storage\, lazy and on-the-fly operations amongst other features. We int roduce Cut and CutSet concepts\, which simplify common data wrangling task s for audio and help incorporate acoustic context of speech utterances. Fi nally\, we show how Lhotse leverages PyTorch data API abstractions and ado pts them to handle speech data for deep learning.
\nB iography
\nPiotr Zelasko is an assistant research scientist in the Center for Language and Speech Processing (CLSP) who specializes i n automatic speech recognition (ASR) and spoken language understanding (SL U). His current research focuses on applying multilingual and crosslingual speech recognition systems to categorize the phonetic inventory of a prev iously unknown language and on improving defenses against adversarial atta cks on both speaker identification and automatic speech recognition system s. He is also addressing the question of how to structure a spontaneous co nversation into high-level semantic units such as dialog acts or topics. F inally\, he is working on Lhotse + K2\, the next-generation speech process ing research software ecosystem. Before joining Johns Hopkins\, Zelasko wo rked as a machine learning consultant for Avaya (2017-2019)\, and as a mac hine learning engineer for Techmo (2015-2017). Zelasko received his PhD (2 019) in electronics engineering\, as well as his master’s (2014) and under graduate degrees (2013) in acoustic engineering from AGH University of Sci ence and Technology in Kraków\, Poland.
\n X-TAGS;LANGUAGE=en-US:2021\,October\,Zelasko END:VEVENT BEGIN:VEVENT UID:ai1ec-21275@www.clsp.jhu.edu DTSTAMP:20240328T231209Z CATEGORIES;LANGUAGE=en-US:Student Seminars CONTACT: DESCRIPTION:Abstract\n\n\n\nAutomatic discovery of phone or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ contrastive predictive coding (CPC)\, where the model lear ns representations by predicting the next frame given past context. Howeve r\, CPC only looks at the audio signal’s structure at the frame level. The speech structure exists beyond frame-level\, i.e.\, at phone level or eve n higher. We propose a segmental contrastive predictive coding (SCPC) fram ework to learn from the signal structure at both the frame and phone level s.\n\nSCPC is a hierarchical model with three stages trained in an end-to- end manner. In the first stage\, the model predicts future feature frames and extracts frame-level representation from the raw waveform. In the seco nd stage\, a differentiable boundary detector finds variable-length segmen ts. In the last stage\, the model predicts future segments to learn segmen t representations. Experiments show that our model outperforms existing ph one and word segmentation methods on TIMIT and Buckeye datasets. DTSTART;TZID=America/New_York:20220211T120000 DTEND;TZID=America/New_York:20220211T131500 LOCATION:Ames Hall 234 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Student Seminar – Saurabhchand Bhati “Segmental Contrastive Predict ive Coding for Unsupervised Acoustic Segmentation” URL:https://www.clsp.jhu.edu/events/student-seminar-saurabhchand-bhati/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\n\n\n\n\nAutomatic discovery of phone or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ contrastive predictive coding (CPC)\, where the model learns repre sentations by predicting the next frame given past context. However\, CPC only looks at the audio signal’s structure at the frame level. The speech structure exists beyond frame-level\, i.e.\, at phone level or even higher . We propose a segmental contrastive predictive coding (SCPC) framework to learn from the signal structure at both the frame and phone levels.\n\n\nSCPC is a hierarchical mode l with three stages trained in an end-to-end manner. In the first stage\, the model predicts future feature frames and extracts frame-level represen tation from the raw waveform. In the second stage\, a differentiable bound ary detector finds variable-length segments. In the last stage\, the model predicts future segments to learn segment representations. Experiments sh ow that our model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets.
Abstr act
\nThe growing power in computing and AI promises a near -term future of human-machine teamwork. In this talk\, I will present my r esearch group’s efforts in understanding the complex dynamics of human-mac hine interaction and designing intelligent machines aimed to assist and co llaborate with people. I will focus on 1) tools for onboarding machine tea mmates and authoring machine assistance\, 2) methods for detecting\, and b roadly managing\, errors in collaboration\, and 3) building blocks of know ledge needed to enable ad hoc human-machine teamwork. I will also highligh t our recent work on designing assistive\, collaborative machines to suppo rt older adults aging in place.
\nBiography
\nChien-Ming Huang is the John C. Malone Assistant Professor in the Departm ent of Computer Science at the Johns Hopkins University. His research focu ses on designing interactive AI aimed to assist and collaborate with peopl e. He publishes in top-tier venues in HRI\, HCI\, and robotics including S cience Robotics\, HRI\, CHI\, and CSCW. His research has received media co verage from MIT Technology Review\, Tech Insider\, and Science Nation. Hua ng completed his postdoctoral training at Yale University and received his Ph.D. in Computer Science at the University of Wisconsin–Madison. He is a recipient of the NSF CAREER award. https://www .cs.jhu.edu/~cmhuang/
\n X-TAGS;LANGUAGE=en-US:2023\,Huang\,September END:VEVENT BEGIN:VEVENT UID:ai1ec-24479@www.clsp.jhu.edu DTSTAMP:20240328T231209Z CATEGORIES;LANGUAGE=en-US:Student Seminars CONTACT: DESCRIPTION:Abstract\nThe speech field is evolving to solve more challengin g scenarios\, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there\, we present the UniX-Encoder. It’s a universal encoder designed for multiple tasks\, a nd worked with any microphone array\, in both solo and multi-talker enviro nments. Our research enhances previous multichannel speech processing effo rts in four key areas: 1) Adaptability: Contrasting traditional models con strained to certain microphone array configurations\, our encoder is unive rsally compatible. 2) MultiTask Capability: Beyond the single-task focus o f previous systems\, UniX-Encoder acts as a robust upstream model\, adeptl y extracting features for diverse tasks including ASR and speaker recognit ion. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to mod els that first beamform then process single-channels\, our encoder offers an end-to-end solution\, bypassing explicit beamforming or separation. To validate its effectiveness\, we tested the UniXEncoder on a synthetic mult i-channel dataset from the LibriSpeech corpus. Across tasks like speech re cognition and speaker diarization\, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend. DTSTART;TZID=America/New_York:20240311T200500 DTEND;TZID=America/New_York:20240311T210500 SEQUENCE:0 SUMMARY:Zili Huang (JHU) “Unix-Encoder: A Universal X-Channel Speech Encode r for Ad-Hoc Microphone Array Speech Processing” URL:https://www.clsp.jhu.edu/events/zili-huang-jhu-unix-encoder-a-universal -x-channel-speech-encoder-for-ad-hoc-microphone-array-speech-processing/ X-COST-TYPE:free X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\nAbstr act
\nThe speech field is evolving to solve more challenging scenarios\, such as multi-channel recordings wi th multiple simultaneous talkers. Given the many types of microphone setup s out there\, we present the UniX-Encoder. It’s a universal encoder design ed for multiple tasks\, and worked with any microphone array\, in both sol o and multi-talker environments. Our research enhances previous multichann el speech processing efforts in four key areas: 1) Adaptability: Contrasti ng traditional models constrained to certain microphone array configuratio ns\, our encoder is universally compatible. 2) MultiTask Capability: Beyon d the single-task focus of previous systems\, UniX-Encoder acts as a robus t upstream model\, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integr ation: In contrast to models that first beamform then process single-chann els\, our encoder offers an end-to-end solution\, bypassing explicit beamf orming or separation. To validate its effectiveness\, we tested the UniXEn coder on a synthetic multi-channel dataset from the LibriSpeech corpus. Ac ross tasks like speech recognition and speaker diarization\, our encoder c onsistently outperformed combinations like the WavLM model with the Beamfo rmIt frontend.
\n X-TAGS;LANGUAGE=en-US:2024\,Huang\,March END:VEVENT END:VCALENDAR