BEGIN:VCALENDAR VERSION:2.0 PRODID:-//128.220.36.25//NONSGML kigkonsult.se iCalcreator 2.26.9// CALSCALE:GREGORIAN METHOD:PUBLISH X-FROM-URL:https://www.clsp.jhu.edu X-WR-TIMEZONE:America/New_York BEGIN:VTIMEZONE TZID:America/New_York X-LIC-LOCATION:America/New_York BEGIN:STANDARD DTSTART:20231105T020000 TZOFFSETFROM:-0400 TZOFFSETTO:-0500 RDATE:20241103T020000 TZNAME:EST END:STANDARD BEGIN:DAYLIGHT DTSTART:20240310T020000 TZOFFSETFROM:-0500 TZOFFSETTO:-0400 RDATE:20250309T020000 TZNAME:EDT END:DAYLIGHT END:VTIMEZONE BEGIN:VEVENT UID:ai1ec-23515@www.clsp.jhu.edu DTSTAMP:20240329T081147Z CATEGORIES;LANGUAGE=en-US:Student Seminars CONTACT: DESCRIPTION:
Abstract
\nHow important are different temporal s peech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly\, we quantify the amount of phonetic information in the modulation spectrum of speech by computing the mutual i nformation between temporal modulations with frame-wise phoneme labels. Lo oking from another perspective\, we ask – which speech modulations an Auto matic Speech Recognition (ASR) system prefers for its operation. Data-driv en weights are learned over the modulation spectrum and optimized for an e nd-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most pref erred by the ASR. In addition\, we show that the incorporation of this kno wledge into ASRs significantly reduces their dependency on the amount of t raining data.
\n\n