BEGIN:VCALENDAR VERSION:2.0 PRODID:-//128.220.36.25//NONSGML kigkonsult.se iCalcreator 2.26.9// CALSCALE:GREGORIAN METHOD:PUBLISH X-FROM-URL:https://www.clsp.jhu.edu X-WR-TIMEZONE:America/New_York BEGIN:VTIMEZONE TZID:America/New_York X-LIC-LOCATION:America/New_York BEGIN:STANDARD DTSTART:20231105T020000 TZOFFSETFROM:-0400 TZOFFSETTO:-0500 RDATE:20241103T020000 TZNAME:EST END:STANDARD BEGIN:DAYLIGHT DTSTART:20240310T020000 TZOFFSETFROM:-0500 TZOFFSETTO:-0400 RDATE:20250309T020000 TZNAME:EDT END:DAYLIGHT END:VTIMEZONE BEGIN:VEVENT UID:ai1ec-24157@www.clsp.jhu.edu DTSTAMP:20240329T142928Z CATEGORIES;LANGUAGE=en-US:Seminars CONTACT: DESCRIPTION:
Abstract
\nIn this talk\, I will pres ent a simple extension of image-based Masked Autoencoders (MAE) to self-su pervised representation learning from audio spectrograms. Following the Tr ansformer encoder-decoder design in MAE\, our Audio-MAE first encodes audi o spectrogram patches with a high masking ratio\, feeding only the non-mas ked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens\, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window atten tion in the decoder\, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower mask ing ratio on target datasets. Empirically\, Audio-MAE sets new state-of-th e-art performance on six audio and speech classification tasks\, outperfor ming other recent models that use external supervised pre-training.
\n< p>Bio\nFlorian Metze is a Research Scientist Manag er at Meta AI in New York\, supporting a team of researchers and engineers working on multi-modal (image\, video\, audio\, text) content understandi ng for Meta’s Family of Apps (Instagram\, Threads\, Facebook\, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon Universit y\, in the School of Computer Science’s Language Technologies Institute\, where he still is an Adjunct Professor. He is also a co-founder of Abridge \, a company working on extracting information from doctor patient convers ations. His work covers many areas of speech recognition and multi-media a nalysis with a focus on end-to-end deep learning. Currently\, he focuses o n multi-modal processing of videos\, and using that information to recomme nd unconnected content. In the past\, he has worked on low resource and mu lti-lingual speech processing\, speech recognition with articulatory featu res\, large-scale multi-media retrieval and summarization\, information ex traction from medical interviews\, and recognition of personality or simil ar meta-data from speech.
\nFor more information\, please see http://www.cs.cmu.edu/directory /fmetze
\nDTSTART;TZID=America/New_York:20231110T120000 DTEND;TZID=America/New_York:20231110T131500 LOCATION:Hackerman Hall B17 @ 3400 N. Charles Street\, Baltimore\, MD 21218 SEQUENCE:0 SUMMARY:Florian Metze (CMU) “Masked Autoencoders that Listen” URL:https://www.clsp.jhu.edu/events/florian-metze-cmu/ X-COST-TYPE:free X-TAGS;LANGUAGE=en-US:2023\,Metze\,November END:VEVENT END:VCALENDAR