Neural Speaker Diarization in the Context of Automatic Speech Recognition
Speakers: Kyu J. Han, Tae Jin Park, and Dimitrios Dimitriadis
Kyu J. Han
Senior Director of Speech Modeling and ML Data Labeling @ASAPP
Bio： Dr. Kyu J. Han is a Sr. Director of Speech Modeling and ML Data Labeling at ASAPP, leading an applied science team primarily working on automatic speech recognition and voice analytics for ASAPP's machine learning services to its enterprise customers in a customer experience (CX) domain. He received his Ph.D. from the University of Southern California, USA in 2009 under Prof. Shrikanth Narayanan and has since held research positions at IBM Watson, Ford Research, Capio (acquired by Twilio), JD AI Research and ASAPP. At IBM he participated in the IARPA Biometrics Exploitation Science and Technology (BEST) project and the DARPA Robust Automatic Transcription of Speech (RATS) program. He led a research team at Capio where the team achieved state-of-the-art performances in telephony speech recognition and successfully completed a government-funded project for noise robust, on-prem ASR system integration across 13 different languages. Dr. Kyu J. Han is actively involved in speech community activities, serving as reviewers for IEEE, ISCA and ACL journals and conferences. He was a member of the Speech and Language Processing Technical Committee (SLTC) of the IEEE Signal Processing Society from 2019-2021 where he served as the Chair of the Workshop Subcommittee. He also served as a member of the Organizing Committee for the IEEE SLT 2021. He was the Survey Talk speaker and the Doctoral Consortium panel at Interspeech 2019, the Tutorial speaker at Interspeech 2020, and the Industry Expert Talk speaker at ICASSP 2022. In addition, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017. He is a Senior Member of the IEEE.
Tae Jin Park
Deep Learning Applied Scientist @Nvidia
Bio：Taejin Park received his B.S. degree in electrical engineering and M.S. degree in Electric Engineering and Computer Science from Seoul National University, Seoul, South Korea. in 2010 and 2012, respectively. In 2012, he joined Electrical and Telecommunication Research Institute (ETRI), Daejeon, South Korea, as a researcher. He graduated from University of Southern California (USC) with PhD degree in Electrical Engineering and MS degree in Computer Science. Taejin Park is currently working as an applied scientist at NVIDIA. His research interests include machine learning and speech signal processing concentrating on speaker diarization.
Principal Researcher @Microsoft Research
Bio：Dr. D. Dimitriadis is Principal Researcher in Microsoft Research, since 2018, with focus on Federated Learning. He has worked as a Research Staff Member in IBM (2014-2018), Principal Member of Technical Staff at AT&T Labs (2009-14), lecturer P.D 407/80 at the School of ECE, NTUA, Greece. He is a Senior Member of IEEE and has served as a chair in several conferences and workshops. He has published more than 100 papers in peer-reviewed scientific journals and conferences with over 2200 citations. He received his PhD degree from NTUA in February 2005. His PhD Thesis Title is "Non-Linear Speech Processing, Modulation Models and Applications to Speech Recognition".
Speaker Diarization: A Journey from Unsupervised to Supervised Approaches
Speaker diarization is the task to solve the problem of “who spoke when”. It is a critical component in various AI applications that need to process long-form audio/video streams involving multiple speakers, such as dialogues, meetings, telephone recordings, and broadcast news. Traditional speaker diarization pipelines usually consist of multiple modules: voice activity detection, change point detection, speaker embedding, and speaker clustering. Although some of these modules, such as speaker embedding, are usually trained in a supervised and discriminative way, the clustering procedures remain unsupervised and nonparametric. This makes it difficult to optimize the diarization error rate or other diarization metrics for the entire pipeline. Recently, supervised neural speaker diarization approaches, such as UIS-RNN, EEND and DNC, allow the use of an all-neural solution to implement multiple stages, and enable the pipeline to be trained in a fully supervised and more “end-to-end” fashion. These supervised approaches benefit more from large scale training datasets, and usually have significantly better performance compared with unsupervised approaches. In this tutorial, we will first review the basics of speaker diarization and the traditional unsupervised methods. Then we will introduce several representative supervised solutions, and talk about the challenges in developing these supervised approaches. At last, we present further discussions and future directions.
Unraveling the acoustic biomarkers of COVID-19
Speaker：Sriram Ganapathy（ http://www.leap.ee.iisc.ac.in/sriram/）
Associate Professor at Indian Institute of Science Bangalore Learning and Extraction of Acoustic Patterns (LEAP) lab
The COVID-19 induced pandemic of the last 30 months needs no introduction. The efforts to deal with the pandemic have been truly global, including identification of the genomic patterns, developing models of disease progression, vaccination, drugs and testing. This talk will focus on the testing part, where there is a persistent need for testing methodologies that provide a good trade-off between accuracy, cost, time and scale. The audio-based testing of COVID-19 has the potential of a niche tool that combines the advantages of being remote, inexpensive, scalable and rapid. Motivated by this goal, several groups across the world have attempted to develop protocols, datasets, and machine learning methodologies. This talk will provide a brief motivation and overview of the works performed by various groups for the task of identifying audio-based biomarkers of COVID-19. I will discuss our work, Project Coswara, a tool for COVID-19 diagnostics using crowd-sourced multi-modal data. The rationale for the project, the data collection and the development of the web-based tool will be highlighted. The machine learning aspects of the tool development, including generalization to new variants of COVID-19 and the fairness analysis of the models developed will be presented. The talk will also summarize the findings from the series of data challenges, named DiCOVA (Diagnosis of COVid-19 using Audio), organized using the data collected as part of the project.
Speech Emotion Recognition: Challenges, Approaches and Future Directions
Speakers：Longbiao Wang and Xixin Wu
Speech emotion recognition (SER) is an important technique for graceful speech-based human-machine interaction. Though the task is quite challenging, SER has attracted much research attention and achieved great progresses over the past decades. Many application scenarios have been explored and demonstrated value of the technique. This tutorial will describe the challenges of the SER problem and provide an overview of recent advances in SER where signal processing and machine learning techniques are fused to obtain significant improvement and shed new insights. Exemplary approaches, including feature extraction, representation learning and model architectures, will be illustrated in a hope to get some inspiration for the future research. Some interesting questions that need further investigation will also be highlighted.