Tutorial Speaker

Neural Speaker Diarization in the Context of Automatic Speech Recognition

Speakers: Kyu J. Han, Tae Jin Park, and Dimitrios Dimitriadis

Chair:Ming Li

Kyu J. Han

Senior Director of Speech Modeling and ML Data Labeling @ASAPP

Bio: Dr. Kyu J. Han is a Sr. Director of Speech Modeling and ML Data Labeling at ASAPP, leading an applied science team primarily working on automatic speech recognition and voice analytics for ASAPP's machine learning services to its enterprise customers in a customer experience (CX) domain. He received his Ph.D. from the University of Southern California, USA in 2009 under Prof. Shrikanth Narayanan and has since held research positions at IBM Watson, Ford Research, Capio (acquired by Twilio), JD AI Research and ASAPP. At IBM he participated in the IARPA Biometrics Exploitation Science and Technology (BEST) project and the DARPA Robust Automatic Transcription of Speech (RATS) program. He led a research team at Capio where the team achieved state-of-the-art performances in telephony speech recognition and successfully completed a government-funded project for noise robust, on-prem ASR system integration across 13 different languages. Dr. Kyu J. Han is actively involved in speech community activities, serving as reviewers for IEEE, ISCA and ACL journals and conferences. He was a member of the Speech and Language Processing Technical Committee (SLTC) of the IEEE Signal Processing Society from 2019-2021 where he served as the Chair of the Workshop Subcommittee. He also served as a member of the Organizing Committee for the IEEE SLT 2021. He was the Survey Talk speaker and the Doctoral Consortium panel at Interspeech 2019, the Tutorial speaker at Interspeech 2020, and the Industry Expert Talk speaker at ICASSP 2022. In addition, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017. He is a Senior Member of the IEEE.

Tae Jin Park

Deep Learning Applied Scientist @Nvidia

Bio:Taejin Park received his B.S. degree in electrical engineering and M.S. degree in Electric Engineering and Computer Science from Seoul National University, Seoul, South Korea. in 2010 and 2012, respectively. In 2012, he joined Electrical and Telecommunication Research Institute (ETRI), Daejeon, South Korea, as a researcher. He graduated from University of Southern California (USC) with PhD degree in Electrical Engineering and MS degree in Computer Science. Taejin Park is currently working as an applied scientist at NVIDIA. His research interests include machine learning and speech signal processing concentrating on speaker diarization.

Dimitrios Dimitriadis

Principal Researcher @Microsoft Research

Bio:Dr. D. Dimitriadis is Principal Researcher in Microsoft Research, since 2018, with focus on Federated Learning. He has worked as a Research Staff Member in IBM (2014-2018), Principal Member of Technical Staff at AT&T Labs (2009-14), lecturer P.D 407/80 at the School of ECE, NTUA, Greece. He is a Senior Member of IEEE and has served as a chair in several conferences and workshops. He has published more than 100 papers in peer-reviewed scientific journals and conferences with over 2200 citations. He received his PhD degree from NTUA in February 2005. His PhD Thesis Title is "Non-Linear Speech Processing, Modulation Models and Applications to Speech Recognition".

Tutorial Speakers

Speaker Diarization: A Journey from Unsupervised to Supervised Approaches

Speakers:Chao Zhang and Quan Wang

Chair:Ming Li


Speaker diarization is the task to solve the problem of “who spoke when”. It is a critical component in various AI applications that need to process long-form audio/video streams involving multiple speakers, such as dialogues, meetings, telephone recordings, and broadcast news. Traditional speaker diarization pipelines usually consist of multiple modules: voice activity detection, change point detection, speaker embedding, and speaker clustering. Although some of these modules, such as speaker embedding, are usually trained in a supervised and discriminative way, the clustering procedures remain unsupervised and nonparametric. This makes it difficult to optimize the diarization error rate or other diarization metrics for the entire pipeline. Recently, supervised neural speaker diarization approaches, such as UIS-RNN, EEND and DNC, allow the use of an all-neural solution to implement multiple stages, and enable the pipeline to be trained in a fully supervised and more “end-to-end” fashion. These supervised approaches benefit more from large scale training datasets, and usually have significantly better performance compared with unsupervised approaches. In this tutorial, we will first review the basics of speaker diarization and the traditional unsupervised methods. Then we will introduce several representative supervised solutions, and talk about the challenges in developing these supervised approaches. At last, we present further discussions and future directions.

Chao Zhang (Google)
Quan Wang (Google)
Chao Zhang received his B.E. and M.S. degrees in 2009 and 2012 from the Department of Computer Science & Technology, Tsinghua University, and received his Ph.D. degree in 2017 from Cambridge University Engineering Department. He is currently a Senior Research Scientist at Google. Before that, he was a Research Associate at Cambridge University and an Advisor of JD.com. He has published 60 peer-reviewed papers in speech and language processing, and has received the IBM Spoken Language Processing Award from ICASSP 2014, and the best student paper awards from NCMMSC 2011, ASRU 2019 and SLT 2021. He is also a Visiting Fellow of Cambridge University and serves as an Associate Member of IEEE SLTC since 2021. His research interests include speech recognition, speaker diarization, and emotion recognition etc.

Quan Wang is currently a Staff Software Engineer at Google, leading the Speaker, Voice & Language team. Quan is an IEEE Senior Member, and was a former Machine Learning Scientist at Amazon Alexa team. Quan received his B.E. degree in 2010 from the Department of Automation, Tsinghua University, and received his Ph.D. degree in 2014 from Rensselaer Polytechnic Institute. Quan published 50+ impactful patents and papers at top conferences with 2600+ citations. Quan is the author of the textbook "Voice Identity Techniques: From core algorithms to engineering practice", winning the Distinguished Author of Year 2020 Award. Quan is also an online instructor, teaching the Speaker Recognition course on Udemy.

Tutorial Speakers

Unraveling the acoustic biomarkers of COVID-19

Speaker:Sriram Ganapathy( http://www.leap.ee.iisc.ac.in/sriram/)

Associate Professor at Indian Institute of Science Bangalore Learning and Extraction of Acoustic Patterns (LEAP) lab

Chair:Meng Sun


The COVID-19 induced pandemic of the last 30 months needs no introduction. The efforts to deal with the pandemic have been truly global, including identification of the genomic patterns, developing models of disease progression, vaccination, drugs and testing. This talk will focus on the testing part, where there is a persistent need for testing methodologies that provide a good trade-off between accuracy, cost, time and scale. The audio-based testing of COVID-19 has the potential of a niche tool that combines the advantages of being remote, inexpensive, scalable and rapid. Motivated by this goal, several groups across the world have attempted to develop protocols, datasets, and machine learning methodologies. This talk will provide a brief motivation and overview of the works performed by various groups for the task of identifying audio-based biomarkers of COVID-19. I will discuss our work, Project Coswara, a tool for COVID-19 diagnostics using crowd-sourced multi-modal data. The rationale for the project, the data collection and the development of the web-based tool will be highlighted. The machine learning aspects of the tool development, including generalization to new variants of COVID-19 and the fairness analysis of the models developed will be presented. The talk will also summarize the findings from the series of data challenges, named DiCOVA (Diagnosis of COVid-19 using Audio), organized using the data collected as part of the project.

Bio: Sriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights from 2011 to 2015. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland from 2006 to 2008. At the LEAP lab, his research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Over the past 10 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, nominated member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, and Verisk Analytics AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association.
Tutorial Speakers

Speech Emotion Recognition: Challenges, Approaches and Future Directions

Speakers:Longbiao Wang and Xixin Wu

Chair:Meng Sun


Speech emotion recognition (SER) is an important technique for graceful speech-based human-machine interaction. Though the task is quite challenging, SER has attracted much research attention and achieved great progresses over the past decades. Many application scenarios have been explored and demonstrated value of the technique. This tutorial will describe the challenges of the SER problem and provide an overview of recent advances in SER where signal processing and machine learning techniques are fused to obtain significant improvement and shed new insights. Exemplary approaches, including feature extraction, representation learning and model architectures, will be illustrated in a hope to get some inspiration for the future research. Some interesting questions that need further investigation will also be highlighted.

Longbiao Wang (TJU)
Xixin Wu (CUHK)
Bio: Longbiao Wang received his Dr. Eng. degree from Toyohashi University of Technology, Japan, in 2008. He was an assistant professor in the faculty of engineering at Shizuoka University, Japan from April 2008 to September 2012. From October 2012 to August 2016 he was an associate professor at Nagaoka University of Technology, Japan. Currently he is a Professor at the Tianjin University, China and a visiting professor of Japan Advanced Institute of Science and Technology (JAIST), Japan. His research interests include acoustic signal processing, robust speech recognition, speaker recognition and speech emotion recognition.
Bio: Xixin Wu received his B.S., M.S. and Ph.D. degrees from Beihang University, Tsinghua University and The Chinese University of Hong Kong (CUHK), respectively. He is currently a Research Assistant Professor with the Stanley Ho Big Data Decision Analytics Research Centre, CUHK. Before this, he worked as a Research Associate with the Machine Intelligence Laboratory, Cambridge University, U.K. In 2020, his team won the first place in the challenge of non-native children’s speech recognition in INTERSPEECH. His team also ranked second in the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT). His research interests include speech synthesis and recognition, speaker verification, speech emotion recognition and neural network uncertainty.