Speaker: Jenthe Thienpondt, IDLab, Ghent University
Chair: Kong Aik Lee, Institute for Infocomm Research, A*STAR, Singapore
Title:Beyond ECAPA-TDNN: Lessons Learned and Future Trends in Speaker Recognition

Abstract: After the performance leap induced by deep learning models in various scientific fields, the speaker recognition community followed swiftly. Some initially promising results were gained by adopting convolutional neural networks from the domain of computer vision, but the breakthrough moment happened with the introduction of models better tailored for speech processing in the form of time-delay neural networks. In the following years, various extensions and alterations were proposed, resulting in a steady stream of year-over-year performance gains on various speaker recognition benchmarks. Among them, the ECAPA-TDNN model grew to be a popular state-of-the-art speaker recognition architecture and has since been used for other related tasks such as language identification and speaker diarization. In this presentation, we will share lessons learned after the development of the ECAPA-TDNN model, how it compares against other architectures and assess the current trend towards hybrid networks in the fields of speaker recognition and language identification.

Jenthe Thienpondt is a senior machine learning researcher at IDLab, a research group from imec embedded under Ghent University. His main research interests are speaker recognition and language identification. He finished his Master’s degree in Information Engineering Technology at Ghent University in 2018 with a thesis covering speaker recognition in the speech processing research group at IDLab under the supervision of Prof. Dr. Kris Demuynck. He co-developed the popular ECAPA-TDNN speaker recognition architecture with colleague Brecht Desplanques and successfully competed in various speaker verification challenges such as the VoxCeleb Speaker Recognition Challenge (VoxSRC) and the Short-duration Speaker Verification Challenge (SdSVC) of Interspeech.

Keynote Speaker

Speaker: Tomi Kinnunen, University of Eastern Finland

Chair: Niko Brümmer, Phonexia, South Africa

Title:A decade of experiences in evaluating speaker recognition under attacks

Abstract:Thanks to decades of research and standardized performance evaluation with common data and evaluation metrics, today’s automatic speaker verification (ASV) systems can handle many challenging data conditions robustly. Nonetheless, the high performance builds up on an important (though often an implicit) assumption of non-adversarial users – an assumption that falls apart when the system is faced with targeted spoofing attacks (such as voice cloning) and other kinds of adversaries. To assess vulnerabilities and to develop countermeasures, we need an extended mindset both in terms of systems and performance evaluation. One important question concerns joint operation of ASV and anti-spoofing – should we treat them as separate tasks, or should we extend the notion of speaker detection by making our ASV systems aware of attacks? In this talk, I provide a condensed summary of experiences gathered from the first decade in evaluating speaker recognition under different attacks, stemming from both ASVspoof challenges and other selected case studies. My aim is to stimulate further discussion about the future of speech anti-spoofing whose potential extends from protection of ASV systems into detection of speech deepfakes.

Tomi H. Kinnunen is full professor of speech technology at the University of Eastern Finland (UEF). He received his Ph.D. degree (computer science) from the University of Joensuu in 2005. From 2005 to 2007, he was an Associate Scientist at the Institute for Infocomm Research (I2R), Singapore. Since 2007, he has been with UEF. From 2010 to 2012, he was funded by a postdoctoral grant from the Academy of Finland. He has been a PI or co-PI in three other large Academy of Finland-funded projects and a partner in the H2020-funded OCTAVE project. He chaired the Odyssey workshop in 2014. From 2015 to 2018, he served as an Associate Editor for IEEE/ACM Trans. on Audio, Speech, and Language Processing and from 2016 to 2018 as a Subject Editor in Speech Communication. In 2015 and 2016, he visited the National Institute of Informatics, Japan, for 6 months under a mobility grant from the Academy of Finland. He is one of the co-founders of the ASVspoof challenge, a nonprofit initiative that seeks to evaluate and improve the security of voice biometric solutions under spoofing attacks.

Keynote Speaker

Speaker: Thomas Fang Zheng, Tsinghua University

Chair: Koichi Shinoda, Tokyo Institute of Technology, Japan

Title:The Trusted Identity Authentication in China and A Brain-inspired Decision-Making Strategy for Fake Speech Detection

Abstract: The trusted identity authentication is very important in cyberspace activities. In this talk, the on-going plan of trusted identity authentication in China will be introduced. As is well known, biometric recognition technologies, such as fingerprint, face, and voice, can be regarded as a good solution, however security, privacy-protection, real-intention, and convenience and cost, come to be extremely important concerns. By analyzing all these requirements, voice biometrics (voiceprint) is believed to be the best one.
Among those concerns, spoofing is one of biggest threatens to voiceprint identity authentication. The current anti-spoofing methods are suffering from known attacks and unseen types of fake speech data. This talk will introduce a brain-inspired decision-making strategy for fake speech detection, which consists of several neuron-like unnatural clue detectors and a brain-inspired decision maker. This strategy enables a high detection accuracy for different kinds of speech fakes and an efficient incremental learning for known attacks.
Dr. Thomas Fang Zheng is a full professor and Director of Center for Speech and Language Technologies (CSLT), Tsinghua University. His research and development interests are speech and language processing, having published over 300 papers and 13 books, holding over 20 patents. He received several awards, including 1st Prize of Science and Technology Award (Technical Innovation) of Chinese Institute of Electronics (2021), 3rd Prize of Science and Technology Award, Ministry of Public Security of China (2007), 2nd Prize of Beijing City Scientific and Technical Progress Award (2001).
He is an IEEE Senior member, an APSIPA Distinguished Lecturer (2012-2013), APSIPA Vice-President (2013-2014, 2016-2019), CCF (China Computer Federation) Deputy Director of its Speech Dialogue and Auditory Technical Committee, CIPS (Chinese Information Processing Society of China) Executive Council Member and Director of its Speech Information Technical Committee, and founding chair of steering committee of National Conference on Man-Machine Speech Communication (NCMMSC) of China. He is/was an associate editor of IEEE Trans. on ASLP (2012-2015), an editorial board member of Speech Communication (2005-present), an associate editor of APSIPA Transactions on SIP (2011-present), a series editor of SpringerBriefs in Signal Processing (2014-present).
He served as General Co-Chair of Oriental COCOSDA 2003, General Co-Chair of APSIPA ASC 2011, General Co-Chair of IEEE ChinaSIP 2013, Area Chair of Interspeech 2014, General Co-Chair of ISCSLP 2014, Publication Chair of IEEE ICASSP 2016, General Co-Chair of APSIPA ASC 2019, and General Co-Chair of Interspeech 2020.