Abstract: After the performance leap induced by deep learning models in various scientific fields, the speaker recognition community followed swiftly. Some initially promising results were gained by adopting convolutional neural networks from the domain of computer vision, but the breakthrough moment happened with the introduction of models better tailored for speech processing in the form of time-delay neural networks. In the following years, various extensions and alterations were proposed, resulting in a steady stream of year-over-year performance gains on various speaker recognition benchmarks. Among them, the ECAPA-TDNN model grew to be a popular state-of-the-art speaker recognition architecture and has since been used for other related tasks such as language identification and speaker diarization. In this presentation, we will share lessons learned after the development of the ECAPA-TDNN model, how it compares against other architectures and assess the current trend towards hybrid networks in the fields of speaker recognition and language identification.
Speaker: Tomi Kinnunen, University of Eastern Finland
Chair: Niko Brümmer, Phonexia, South Africa
Title:A decade of experiences in evaluating speaker recognition under attacks
Abstract:Thanks to decades of research and standardized performance evaluation with common data and evaluation metrics, today’s automatic speaker verification (ASV) systems can handle many challenging data conditions robustly. Nonetheless, the high performance builds up on an important (though often an implicit) assumption of non-adversarial users – an assumption that falls apart when the system is faced with targeted spoofing attacks (such as voice cloning) and other kinds of adversaries. To assess vulnerabilities and to develop countermeasures, we need an extended mindset both in terms of systems and performance evaluation. One important question concerns joint operation of ASV and anti-spoofing – should we treat them as separate tasks, or should we extend the notion of speaker detection by making our ASV systems aware of attacks? In this talk, I provide a condensed summary of experiences gathered from the first decade in evaluating speaker recognition under different attacks, stemming from both ASVspoof challenges and other selected case studies. My aim is to stimulate further discussion about the future of speech anti-spoofing whose potential extends from protection of ASV systems into detection of speech deepfakes.
Speaker: Thomas Fang Zheng, Tsinghua University
Chair: Koichi Shinoda, Tokyo Institute of Technology, Japan
Title:The Trusted Identity Authentication in China and A Brain-inspired Decision-Making Strategy for Fake Speech Detection