default search action
22nd Interspeech 2021: Brno, Czechia
- Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, Petr Motlícek:
22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021. ISCA 2021
Speech Synthesis: Other Topics
- Michael Pucher, Thomas Woltron:
Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks. 1-5 - Markéta Rezácková, Jan Svec, Daniel Tihelka:
T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion. 6-10 - Olivier Perrotin, Hussein El Amouri, Gérard Bailly, Thomas Hueber:
Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values. 11-15 - Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers:
A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages. 16-20
Disordered Speech
- Tanya Talkar, Nancy Pearl Solomon, Douglas S. Brungart, Stefanie E. Kuchinsky, Megan M. Eitel, Sara M. Lippa, Tracey A. Brickell, Louis M. French, Rael T. Lange, Thomas F. Quatieri:
Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury. 21-25 - Juan Camilo Vásquez-Correa, Julian Fritsch, Juan Rafael Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss:
On Modeling Glottal Source Information for Phonation Assessment in Parkinson's Disease. 26-30 - Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Anne Pavy-Le Traon, Olivier Rascol, Wassilios G. Meissner, Virginie Woisard:
Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson's Disease and Multiple System Atrophy. 31-35 - Pu Wang, Bagher BabaAli, Hugo Van hamme:
A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech. 36-40 - Rosanna Turrisi, Arianna Braccia, Marco Emanuele, Simone Giulietti, Maura Pugliatti, Mariachiara Sensi, Luciano Fadiga, Leonardo Badino:
EasyCall Corpus: A Dysarthric Speech Dataset. 41-45
Speech Signal Analysis and Representation II
- Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda:
A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling. 46-50 - Metehan Yurt, Pavan Kantharaju, Sascha Disch, Andreas Niedermeier, Alberto N. Escalante-B., Veniamin I. Morgenshtern:
Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods. 51-55 - RaviShankar Prasad, Mathew Magimai-Doss:
Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering. 56-60 - Yann Teytaut, Axel Roebel:
Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice. 61-65
Feature, Embedding and Neural Architecture for Speaker Recognition
- Seong-Hu Kim, Yong-Hwa Park:
Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition. 66-70 - Jiajun Qi, Wu Guo, Bin Gu:
Bidirectional Multiscale Feature Aggregation for Speaker Verification. 71-75 - Yu-Jia Zhang, Yih-Wen Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan:
Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods. 76-80 - Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu:
Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification. 81-85 - Tinglong Zhu, Xiaoyi Qin, Ming Li:
Binary Neural Network for Speaker Verification. 86-90 - Youzhi Tu, Man-Wai Mak:
Mutual Information Enhanced Training for Speaker Embedding. 91-95 - Ge Zhu, Fei Jiang, Zhiyao Duan:
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding. 96-100 - Yan Liu, Zheng Li, Lin Li, Qingyang Hong:
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification. 101-105 - Hongning Zhu, Kong Aik Lee, Haizhou Li:
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. 106-110
Speech Synthesis: Toward End-to-End Synthesis II
- Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang:
TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions. 111-115 - Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho:
FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis. 116-120 - Taiki Nakamura, Tomoki Koriyama, Hiroshi Saruwatari:
Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer. 121-125 - Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima:
Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech. 126-130 - Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang:
Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis. 131-135 - Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J. F. Gales:
Deliberation-Based Multi-Pass Speech Synthesis. 136-140 - Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R. J. Skerry-Ryan, Yonghui Wu:
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. 141-145 - Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Köhler, Qing He:
Transformer-Based Acoustic Modeling for Streaming Speech Synthesis. 146-150 - Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu:
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS. 151-155 - Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar:
Speed up Training with Variable Length Inputs by Efficient Batching Strategies. 156-160
Speech Enhancement and Intelligibility
- Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao:
Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement. 161-165 - Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li:
Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement. 166-170 - Changjie Pan, Feng Yang, Fei Chen:
Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences. 171-175 - Ritujoy Biswas, Karan Nathwani, Vinayak Abrol:
Transfer Learning for Speech Intelligibility Improvement in Noisy Environments. 176-180 - Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani:
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility. 181-185 - Wenzhe Liu, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li:
Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement. 186-190 - Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang:
Speech Enhancement with Weakly Labelled Data from AudioSet. 191-195 - Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao:
Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement. 196-200 - Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao:
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. 201-205 - Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty:
A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction. 206-210 - Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma, Feng Hou:
Self-Supervised Learning Based Phone-Fortified Speech Enhancement. 211-215 - Khandokar Md. Nayem, Donald S. Williamson:
Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement. 216-220 - Jianwei Zhang, Suren Jayasuriya, Visar Berisha:
Restoring Degraded Speech via a Modified Diffusion Model. 221-225
Spoken Dialogue Systems I
- Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak:
User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems. 226-230 - Nuo Chen, Chenyu You, Yuexian Zou:
Self-Supervised Dialogue Learning for Spoken Conversational Question Answering. 231-235 - Ruolin Su, Ting-Wei Wu, Biing-Hwang Juang:
Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking. 236-240 - Yuya Chiba, Ryuichiro Higashinaka:
Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information. 241-245 - Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose, Akinori Ito:
Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems. 246-250 - Weiyuan Xu, Peilin Zhou, Chenyu You, Yuexian Zou:
Semantic Transportation Prototypical Network for Few-Shot Intent Detection. 251-255 - Li Tang, Yuke Si, Longbiao Wang, Jianwu Dang:
Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios. 256-260 - Haoyu Wang, John Chen, Majid Laali, Kevin Durda, Jeff King, William Campbell, Yang Liu:
Leveraging ASR N-Best in Deep Entity Retrieval. 261-265
Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR
- Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen:
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition. 266-270 - Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig:
Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties. 271-275 - Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals:
Speech Acoustic Modelling Using Raw Source and Filter Components. 276-280 - Masakiyo Fujimoto, Hisashi Kawai:
Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture. 281-285 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition. 286-290 - Junqi Chen, Xiao-Lei Zhang:
Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays. 291-295 - Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo:
Multi-Channel Transformer Transducer for Speech Recognition. 296-300 - Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe:
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios. 301-305 - Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang:
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition. 306-310 - Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve:
Rethinking Evaluation in ASR: Are Our Models Robust Enough? 311-315 - Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu:
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition. 316-320
Voice Activity Detection and Keyword Spotting
- Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren:
Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams. 321-325 - Ui-Hyun Kim:
Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection. 326-330 - Hyun-Jin Park, Pai Zhu, Ignacio López-Moreno, Niranjan Subrahmanya:
Noisy Student-Teacher Training for Robust Keyword Spotting. 331-335 - Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu:
Multi-Channel VAD for Transcription of Group Discussion. 336-340 - Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee:
Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments. 341-345 - Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Enrollment-Less Training for Personalized Voice Activity Detection. 346-350 - Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model. 351-355 - Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo:
FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications. 356-360 - Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, Sung-Un Park:
End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention. 361-365 - Saurabhchand Bhati, Jesús Villalba, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak:
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. 366-370 - Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu:
A Lightweight Framework for Online Voice Activity Detection in the Wild. 371-375
Voice and Voicing
- Aurélie Chlébowski, Nicolas Ballier:
"See what I mean, huh?" Evaluating Visual Inspection of F0 Tracking in Nasal Grunts. 376-380 - Bruce Xiao Wang, Vincent Hughes:
System Performance as a Function of Calibration Methods, Sample Size and Sampling Variability in Likelihood Ratio-Based Forensic Voice Comparison. 381-385 - Anne Bonneau:
Voicing Assimilations by French Speakers of German in Stop-Fricative Sequences. 386-390 - Titas Chakraborty, Vaishali Patil, Preeti Rao:
The Four-Way Classification of Stops with Voicing and Aspiration for Non-Native Speech Evaluation. 391-395 - Saba Urooj, Benazir Mumtaz, Sarmad Hussain, Ehsan ul Haq:
Acoustic and Prosodic Correlates of Emotions in Urdu Speech. 396-400 - Nour Tamim, Silke Hamann:
Voicing Contrasts in the Singleton Stops of Palestinian Arabic: Production and Perception. 401-405 - Thomas Coy, Vincent Hughes, Philip Harrison, Amelia Jane Gully:
A Comparison of the Accuracy of Dissen and Keshet's (2016) DeepFormants and Traditional LPC Methods for Semi-Automatic Speaker Recognition. 406-410 - Michael Jessen:
MAP Adaptation Characteristics in Forensic Long-Term Formant Analysis. 411-415 - Justin J. H. Lo:
Cross-Linguistic Speaker Individuality of Long-Term Formant Distributions: Phonetic and Forensic Perspectives. 416-420 - Rachel Soo, Khia A. Johnson, Molly Babel:
Sound Change in Spontaneous Bilingual Speech: A Corpus Study on the Cantonese n-l Merger in Cantonese-English Bilinguals. 421-425 - Wendy Lalhminghlui, Priyankoo Sarmah:
Characterizing Voiced and Voiceless Nasals in Mizo. 426-430
The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates
- Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Léon J. M. Rothkrantz, Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp:
The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. 431-435 - Rubén Solera-Ureña, Catarina Botelho, Francisco Teixeira, Thomas Rolland, Alberto Abad, Isabel Trancoso:
Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19. 436-440 - Philipp Klumpp, Tobias Bocklet, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Sebastian P. Bayerl, Juan Rafael Orozco-Arroyave, Elmar Nöth:
The Phonetic Footprint of Covid-19? 441-445 - Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Junior, Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva:
Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021. 446-450 - Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien:
Visual Transformers for Primates Classification and Covid Detection. 451-455 - Thomas Pellegrini:
Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment. 456-460 - Robert Müller, Steffen Illium, Claudia Linnhoff-Popien:
A Deep and Recurrent Architecture for Primate Vocalization Classification. 461-465 - Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya:
Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification. 466-470 - Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller:
Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild. 471-475 - José Vicente Egas López, Mercedes Vetráb, László Tóth, Gábor Gosztolya:
Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features. 476-480 - Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov:
Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech. 481-485 - Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André:
Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification. 486-490
Survey Talk 1: Heidi Christensen
- Heidi Christensen:
Towards Automatic Speech Recognition for People with Atypical Speech.
Embedding and Network Architecture for Speaker Recognition
- Chau Luu, Peter Bell, Steve Renals:
Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization. 491-495 - Magdalena Rybicka, Jesús Villalba, Piotr Zelasko, Najim Dehak, Konrad Kowalczyk:
Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition. 496-500 - Themos Stafylakis, Johan Rohdin, Lukás Burget:
Speaker Embeddings by Modeling Channel-Wise Correlations. 501-505 - Weipeng He, Petr Motlícek, Jean-Marc Odobez:
Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction. 506-510 - Junyi Peng, Xiaoyang Qu, Jianzong Wang, Rongzhi Gu, Jing Xiao, Lukás Burget, Jan Cernocký:
ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform. 511-515
Speech Perception I
- Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d'Alessandro, Barbara Kühnert, Claire Pillot-Loiseau:
Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers. 516-520 - Aleese Block, Michelle Cohn, Georgia Zellou:
Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-Produced and TTS Voices. 521-525 - Mohammad Jalilpour-Monesi, Bernd Accou, Tom Francart, Hugo Van hamme:
Extracting Different Levels of Speech Information from EEG Using an LSTM-Based Model. 526-530 - Louis ten Bosch, Lou Boves:
Word Competition: An Entropy-Based Approach in the DIANA Model of Human Word Comprehension. 531-535 - Louis ten Bosch, Lou Boves:
Time-to-Event Models for Analyzing Reaction Time Sequences. 536-540 - Sophie Brand, Kimberley Mulder, Louis ten Bosch, Lou Boves:
Models of Reaction Times in Auditory Lexical Decision: RTonset versus RToffset. 541-545
Acoustic Event Detection and Acoustic Scene Classification
- Gwantae Kim, David K. Han, Hanseok Ko:
SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features. 546-550 - Helin Wang, Yuexian Zou, Wenwu Wang:
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification. 551-555 - Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu:
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection. 556-560 - Ritika Nandi, Shashank Shekhar, Manjunath Mulimani:
Acoustic Scene Classification Using Kervolution-Based SubSpectralNet. 561-565 - Harshavardhan Sundar, Ming Sun, Chao Wang:
Event Specific Attention for Polyphonic Sound Event Detection. 566-570 - Yuan Gong, Yu-An Chung, James R. Glass:
AST: Audio Spectrogram Transformer. 571-575 - Soonshin Seo, Donghyun Lee, Ji-Hwan Kim:
Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene. 576-580 - Helen L. Bear, Veronica Morfi, Emmanouil Benetos:
An Evaluation of Data Augmentation Methods for Sound Scene Geotagging. 581-585 - Chiori Hori, Takaaki Hori, Jonathan Le Roux:
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers. 586-590 - Shijing Si, Jianzong Wang, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao:
Variational Information Bottleneck for Effective Low-Resource Audio Classification. 591-595 - Soham Deshmukh, Bhiksha Raj, Rita Singh:
Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks. 596-600 - Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi:
Acoustic Event Detection with Classifier Chains. 601-605
Diverse Modes of Speech Acquisition and Processing
- Shu-Chuan Tseng, Yi-Fen Liu:
Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children. 606-610 - Feng Wang, Jing Chen, Fei Chen:
Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing. 611-615 - Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh:
A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping. 616-620 - Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar:
Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results. 621-625 - Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu:
An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech. 626-630 - Judith Dineley, Grace Lavelle, Daniel Leightley, Faith Matcham, Sara Siddi, Maria Teresa Peñarrubia-María, Katie M. White, Alina Ivan, Carolin Oetzmann, Sara Simblett, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J. B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins, RADAR-CNS Consortium:
Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder. 631-635 - Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce, Michael A. Riley, T. Douglas Mast:
An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables. 636-640 - Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals:
Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video. 641-645 - David Ferreira, Samuel S. Silva, Francisco Curado, António J. S. Teixeira:
RaSSpeR: Radar-Based Silent Speech Recognition. 646-650 - Beiming Cao, Nordine Sebkhi, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang:
Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces. 651-655
Multi-Channel Speech Enhancement and Hearing Aids
- Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., Andreas K. Maier:
LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement. 656-660 - Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii:
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation. 661-665 - Siyuan Zhang, Xiaofei Li:
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. 666-670 - Hyungchan Song, Jong Won Shin:
Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks. 671-675 - Pablo Pérez Zarazaga, Mariem Bouafif Mansali, Tom Bäckström, Zied Lachiri:
Cancellation of Local Competing Speaker with Near-Field Localization for Distributed ad-hoc Sensor Network. 676-680 - Hao Zhang, DeLiang Wang:
A Deep Learning Method to Multi-Channel Active Noise Control. 681-685 - Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz:
Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing. 686-690 - Zehai Tu, Ning Ma, Jon Barker:
Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model. 691-695 - Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr:
Explaining Deep Learning Models for Speech Enhancement. 696-700 - Weilong Huang, Jinwei Feng:
Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones. 701-705
Self-Supervision and Semi-Supervision for Neural ASR Training
- Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma:
Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning. 706-710 - Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas:
wav2vec-C: A Self-Supervised Model for Speech Representation Learning. 711-715 - Electra Wallington, Benji Kershenbaum, Ondrej Klejch, Peter Bell:
On the Learning Dynamics of Semi-Supervised Training for ASR. 716-720 - Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli:
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. 721-725 - Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori:
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. 726-730 - Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim:
A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models. 731-735 - Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno:
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation. 736-740 - Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert:
slimIPL: Language-Model-Free Iterative Pseudo-Labeling. 741-745 - Xianghu Yue, Haizhou Li:
Phonetically Motivated Self-Supervised Speech Representation Learning. 746-750 - Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, Lei He:
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS. 751-755
Spoken Language Processing I
- Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff:
Speaker-Conversation Factorial Designs for Diarization Error Analysis. 756-760 - Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel:
SmallER: Scaling Neural Entity Resolution for Edge Devices. 761-765 - Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling:
Disfluency Detection with Unlabeled Data and Small BERT Models. 766-770 - Qian Chen, Wen Wang, Mengzhe Chen, Qinglin Zhang:
Discriminative Self-Training for Punctuation Prediction. 771-775 - Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens. 776-780 - Binghuai Lin, Liyuan Wang:
A Noise Robust Method for Word-Level Pronunciation Assessment. 781-785 - Jonathan Wintrode:
Targeted Keyword Filtering for Accelerated Spoken Topic Identification. 786-790 - Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze:
Multimodal Speech Summarization Through Semantic Concept Learning. 791-795 - Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon:
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization. 796-800 - Marcin Wlodarczak, Emer Gilmartin:
Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish. 801-805
Voice Conversion and Adaptation II
- Samuel J. Broughton, Md. Asif Jalal, Roger K. Moore:
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion. 806-810 - Kun Zhou, Berrak Sisman, Haizhou Li:
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training. 811-815 - Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhen-Hua Ling:
Adversarial Voice Conversion Against Neural Spoofing Detectors. 816-820 - Xiangheng He, Junjie Chen, Georgios Rizos, Björn W. Schuller:
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation. 821-825 - Ziyi Chen, Pengyuan Zhang:
TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion. 826-830 - Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li:
Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion. 831-835 - Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee:
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. 836-840 - Christopher Liberatore, Ricardo Gutierrez-Osuna:
An Exemplar Selection Algorithm for Native-Nonnative Voice Conversion. 841-845 - Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng:
Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion. 846-850 - Manh Luong, Viet-Anh Tran:
Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder. 851-855
Privacy-Preserving Machine Learning for Audio & Speech Processing
- Oubaïda Chouchane, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino, Nicholas W. D. Evans, Melek Önen, Massimiliano Todisco:
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation. 856-860 - Ranya Aloufi, Hamed Haddadi, David Boyle:
Configurable Privacy-Preserving Automatic Speech Recognition. 861-865 - Scott Novotney, Yile Gu, Ivan Bulyko:
Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation. 866-870 - Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh:
Communication-Efficient Agnostic Federated Averaging. 871-875 - Timm Koppelmann, Alexandru Nelus, Lea Schönherr, Dorothea Kolossa, Rainer Martin:
Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification. 876-880 - Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification. 881-885 - Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang:
Continual Learning for Fake Audio Detection. 886-890 - Muhammad A. Shah, Joseph Szurley, Markus Müller, Athanasios Mouchtaris, Jasha Droppo:
Evaluating the Vulnerability of End-to-End Automatic Speech Recognition Models to Membership Inference Attacks. 891-895 - Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo:
SynthASR: Unlocking Synthetic Data for Speech Recognition. 896-900
The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics
- Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Kumar Sharma, Prashant Krishnan V, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda:
DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics. 901-905 - Madhu R. Kamble, José Andrés González López, Teresa Grau, Juan M. Espín, Lorenzo Cascioli, Yiqing Huang, Alejandro Gómez Alanís, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas W. D. Evans, Maria A. Zuluaga, Massimiliano Todisco:
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge. 906-910 - Vincent Karas, Björn W. Schuller:
Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features. 911-915 - Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács:
Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines. 916-920 - Rohan Kumar Das, Maulik C. Madhavi, Haizhou Li:
Diagnosis of COVID-19 Using Auditory Acoustic Cues. 921-925 - John B. Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David G. Beiser, David Chestek:
Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation. 926-930 - Gauri Deshpande, Björn W. Schuller:
The DiCOVA 2021 Challenge - An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio. 931-935 - Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan:
COVID-19 Detection from Spectral Features on the DiCOVA Dataset. 936-940 - Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller:
Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information. 941-945 - Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu:
Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis. 946-950 - Flávio Ávila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy, Maneesh Singh:
Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds. 951-955
Show and Tell 1
- Gábor Kiss, Dávid Sztahó, Miklós Gábriel Tulics:
Application for Detecting Depression, Parkinson's Disease and Dysphonic Speech. 956-957 - Lenka Weingartová, Veronika Volna, Ewa Balejová:
Beey: More Than a Speech-to-Text Editor. 958-959 - Takayuki Arai:
Downsizing of Vocal-Tract Models to Line up Variations and Reduce Manufacturing Costs. 960-961 - Maël Fabien, Shantipriya Parida, Petr Motlícek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen:
ROXANNE Research Platform: Automate Criminal Investigations. 962-964 - Alexandre Flucha, Anthony Larcher, Ambuj Mehrish, Sylvain Meignier, Florian Plaut, Nicolas Poupon, Yevhenii Prokopalo, Adrien Puertolas, Meysam Shamsi, Marie Tahon:
The LIUM Human Active Correction Platform for Speaker Diarization. 965-966 - Yoo Rhee Oh, Kiyoung Park:
On-Device Streaming Transformer-Based End-to-End Speech Recognition. 967-968 - Jaroslav Cmejla, Tomás Kounovský, Jakub Janský, Jirí Málek, M. Rozkovec, Zbynek Koldovský:
Advanced Semi-Blind Speaker Extraction and Tracking Implemented in Experimental Device with Revolving Dense Microphone Array. 969-970
Keynote 1: Hermann Ney
- Hermann Ney:
Forty Years of Speech and Language Processing: From Bayes Decision Rule to Deep Learning.
ASR Technologies and Systems
- Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw. 971-975 - Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Aligned Contrastive Predictive Coding. 976-980 - Benjamin Suter, Josef Novák:
Neural Text Denormalization for Speech Transcripts. 981-985 - Aditya Joglekar, Seyed Omid Sadjadi, Meena Chandra Shekar, Christopher Cieri, John H. L. Hansen:
Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio. 986-990
Phonation and Voicing
- Hannah Leykum:
Voice Quality in Verbal Irony: Electroglottographic Analyses of Ironic Utterances in Standard Austrian German. 991-995 - Mathilde Hutin, Yaru Wu, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker:
Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing. 996-1000 - Ivan Kraljevski, Maria Paola Bissiri, Frank Duckhorn, Constanze Tschöpe, Matthias Wolff:
Glottal Stops in Upper Sorbian: A Data-Driven Approach. 1001-1005 - Bogdan Ludusan, Petra Wagner, Marcin Wlodarczak:
Cue Interaction in the Perception of Prosodic Prominence: The Role of Voice Quality. 1006-1010 - Jenifer Vega Rodríguez, Nathalie Vallée:
Glottal Sounds in Korebaju. 1011-1014 - Anaïs Chanclu, Imen Ben Amor, Cédric Gendrot, Emmanuel Ferragne, Jean-François Bonastre:
Automatic Classification of Phonation Types in Spontaneous Speech: Towards a New Workflow for the Characterization of Speakers' Voice Quality. 1015-1018
Health and Affect I
- Rob J. J. H. van Son:
Measuring Voice Quality Parameters After Speaker Pseudonymization. 1019-1023 - Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz:
Audio-Visual Recognition of Emotional Engagement of People with Dementia. 1024-1028 - Pascal Hecker, Florian B. Pokorny, Katrin D. Bartl-Pokorny, Uwe Reichel, Zhao Ren, Simone Hantke, Florian Eyben, Dagmar M. Schuller, Bert Arnrich, Björn W. Schuller:
Speaking Corona? Human and Machine Recognition of COVID-19 from Voice. 1029-1033 - Huyen Nguyen, Ralph Vente, David Lupea, Sarah Ita Levitan, Julia Hirschberg:
Acoustic-Prosodic, Lexical and Demographic Cues to Persuasiveness in Competitive Debate Speeches. 1034-1038
Robust Speaker Recognition
- Bengt J. Borgström:
Unsupervised Bayesian Adaptation of PLDA for Speaker Verification. 1039-1043 - Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li:
The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III. 1044-1048 - Yafeng Chen, Wu Guo, Bin Gu:
Improved Meta-Learning Training for Speaker Verification. 1049-1053 - Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong:
Variational Information Bottleneck Based Regularization for Speaker Recognition. 1054-1058 - Niko Brümmer, Luciana Ferrer, Albert Swart:
Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make? 1059-1063 - Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio López-Moreno:
SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System. 1064-1068 - Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu:
AntVoice Neural Speaker Embedding System for FFSVC 2020. 1069-1073 - Jianchen Li, Jiqing Han, Hongwei Song:
Gradient Regularization for Noise-Robust Speaker Verification. 1074-1078 - Saurabh Kataria, Jesús Villalba, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak:
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification. 1079-1083 - Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo:
Scaling Effect of Self-Supervised Speech Models. 1084-1088 - Yibo Wu, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang:
Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network. 1089-1093 - Li Zhang, Qing Wang, Kong Aik Lee, Lei Xie, Haizhou Li:
Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification. 1094-1098 - Jose Patino, Natalia A. Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans:
Speaker Anonymisation Using the McAdams Coefficient. 1099-1103
Source Separation, Dereverberation and Echo Cancellation
- Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang:
Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments. 1104-1108 - Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu:
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation. 1109-1113 - Jianjun Gu, Longbiao Cheng, Xingwei Sun, Junfeng Li, Yonghong Yan:
Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function. 1114-1118 - Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu:
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation. 1119-1123 - Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy:
Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement. 1124-1128 - Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot:
Scene-Agnostic Multi-Microphone Speech Dereverberation. 1129-1133 - Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi:
Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex. 1134-1138 - Hao Zhang, DeLiang Wang:
A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation. 1139-1143 - Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu:
Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation. 1144-1148 - Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo:
Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition. 1149-1153
Speech Signal Analysis and Representation I
- Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Estimating Articulatory Movements in Speech Production with Transformer Networks. 1154-1158 - Dongchao Yang, Helin Wang, Yuexian Zou:
Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification. 1159-1163 - Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen:
Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation. 1164-1168 - Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao:
Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation. 1169-1173 - Chiranjeevi Yarra, Prasanta Kumar Ghosh:
Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion. 1174-1178 - Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee:
An Attribute-Aligned Strategy for Learning Speech Representation. 1179-1183 - Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Torbjørn Svendsen:
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. 1184-1188 - Jason Lilley, H. Timothy Bunnell:
Unsupervised Training of a DNN-Based Formant Tracker. 1189-1193 - Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee:
SUPERB: Speech Processing Universal PERformance Benchmark. 1194-1198 - Cong Zhang, Jian Zhu:
Synchronising Speech Segments with Musical Beats in Mandarin and English Singing. 1199-1203 - Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak N. Patel:
FRILL: A Non-Semantic Speech Embedding for Mobile Devices. 1204-1208 - Hiroki Mori:
Pitch Contour Separation from Overlapping Speech. 1209-1213 - Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen:
Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning. 1214-1218
Spoken Language Understanding I
- Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao:
Data Augmentation for Spoken Language Understanding via Pretrained Language Models. 1219-1223 - Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow:
FANS: Fusing ASR and NLU for On-Device SLU. 1224-1228 - Yiran Cao, Nihal Potdar, Anderson R. Avila:
Sequential End-to-End Intent and Slot Label Classification and Localization. 1229-1233 - Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason D. Williams, Alex Acero:
DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants. 1234-1238 - Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang:
A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection. 1239-1243 - Qian Chen, Wen Wang, Qinglin Zhang:
Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning. 1244-1248 - Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen:
Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models. 1249-1253 - Jatin Ganhotra, Samuel Thomas, Hong-Kwang Jeff Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury:
Integrating Dialog History into End-to-End Spoken Language Understanding Systems. 1254-1258 - Ting Han, Chongxuan Huang, Wei Peng:
Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking. 1259-1263 - Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W. Black:
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding. 1264-1268
Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings
- Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li:
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition. 1269-1273 - Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian:
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition. 1274-1278 - Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan:
Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children - INTERSPEECH 2021 Shared Task SPAPL System. 1279-1283 - Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays:
Robust Continuous On-Device Personalization for Automatic Speech Recognition. 1284-1288 - Shashi Kumar, Shakti P. Rath, Abhishek Pandey:
Speaker Normalization Using Joint Variational Autoencoder. 1289-1293 - Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu:
The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech. 1294-1298 - Tsz Kin Lam, Mayumi Ohta, Shigehiko Schamoni, Stefan Riezler:
On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR. 1299-1303 - Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson:
Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding. 1304-1308 - Yan Huang, Guoli Ye, Jinyu Li, Yifan Gong:
Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need. 1309-1313 - Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau:
Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning. 1314-1318 - Wei Chu, Peng Chang, Jing Xiao:
Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII's System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge. 1319-1323
Voice Conversion and Adaptation I
- Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao:
CVC: Contrastive Learning for Non-Parallel Voice Conversion. 1324-1328 - Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda:
A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion. 1329-1333 - Sefik Emre Eskimez, Dimitrios Dimitriadis, Ken'ichi Kumatani, Robert Gmyr:
One-Shot Voice Conversion with Speaker-Agnostic StarGAN. 1334-1338 - Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada:
Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data. 1339-1343 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion. 1344-1348 - Yinghao Aaron Li, Ali Zare, Nima Mesgarani:
StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. 1349-1353 - Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall:
Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis. 1354-1358 - Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka:
StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition. 1359-1363 - Xuexin Xu, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock:
Two-Pathway Style Embedding for Arbitrary Voice Conversion. 1364-1368 - Yufei Liu, Chengzhu Yu, Shuai Wang, Zhenchuan Yang, Yang Chao, Weibin Zhang:
Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics. 1369-1373 - Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li:
Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation. 1374-1378 - Hongqiang Du, Lei Xie:
Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder. 1379-1383
Voice Quality Characterization for Clinical Voice Assessment: Voice Production, Acoustics, and Auditory Perception
- Hannah White, Joshua Penney, Andy Gibson, Anita Szakay, Felicity Cox:
Optimizing an Automatic Creaky Voice Detection Method for Australian English Speaking Females. 1384-1388 - Joshua Penney, Andy Gibson, Felicity Cox, Michael I. Proctor, Anita Szakay:
A Comparison of Acoustic Correlates of Voice Quality Across Different Recording Devices: A Cautionary Tale. 1389-1393 - Anna Sfakianaki, George P. Kafentzis:
Investigating Voice Function Characteristics of Greek Speakers with Hearing Loss Using Automatic Glottal Source Feature Extraction. 1394-1398 - Mark A. Huckvale, Catinca Buciuleac:
Automated Detection of Voice Disorder in the Saarbrücken Voice Database: Effects of Pathology Subset and Audio Materials. 1399-1403 - Steven M. Lulich, Rita R. Patel:
Accelerometer-Based Measurements of Voice Quality in Children During Semi-Occluded Vocal Tract Exercise with a Narrow Straw in Air. 1404-1408 - Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost:
Articulatory Coordination for Speech Motor Tracking in Huntington Disease. 1409-1413 - Carlos A. Ferrer, Efren Aragón, María E. Hdez-Díaz, Marc S. De Bodt, Roman Cmejla, Marina Englert, Mara Behlau, Elmar Nöth:
Modeling Dysphonia Severity as a Function of Roughness and Breathiness Ratings in the GRBAS Scale. 1414-1418
Miscellanous Topics in ASR
- Nikolay Karpov, Alexander Denisenko, Fedor Minkin:
Golos: Russian Dataset for Speech Research. 1419-1423 - Samik Sadhu, Hynek Hermansky:
Radically Old Way of Computing Spectra: Applications in End-to-End ASR. 1424-1428 - Ragheb Al-Ghezi, Yaroslav Getman, Aku Rouhe, Raili Hildén, Mikko Kurimo:
Self-Supervised End-to-End ASR for Low Resource L2 Swedish. 1429-1433 - Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko:
SPGISpeech: 5, 000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition. 1434-1438 - Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia A. Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier:
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. 1439-1443
Phonetics I
- Pavel Sturm, Radek Skarnitzl, Tomás Nechanský:
Prosodic Accommodation in Face-to-Face and Telephone Dialogues. 1444-1448 - Josiane Riverin-Coutlée, Conceição Cunha, Enkeleida Kapia, Jonathan Harrington:
Dialect Features in Heterogeneous and Homogeneous Gheg Speaking Communities. 1449-1453 - Margaret Zellers, Alena Witzlack-Makarevich, Lilja Saeboe, Saudah Namyalo:
An Exploration of the Acoustic Space of Rhotics and Laterals in Ruruuli. 1454-1458 - Kübra Bodur, Sweeney Branje, Morgane Peirolo, Ingrid Tiscareno, James Sneed German:
Domain-Initial Strengthening in Turkish: Acoustic Cues to Prosodic Hierarchy in Stop Consonants. 1459-1463
Target Speaker Detection, Localization and Separation
- Katerina Zmolíková, Marc Delcroix, Desh Raj, Shinji Watanabe, Jan Cernocký:
Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics. 1464-1468 - Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz:
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers. 1469-1473 - Lukás Mateju, Frantisek Kynych, Petr Cerva, Jindrich Zdánský, Jirí Málek:
Using X-Vectors for Speech Activity Detection in Broadcast Streams. 1474-1478 - Daniele Salvati, Carlo Drioli, Gian Luca Foresti:
Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. 1479-1483 - Midia Yousefi, John H. L. Hansen:
Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network. 1484-1488
Language and Accent Recognition
- Hexin Liu, Leibny Paola García-Perera, Xinyi Zhang, Justin Dauwels, Andy W. H. Khong, Sanjeev Khudanpur, Suzy J. Styles:
End-to-End Language Diarization for Bilingual Code-Switching Speech. 1489-1493 - Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina:
Modeling and Training Strategies for Language Recognition Systems. 1494-1498 - Hui Wang, Lin Liu, Yan Song, Lei Fang, Ian McLoughlin, Li-Rong Dai:
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification. 1499-1503 - Keqi Deng, Songjun Cao, Long Ma:
Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning. 1504-1508 - Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu:
Exploring wav2vec 2.0 on Speaker Verification and Language Identification. 1509-1513 - Gundluru Ramesh, C. Shiva Kumar, K. Sri Rama Murty:
Self-Supervised Phonotactic Representations for Language Identification. 1514-1518 - Jicheng Zhang, Yizhou Peng, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng:
E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition. 1519-1523 - Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S. R. Mahadeva Prasanna:
Excitation Source Feature Based Dialect Identification in Ao - A Low Resource Language. 1524-1528
Low-Resource Speech Recognition
- Shreya Khare, Ashish R. Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj:
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. 1529-1533 - Siyuan Feng, Piotr Zelasko, Laureano Moro-Velázquez, Odette Scharenborg:
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation. 1534-1538 - Herman Kamper, Benjamin van Niekerk:
Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks. 1539-1543 - Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li:
Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning. 1544-1548 - Christiaan Jacobs, Herman Kamper:
Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language. 1549-1553 - Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper:
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing. 1554-1558 - Shun Takahashi, Sakriani Sakti, Satoshi Nakamura:
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages. 1559-1563 - Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander I. Rudnicky:
Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021. 1564-1568 - Xia Cui, Amila Gamage, Terry Hanley, Tingting Mu:
Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features. 1569-1573 - Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux:
The Zero Resource Speech Challenge 2021: Spoken Language Modelling. 1574-1578 - Gautham Krishna Gudur, Satheesh Kumar Perepu:
Zero-Shot Federated Learning with New Classes for Audio Classification. 1579-1583 - Andrew Rouditchenko, Angie W. Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogério Schmidt Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James R. Glass:
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. 1584-1588
Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis
- Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho:
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement. 1589-1593 - Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis:
Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features. 1594-1598 - Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin:
Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information. 1599-1603 - Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing:
Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations. 1604-1608 - Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder. 1609-1613 - Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari:
Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis. 1614-1618 - Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan:
Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech. 1619-1623 - Ege Kesim, Engin Erzin:
Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation. 1624-1628 - Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao:
Speech2Video: Cross-Modal Distillation for Speech to Video Generation. 1629-1633
Speech Coding and Privacy
- Junhyeok Lee, Seungu Han:
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. 1634-1638 - Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu:
QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization. 1639-1643 - Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi:
X-net: A Joint Scale Down and Scale Up Method for Voice Call. 1644-1648 - Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao:
WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution. 1649-1653 - Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu:
Half-Truth: A Partially Fake Audio Detection Dataset. 1654-1658 - Bhusan Chettri, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen:
Data Quality as Predictor of Voice Anti-Spoofing Generalization. 1659-1663 - Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin:
Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features. 1664-1668 - Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin:
Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget. 1669-1673 - Ingo Siegert:
Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant. 1674-1678 - Adam Gabrys, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote:
Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows. 1679-1683 - Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin, Hemant A. Patil:
Voice Privacy Through x-Vector and CycleGAN-Based Anonymization. 1684-1688 - Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen:
A Two-Stage Approach to Speech Bandwidth Extension. 1689-1693 - Joon Byun, Seungmin Shin, Youngcheol Park, Jongmo Sung, Seungkwon Beack:
Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder. 1694-1698 - Dimitrios Stoidis, Andrea Cavallaro:
Protecting Gender and Identity with Disentangled Speech Representations. 1699-1703
Speech Perception II
- Yahya Aldholmi, Rawan Aldhafyan, Asma Alqahtani:
Perception of Standard Arabic Synthetic Speech Rate. 1704-1707 - Takeshi Kishiyama:
The Influence of Parallel Processing on Illusory Vowels. 1708-1712 - Anupama Chingacham, Vera Demberg, Dietrich Klakow:
Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors. 1713-1717 - Olympia Simantiraki, Martin Cooke:
SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility. 1718-1722 - Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa:
VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification. 1723-1727 - Min Xu, Jing Shao, Lan Wang:
Effects of Aging and Age-Related Hearing Loss on Talker Discrimination. 1728-1732 - Yuqing Zhang, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang:
Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication. 1733-1737 - Camryn Terblanche, Philip Harrison, Amelia Jane Gully:
Human Spoofing Detection Performance on Degraded Speech. 1738-1742 - Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak, Bettina Braun:
Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research. 1743-1747 - Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman:
Towards the Explainability of Multimodal Speech Emotion Recognition. 1748-1752 - Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel:
Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels. 1753-1756 - Takanori Ashihara, Takafumi Moriya, Makio Kashino:
Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance. 1757-1761
Streaming for ASR/RNN Transducers
- Thai-Son Nguyen, Sebastian Stüker, Alex Waibel:
Super-Human Performance in Online Low-Latency Recognition of Conversational Speech. 1762-1766 - Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li, Yifan Gong:
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems. 1767-1771 - Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer:
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion. 1772-1776 - Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon:
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling. 1777-1781 - Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong:
Streaming Multi-Talker Speech Recognition with Joint Speaker Identification. 1782-1786 - Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix, Taichi Asami:
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture. 1787-1791 - Andreas Schwarz, Ilya Sklyar, Simon Wiesler:
Improving RNN-T ASR Accuracy Using Context Audio. 1792-1796 - Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma:
HMM-Free Encoder Pre-Training for Streaming RNN Transducer. 1797-1801 - Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltán Tüske:
Reducing Exposure Bias in Training Recurrent Neural Network Transducers. 1802-1806 - Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao:
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models. 1807-1811 - Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno:
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition. 1812-1816 - Hirofumi Inaguma, Tatsuya Kawahara:
StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR. 1817-1821 - Niko Moritz, Takaaki Hori, Jonathan Le Roux:
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition. 1822-1826 - Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu Jeong Han, Shinji Watanabe:
Multi-Mode Transformer Transducer with Stochastic Future Context. 1827-1831
ConferencingSpeech 2021 Challenge: Far-Field Multi-Channel Speech Enhancement for Video Conferencing
- Xinlei Ren, Xu Zhang, Lianwu Chen, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu:
A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement. 1832-1836 - Rui Zhu, Feiran Yang, Yuepeng Li, Shidong Shang:
A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation. 1837-1841 - Taihui Wang, Feiran Yang, Rui Zhu, Jun Yang:
Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model. 1842-1846 - Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long:
Improving Channel Decorrelation for Multi-Channel Target Speech Extraction. 1847-1851 - Jinjiang Liu, Xueliang Zhang:
Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement. 1852-1856 - R. G. Prithvi Raj, Rohit Kumar, M. K. Jayesh, Anurenjan Purushothaman, Sriram Ganapathy, M. Ali Basha Shaik:
SRIB-LEAP Submission to Far-Field Multi-Channel Speech Enhancement Challenge for Video Conferencing. 1857-1861 - Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng:
Real-Time Multi-Channel Speech Enhancement Based on Neural Network Masking with Attention Model. 1862-1866
Survey Talk 2: Sriram Ganapathy
- Sriram Ganapathy:
Uncovering the Acoustic Cues of COVID-19 Infection.
Keynote 2: Pascale Fung
- Pascale Fung:
Ethical and Technological Challenges of Conversational AI.
Language Modeling and Text-Based Innovations for ASR
- Dominique Fohr, Irina Illina:
BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List. 1867-1871 - Karel Benes, Lukás Burget:
Text Augmentation for Language Models in High Error Recognition Scenario. 1872-1876 - Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney:
On Sampling-Based Training Criteria for Neural Language Modeling. 1877-1881 - Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo:
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network. 1882-1886
Speaker, Language, and Privacy
- Christopher Cieri, James Fiumara, Jonathan Wright:
Using Games to Augment Corpora for Language Recognition and Confusability. 1887-1891 - Gianni Fenu, Mirko Marras, Giacomo Medda, Giacomo Meloni:
Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition. 1892-1896 - Leying Zhang, Zhengyang Chen, Yanmin Qian:
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification. 1897-1901 - Paul-Gauthier Noé, Mohammad MohammadAmini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre:
Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation. 1902-1906
Assessment of Pathological Speech and Language I
- Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts, Emily Mower Provost:
Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson's Disease. 1907-1911 - Robin Vaysse, Jérôme Farinas, Corine Astésano, Régine André-Obrecht:
Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers. 1912-1916 - Jinzi Qi, Hugo Van hamme:
Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-Encoders. 1917-1921 - Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad, Julie Liss, Visar Berisha:
The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation. 1922-1926 - Esaú Villatoro-Tello, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlícek, Mathew Magimai-Doss:
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition. 1927-1931 - Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó:
Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces. 1932-1936
Communication and Interaction, Multimodality
- Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan:
Cross-Modal Learning for Audio-Visual Video Parsing. 1937-1941 - Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison:
A Psychology-Driven Computational Analysis of Political Interviews. 1942-1946 - Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura:
Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure. 1947-1951 - Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen, Guanlong Zhao, Ricardo Gutierrez-Osuna:
Effects of Voice Type and Task on L2 Learners' Awareness of Pronunciation Errors. 1952-1956 - Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia:
Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues. 1957-1961 - Shamila Nasreen, Julian Hough, Matthew Purver:
Detecting Alzheimer's Disease Using Interactional and Acoustic Features from Spontaneous Speech. 1962-1966 - Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos:
Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent. 1967-1971 - Carlos Toshinori Ishi, Taiken Shintani:
Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations. 1972-1976
Language and Lexical Modeling for ASR
- Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding. 1977-1981 - Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li:
A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems. 1982-1986 - Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin:
Incorporating External POS Tagger for Punctuation Restoration. 1987-1991 - Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo:
Phonetically Induced Subwords for End-to-End Speech Recognition. 1992-1996 - Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright, Mari Ostendorf:
Revisiting Parity of Human vs. Machine Conversational Speech Transcription. 1997-2001 - W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman:
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition. 2002-2006 - Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila:
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems. 2007-2011 - Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu:
Token-Level Supervised Contrastive Learning for Punctuation Restoration. 2012-2016 - Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou:
BART Based Semantic Correction for Mandarin Automatic Speech Recognition System. 2017-2021 - Lingfeng Dai, Qi Liu, Kai Yu:
Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR. 2022-2026 - Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, Zoltán Tüske:
Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio. 2027-2031 - Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel:
A Discriminative Entity-Aware Language Model for Virtual Assistants. 2032-2036 - Mahdi Namazifar, John Malik, Li Erran Li, Gökhan Tür, Dilek Hakkani-Tür:
Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models. 2037-2041
Novel Neural Network Architectures for ASR
- Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency. 2042-2046 - Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen:
Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation. 2047-2051 - Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney:
Librispeech Transducer Model with Internal Language Model Prior Correction. 2052-2056 - Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu:
A Deliberation-Based Joint Acoustic and Text Decoder. 2057-2061 - Zoltán Tüske, George Saon, Brian Kingsbury:
On the Limit of English Conversational Speech Recognition. 2062-2066 - Keyu An, Yi Zhang, Zhijian Ou:
Deformable TDNN with Adaptive Receptive Fields for Speech Recognition. 2067-2071 - Zhao You, Shulin Feng, Dan Su, Dong Yu:
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. 2077-2081 - Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien:
Online Compressive Transformer for End-to-End Speech Recognition. 2082-2086 - Binghuai Lin, Liyuan Wang:
End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network. 2087-2091 - Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones:
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition. 2092-2096 - Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux:
Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers. 2097-2101 - Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh:
Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation. 2102-2106 - Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer:
Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios. 2107-2111
Speech Localization, Enhancement, and Quality Assessment
- Przemyslaw Falkowski-Gilski:
Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students. 2112-2116 - Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa:
PILOT: Introducing Transformers for Probabilistic Sound Event Localization. 2117-2121 - Masahito Togami, Robin Scheibler:
Sound Source Localization with Majorization Minimization. 2122-2126 - Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller:
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. 2127-2131 - Babak Naderi, Ross Cutler:
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing. 2132-2136 - Jianhua Geng, Sifan Wang, Juan Li, Jingwei Li, Xin Lou:
Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor. 2137-2141 - Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu:
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment. 2142-2146 - Andrea Toma, Daniele Salvati, Carlo Drioli, Gian Luca Foresti:
CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs. 2147-2151 - Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima, Kenji Nishida, Kazuhiro Nakadai:
Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization. 2152-2156 - Rongliang Liu, Nengheng Zheng, Xi Chen:
Feature Fusion by Attention Networks for Robust DOA Estimation. 2157-2161 - Shoufeng Lin, Zhaojie Luo:
Far-Field Speaker Localization and Adaptive GLMB Tracking. 2162-2166 - Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias:
On the Design of Deep Priors for Unsupervised Audio Restoration. 2167-2171 - Weiguang Chen, Cheng Xue, Xionghu Zhong:
Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments. 2172-2176
Speech Synthesis: Neural Waveform Generation
- Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae:
GAN Vocoder: Multi-Resolution Discriminator Is All You Need. 2177-2181 - Jian Cong, Shan Yang, Lei Xie, Dan Su:
Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis. 2182-2186 - Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda:
Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN. 2187-2191 - Kazuki Mizuta, Tomoki Koriyama, Hiroshi Saruwatari:
Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator. 2192-2196 - Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee:
Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. 2197-2201 - Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho:
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. 2202-2206 - Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim:
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. 2207-2211 - Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh:
Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis. 2212-2216 - Patrick Lumban Tobing, Tomoki Toda:
High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling. 2217-2221 - Zhengxi Liu, Yanmin Qian:
Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition. 2222-2226 - Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim:
High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model. 2227-2231
Spoken Machine Translation
- Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang:
SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction. 2232-2236 - Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun:
Subtitle Translation as Markup Translation. 2237-2241 - Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau:
Large-Scale Self- and Semi-Supervised Learning for Speech Translation. 2242-2246 - Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino:
CoVoST 2 and Massively Multilingual Speech Translation. 2247-2251 - Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang:
AlloST: Low-Resource Speech Translation Without Source Transcription. 2252-2256 - Johanes Effendi, Sakriani Sakti, Satoshi Nakamura:
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer. 2257-2261 - Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura:
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation. 2262-2266 - Rong Ye, Mingxuan Wang, Lei Li:
End-to-End Speech Translation via Cross-Modal Progressive Training. 2267-2271 - Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura:
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation. 2272-2276 - Alejandro Pérez González de Martos, Javier Iranzo-Sánchez, Adrià Giménez-Pastor, Javier Jorge, Joan Albert Silvestre-Cerdà, Jorge Civera, Albert Sanchís, Alfons Juan:
Towards Simultaneous Machine Interpretation. 2277-2281 - Giuseppe Martucci, Mauro Cettolo, Matteo Negri, Marco Turchi:
Lexical Modeling of ASR Errors for Robust Speech Translation. 2282-2286 - Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson:
Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation. 2287-2291 - Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu:
Effects of Feature Scaling and Fusion on Sign Language Translation. 2292-2296
SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification
- Alexander Alenin, Anton Okhotnikov, Rostislav Makarov, Nikita Torgashov, Ilya Shigabeev, Konstantin Simonchik:
The ID R&D System Description for Short-Duration Speaker Verification Challenge 2021. 2297-2301 - Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck:
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification. 2302-2306 - Aleksei Gusev, Alisa Vinogradova, Sergey Novoselov, Sergei Astapov:
SdSVC Challenge 2021: Tips and Tricks to Boost the Short-Duration Speaker Verification System Performance. 2307-2311 - Woo Hyun Kang, Nam Soo Kim:
Team02 Text-Independent Speaker Verification System for SdSV Challenge 2021. 2312-2316 - Xiaoyi Qin, Chao Wang, Yong Ma, Min Liu, Shilei Zhang, Ming Li:
Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021. 2317-2321 - Peng Zhang, Peng Hu, Xueliang Zhang:
Investigation of IMU&Elevoc Submission for the Short-Duration Speaker Verification Challenge 2021. 2322-2326 - Jie Yan, Shengyu Yao, Yiqian Pan, Wei Chen:
The Sogou System for Short-Duration Speaker Verification Challenge 2021. 2327-2331 - Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian:
The SJTU System for Short-Duration Speaker Verification Challenge 2021. 2332-2336
Show and Tell 2
- Sungjae Cho, Soo-Young Lee:
Multi-Speaker Emotional Text-to-Speech Synthesizer. 2337-2338 - Ales Prazák, Zdenek Loose, Josef V. Psutka, Vlasta Radová, Josef Psutka, Jan Svec:
Live TV Subtitling Through Respeaking. 2339-2340 - Stefan Fragner, Tobias Topar, Maximilian Giller, Lukas Pfeifenberger, Franz Pernkopf:
Autonomous Robot for Measuring Room Impulse Responses. 2341-2342 - Jonas Beskow, Charlie Caper, Johan Ehrenfors, Nils Hagberg, Anne Jansen, Chris Wood:
Expressive Robot Performance Based on Facial Motion Capture. 2343-2344 - Mónica Domínguez, Juan Soler Company, Leo Wanner:
ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction. 2345-2346 - Sai Guruju, Jithendra Vepa:
Addressing Compliance in Call Centers with Entity Extraction. 2347-2348 - Krishnachaitanya Gogineni, Tarun Reddy Yadama, Jithendra Vepa:
Audio Segmentation Based Conversational Silence Detection for Contact Center Calls. 2349-2350
Graph and End-to-End Learning for Speaker Recognition
- Desh Raj, Sanjeev Khudanpur:
Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem. 2351-2355 - Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas W. D. Evans:
Graph Attention Networks for Anti-Spoofing. 2356-2360 - Victoria Mingote, Antonio Miguel, Alfonso Ortega Giménez, Eduardo Lleida:
Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems. 2361-2365 - Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang, Jing Xiao, Lukás Burget, Jan Cernocký:
Effective Phase Encoding for End-To-End Speaker Verification. 2366-2370
Spoken Language Processing II
- Ha Nguyen, Yannick Estève, Laurent Besacier:
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. 2371-2375 - Dominik Machácek, Matús Zilinec, Ondrej Bojar:
Lost in Interpreting: Speech Translation from Source or Interpreter? 2376-2380 - Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles Bouveyron, Frédéric Precioso:
Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion. 2381-2385 - Sarenne Wallbridge, Peter Bell, Catherine Lai:
It's Not What You Said, it's How You Said it: Discriminative Perception of Speech as a Multichannel Communication System. 2386-2390
Speech and Audio Analysis
- Thilo Michael, Gabriel Mittag, Andreas Bütow, Sebastian Möller:
Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations. 2391-2395 - Christian Bergler, Manuel Schmitt, Andreas K. Maier, Helena Symonds, Paul Spong, Steven R. Ness, George Tzanetakis, Elmar Nöth:
ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification. 2396-2400 - Wim Boes, Hugo Van hamme:
Audiovisual Transfer Learning for Audio Tagging and Sound Event Detection. 2401-2405 - Natalia Nessler, Milos Cernak, Paolo Prandoni, Pablo Mainar:
Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling. 2406-2410 - Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie:
Audio Retrieval with Natural Language Queries. 2411-2415
Cross/Multi-Lingual and Code-Switched ASR
- Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett:
Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio. 2416-2420 - Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel:
Efficient Weight Factorization for Multilingual Speech Recognition. 2421-2425 - Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli:
Unsupervised Cross-Lingual Representation Learning for Speech Recognition. 2426-2430 - Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition. 2431-2435 - Krishna D. N, Pinyi Wang, Bruno Bozza:
Using Large Self-Supervised Models for Low-Resource Speech Recognition. 2436-2440 - Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema A. Murthy:
Dual Script E2E Framework for Multilingual and Code-Switching ASR. 2441-2445 - Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan K. M., Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish R. Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan:
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2446-2450 - Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven C. H. Hoi:
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition. 2451-2455 - Hardik B. Sailor, Kiran Praveen, Vikas Agrawal, Abhinav Jain, Abhishek Pandey:
SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2456-2460 - Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black:
Hierarchical Phone Recognition with Compositional Phonetics. 2461-2465 - Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali:
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR. 2466-2470 - Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe:
Differentiable Allophone Graphs for Language-Universal Speech Recognition. 2471-2475
Health and Affect II
- Vincent P. Martin, Jean-Luc Rouas, Florian Boyer, Pierre Philip:
Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice. 2476-2480 - Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman:
Robust Laughter Detection in Noisy Environments. 2481-2485 - Mizuki Nagano, Yusuke Ijima, Sadao Hiroya:
Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech. 2486-2490 - Huda Alsofyani, Alessandro Vinciarelli:
Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children. 2491-2495 - Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli:
Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units. 2496-2500 - Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi:
Emotion Carrier Recognition from Personal Narratives. 2501-2505 - Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz:
Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training. 2506-2510 - Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu:
TDCA-Net: Time-Domain Channel Attention Network for Depression Detection. 2511-2515 - Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso:
Visual Speech for Obstructive Sleep Apnea Detection. 2516-2520 - Héctor A. Cordourier Maruri, Sinem Aslan, Georg Stemmer, Nese Alyüz, Lama Nachman:
Analysis of Contextual Voice Changes in Remote Meetings. 2521-2525 - Nadee Seneviratne, Carol Y. Espy-Wilson:
Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model. 2526-2530
Neural Network Training Methods for ASR
- Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang:
Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models. 2531-2535 - Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow:
Learning a Neural Diff for Speech Models. 2536-2540 - Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals:
Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models. 2541-2545 - Jiabin Xue, Tieran Zheng, Jiqing Han:
Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training. 2546-2550 - Heng-Jui Chang, Hung-yi Lee, Lin-Shan Lee:
Towards Lifelong Learning of End-to-End ASR. 2551-2555 - Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu:
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence. 2556-2560 - Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran:
Regularizing Word Segmentation by Creating Misspellings. 2561-2565 - Peidong Wang, Tara N. Sainath, Ron J. Weiss:
Multitask Training with Text Data for End-to-End Speech Recognition. 2566-2570 - Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie:
Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition. 2571-2575 - Jasha Droppo, Oguz Elibol:
Scaling Laws for Acoustic Models. 2576-2580 - Jayadev Billa:
Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language. 2581-2585 - Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan:
4-Bit Quantization of LSTM-Based Speech Recognition Models. 2586-2590 - Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi:
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation. 2591-2595 - Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong:
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition. 2596-2600 - Dongcheng Jiang, Chao Zhang, Philip C. Woodland:
Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning. 2601-2605
Prosodic Features and Structure
- Constantijn Kaland, Matthew Gordon:
How f0 and Phrase Position Affect Papuan Malay Word Identification. 2606-2610 - Anna Bothe Jespersen, Pavel Sturm, Mísa Hejná:
On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish. 2611-2615 - Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson:
An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus. 2616-2620 - Branislav Gerazov, Michael Wagner:
ProsoBeast Prosody Annotation Tool. 2621-2625 - Trang Tran, Mari Ostendorf:
Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts. 2626-2630 - Roger Cheng-yen Liu, Feng-fan Hsieh, Yueh-Chin Chang:
Targeted and Targetless Neutral Tones in Taiwanese Southern Min. 2631-2635 - Mária Gósy, Kálmán Abari:
The Interaction of Word Complexity and Word Duration in an Agglutinative Language. 2636-2640 - Ho-hsien Pan, Shao-Ren Lyu:
Taiwan Min Nan (Taiwanese) Checked Tones Sound Change. 2641-2645 - Moritz Jakob, Bettina Braun, Katharina Zahner-Ritter:
In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German. 2646-2650 - Christer Gobl:
The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion. 2651-2655 - Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang:
Parsing Speech for Grouping and Prominence, and the Typology of Rhythm. 2656-2660 - Benazir Mumtaz, Massimiliano Canzi, Miriam Butt:
Prosody of Case Markers in Urdu. 2661-2665 - Brynhildur Stefansdottir, Francesco Burroni, Sam Tilsen:
Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects. 2666-2670 - Khia A. Johnson:
Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech. 2671-2675
Single-Channel Speech Enhancement
- Aswin Sivaraman, Sunwoo Kim, Minje Kim:
Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification. 2676-2680 - Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott:
Speech Denoising with Auditory Models. 2681-2685 - Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka:
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement. 2686-2690 - Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen:
Multi-Stage Progressive Speech Enhancement Network. 2691-2695 - Oscar Chang, Dung N. Tran, Kazuhito Koishida:
Single-Channel Speech Enhancement Using Learnable Loss Mixup. 2696-2700 - Xiaoqi Zhang, Jun Du, Li Chai, Chin-Hui Lee:
A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement. 2701-2705 - Vikas Agrawal, Shashi Kumar, Shakti P. Rath:
Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition. 2706-2710 - Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi:
DEMUCS-Mobile : On-Device Lightweight Speech Enhancement. 2711-2715 - Madhav Mahesh Kashyap, Anuj Tambwekar, Krishnamoorthy Manohara, S. Natarajan:
Speech Denoising Without Clean Training Data: A Noise2Noise Approach. 2716-2720 - Feng Dang, Pengyuan Zhang, Hangting Chen:
Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints. 2721-2725 - Xudong Zhang, Liang Zhao, Feng Gu:
Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs). 2726-2730 - Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han:
Learning Speech Structure to Improve Time-Frequency Masks. 2731-2735 - Eesung Kim, Hyeji Seo:
SE-Conformer: Time-Domain Speech Enhancement Using Conformer. 2736-2740
Speech Synthesis: Tools, Data, Evaluation
- Thananchai Kongthaworn, Burin Naowarat, Ekapol Chuangsuwanich:
Spectral and Latent Speech Representation Distortion for TTS Evaluation. 2741-2745 - Cassia Valentini-Botinhao, Simon King:
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech. 2746-2750 - Rohola Zandie, Mohammad H. Mahoor, Julia Madsen, Eshrat S. Emamian:
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. 2751-2755 - Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li:
AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. 2756-2760 - Nicholas Eng, C. T. Justine Hui, Yusuke Hioka, Catherine I. Watson:
Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis. 2761-2765 - Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao:
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. 2766-2770 - Sai Sirisha Rallabandi, Abhinav Bharadwaj, Babak Naderi, Sebastian Möller:
Perception of Social Speaker Characteristics in Synthetic Speech. 2771-2775 - Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang:
Hi-Fi Multi-Speaker English TTS Dataset. 2776-2780 - Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee:
Utilizing Self-Supervised Representations for MOS Prediction. 2781-2785 - Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol:
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. 2786-2790 - Jason Taylor, Korin Richmond:
Confidence Intervals for ASR-Based TTS Evaluation. 2791-2795
INTERSPEECH 2021 Deep Noise Suppression Challenge
- Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan:
INTERSPEECH 2021 Deep Noise Suppression Challenge. 2796-2800 - Andong Li, Wenzhe Liu, Xiaoxue Luo, Guochen Yu, Chengshi Zheng, Xiaodong Li:
A Simultaneous Denoising and Dereverberation Framework with Target Decoupling. 2801-2805 - Ziyi Xu, Maximilian Strake, Tim Fingscheidt:
Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data. 2806-2810 - Xiaohuai Le, Hongsheng Chen, Kai Chen, Jing Lu:
DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. 2811-2815 - Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie:
DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. 2816-2820 - Kanghao Zhang, Shulin He, Hao Li, Xueliang Zhang:
DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement. 2821-2825 - Xu Zhang, Xinlei Ren, Xiguang Zheng, Lianwu Chen, Chen Zhang, Liang Guo, Bing Yu:
Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss. 2826-2830 - Koen Oostermeijer, Qing Wang, Jun Du:
Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement. 2831-2835
Neural Network Training Methods and Architectures for ASR
- Nicolae-Catalin Ristea, Radu Tudor Ionescu:
Self-Paced Ensemble Learning for Speech and Audio Classification. 2836-2840 - Atsushi Kojima:
Knowledge Distillation for Streaming Transformer-Transducer. 2841-2845 - Timo Lohrenz, Zhengyang Li, Tim Fingscheidt:
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition. 2846-2850 - Salah Zaiem, Titouan Parcollet, Slim Essid:
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning. 2851-2855 - Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney:
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models. 2856-2860 - Apoorv Vyas, Srikanth R. Madikeri, Hervé Bourlard:
Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model. 2861-2865
Emotion and Sentiment Analysis I
- Clément Le Moine, Nicolas Obin, Axel Roebel:
Speaker Attentive Speech Emotion Recognition. 2866-2870 - Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso:
Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions. 2871-2875 - Efthymios Georgiou, Georgios Paraskevopoulos, Alexandros Potamianos:
M3: MultiModal Masking Applied to Sentiment Analysis. 2876-2880
Linguistic Components in End-to-End ASR
- Ondrej Klejch, Electra Wallington, Peter Bell:
The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2881-2885 - Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney:
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition. 2886-2890 - Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney:
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept. 2891-2895 - Abbas Khosravani, Philip N. Garner, Alexandros Lazaridis:
Modeling Dialectal Variation for Swiss German Automatic Speech Recognition. 2896-2900 - Ekaterina Egorova, Hari Krishna Vydana, Lukás Burget, Jan Cernocký:
Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System. 2901-2905 - Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García-Perera, Sanjeev Khudanpur:
Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition. 2906-2910
Assessment of Pathological Speech and Language II
- Wei Xue, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik:
Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features. 2911-2915 - Young-Kyung Kim, Rimita Lahiri, Md. Nasir, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan:
Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder. 2916-2920 - Waldemar Jesko:
Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms. 2921-2925 - Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio:
Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech. 2926-2930 - Si Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee:
Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding. 2931-2935 - Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard, Ricardo Gutierrez-Osuna:
Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions. 2936-2940 - Bahman Mirheidari, Yilin Pan, Daniel Blackburn, Ronan O'Malley, Heidi Christensen:
Identifying Cognitive Impairment Using Sentence Representation Vectors. 2941-2945 - Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright:
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children. 2946-2950 - Tong Xia, Jing Han, Lorena Qendro, Ting Dang, Cecilia Mascolo:
Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data. 2951-2955 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization. 2956-2960 - Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh:
Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson's Disease and Healthy Subjects. 2961-2965 - R'mani Haulcy, James R. Glass:
CLAC: A Speech Corpus of Healthy English Speakers. 2966-2970
Multimodal Systems
- Leanne Nortje, Herman Kamper:
Direct Multimodal Few-Shot Learning of Speech and Images. 2971-2975 - Ramon Sanabria, Austin Waters, Jason Baldridge:
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. 2976-2980 - Huan Zhao, Kaili Ma:
A Fast Discrete Two-Step Learning Hashing for Scalable Cross-Modal Retrieval. 2981-2985 - Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu:
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition. 2986-2990 - Kayode Olaleye, Herman Kamper:
Attention-Based Keyword Localisation in Speech Using Visual Grounding. 2991-2995 - Khazar Khorrami, Okko Räsänen:
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models. 2996-3000 - Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee:
Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries. 3001-3005 - Andrew Rouditchenko, Angie W. Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogério Feris, Brian Kingsbury, Michael Picheny, James R. Glass:
Cascaded Multilingual Audio-Visual Learning from Videos. 3006-3010 - Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic:
LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision. 3011-3015 - Richard Rose, Olivier Siohan, Anshuman Tripathi, Otavio Braga:
End-to-End Audio-Visual Speech Recognition for Overlapping Speech. 3016-3020 - Yifei Wu, Chenda Li, Song Yang, Zhongqin Wu, Yanmin Qian:
Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party. 3021-3025
Source Separation I
- Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu:
Ultra Fast Speech Separation Model with Teacher Student Learning. 3026-3030 - Murtiza Ali, Ashwani Koul, Karan Nathwani:
Group Delay Based Re-Weighted Sparse Recovery Algorithms for Robust and High-Resolution Source Separation in DOA Framework. 3031-3035 - Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen:
Continuous Speech Separation Using Speaker Inventory for Long Recording. 3036-3040 - Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, Wenwu Wang:
Crossfire Conditional Generative Adversarial Networks for Singing Voice Extraction. 3041-3045 - Kai Wang, Hao Huang, Ying Hu, Zhihua Huang, Sheng Li:
End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain. 3046-3050 - Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi:
Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation. 3051-3055 - Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee:
Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training. 3056-3060 - Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang:
Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation. 3061-3065 - Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li:
Investigation of Practical Aspects of Single Channel Speech Separation for ASR. 3066-3070 - Yi Luo, Nima Mesgarani:
Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation. 3071-3075 - Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu:
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation. 3076-3080
Speaker Diarization I
- Yi-Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke:
End-to-End Neural Diarization: From Transformer to Conformer. 3081-3085 - Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee:
Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network. 3086-3090 - Xucheng Wan, Kai Liu, Huan Zhou:
Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference. 3091-3095 - Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Leibny Paola García-Perera, Kenji Nagamatsu:
Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization. 3096-3100 - Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung:
Adapting Speaker Embeddings for Speaker Diarisation. 3101-3105 - Yu-Xuan Wang, Jun Du, Maokui He, Shutong Niu, Lei Sun, Chin-Hui Lee:
Scenario-Dependent Speaker Diarization for DIHARD-III Challenge. 3106-3110 - Hervé Bredin, Antoine Laurent:
End-To-End Speaker Segmentation for Overlap-Aware Resegmentation. 3111-3115 - Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Leibny Paola García-Perera, Kenji Nagamatsu:
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers. 3116-3120 - Or Haim Anidjar, Itshak Lapidot, Chen Hajaj, Amit Dvir:
A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection. 3121-3125
Speech Synthesis: Prosody Modeling I
- Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:
Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis. 3126-3130 - Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo:
Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows. 3131-3135 - Chenpeng Du, Kai Yu:
Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network. 3136-3140 - Kenichi Fujita, Atsushi Ando, Yusuke Ijima:
Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis. 3141-3145 - Yuxiang Zou, Shichao Liu, Xiang Yin, Haopeng Lin, Chunfeng Wang, Haoyu Zhang, Zejun Ma:
Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation. 3146-3150 - Mayank Sharma, Yogesh Virkar, Marcello Federico, Roberto Barra-Chicote, Robert Enyedi:
Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing. 3151-3155 - Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee:
Applying the Information Bottleneck Principle to Prosodic Representation Learning. 3156-3160 - Alice Baird, Silvan Mertes, Manuel Milling, Lukas Stappen, Thomas Wiest, Elisabeth André, Björn W. Schuller:
A Prototypical Network Approach for Evaluating Generated Emotional Speech. 3161-3165
Speech Production II
- Tsukasa Yoshinaga, Kohei Tada, Kazunori Nozaki, Akiyoshi Iida:
A Simplified Model for the Vocal Tract of [s] with Inclined Incisors. 3166-3170 - Takayuki Arai:
Vocal-Tract Models to Visualize the Airstream of Human Breath and Droplets While Producing Speech. 3171-3175 - Ryo Tanji, Hidefumi Ohmura, Kouichi Katsurada:
Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data. 3176-3180 - Rafia Inaam, Tsukasa Yoshinaga, Takayuki Arai, Hiroshi Yokoyama, Akiyoshi Iida:
Comparison Between Lumped-Mass Modeling and Flow Simulation of the Reed-Type Artificial Vocal Fold. 3181-3185 - Raphael Werner, Susanne Fuchs, Jürgen Trouvain, Bernd Möbius:
Inhalations in Speech: Acoustic and Physiological Characteristics. 3186-3190 - Anqi Xu, Daniel R. van Niekerk, Branislav Gerazov, Paul Konstantin Krug, Santitham Prom-on, Peter Birkholz, Yi Xu:
Model-Based Exploration of Linking Between Vowel Articulatory Space and Acoustic Space. 3191-3195 - Mikey Elmers, Raphael Werner, Beeke Muhlack, Bernd Möbius, Jürgen Trouvain:
Take a Breath: Respiratory Sounds Improve Recollection in Synthetic Speech. 3196-3200 - Taijing Chen, Adam C. Lammert, Benjamin Parrell:
Modeling Sensorimotor Adaptation in Speech Through Alterations to Forward and Inverse Models. 3201-3205 - Hideki Kawahara, Toshie Matsui, Kohei Yatabe, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino:
Mixture of Orthogonal Sequences Made from Extended Time-Stretched Pulses Enables Measurement of Involuntary Voice Fundamental Frequency Response to Pitch Perturbation. 3206-3210
Spoken Dialogue Systems II
- Chenyu You, Nuo Chen, Yuexian Zou:
Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering. 3211-3215 - Wenying Duan, Xiaoxi He, Zimu Zhou, Hong Rao, Lothar Thiele:
Injecting Descriptive Meta-Information into Pre-Trained Language Models with Hypernetworks. 3216-3220 - Mahdin Rohmatillah, Jen-Tzung Chien:
Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy. 3221-3225 - Shinya Fujie, Hayato Katayama, Jin Sakuma, Tetsunori Kobayashi:
Timing Generating Networks: Neural Network Based Precise Turn-Taking Timing Prediction in Multiparty Conversation. 3226-3230 - Kehan Chen, Zezhong Li, Suyang Dai, Wei Zhou, Haiqing Chen:
Human-to-Human Conversation Dataset for Learning Fine-Grained Turn-Taking Action. 3231-3235 - Mukuntha Narayanan Sundararaman, Ayush Kumar, Jithendra Vepa:
PhonemeBERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript. 3236-3240 - Hongyin Luo, James R. Glass, Garima Lalwani, Yi Zhang, Shang-Wen Li:
Joint Retrieval-Extraction Training for Evidence-Aware Dialog Response Selection. 3241-3245 - Ashish Shenoy, Sravan Bodapati, Monica Sunkara, Srikanth Ronanki, Katrin Kirchhoff:
Adapting Long Context NLM for ASR Rescoring in Conversational Agents. 3246-3250
Oriental Language Recognition
- Jing Li, Binling Wang, Yiming Zhi, Zheng Li, Lin Li, Qingyang Hong, Dong Wang:
Oriental Language Recognition (OLR) 2020: Summary and Analysis. 3251-3255 - Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina:
Language Recognition on Unknown Conditions: The LORIA-Inria-MULTISPEECH System for AP20-OLR Challenge. 3256-3260 - Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, Xiaorui Wang:
Dynamic Multi-Scale Convolution for Dialect Identification. 3261-3265 - Ding Wang, Shuaishuai Ye, Xinhui Hu, Sheng Li, Xinkang Xu:
An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model. 3266-3270 - Haibin Yu, Jing Zhao, Song Yang, Zhongqin Wu, Yuting Nie, Wei-Qiang Zhang:
Language Recognition Based on Unsupervised Pretrained Models. 3271-3275 - Zheng Li, Yan Liu, Lin Li, Qingyang Hong:
Additive Phoneme-Aware Margin Softmax Loss for Language Recognition. 3276-3280
Automatic Speech Recognition in Air Traffic Management
- Nataly Jahchan, Florentin Barbier, Ariyanidevi Dharma Gita, Khaled Khelif, Estelle Delpech:
Towards an Accent-Robust Approach for ATC Communications Transcription. 3281-3285 - Igor Szöke, Santosh Kesiraju, Ondrej Novotný, Martin Kocour, Karel Veselý, Jan Cernocký:
Detecting English Speech in the Air Traffic Control Voice Communication. 3286-3290 - Oliver Ohneiser, Seyyed Saeed Sarfjoo, Hartmut Helmke, Shruthi Shetty, Petr Motlícek, Matthias Kleinert, Heiko Ehr, Sarunas Murauskas:
Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances. 3291-3295 - Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlícek, Karel Veselý, Martin Kocour, Igor Szöke:
Contextual Semi-Supervised Learning: An Approach to Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems. 3296-3300 - Martin Kocour, Karel Veselý, Alexander Blatt, Juan Zuluaga-Gomez, Igor Szöke, Jan Cernocký, Dietrich Klakow, Petr Motlícek:
Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition. 3301-3305 - Benjamin Elie, Jodie Gauvain, Jean-Luc Gauvain, Lori Lamel:
Modeling the Effect of Military Oxygen Masks on Speech Characteristics. 3306-3310
Show and Tell 3
- Benjamin Milde, Tim Fischer, Steffen Remus, Chris Biemann:
MoM: Minutes of Meeting Bot. 3311-3312 - Alexander Wilbrandt, Simon Stone, Peter Birkholz:
Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording. 3313-3314 - Joan Codina-Filbà, Guillermo Cámbara, Alex Peiró Lilja, Jens Grivolla, Roberto Carlini, Mireia Farrús:
The INGENIOUS Multilingual Operations App. 3315-3316 - Joanna Rownicka, Kilian Sprenkamp, Antonio Tripiana, Volodymyr Gromoglasov, Timo P. Kunz:
Digital Einstein Experience: Fast Text-to-Speech for Conversational AI. 3317-3318 - Robert Geislinger, Benjamin Milde, Timo Baumann, Chris Biemann:
Live Subtitling for BigBlueButton with Open-Source Software. 3319-3320 - Davis Nicmanis, Askars Salimbajevs:
Expressive Latvian Speech Synthesis for Dialog Systems. 3321-3322 - Pramod H. Kachare, Prem C. Pandey, Vishal Mane, Hirak Dasgupta, K. S. Nataraj, Akshada Rathod, Sheetal K. Pathak:
ViSTAFAE: A Visual Speech-Training Aid with Feedback of Articulatory Efforts. 3323-3324
Survey Talk 3: Karen Livescu
- Karen Livescu:
Learning Speech Models from Multi-Modal Data.
Keynote 3: Mounya Elhilali
- Mounya Elhilali:
Adaptive Listening to Everyday Soundscapes.
Speech Production I
- Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie:
Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated. 3325-3329 - Rémi Blandin, Marc Arnela, Simon Félix, Jean-Baptiste Doc, Peter Birkholz:
Comparison of the Finite Element Method, the Multimodal Method and the Transmission-Line Model for the Computation of Vocal Tract Transfer Functions. 3330-3334 - Petra Wagner, Sina Zarrieß, Joana Cholin:
Effects of Time Pressure and Spontaneity on Phonotactic Innovations in German Dialogues. 3335-3339 - Salvador Medina, Sarah Taylor, Mark Tiede, Alexander G. Hauptmann, Iain A. Matthews:
Importance of Parasagittal Sensor Information in Tongue Motion Capture Through a Diphonic Analysis. 3340-3344 - Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber:
Learning Robust Speech Representation with an Articulatory-Regularized Variational Autoencoder. 3345-3349 - Heather Weston, Laura L. Koenig, Susanne Fuchs:
Changes in Glottal Source Parameter Values with Light to Moderate Physical Load. 3350-3354
Speech Enhancement and Coding
- Mohammad Hassan Vali, Tom Bäckström:
End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding. 3355-3359 - Santhan Kumar Reddy Nareddula, Subrahmanyam Gorthi, Rama Krishna Sai Subrahmanyam Gorthi:
Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement. 3360-3364 - Lubos Marcinek, Michael Stone, Rebecca E. Millman, Patrick Gaydecki:
N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification. 3365-3369
Emotion and Sentiment Analysis II
- Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, Richard M. Stern:
Temporal Context in Speech Emotion Recognition. 3370-3374 - Hang Li, Wenbiao Ding, Zhongqin Wu, Zitao Liu:
Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition. 3375-3379 - Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen:
Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit. 3380-3384 - Fan Qian, Jiqing Han:
Multimodal Sentiment Analysis with Temporal Modality Attention. 3385-3389 - Mani Kumar Tellamekala, Enrique Sanchez, Georgios Tzimiropoulos, Timo Giesbrecht, Michel F. Valstar:
Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition. 3390-3394 - Haoqi Li, Yelin Kim, Cheng-Hao Kuo, Shrikanth S. Narayanan:
Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition. 3395-3399 - Leonardo Pepino, Pablo Riera, Luciana Ferrer:
Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. 3400-3404 - Jiawang Liu, Haoxiang Wang:
Graph Isomorphism Network for Speech Emotion Recognition. 3405-3409 - Pooja Kumawat, Aurobinda Routray:
Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition. 3410-3414 - Aaron Keesing, Yun Sing Koh, Michael Witbrock:
Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech. 3415-3419 - Suwon Shon, Pablo Brusco, Jing Pan, Kyu Jeong Han, Shinji Watanabe:
Leveraging Pre-Trained Language Model for Speech Sentiment Analysis. 3420-3424
Multi- and Cross-Lingual ASR, Other Topics in ASR
- Wenxin Hou, Jindong Wang, Xu Tan, Tao Qin, Takahiro Shinozaki:
Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching. 3425-3429 - Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone. 3430-3434 - Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong:
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer. 3435-3439 - Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak:
Reducing Streaming ASR Model Delay with Self Alignment. 3440-3444 - Anuj Diwan, Preethi Jyothi:
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages. 3445-3449 - Takashi Fukuda, Samuel Thomas:
Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer. 3450-3454 - Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo:
Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End. 3455-3459 - Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao:
Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models. 3460-3464 - Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, Miguel Jette:
Earnings-21: A Practical Benchmark for ASR in the Wild. 3465-3469 - Eric Sun, Jinyu Li, Zhong Meng, Yu Wu, Jian Xue, Shujie Liu, Yifan Gong:
Improving Multilingual Transformer Transducer Models by Reducing Language Confusions. 3470-3474 - Ahmed Ali, Shammur Absar Chowdhury, Amir Hussein, Yasser Hifny:
Arabic Code-Switching Speech Recognition Using Monolingual Data. 3475-3479
Source Separation II
- Aviad Eisenberg, Boaz Schwartz, Sharon Gannot:
Online Blind Audio Source Separation Using Recursive Expectation-Maximization. 3480-3484 - Yi Luo, Cong Han, Nima Mesgarani:
Empirical Analysis of Generalized Iterative Speech Separation Networks. 3485-3489 - Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach:
Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers. 3490-3494 - Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker:
Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation. 3495-3499 - Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki:
Few-Shot Learning of New Sound Classes for Target Sound Extraction. 3500-3504 - Cong Han, Yi Luo, Nima Mesgarani:
Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues. 3505-3509 - Shell Xu Hu, Md Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias:
AvaTr: One-Shot Speaker Extraction with Transformers. 3510-3514 - Saurjya Sarkar, Emmanouil Benetos, Mark B. Sandler:
Vocal Harmony Separation Using Time-Domain Neural Networks. 3515-3519 - Matthew Maciejewski, Shinji Watanabe, Sanjeev Khudanpur:
Speaker Verification-Based Evaluation of Single-Channel Speech Separation. 3520-3524 - Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu:
Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection. 3525-3529 - Chengyun Deng, Shiqian Ma, Yongtao Sha, Yi Zhang, Hui Zhang, Hui Song, Fei Wang:
Robust Speaker Extraction Network Based on Iterative Refined Adaptation. 3530-3534 - Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li:
Neural Speaker Extraction with Speaker-Speech Cross-Attention Network. 3535-3539 - Rémi Rigal, Jacques Chodorowski, Benoît Zerr:
Deep Audio-Visual Speech Separation Based on Facial Motion. 3540-3544
Speaker Diarization II
- Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy:
LEAP Submission for the Third DIHARD Diarization Challenge. 3545-3549 - Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng, Zhijie Yan:
Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings. 3550-3554 - Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe:
Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker. 3555-3559 - Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na:
ECAPA-TDNN Embeddings for Speaker Diarization. 3560-3564 - Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara:
Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech. 3565-3569 - Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman:
The Third DIHARD Diarization Challenge. 3570-3574 - Tsun-Yat Leung, Lahiru Samarakoon:
Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty. 3575-3579 - Benjamin O'Brien, Natalia A. Tomashenko, Anaïs Chanclu, Jean-François Bonastre:
Anonymous Speaker Clusters: Making Distinctions Between Anonymised Speech Recordings with Clustering Interface. 3580-3584 - Kiran Karra, Alan McCree:
Speaker Diarization Using Two-Pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings. 3585-3589
Speech Synthesis: Toward End-to-End Synthesis I
- Zhenhou Hong, Jianzong Wang, Xiaoyang Qu, Jie Liu, Chendong Zhao, Jing Xiao:
Federated Learning with Dynamic Transformer for Text to Speech. 3590-3594 - Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang:
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks. 3595-3599 - Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng:
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration. 3600-3604 - Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim:
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. 3605-3609 - Jae-Sung Bae, Taejun Bak, Young-Sun Joo, Hoon-Young Cho:
Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech. 3610-3614 - Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux:
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. 3615-3619 - Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo-Trueba, Thomas Drugman:
A Learned Conditional Prior for the VAE Acoustic Space of a TTS System. 3620-3624 - Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, Yannis Stylianou:
A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization. 3625-3629 - Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda:
Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder. 3630-3634 - Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee:
Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech. 3635-3639 - Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu:
Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet. 3640-3644 - Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Jr., Anderson da Silva Soares, Sandra Maria Aluísio, Moacir Antonelli Ponti:
SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. 3645-3649
Tools, Corpora and Resources
- Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James R. Glass:
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. 3650-3654 - Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post:
The Multilingual TEDx Corpus for Speech Recognition and Translation. 3655-3659 - David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu:
Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments. 3660-3664 - Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen:
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. 3665-3669 - Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan:
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio. 3670-3674 - You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung:
Look Who's Talking: Active Speaker Detection in the Wild. 3675-3679 - Beena Ahmed, Kirrie J. Ballard, Denis Burnham, Tharmakulasingam Sirojan, Hadi Mehmood, Dominique Estival, Elise Baker, Felicity Cox, Joanne Arciuli, Titia Benders, Katherine Demuth, Barbara Kelly, Chloé Diskin-Holdaway, Mostafa Ali Shahin, Vidhyasaharan Sethu, Julien Epps, Chwee Beng Lee, Eliathamby Ambikairajah:
AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children's Speech. 3680-3684 - Per Fallgren, Jens Edlund:
Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson. 3685-3689 - Elena Ryumina, Oxana Verkholyak, Alexey Karpov:
Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition. 3690-3694 - Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez-Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez González de Martos, Jorge Civera, Albert Sanchís, Alfons Juan:
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization. - Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B. Hegde, Vinay P. Namboodiri, C. V. Jawahar:
Towards Automatic Speech to Sign Language Generation. 3700-3704 - Won-Ik Cho, Seok Min Kim, Hyunchang Cho, Nam Soo Kim:
kosp2e: Korean Speech to English Translation Corpus. 3705-3709 - Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang:
speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment. 3710-3714
Non-Autoregressive Sequential Modeling for Speech Processing
- Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan:
An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition. 3715-3719 - Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie:
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain. 3720-3724 - Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan:
Pushing the Limits of Non-Autoregressive Speech Recognition. 3725-3729 - Alexander H. Liu, Yu-An Chung, James R. Glass:
Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies. 3730-3734 - Jumon Nozaki, Tatsuya Komatsu:
Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions. 3735-3739 - Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi:
Toward Streaming ASR with Non-Autoregressive Insertion-Based Model. 3740-3744 - Jaesong Lee, Jingu Kang, Shinji Watanabe:
Layer Pruning on Demand with Intermediate CTC. 3745-3749 - Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li, Qingyang Hong:
Real-Time End-to-End Monaural Multi-Speaker Speech Recognition. 3750-3754 - Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe:
Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models. 3755-3759 - Stanislav Beliaev, Boris Ginsburg:
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis. 3760-3764 - Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan:
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. 3765-3769 - Nanxin Chen, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba, Najim Dehak:
Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition. 3770-3774 - Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng:
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis. 3775-3779
The ADReSSo Challenge: Detecting Cognitive Decline Using Speech Only
- Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney:
Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge. 3780-3784 - Paula Andrea Pérez-Toro, Sebastian P. Bayerl, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Philipp Klumpp, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Korbinian Riedhammer:
Influence of the Interviewer on the Automatic Assessment of Alzheimer's Disease in the Context of the ADReSSo Challenge. 3785-3789 - Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A. Batsis, Robert M. Roth:
WavBERT: Exploiting Semantic and Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection. 3790-3794 - Lara Gauder, Leonardo Pepino, Luciana Ferrer, Pablo Riera:
Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models. 3795-3799 - Aparna Balagopalan, Jekaterina Novikova:
Comparing Acoustic-Based Approaches for Alzheimer's Disease Detection. 3800-3804 - Yu Qiao, Xuefeng Yin, Daniel Wiechmann, Elma Kerz:
Alzheimer's Disease Detection from Spontaneous Speech Through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models. 3805-3809 - Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson, Matthew Jones, Julie S. Snowden, Daniel Blackburn, Heidi Christensen:
Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer's Dementia Detection Through Spontaneous Speech. 3810-3814 - Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Margaret Lech, Elena Pirogova:
Tackling the ADRESSO Challenge 2021: The MUET-RMIT System for Alzheimer's Dementia Recognition from Spontaneous Speech. 3815-3819 - Morteza Rohanian, Julian Hough, Matthew Purver:
Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs. 3820-3824 - Raghavendra Pappagari, Jaejin Cho, Sonal Joshi, Laureano Moro-Velázquez, Piotr Zelasko, Jesús Villalba, Najim Dehak:
Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios. 3825-3829 - Jun Chen, Jieping Ye, Fengyi Tang, Jiayu Zhou:
Automatic Detection of Alzheimer's Disease Using Spontaneous Speech Only. 3830-3834 - Ning Wang, Yupeng Cao, Shuai Hao, Zongru Shao, K. P. Subbalakshmi:
Modular Multi-Modal Attention Network for Alzheimer's Disease Detection Using Patient Audio and Language Data. 3835-3839
Robust and Far-Field ASR
- Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, José Laínez, Ljubomir Milanovic:
Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-Field Speech Recognition. 3840-3844 - Roberto Gretter, Marco Matassoni, Daniele Falavigna, A. Misra, Chee Wee Leong, Kate M. Knill, Linlin Wang:
ETLT 2021: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech. 3845-3849 - Lars Rumberg, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann:
Age-Invariant Training for End-to-End Child Speech Recognition Using Adversarial Multi-Task Learning. 3850-3854 - Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini:
Learning to Rank Microphones for Distant Speech Recognition. 3855-3859 - Lucile Gelin, Thomas Pellegrini, Julien Pinquier, Morgane Daniel:
Simulating Reading Mistakes for Child Speech Transformer-Based Phone Recognition. 3860-3864
Speech Synthesis: Prosody Modeling II
- Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier:
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input. 3865-3869 - Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison, Pauline Larrouy-Maestri, Elisabeth André, Nori Jacoby:
Exploring Emotional Prototypes in a High Dimensional TTS Latent Space. 3870-3874 - Devang S. Ram Mohan, Qinmin Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King:
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis. 3875-3879 - Alexandra Torresquintero, Tian Huey Teh, Christopher G. R. Wallis, Marlene Staib, Devang S. Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King:
ADEPT: A Dataset for Evaluating Prosody Transfer. 3880-3884 - Thi Thu Trang Nguyen, Nguyen Hoang Ky, Albert Rilliard, Christophe d'Alessandro:
Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech. 3885-3889
Source Separation III
- Shaked Dovrat, Eliya Nachmani, Lior Wolf:
Many-Speakers Single Channel Speech Separation with Optimal Permutation Training. 3890-3894 - Mieszko Fras, Marcin Witkowski, Konrad Kowalczyk:
Combating Reverberation in NTF-Based Speech Separation Using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction. 3895-3899 - Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler:
A Hands-On Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation. 3900-3904 - Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz:
GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation. 3905-3909
Non-Native Speech
- Kimiko Tsukada, Yu Rong, Joo-Yeon Kim, Jeong-Im Han, John Hajek:
Cross-Linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared. 3910-3914 - Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek:
Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention. 3915-3919 - Bettina Braun, Nicole Dehé, Marieke Einfeldt, Daniela Wochner, Katharina Zahner-Ritter:
Testing Acoustic Voice Quality Classification Across Languages and Speech Styles. 3920-3924 - Qianyutong Zhang, Kexin Lyu, Zening Chen, Ping Tang:
Acquisition of Prosodic Focus Marking by Three- to Six-Year-Old Children Learning Mandarin Chinese. 3925-3928 - Maryam Sadat Mirzaei, Kourosh Meshgi:
Adaptive Listening Difficulty Detection for L2 Learners Through Moderating ASR Resources. 3929-3933 - Hongwei Ding, Binghuai Lin, Liyuan Wang:
F0 Patterns of L2 English Speech by Mandarin Chinese Learners. 3934-3938 - Binghuai Lin, Liyuan Wang:
A Neural Network-Based Noise Compensation Method for Pronunciation Assessment. 3939-3943 - Jacek Kudera, Philip Georgis, Bernd Möbius, Tania Avgustinova, Dietrich Klakow:
Phonetic Distance and Surprisal in Multilingual Priming: Evidence from Slavic. 3944-3948 - Yuqing Zhang, Zhu Li, Binghuai Lin, Jinsong Zhang:
A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives. 3949-3953 - Minglin Wu, Kun Li, Wai-Kim Leung, Helen Meng:
Transformer Based End-to-End Mispronunciation Detection and Diagnosis. 3954-3958 - Calbert Graham:
L1 Identification from L2 Speech Using Neural Spectrogram Analysis. 3959-3963
Phonetics II
- Miran Oh, Dani Byrd, Shrikanth S. Narayanan:
Leveraging Real-Time MRI for Illuminating Linguistic Velum Action. 3964-3968 - Zirui Liu, Yi Xu:
Segmental Alignment of English Syllables with Singleton and Cluster Onsets. 3969-3973 - Mísa Hejná:
Exploration of Welsh English Pre-Aspiration: How Wide-Spread is it? 3974-3978 - Beeke Muhlack, Mikey Elmers, Heiner Drenhaus, Jürgen Trouvain, Marjolein van Os, Raphael Werner, Margarita Ryzhova, Bernd Möbius:
Revisiting Recall Effects of Filler Particles in German and English. 3979-3983 - Chunyu Ge, Yixuan Xiong, Peggy Mok:
How Reliable Are Phonetic Data Collected Remotely? Comparison of Recording Devices and Environments on Acoustic Measurements. 3984-3988 - Jing Huang, Feng-fan Hsieh, Yueh-Chin Chang:
A Cross-Dialectal Comparison of Apical Vowels in Beijing Mandarin, Northeastern Mandarin and Southwestern Mandarin: An EMA and Ultrasound Study. 3989-3993 - Mark Gibson, Oihane Muxika, Marianne Pouplier:
Dissecting the Aero-Acoustic Parameters of Open Articulatory Transitions. 3994-3998 - Amelia Jane Gully:
Quantifying Vocal Tract Shape Variation and its Acoustic Impact: A Geometric Morphometric Approach. 3999-4003 - Adriana Guevara-Rukoz, Shi Yu, Sharon Peperkamp:
Speech Perception and Loanword Adaptations: The Case of Copy-Vowel Epenthesis. 4004-4008 - Zhe-chen Guo, Rajka Smiljanic:
Speakers Coarticulate Less When Facing Real and Imagined Communicative Difficulties: An Analysis of Read and Spontaneous Speech from the LUCID Corpus. 4009-4013 - Einar Meister, Lya Meister:
Developmental Changes of Vowel Acoustics in Adolescents. 4014-4018 - Sonia D'Apolito, Barbara Gili Fivela:
Context and Co-Text Influence on the Accuracy Production of Italian L2 Non-Native Sounds. 4019-4023 - Wilbert Heeringa, Hans Van de Velde:
A New Vowel Normalization for Sociophonetics. 4024-4028 - Rosey Billington, Hywel Stoakes, Nick Thieberger:
The Pacific Expansion: Optimizing Phonetic Transcription of Archival Corpora. 4029-4033
Search/Decoding Techniques and Confidence Measures for ASR
- Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen:
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization. 4034-4038 - Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Y. Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko:
LT-LM: A Novel Non-Autoregressive Language Model for Single-Shot Lattice Rescoring. 4039-4043 - Cyril Allauzen, Ehsan Variani, Michael Riley, David Rybach, Hao Zhang:
A Hybrid Seq-2-Seq ASR Design for On-Device and Server Applications. 4044-4048 - Hirofumi Inaguma, Tatsuya Kawahara:
VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording. 4049-4053 - Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei:
WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. 4054-4058 - Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima:
Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition. 4059-4063 - Mun-Hak Lee, Joon-Hyuk Chang:
Deep Neural Network Calibration for E2E Speech Recognition System. 4064-4068 - Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland:
Residual Energy-Based Models for End-to-End Speech Recognition. 4069-4073 - David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw:
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction. 4074-4078 - Anna Ollerenshaw, Md. Asif Jalal, Thomas Hain:
Insights on Neural Representations for End-to-End Speech Recognition. 4079-4083 - Amber Afshan, Kshitiz Kumar, Jian Wu:
Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models. 4084-4088
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics
- Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita:
Unsupervised Learning of Disentangled Speech Content and Style Representation. 4089-4093 - Eunbi Choi, Hwa-Yeon Kim, Jong-Hwan Kim, Jae-Min Kim:
Label Embedding for Chinese Grapheme-to-Phoneme Conversion. 4094-4098 - Haiteng Zhang:
PDF: Polyphone Disambiguation in Chinese by Using FLAT. 4099-4103 - Junjie Li, Zhiyu Zhang, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-Pooling Strategy and Window-Based Attention. 4104-4108 - Yi Shi, Congyi Wang, Yu Chen, Bin Wang:
Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning. 4109-4113 - Yue Chen, Zhen-Hua Ling, Qing-Feng Liu:
A Neural-Network-Based Approach to Identifying Speakers in Novels. 4114-4118 - Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai:
UnitNet-Based Hybrid Speech Synthesis. 4119-4123 - Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura:
Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder. 4124-4128 - Haozhe Zhang, Zhihua Huang, Zengqiang Shang, Pengyuan Zhang, Yonghong Yan:
LinearSpeech: Parallel Text-to-Speech with Linear Complexity. 4129-4133
Speech Type Classification and Diagnosis
- Noa Mansbach, Evgeny Hershkovitch Neiterman, Amos Azaria:
An Agent for Competing with Humans in a Deceptive Game Based on Vocal Cues. 4134-4138 - Ahmed Fakhry, Xinyi Jiang, Jaclyn Xiao, Gunvant Chaudhari, Asriel Han:
A Multi-Branch Deep Learning Network for Automated Detection of COVID-19. 4139-4143 - Youxuan Ma, Zongze Ren, Shugong Xu:
RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform. 4144-4148 - Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi, Agha Ali Raza:
Fake Audio Detection in Resource-Constrained Settings Using Microfeatures. 4149-4153 - Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Shuo Liu, Meishu Song, Björn W. Schuller:
Coughing-Based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks. 4154-4158 - Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das:
Knowledge Distillation for Singing Voice Detection. 4159-4163 - Ryu Takeda, Kazunori Komatani:
Age Estimation with Speech-Age Model for Heterogeneous Speech Datasets. 4164-4168 - Kah Kuan Teh, Huy Dat Tran:
Open-Set Audio Classification with Limited Training Resources Based on Augmentation Enhanced Variational Auto-Encoder GAN with Detection-Classification Joint Training. 4169-4173 - Takahiro Fukumori:
Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification. 4174-4178 - Shikha Baghel, Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna, Prithwijit Guha:
Automatic Detection of Shouted Speech Segments in Indian News Debates. 4179-4183 - Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh:
Generalized Spoofing Detection Inspired from Audio Generation Artifacts. 4184-4188 - Weiguang Chen, Van Tung Pham, Eng Siong Chng, Xionghu Zhong:
Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion. 4189-4193
Spoken Term Detection & Voice Search
- Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow:
Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study. 4194-4198 - Zheng Gao, Radhika Arava, Qian Hu, Xibin Gao, Thahir Mohamed, Wei Xiao, Mohamed Abdelhady:
Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding. 4199-4203 - Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng Huang, Arun Narayanan, Ian McGraw:
Personalized Keyphrase Detection Using Speaker and Environment Information. 4204-4208 - Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir:
Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation. 4209-4213 - Mark Mazumder, Colby R. Banbury, Josh Meyer, Pete Warden, Vijay Janapa Reddi:
Few-Shot Keyword Spotting in Any Language. 4214-4218 - Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou:
Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting. 4219-4223 - Yangbin Chen, Tom Ko, Jianping Wang:
A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples. 4224-4228 - Dongyub Lee, Byeongil Ko, Myeongcheol Shin, Taesun Whang, Daniel Lee, Eun Hwa Kim, EungGyun Kim, Jaechoon Jo:
Auxiliary Sequence Labeling Tasks for Disfluency Detection. 4229-4233 - Hang Zhou, Wenchao Hu, Yu Ting Yeung, Xiao Chen:
Energy-Friendly Keyword Spotting System Using Add-Based Convolution. 4234-4238 - Yan Jia, Xingming Wang, Xiaoyi Qin, Yinping Zhang, Xuyang Wang, Junjie Wang, Dong Zhang, Ming Li:
The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results. 4239-4243 - Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie:
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines. 4244-4248 - Axel Berg, Mark O'Connor, Miguel Tairum Cruz:
Keyword Transformer: A Self-Attention Model for Keyword Spotting. 4249-4253 - Abhijeet Awasthi, Kevin Kilgour, Hassan Rom:
Teaching Keyword Spotters to Spot New Keywords with Limited Examples. 4254-4258
Voice Anti-Spoofing and Countermeasure
- Xin Wang, Junichi Yamagishi:
A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection. 4259-4263 - Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas W. D. Evans:
An Initial Investigation for Detecting Partially Spoofed Audio. 4264-4268 - Yang Xie, Zhenchuan Zhang, Yingchun Yang:
Siamese Network with wav2vec Feature for Spoofing Speech Detection. 4269-4273 - Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng:
Cross-Database Replay Detection in Terminal-Dependent Speaker Verification. 4274-4278 - Yuxiang Zhang, Wenchao Wang, Pengyuan Zhang:
The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System. 4279-4283 - Zhiyuan Peng, Xu Li, Tan Lee:
Pairing Weak with Strong: Twin Models for Defending Against Adversarial Attack on Speaker Verification. 4284-4288 - Hefei Ling, Leichao Huang, Junrui Huang, Baiyan Zhang, Ping Li:
Attention-Based Convolutional Neural Network for ASV Spoofing Detection. 4289-4293 - Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-yi Lee:
Voting for the Right Answer: Adversarial Defense for Speaker Verification. 4294-4298 - Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas W. D. Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee:
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing. 4299-4303 - Jesús Villalba, Sonal Joshi, Piotr Zelasko, Najim Dehak:
Representation Learning to Classify and Detect Adversarial Attacks Against Speaker and Speech Recognition Systems. 4304-4308 - You Zhang, Ge Zhu, Fei Jiang, Zhiyao Duan:
An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. 4309-4313 - Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng:
Channel-Wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks. 4314-4318 - Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas W. D. Evans:
Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection. 4319-4323
OpenASR20 and Low Resource ASR Development
- Kay Peterson, Audrey Tong, Yan Yu:
OpenASR20: An Open Challenge for Automatic Speech Recognition of Conversational Telephone Speech in Low-Resource Languages. 4324-4328 - Srikanth R. Madikeri, Petr Motlícek, Hervé Bourlard:
Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages. 4329-4333 - Qiu-Shi Zhu, Jie Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai:
An Improved Wav2Vec 2.0 Pre-Training Approach Using Enhanced Local Dependency Modeling for Speech Recognition. 4334-4338 - Hung-Pang Lin, Yu-Jia Zhang, Chia-Ping Chen:
Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges. 4339-4343 - Jing Zhao, Zhiqiang Lv, Ambyera Han, Guan-Bo Wang, Gui-Xin Shi, Jian Kang, Jinghao Yan, Pengfei Hu, Shen Huang, Weiqiang Zhang:
The TNT Team System Descriptions of Cantonese and Mongolian for IARPA OpenASR20. 4344-4348 - Tanel Alumäe, Jiaming Kong:
Combining Hybrid and End-to-End Approaches for the OpenASR20 Challenge. 4349-4353 - Ethan Morris, Robbie Jimerson, Emily Prud'hommeaux:
One Size Does Not Fit All in Resource-Constrained ASR. 4354-4358
Survey Talk 4: Alejandrina Cristia
- Alejandrina Cristià:
Child Language Acquisition Studied with Wearables.
Keynote 4: Tomáš Mikolov
- Tomás Mikolov:
Language Modeling and Artificial Intelligence.
Voice Activity Detection
- Pablo Gimeno, Alfonso Ortega Giménez, Antonio Miguel, Eduardo Lleida:
Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021. 4359-4363 - Tyler Vuong, Yangyang Xia, Richard M. Stern:
The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge. 4364-4368 - Seyyed Saeed Sarfjoo, Srikanth R. Madikeri, Petr Motlícek:
Speech Activity Detection Based on Multilingual Speech Recognition System. 4369-4373 - Jarrod Luckenbaugh, Samuel Abplanalp, Rachel Gonzalez, Daniel Fulford, David Gard, Carlos Busso:
Voice Activity Detection with Teacher-Student Domain Emulation. 4374-4378 - Omid Ghahabi, Volker Fischer:
EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III. 4379-4382
Keyword Search and Spoken Language Processing
- Kuba Lopatka, Katarzyna Kaszuba-Miotke, Piotr Klinke, Pawel Trella:
Device Playback Augmentation with Echo Cancellation for Keyword Spotting. 4383-4387 - Bolaji Yusuf, Alican Gök, Batuhan Gündogdu, Murat Saraclar:
End-to-End Open Vocabulary Keyword Search. 4388-4392 - Danny Merkx, Stefan L. Frank, Mirjam Ernestus:
Semantic Sentence Similarity: Size does not Always Matter. 4393-4397 - Jan Svec, Lubos Smídl, Josef V. Psutka, Ales Prazák:
Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings. 4398-4402 - François Buet, François Yvon:
Toward Genre Adapted Closed Captioning. 4403-4407
Applications in Transcription, Education and Learning
- Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek:
Weakly-Supervised Word-Level Pronunciation Error Detection in Non-Native English Speech. 4408-4412 - Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
End-to-End Speaker-Attributed ASR with Transformer. 4413-4417 - Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey:
Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction. 4418-4422 - Jazmín Vidal, Cyntia Bonomi, Marcelo Sancinetti, Luciana Ferrer:
Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System. 4423-4427 - Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma:
Explore wav2vec 2.0 for Mispronunciation Detection. 4428-4432 - Shintaro Ando, Nobuaki Minematsu, Daisuke Saito:
Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings. 4433-4437 - Binghuai Lin, Liyuan Wang:
Deep Feature Transfer Learning for Automatic Pronunciation Assessment. 4438-4442 - Huayun Zhang, Ke Shi, Nancy F. Chen:
Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil. 4443-4447 - Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhang:
A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. 4448-4452 - Yu Qiao, Wei Zhou, Elma Kerz, Ralf Schlüter:
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech. 4453-4457 - Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima:
End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning. 4458-4462 - Ronald Cumbal, Birger Moëll, José Lopes, Olov Engwall:
"You don't understand me!": Comparing ASR Results for L1 and L2 Speakers of Swedish. 4463-4467 - Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg:
NeMo Inverse Text Normalization: From Development to Production. 4468-4472 - Satsuki Naijo, Akinori Ito, Takashi Nose:
Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability. 4473-4477
Emotion and Sentiment Analysis III
- Fasih Haider, Saturnino Luz:
Affect Recognition Through Scalogram and Multi-Resolution Cochleagram Features. 4478-4482 - Jiawang Liu, Haoxiang Wang:
A Speech Emotion Recognition Framework for Better Discrimination of Confusions. 4483-4487 - Ruichen Li, Jinming Zhao, Qin Jin:
Speech Emotion Recognition via Multi-Level Cross-Modal Distillation. 4488-4492 - Koichiro Ito, Takuya Fujioka, Qinghua Sun, Kenji Nagamatsu:
Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes. 4493-4497 - Deboshree Bose, Vidhyasaharan Sethu, Eliathamby Ambikairajah:
Parametric Distributions to Model Numerical Emotion Labels. 4498-4502 - Yuan Gao, Jiaxing Liu, Longbiao Wang, Jianwu Dang:
Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition. 4503-4507 - Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church:
Speech Emotion Recognition with Multi-Task Learning. 4508-4512 - Nadee Seneviratne, Carol Y. Espy-Wilson:
Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables. 4513-4517 - Yuhua Wang, Guang Shen, Yuezhu Xu, Jiahang Li, Zhengdao Zhao:
Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition. 4518-4522 - Jiaxing Liu, Yaodong Song, Longbiao Wang, Jianwu Dang, Ruiguo Yu:
Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition. 4523-4527
Resource-Constrained ASR
- Gonçalo Mordido, Matthijs Van Keirsbilck, Alexander Keller:
Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices. 4528-4532 - Mengli Cheng, Chengyu Wang, Jun Huang, Xiaobo Wang:
Weakly Supervised Construction of ASR Systems from Massive Video Data. 4533-4537 - Byeonggeun Kim, Simyung Chang, Jinkyu Lee, Dooyong Sung:
Broadcasted Residual Learning for Efficient Keyword Spotting. 4538-4542 - Rupak Vignesh Swaminathan, Brian John King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris:
CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition. 4543-4547 - Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin:
Extremely Low Footprint End-to-End ASR System for Smart Device. 4548-4552 - Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer:
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition. 4553-4557 - Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow:
Amortized Neural Networks for Low-Latency Speech Recognition. 4558-4562 - Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He:
Tied & Reduced RNN-T Decoder. 4563-4567 - Jangho Kim, Simyung Chang, Nojun Kwak:
PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation. 4568-4572 - Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra:
Collaborative Training of Acoustic Encoders for Speech Recognition. 4573-4577 - Xiong Wang, Sining Sun, Lei Xie, Long Ma:
Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition. 4578-4582 - Titouan Parcollet, Mirco Ravanelli:
The Energy and Carbon Footprint of Training End-to-End Speech Recognizers. 4583-4587
Speaker Recognition: Applications
- Long Chen, Venkatesh Ravichandran, Andreas Stolcke:
Graph-Based Label Propagation for Semi-Supervised Speaker Identification. 4588-4592 - Ruirui Li, Chelsea J.-T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol, Andreas Stolcke:
Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition. 4593-4597 - Sandro Cumani, Salvatore Sarni:
A Generative Model for Duration-Dependent Score Calibration. 4598-4602 - Jason Pelecanos, Quan Wang, Ignacio López-Moreno:
Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition. 4603-4607 - Saurabh Kataria, Shi-Xiong Zhang, Dong Yu:
Multi-Channel Speaker Verification for Single and Multi-Talker Speech. 4608-4612 - Dirk Padfield, Daniel J. Liebling:
Chronological Self-Training for Real-Time Speaker Diarization. 4613-4617 - Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, Liuping Luo:
Adaptive Margin Circle Loss for Speaker Verification. 4618-4622 - Benjamin O'Brien, Christine Meunier, Alain Ghio:
Presentation Matters: Evaluating Speaker Identification Tasks. 4623-4627 - Fuchuan Tong, Yan Liu, Song Li, Jie Wang, Lin Li, Qingyang Hong:
Automatic Error Correction for Speaker Embedding Learning with Noisy Labels. 4628-4632 - Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li:
An Integrated Framework for Two-Pass Personalized Voice Trigger. 4633-4637 - Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh:
Masked Proxy Loss for Text-Independent Speaker Verification. 4638-4642
Speech Synthesis: Speaking Style and Emotion
- Keon Lee, Kyumin Park, Daeyoung Kim:
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. 4643-4647 - Rui Liu, Berrak Sisman, Haizhou Li:
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability. 4648-4652 - Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi:
Emotional Prosody Control for Speech Generation. 4653-4657 - Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su:
Controllable Context-Aware Conversational Speech Synthesis. 4658-4662 - Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, Nam Soo Kim:
Expressive Text-to-Speech Using Style Tag. 4663-4667 - Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu:
Adaptive Text to Speech for Spontaneous Style. 4668-4672 - Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng:
Towards Multi-Scale Style Control for Expressive Speech Synthesis. 4673-4677 - Shifeng Pan, Lei He:
Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis. 4678-4682 - Daxin Tan, Tan Lee:
Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement. 4683-4687 - Xiaochun An, Frank K. Soong, Lei Xie:
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS. 4688-4692 - Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws:
Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture. 4693-4697
Spoken Language Understanding II
- Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen:
Intent Detection and Slot Filling for Vietnamese. 4698-4702 - Haitao Lin, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong:
Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models. 4703-4707 - Judith Gaspers, Quynh Do, Daniil Sorokin, Patrick Lehnen:
The Impact of Intent Distribution Mismatch on Semi-Supervised Spoken Language Understanding. 4708-4712 - Yidi Jiang, Bidisha Sharma, Maulik C. Madhavi, Haizhou Li:
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification. 4713-4717 - Nick J. C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang:
Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-Trained DNN-HMM-Based Acoustic-Phonetic Model. 4718-4722 - Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Jeff Kuo, Samuel Thomas, Edmilson da Silva Morais:
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs. 4723-4727 - Xianwei Zhang, Liang He:
End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining. 4728-4732 - Hamidreza Saghir, Samridhi Choudhary, Sepehr Eghbali, Clement Chung:
Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge. 4733-4737 - Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris:
End-to-End Spoken Language Understanding for Generalized Voice Assistants. 4738-4742 - Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, Josiah Poon:
Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling. 4743-4747
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
- Ross Cutler, Ando Saabas, Tanel Pärnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen, Robert Aichner, Sriram Srinivasan:
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge. 4748-4752 - Lukas Pfeifenberger, Matthias Zöhrer, Franz Pernkopf:
Acoustic Echo Cancellation with Cross-Domain Learning. 4753-4757 - Shimin Zhang, Yuxiang Kong, Shubo Lv, Yanxin Hu, Lei Xie:
F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement. 4758-4762 - Ernst Seidel, Jan Franzen, Maximilian Strake, Tim Fingscheidt:
Y2-Net FCRN for Acoustic Echo and Noise Suppression. 4763-4767 - Renhua Peng, Linjuan Cheng, Chengshi Zheng, Xiaodong Li:
Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information. 4768-4772 - Amir Ivry, Israel Cohen, Baruch Berdugo:
Nonlinear Acoustic Echo Cancellation with Deep Learning. 4773-4777
Speech Recognition of Atypical Speech
- Jordan R. Green, Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Katrin Tomanek:
Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases. 4778-4782 - Michael Neumann, Oliver Roesler, Jackson Liscombe, Hardik Kothare, David Suendermann-Oeft, David Pautler, Indu Navar, Aria Anvar, Jochen Kumm, Raquel Norel, Ernest Fraenkel, Alexander V. Sherman, James D. Berry, Gary L. Pattee, Jun Wang, Jordan R. Green, Vikram Ramanarayanan:
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale. 4783-4787 - Enno Hermann, Mathew Magimai-Doss:
Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination. 4788-4792 - Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng:
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition. 4793-4797 - Sarah E. Gutz, Hannah P. Rowe, Jordan R. Green:
Speaking with a KN95 Face Mask: ASR Performance and Speaker Compensation. 4798-4802 - Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng:
Adversarial Data Augmentation for Disordered Speech Recognition. 4803-4807 - Xurong Xie, Rukiye Ruzi, Xunying Liu, Lan Wang:
Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition. 4808-4812 - Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng:
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion. 4813-4817 - Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng:
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition. 4818-4822 - Shanqing Cai, Lisie Lillianfeld, Katie Seaver, Jordan R. Green, Michael P. Brenner, Philip C. Nelson, D. Sculley:
A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks. 4823-4827 - Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno:
Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. 4828-4832 - Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Jordan R. Green, Katrin Tomanek:
Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia. 4833-4837 - Eun Jung Yeo, Sunhee Kim, Minhwa Chung:
Automatic Severity Classification of Korean Dysarthric Speech Using Phoneme-Level Pronunciation Features. 4838-4842 - Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin Tomanek, Jordan R. Green, Michael P. Brenner:
Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases. 4843-4847 - Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis G. Georgiou, Sachin Kajarekar, Jeffrey P. Bigham:
Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. 4848-4852
Show and Tell 4
- Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Masanori Morise, Hideki Banno, Toshio Irino:
Interactive and Real-Time Acoustic Measurement Tools for Speech Data Acquisition and Presentation: Application of an Extended Member of Time Stretched Pulses. 4853-4854 - Daniel Tihelka, Markéta Rezácková, Martin Gruber, Zdenek Hanzlícek, Jakub Vít, Jindrich Matousek:
Save Your Voice: Voice Banking and TTS for Anyone. 4855-4856 - Yang Zhang, Evelina Bakhturina, Boris Ginsburg:
NeMo (Inverse) Text Normalization: From Development to Production. 4857-4859 - Corentin Hembise, Lucile Gelin, Morgane Daniel:
Lalilo: A Reading Assistant for Children Featuring Speech Recognition-Based Reading Mistake Detection. 4860-4861 - Manh Hung Nguyen, Vu Hoang, Tu Anh Nguyen, Trung H. Bui:
Automatic Radiology Report Editing Through Voice. 4862-4863 - Ke Shi, Kye Min Tan, Huayun Zhang, Siti Umairah Md. Salleh, Shikang Ni, Nancy F. Chen:
WittyKiddy: Multilingual Spoken Language Learning for Kids. 4864-4865 - Chunxiang Jin, Minghui Yang, Zujie Wen:
Duplex Conversation in Outbound Agent System. 4866-4867 - Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Web Interface for Estimating Articulatory Movements in Speech Production from Acoustics and Text. 4868-4869
manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.