Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Shi, Bowen; Settle, Shane; Livescu, Karen

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2007.00183 (eess)

[Submitted on 1 Jul 2020 (v1), last revised 24 Nov 2020 (this version, v2)]

Title:Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Authors:Bowen Shi, Shane Settle, Karen Livescu

View PDF

Abstract:Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models.

Comments:	SLT 2021
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2007.00183 [eess.AS]
	(or arXiv:2007.00183v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2007.00183

Submission history

From: Bowen Shi [view email]
[v1] Wed, 1 Jul 2020 02:22:09 UTC (344 KB)
[v2] Tue, 24 Nov 2020 17:03:52 UTC (432 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators