Prosody dependent speech recognition on radio news corpus of American English

Ken Chen; Mark Hasegawa-Johnson; Aaron Cohen; Sarah Borys; Sung Suk Kim; Jennifer Cole; Jeung Yoon Choi

doi:10.1109/TSA.2005.853208

Prosody dependent speech recognition on radio news corpus of American English

Ken Chen, Mark Hasegawa-Johnson, Aaron Cohen, Sarah Borys, Sung Suk Kim, Jennifer Cole, Jeung Yoon Choi

Research output: Contribution to journal › Article › peer-review

44 Scopus citations

Abstract

Does prosody help word recognition? This paper proposes a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that reduces word error rates (WER) relative to a prosody-independent recognizer with comparable parameter count. In the proposed prosody-dependent speech recognizer, word and phoneme models are conditioned on two important prosodic variables: the intonational phrase boundary and the pitch accent. An information-theoretic analysis is provided to show that prosody dependent acoustic and language modeling can increase the mutual information between the true word hypothesis and the acoustic observation by exciting the interaction between prosody dependent acoustic model and prosody dependent language model. Empirically, results indicate that the influence of these prosodic variables on allophonic models are mainly restricted to a small subset of distributions: the duration PDFs (modeled using an explicit duration hidden Markov model or EDHMM) and the acoustic-prosodic observation PDFs (normalized pitch frequency). Influence of prosody on cepstral features is limited to a subset of phonemes: for example, vowels may be influenced by both accent and phrase position, but phrase-initial and phrase-final consonants are independent of accent. Leveraging these results, effective prosody dependent allophonic models are built with minimal increase in parameter count. These prosody dependent speech recognizers are able to reduce word error rates by up to 11 % relative to prosody independent recognizers with comparable parameter count, in experiments based on the prosodically-transcribed Boston Radio News corpus.

Original language	English (US)
Pages (from-to)	232-244
Number of pages	13
Journal	IEEE Transactions on Audio, Speech and Language Processing
Volume	14
Issue number	1
DOIs	https://doi.org/10.1109/TSA.2005.853208
State	Published - Jan 2006
Externally published	Yes

Keywords

ANN
Acoustic model
Duration
HMM
Mutual information
Pitch
Prosody
ToBI
Word error rate

ASJC Scopus subject areas

Acoustics and Ultrasonics
Electrical and Electronic Engineering

Access to Document

10.1109/TSA.2005.853208

Cite this

@article{c09e2b8aef59419f9ead9c0c47321c4c,

title = "Prosody dependent speech recognition on radio news corpus of American English",

abstract = "Does prosody help word recognition? This paper proposes a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that reduces word error rates (WER) relative to a prosody-independent recognizer with comparable parameter count. In the proposed prosody-dependent speech recognizer, word and phoneme models are conditioned on two important prosodic variables: the intonational phrase boundary and the pitch accent. An information-theoretic analysis is provided to show that prosody dependent acoustic and language modeling can increase the mutual information between the true word hypothesis and the acoustic observation by exciting the interaction between prosody dependent acoustic model and prosody dependent language model. Empirically, results indicate that the influence of these prosodic variables on allophonic models are mainly restricted to a small subset of distributions: the duration PDFs (modeled using an explicit duration hidden Markov model or EDHMM) and the acoustic-prosodic observation PDFs (normalized pitch frequency). Influence of prosody on cepstral features is limited to a subset of phonemes: for example, vowels may be influenced by both accent and phrase position, but phrase-initial and phrase-final consonants are independent of accent. Leveraging these results, effective prosody dependent allophonic models are built with minimal increase in parameter count. These prosody dependent speech recognizers are able to reduce word error rates by up to 11 % relative to prosody independent recognizers with comparable parameter count, in experiments based on the prosodically-transcribed Boston Radio News corpus.",

keywords = "ANN, Acoustic model, Duration, HMM, Mutual information, Pitch, Prosody, ToBI, Word error rate",

author = "Ken Chen and Mark Hasegawa-Johnson and Aaron Cohen and Sarah Borys and Kim, {Sung Suk} and Jennifer Cole and Choi, {Jeung Yoon}",

note = "Funding Information: Manuscript received July 1, 2003; revised August 29, 2004. This work was supported by the University of Illinois Critical Research Initiative and by NSF Award 0132900. Statements in this paper reflect the opinions and conclusions of the authors and are not endorsed by the NSF. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Bayya Yegnanarayana.",

year = "2006",

month = jan,

doi = "10.1109/TSA.2005.853208",

language = "English (US)",

volume = "14",

pages = "232--244",

journal = "IEEE Transactions on Audio, Speech and Language Processing",

issn = "1558-7916",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "1",

}

TY - JOUR

T1 - Prosody dependent speech recognition on radio news corpus of American English

AU - Chen, Ken

AU - Hasegawa-Johnson, Mark

AU - Cohen, Aaron

AU - Borys, Sarah

AU - Kim, Sung Suk

AU - Cole, Jennifer

AU - Choi, Jeung Yoon

N1 - Funding Information: Manuscript received July 1, 2003; revised August 29, 2004. This work was supported by the University of Illinois Critical Research Initiative and by NSF Award 0132900. Statements in this paper reflect the opinions and conclusions of the authors and are not endorsed by the NSF. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Bayya Yegnanarayana.

PY - 2006/1

Y1 - 2006/1

N2 - Does prosody help word recognition? This paper proposes a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that reduces word error rates (WER) relative to a prosody-independent recognizer with comparable parameter count. In the proposed prosody-dependent speech recognizer, word and phoneme models are conditioned on two important prosodic variables: the intonational phrase boundary and the pitch accent. An information-theoretic analysis is provided to show that prosody dependent acoustic and language modeling can increase the mutual information between the true word hypothesis and the acoustic observation by exciting the interaction between prosody dependent acoustic model and prosody dependent language model. Empirically, results indicate that the influence of these prosodic variables on allophonic models are mainly restricted to a small subset of distributions: the duration PDFs (modeled using an explicit duration hidden Markov model or EDHMM) and the acoustic-prosodic observation PDFs (normalized pitch frequency). Influence of prosody on cepstral features is limited to a subset of phonemes: for example, vowels may be influenced by both accent and phrase position, but phrase-initial and phrase-final consonants are independent of accent. Leveraging these results, effective prosody dependent allophonic models are built with minimal increase in parameter count. These prosody dependent speech recognizers are able to reduce word error rates by up to 11 % relative to prosody independent recognizers with comparable parameter count, in experiments based on the prosodically-transcribed Boston Radio News corpus.

AB - Does prosody help word recognition? This paper proposes a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that reduces word error rates (WER) relative to a prosody-independent recognizer with comparable parameter count. In the proposed prosody-dependent speech recognizer, word and phoneme models are conditioned on two important prosodic variables: the intonational phrase boundary and the pitch accent. An information-theoretic analysis is provided to show that prosody dependent acoustic and language modeling can increase the mutual information between the true word hypothesis and the acoustic observation by exciting the interaction between prosody dependent acoustic model and prosody dependent language model. Empirically, results indicate that the influence of these prosodic variables on allophonic models are mainly restricted to a small subset of distributions: the duration PDFs (modeled using an explicit duration hidden Markov model or EDHMM) and the acoustic-prosodic observation PDFs (normalized pitch frequency). Influence of prosody on cepstral features is limited to a subset of phonemes: for example, vowels may be influenced by both accent and phrase position, but phrase-initial and phrase-final consonants are independent of accent. Leveraging these results, effective prosody dependent allophonic models are built with minimal increase in parameter count. These prosody dependent speech recognizers are able to reduce word error rates by up to 11 % relative to prosody independent recognizers with comparable parameter count, in experiments based on the prosodically-transcribed Boston Radio News corpus.

KW - ANN

KW - Acoustic model

KW - Duration

KW - HMM

KW - Mutual information

KW - Pitch

KW - Prosody

KW - ToBI

KW - Word error rate

UR - http://www.scopus.com/inward/record.url?scp=33744970676&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33744970676&partnerID=8YFLogxK

U2 - 10.1109/TSA.2005.853208

DO - 10.1109/TSA.2005.853208

M3 - Article

AN - SCOPUS:33744970676

SN - 1558-7916

VL - 14

SP - 232

EP - 244

JO - IEEE Transactions on Audio, Speech and Language Processing

JF - IEEE Transactions on Audio, Speech and Language Processing

IS - 1

ER -

Prosody dependent speech recognition on radio news corpus of American English

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this