Automatic Phonetic Transcription for Danish Speech Recognition

Andreas Søeborg Kirkedal

    Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

    Abstract

    Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic representations, e.g. morphological analysis, decompounding, letter-to-sound rules, etc. Two different phonetic transcribers for Danish will be compared in this study: eSpeak (Duddington, 2010) and Phonix (Henrichsen, 2014). Both transcribers produce a richer transcription than ASR can utilise such as stress, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes.
    eSpeak is an open source speech synthesizer originally created for English and now extended to
    cover 50 languages. Due to the nature of open source software, the quality of language support depends greatly on who encoded them. The Danish version was created by a Danish native speaker and contains more than 8,600 spelling-to-phoneme rules and more than 11,000 rules for
    particular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules.
    Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies to dictionary lookup, compound splitting and letter-to-sound rules. Phonix and eSpeak will be compared in an ASR scenario using the Kaldi speech recognition
    toolkit (Povey et al., 2011) and compared to a graphemic baseline. Also a mapping between the phonetic alphabets, which are both based on X-Sampa or IPA (Wells et al., 1997) will be produced. Using this mapping a dictionary with multuple pronunciations and a hybrid system
    where eSpeak functions as a fallback strategy in Ponix will be compared.
    Original languageEnglish
    Publication date2014
    Number of pages1
    Publication statusPublished - 2014
    Event2014 CRITT - WCRE Conference: Translation in Transition: Between Cognition, Computing and Technology - Copenhagen Business School, Frederiksberg, Denmark
    Duration: 30 Jan 201431 Jan 2014
    http://bridge.cbs.dk/platform/?q=conference2014

    Conference

    Conference2014 CRITT - WCRE Conference
    LocationCopenhagen Business School
    CountryDenmark
    CityFrederiksberg
    Period30/01/201431/01/2014
    Internet address

    Cite this

    Kirkedal, A. S. (2014). Automatic Phonetic Transcription for Danish Speech Recognition. Abstract from 2014 CRITT - WCRE Conference, Frederiksberg, Denmark.
    Kirkedal, Andreas Søeborg. / Automatic Phonetic Transcription for Danish Speech Recognition. Abstract from 2014 CRITT - WCRE Conference, Frederiksberg, Denmark.1 p.
    @conference{68b27c6d646f4137bd9d08b525ef4b5a,
    title = "Automatic Phonetic Transcription for Danish Speech Recognition",
    abstract = "Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic representations, e.g. morphological analysis, decompounding, letter-to-sound rules, etc. Two different phonetic transcribers for Danish will be compared in this study: eSpeak (Duddington, 2010) and Phonix (Henrichsen, 2014). Both transcribers produce a richer transcription than ASR can utilise such as stress, syllabication, st{\o}d and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created for English and now extended tocover 50 languages. Due to the nature of open source software, the quality of language support depends greatly on who encoded them. The Danish version was created by a Danish native speaker and contains more than 8,600 spelling-to-phoneme rules and more than 11,000 rules forparticular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules.Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies to dictionary lookup, compound splitting and letter-to-sound rules. Phonix and eSpeak will be compared in an ASR scenario using the Kaldi speech recognitiontoolkit (Povey et al., 2011) and compared to a graphemic baseline. Also a mapping between the phonetic alphabets, which are both based on X-Sampa or IPA (Wells et al., 1997) will be produced. Using this mapping a dictionary with multuple pronunciations and a hybrid systemwhere eSpeak functions as a fallback strategy in Ponix will be compared.",
    author = "Kirkedal, {Andreas S{\o}eborg}",
    year = "2014",
    language = "English",
    note = "null ; Conference date: 30-01-2014 Through 31-01-2014",
    url = "http://bridge.cbs.dk/platform/?q=conference2014",

    }

    Kirkedal, AS 2014, 'Automatic Phonetic Transcription for Danish Speech Recognition', Frederiksberg, Denmark, 30/01/2014 - 31/01/2014, .

    Automatic Phonetic Transcription for Danish Speech Recognition. / Kirkedal, Andreas Søeborg.

    2014. Abstract from 2014 CRITT - WCRE Conference, Frederiksberg, Denmark.

    Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

    TY - ABST

    T1 - Automatic Phonetic Transcription for Danish Speech Recognition

    AU - Kirkedal, Andreas Søeborg

    PY - 2014

    Y1 - 2014

    N2 - Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic representations, e.g. morphological analysis, decompounding, letter-to-sound rules, etc. Two different phonetic transcribers for Danish will be compared in this study: eSpeak (Duddington, 2010) and Phonix (Henrichsen, 2014). Both transcribers produce a richer transcription than ASR can utilise such as stress, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created for English and now extended tocover 50 languages. Due to the nature of open source software, the quality of language support depends greatly on who encoded them. The Danish version was created by a Danish native speaker and contains more than 8,600 spelling-to-phoneme rules and more than 11,000 rules forparticular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules.Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies to dictionary lookup, compound splitting and letter-to-sound rules. Phonix and eSpeak will be compared in an ASR scenario using the Kaldi speech recognitiontoolkit (Povey et al., 2011) and compared to a graphemic baseline. Also a mapping between the phonetic alphabets, which are both based on X-Sampa or IPA (Wells et al., 1997) will be produced. Using this mapping a dictionary with multuple pronunciations and a hybrid systemwhere eSpeak functions as a fallback strategy in Ponix will be compared.

    AB - Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic representations, e.g. morphological analysis, decompounding, letter-to-sound rules, etc. Two different phonetic transcribers for Danish will be compared in this study: eSpeak (Duddington, 2010) and Phonix (Henrichsen, 2014). Both transcribers produce a richer transcription than ASR can utilise such as stress, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created for English and now extended tocover 50 languages. Due to the nature of open source software, the quality of language support depends greatly on who encoded them. The Danish version was created by a Danish native speaker and contains more than 8,600 spelling-to-phoneme rules and more than 11,000 rules forparticular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules.Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies to dictionary lookup, compound splitting and letter-to-sound rules. Phonix and eSpeak will be compared in an ASR scenario using the Kaldi speech recognitiontoolkit (Povey et al., 2011) and compared to a graphemic baseline. Also a mapping between the phonetic alphabets, which are both based on X-Sampa or IPA (Wells et al., 1997) will be produced. Using this mapping a dictionary with multuple pronunciations and a hybrid systemwhere eSpeak functions as a fallback strategy in Ponix will be compared.

    M3 - Conference abstract for conference

    ER -

    Kirkedal AS. Automatic Phonetic Transcription for Danish Speech Recognition. 2014. Abstract from 2014 CRITT - WCRE Conference, Frederiksberg, Denmark.