Abstract
Today's synthetic voices are largely based on diphone synthesis (DiSyn) and unit selection synthesis (UnitSyn). In most DiSyn systems, prosodic envelopes are generated with formal models while UnitSyn systems refer to extensive, highly
indexed sound databases. Each approach has its drawbacks; such as low naturalness (DiSyn) and dependence on huge amounts of background data (UnitSyn). We present a hybrid model based on high-level speech data. As preliminary tests show, prosodic models combining DiSyn style at the phone level with UnitSyn style at the supra-segmental levels may approach UnitSyn quality on a DiSyn footprint. Our test data are Danish, but our algorithm is language neutral.
indexed sound databases. Each approach has its drawbacks; such as low naturalness (DiSyn) and dependence on huge amounts of background data (UnitSyn). We present a hybrid model based on high-level speech data. As preliminary tests show, prosodic models combining DiSyn style at the phone level with UnitSyn style at the supra-segmental levels may approach UnitSyn quality on a DiSyn footprint. Our test data are Danish, but our algorithm is language neutral.
Original language | English |
---|---|
Title of host publication | SLTC 2012 : Proceedings of the Conference |
Number of pages | 2 |
Place of Publication | Lund |
Publisher | Lund University |
Publication date | 2012 |
Pages | 37-38 |
Publication status | Published - 2012 |
Event | The Fourth Swedish Language Technology Conference 2012 - Lund, Sweden Duration: 24 Oct 2012 → 26 Oct 2012 Conference number: 4 |
Conference
Conference | The Fourth Swedish Language Technology Conference 2012 |
---|---|
Number | 4 |
Country/Territory | Sweden |
City | Lund |
Period | 24/10/2012 → 26/10/2012 |