This thesis presents a concatenation-based singing voice synthesis system for Chinese songs. The system takes melody and lyrics information from a KAR file (a variant of MIDI with lyrics information) and employs text-to-speech techniques to synthesis Chinese songs. According to the lyrics information, the system first selects suitable syllable clips from a pre-recorded collection of all 411 syllables in Mandarin Chinese. Then the system performs necessary pitch shift on the syllabic clips using various methods including PSOLA (Pitch Synchronous Overlap and Add), Cross-fading, Resample, Residual Signal with PSOLA, etc. Time-scale modification is then achieved by a linear mapping and duplication of pitch-mark-justified waveforms. With correct pitch and duration, the resulting vocal clips are then concatenated to form a complete vocal rendition of the song. To make natural-sounding singing voice, the system has to employ several post-processing methods including the addition of coarticulation and vibrato, and the use of energy normalization. Potential applications and future research directions are also covered in the thesis.
中文摘要-------------------------------------------------------i
英文摘要------------------------------------------------------ii
目錄 --------------------------------------------------------iii
圖目錄 --------------------------------------------------------v
第一章 導論----------------------------------------------------1
1.1 研究的動機-------------------------------------------------1
1.2 研究的方向-------------------------------------------------1
1.3 簡介歌聲與語音的特徵---------------------------------------2
1.4 相關的研究-------------------------------------------------4
第二章 歌聲合成的前處理----------------------------------------5
2.1 Endpoint detection的方法介紹-------------------------------5
2.2 Pitch tracking的方法介紹-----------------------------------7
2.3 Pitch mark的方法介紹---------------------------------------9
第三章 語音與歌聲合成的基本理論方法---------------------------11
3.1 簡介語音與歌聲合成常用的方法------------------------------11
3.2 Resample--------------------------------------------------12
3.3 PSOLA 基週同步疊加法--------------------------------------14
3.4 Cross-fading----------------------------------------------16
3.5 Residual signal with PSOLA--------------------------------17
第四章 國語歌曲合成系統---------------------------------------22
前言----------------------------------------------------------22
4.1 調整Pitch 的高低與聲音的長短------------------------------23
4.2 能量大小的調整--------------------------------------------25
4.3 連音、抖音與回音的效果探討--------------------------------26
4.4 其他改進的方法--------------------------------------------28
4.5 實驗結果與分析--------------------------------------------29
第五章 結論與展望---------------------------------------------33
參考文獻 -----------------------------------------------------34
[1] Alan V. Oppenheim and Ronald W. Schafer, “Discrete-Time
Signal Processing”, Prentice Hall, 1989.
[2] C. Hamon and E. Mouline and F. Charpentier , “A diphone
synthesis system based on time-domain prosodic
modifications of speech”, Acoustics, Speech, and Signal
Processing, 1989. ICASSP-89., 1989 International Conference
on , 1989 , Page(s): 238 -241 vol.1
[3] E.S Morais and F. Violaro and P.A Barbosa, “Prosodic
speech modifications using pitch-synchronous time-frequency
interpolation”, Telecommunications Symposium, 1998.
ITS ''98 Proceedings. SBT/IEEE International , 1998, Page
(s): 225 -230 vol.1
[4] F. Charpentier and Moulines, “Pitch-synchronous Waveform
Processing Technique for Text-to-Speech Synthesis Using
Diphones,” European Conf. On Speech Communication and
Technology, pp.13-19, Paris, 1989.
[5] G.S. Ying and L.H. Jamieson and C.D. Michell, “A
probabilistic approach to AMDF pitch detection”, Spoken
Language, 1996. ICSLP 96. Proceedings., Fourth
International Conference on Volume: 2 , 1996 , Page(s):
1201-1204 vol.2
[6] Giuliano Monti, Mark Sandler “Mnophonic transcription with
autocorrelation”, Proceedings of the COST G-6 Conference
on Digital Audio Effects (DAFX-00), Verona, Italy, December
7-9, 2000
[7] H. Valbret and E. Moulines and J.P. Tubach, “Voice
transformation using PSOLA technique” , Acoustics, Speech,
and Signal Processing, 1992. ICASSP-92, 1992 IEEE
International Conference on Volume: 1 , 1992 , Page(s):
145 -148 vol.1
[8] John R.Deller, John G. Proakis, John HL Hansen “Discrete-
Time Processing of Speech Signals” Prentice Hall, 1993,
p236-250
[9] L.Rabiner and B. Juang. “Fundamentals of speech
recognition.” Prentice Hall, 1993, p97-117
[10] M. Edgington and A. Lowry, “Residual-based speech
modification algorithms for text-to-speech synthesis”,
Spoken Language, 1996. ICSLP 96. Proceedings , Fourth
International Conference on Volume: 3 , 1996 , Page(s):
1425 -1428 vol.3
[11] M. W. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and
E. B. George, "Concatenation-based MIDI-to-singing voice
synthesis'''', 103rd Meeting of the Audio Engineering
Society, New York, 1997.
[12] S.G. Chen and G.J. Lin, “High Quality and Low Complexity
Pitch Modification of Acoustic Signals,” Proceedings of
the 1995 IEEE International Conference on Acoustic, Speech,
and Signal Processing, May, Detroit, USA, 1995, p2987-2990.
[13] Thierry Dutoit , “A Short Introduction to Text-to-Speech
Synthesis”, 1999
[14] Xuejing Sun, “Voice quality conversion in TD-PSOLA speech
synthesis”, Acoustics, Speech, and Signal Processing,
2000. ICASSP ''00. Proceedings. 2000 IEEE International
Conference on Volume: 2 , 2000 , Page(s): II953 -II956 vol.2
[15] Y. Arai and R. Mochizuki and H. Nishimura and T. Honda,
“An excitation synchronous pitch waveform extraction
method and its application to the VCV-concatenation
synthesis of Japanese spoken words”, Spoken Language,
1996. ICSLP 96. Proceedings., Fourth International
Conference on Volume: 3 , 1996 , Page(s): 1437 -1440 vol.3
[16] Yiying Zhang, Xiaoyan Zhu, Yu Hao, Yupin Luo, “A robust
and fast endpoint detection algorithm for isolated word
recognition”, Intelligent Processing Systems, 1997.
ICIPS ''97. 1997 IEEE International Conference on Volume:
2 , 1997 , Page(s): 1819 -1822 vol.2
[17] 王鴻彬,國語聲訊處理,交通大學碩士論文,民國85年6月
[18] 邵芳雯,國語歌曲之合成,交通大學碩士論文,民國83年6月