:
twitter line
研究生: 林政源
研究生(外文): Chen Yuan Lin
論文名稱: 國語歌曲的歌聲合成
論文名稱(外文): Singing Voice Synthesis of Mandarin Chinese Songs
指導教授: 張智星 張智星引用關係
指導教授(外文): Roger-Jang
學位類別: 碩士
校院名稱: 國立清華大學
系所名稱: 資訊工程學系
學門: 工程學門
學類: 電資工程學類
論文種類: 學術論文
論文出版年: 2001
畢業學年度: 89
語文別: 中文
論文頁數: 45
中文關鍵詞: 基週同步合成疊加法 交叉消退法 重新取樣法 剩餘誤差訊號 連音 抖音 基週粹取 基週標取
外文關鍵詞: Pitch Synchronous Overlap and Add Cross-Fading Resample Residual signal Coarticulation Vibrato Pitch tracking Pitch mark
相關次數:
  • 被引用 被引用: 4
  • 點閱 點閱:286
  • 評分 評分:
  • 下載 下載:0
  • 收藏至我的研究室書目清單 書目收藏:1
在本論文中,前半部研究關於語音合成的前處理,也就是切字與找基週位置的問題,如何去設計endpoint detection和pitch tracking以及pitch mark等重要課題,而在後半部則探討了一些運用在語音合成的方法,如PSOLA(Pitch Synchronous Overlap and Add)、Cross-fading、Resample、Residual signal with PSOLA等方法來調整pitch。對於音長的伸縮則是採用linear mapping的方式找所對應位置的波形做連接,文中探討了如何正規化振幅 (energy),如何模擬出抖音 (vibrato),以及類似卡拉OK的回音 (echo)等功能,另外也涵蓋對於連音(coarticulation)的初步想法,包含了以37個基本注音符號合成411個國語單字音的實驗,最後嘗試以蛙聲合成歌曲,目前在實作系統上已經有了不錯的成果。
要達到歌聲合成的完美境界是任重而道遠的,就如同語音合成一般,有著許多的困難要克服,但是他們帶給人類的應用卻是非常便利的,試想,若人類能夠對電腦說想聽哪首歌,而電腦除了能辨識人類所說的歌名外,它甚至能夠利用歌聲合成來model出人類所點選的歌曲,豈不快哉? 這方面的應用也可以設計成虛擬歌手 :根據合成的歌聲與人的嘴形做更進一步的合成,或是做成在玩具市場的布偶類的玩具 : 只要給予歌譜或歌曲即可以馬上合成如唐老鴨的歌聲出來。類似的應用甚多,只是關鍵在於電腦合成效果是否真能讓人類分辨不出是真人發音還是電腦發音,相信要達到這麼一天,應該是指日可待的。
This thesis presents a concatenation-based singing voice synthesis system for Chinese songs. The system takes melody and lyrics information from a KAR file (a variant of MIDI with lyrics information) and employs text-to-speech techniques to synthesis Chinese songs. According to the lyrics information, the system first selects suitable syllable clips from a pre-recorded collection of all 411 syllables in Mandarin Chinese. Then the system performs necessary pitch shift on the syllabic clips using various methods including PSOLA (Pitch Synchronous Overlap and Add), Cross-fading, Resample, Residual Signal with PSOLA, etc. Time-scale modification is then achieved by a linear mapping and duplication of pitch-mark-justified waveforms. With correct pitch and duration, the resulting vocal clips are then concatenated to form a complete vocal rendition of the song. To make natural-sounding singing voice, the system has to employ several post-processing methods including the addition of coarticulation and vibrato, and the use of energy normalization. Potential applications and future research directions are also covered in the thesis.
中文摘要-------------------------------------------------------i
英文摘要------------------------------------------------------ii
目錄 --------------------------------------------------------iii
圖目錄 --------------------------------------------------------v
第一章 導論----------------------------------------------------1
1.1 研究的動機-------------------------------------------------1
1.2 研究的方向-------------------------------------------------1
1.3 簡介歌聲與語音的特徵---------------------------------------2
1.4 相關的研究-------------------------------------------------4
第二章 歌聲合成的前處理----------------------------------------5
2.1 Endpoint detection的方法介紹-------------------------------5
2.2 Pitch tracking的方法介紹-----------------------------------7
2.3 Pitch mark的方法介紹---------------------------------------9
第三章 語音與歌聲合成的基本理論方法---------------------------11
3.1 簡介語音與歌聲合成常用的方法------------------------------11
3.2 Resample--------------------------------------------------12
3.3 PSOLA 基週同步疊加法--------------------------------------14
3.4 Cross-fading----------------------------------------------16
3.5 Residual signal with PSOLA--------------------------------17
第四章 國語歌曲合成系統---------------------------------------22
前言----------------------------------------------------------22
4.1 調整Pitch 的高低與聲音的長短------------------------------23
4.2 能量大小的調整--------------------------------------------25
4.3 連音、抖音與回音的效果探討--------------------------------26
4.4 其他改進的方法--------------------------------------------28
4.5 實驗結果與分析--------------------------------------------29
第五章 結論與展望---------------------------------------------33
參考文獻 -----------------------------------------------------34
[1] Alan V. Oppenheim and Ronald W. Schafer, “Discrete-Time
Signal Processing”, Prentice Hall, 1989.
[2] C. Hamon and E. Mouline and F. Charpentier , “A diphone
synthesis system based on time-domain prosodic
modifications of speech”, Acoustics, Speech, and Signal
Processing, 1989. ICASSP-89., 1989 International Conference
on , 1989 , Page(s): 238 -241 vol.1
[3] E.S Morais and F. Violaro and P.A Barbosa, “Prosodic
speech modifications using pitch-synchronous time-frequency
interpolation”, Telecommunications Symposium, 1998.
ITS ''98 Proceedings. SBT/IEEE International , 1998, Page
(s): 225 -230 vol.1
[4] F. Charpentier and Moulines, “Pitch-synchronous Waveform
Processing Technique for Text-to-Speech Synthesis Using
Diphones,” European Conf. On Speech Communication and
Technology, pp.13-19, Paris, 1989.
[5] G.S. Ying and L.H. Jamieson and C.D. Michell, “A
probabilistic approach to AMDF pitch detection”, Spoken
Language, 1996. ICSLP 96. Proceedings., Fourth
International Conference on Volume: 2 , 1996 , Page(s):
1201-1204 vol.2
[6] Giuliano Monti, Mark Sandler “Mnophonic transcription with
autocorrelation”, Proceedings of the COST G-6 Conference
on Digital Audio Effects (DAFX-00), Verona, Italy, December
7-9, 2000
[7] H. Valbret and E. Moulines and J.P. Tubach, “Voice
transformation using PSOLA technique” , Acoustics, Speech,
and Signal Processing, 1992. ICASSP-92, 1992 IEEE
International Conference on Volume: 1 , 1992 , Page(s):
145 -148 vol.1
[8] John R.Deller, John G. Proakis, John HL Hansen “Discrete-
Time Processing of Speech Signals” Prentice Hall, 1993,
p236-250
[9] L.Rabiner and B. Juang. “Fundamentals of speech
recognition.” Prentice Hall, 1993, p97-117
[10] M. Edgington and A. Lowry, “Residual-based speech
modification algorithms for text-to-speech synthesis”,
Spoken Language, 1996. ICSLP 96. Proceedings , Fourth
International Conference on Volume: 3 , 1996 , Page(s):
1425 -1428 vol.3
[11] M. W. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and
E. B. George, "Concatenation-based MIDI-to-singing voice
synthesis'''', 103rd Meeting of the Audio Engineering
Society, New York, 1997.
[12] S.G. Chen and G.J. Lin, “High Quality and Low Complexity
Pitch Modification of Acoustic Signals,” Proceedings of
the 1995 IEEE International Conference on Acoustic, Speech,
and Signal Processing, May, Detroit, USA, 1995, p2987-2990.
[13] Thierry Dutoit , “A Short Introduction to Text-to-Speech
Synthesis”, 1999
[14] Xuejing Sun, “Voice quality conversion in TD-PSOLA speech
synthesis”, Acoustics, Speech, and Signal Processing,
2000. ICASSP ''00. Proceedings. 2000 IEEE International
Conference on Volume: 2 , 2000 , Page(s): II953 -II956 vol.2
[15] Y. Arai and R. Mochizuki and H. Nishimura and T. Honda,
“An excitation synchronous pitch waveform extraction
method and its application to the VCV-concatenation
synthesis of Japanese spoken words”, Spoken Language,
1996. ICSLP 96. Proceedings., Fourth International
Conference on Volume: 3 , 1996 , Page(s): 1437 -1440 vol.3
[16] Yiying Zhang, Xiaoyan Zhu, Yu Hao, Yupin Luo, “A robust
and fast endpoint detection algorithm for isolated word
recognition”, Intelligent Processing Systems, 1997.
ICIPS ''97. 1997 IEEE International Conference on Volume:
2 , 1997 , Page(s): 1819 -1822 vol.2
[17] 王鴻彬,國語聲訊處理,交通大學碩士論文,民國85年6月
[18] 邵芳雯,國語歌曲之合成,交通大學碩士論文,民國83年6月