字級大小SCRIPT,如您的瀏覽器不支援,IE6請利用鍵盤按住ALT鍵 + V → X → (G)最大(L)較大(M)中(S)較小(A)小,來選擇適合您的文字大小,如為IE7或Firefoxy瀏覽器則可利用鍵盤 Ctrl + (+)放大 (-)縮小來改變字型大小。
:
twitter line
研究生: 陳冠宇
研究生(外文): Kuan-Yu Chen
論文名稱: 基於生成對抗網路之非監督式音素辨識
論文名稱(外文): Unsupervised Phoneme Recognition via GenerativeAdversarial Network
指導教授: 李琳山 李琳山引用關係
口試委員: 陳信宏 鄭秋豫 王小川 李宏毅
口試日期: 2019-07-09
學位類別: 碩士
校院名稱: 國立臺灣大學
系所名稱: 電信工程學研究所
學門: 工程學門
學類: 電資工程學類
論文種類: 學術論文
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 74
中文關鍵詞: 生成對抗網路 非監督式 語音辨識
外文關鍵詞: Generative Adversarial Network Unsupervised ASR
DOI: 10.6342/NTU202000709
相關次數:
  • 被引用 被引用:0
  • 點閱 點閱:144
  • 評分 評分:
  • 下載 下載:0
  • 收藏至我的研究室書目清單 書目收藏:0
隨著機器學習技術的日新月異,監督式的語音辨識技術已經可以達到不錯的準確率並早已融入人們的日常生活之中。這類的監督式語音辨識技術必須仰賴大量的人工標註資料來訓練模型,但標註資料的取得常需投入大量資源。相較之下,在巨量數據(Big Data)時代人們事實上可以輕易取得大量的未標註資料,這也是為什麼非監督式語音辨識技術有其吸引力與必要性。因此本論文由音素(Phoneme)辨識開始,提出了兩種不同的非監督式音素辨識架構,並且都是使用了最近被廣泛研究的生成對抗網路(Generative Adversarial Network)來達成非監督式的學習。在以往的非監督式語音處理技術中,僅能找出語音訊號中相似的音型(SpeechTokens),並沒有辦法辨識出這些音型是對應到哪些詞或是音素。因此本論文所提出的第一種方法是透過生成對抗網路來學習音型與音素之間的映射關係,來達到語音辨識的效果。然而透過實驗與其他研究可以發現建構一個非監督式語音辨識系統最大的困難點在於必須克服語音訊號的彈性長度及分段結構的特性,也就是每一個辨識單元例如字、詞、或是音素會分別對應到可長可短且連續的聲音訊號。因此本論文的第二種方法仍舊是透過生成對抗網路來學習,但改良了處理語音訊號的方式,並提出也使用隱藏式馬可夫模型(Hidden Markov Model, HMM)的協同訓練法,透過生成對抗模型與隱藏式馬可夫模型的協同交替學習來提升整體的辨識準確率。
With the rapid development of machine learning technology, supervised speech recognition technology has been able to achieve good accuracy and has been integrated into people's daily life. Such supervised speech recognition technology must rely on a large amount of manually labeled data to train the model, but obtaining labeled data often requires a lot of resources. In contrast, in the era of Big Data, people can easily obtain a large amount of unlabeled data, which is why unsupervised speech recognition technology has its appeal and necessity. Therefore, this thesis starts with phoneme recognition, and proposes two different unsupervised phoneme recognition architectures, and both use recently studied Generative Adversarial Network to achieve unsupervised learning. . In the past unsupervised speech processing technology, it was only possible to find similar speech patterns in speech signals (Speech Tokens), and there was no way to identify which words or phonemes these speech patterns correspond to. Therefore, the first method proposed in this paper is to learn the mapping relationship between phonemes and phonemes by generating an adversarial network to achieve the effect of speech recognition. However, through experiments and other studies, it can be found that the biggest difficulty in constructing an unsupervised speech recognition system is that it must overcome the characteristics of the flexible length and segmentation structure of the speech signal, that is, each recognition unit such as a word, word, or phoneme Corresponds to long and short and continuous sound signals. Therefore, the second method of this paper is still learning by generating adversarial networks, but it improves the way to process voice signals and proposes a collaborative training method that also uses Hidden Markov Model (HMM). The collaborative and alternate learning of the model and the hidden Markov model improves the overall recognition accuracy.
中文摘要............................................... i
英文摘要............................................... ii
一、導論............................................... 1
1.1 研究動機........................................... 1
1.2 研究方向........................................... 4
1.3 章節安排........................................... 5
二、背景知識............................................ 6
2.1 深層類神經網路...................................... 6
2.1.1 簡介............................................. 6
2.1.2 訓練方式......................................... 8
2.1.3 卷積式類神經網路................................. 10
2.1.4 遞迴式類神經網路................................. 12
2.2 自編碼器........................................... 14
2.2.1 簡介............................................ 14
2.2.2 變分自編碼器..................................... 15
2.2.3 序列對序列自編碼器............................... 17
2.2.4 語音詞向量...................................... 17
2.3 生成對抗網路...................................... 19
2.3.1 簡介............................................ 19
2.3.2 霍氏生成對抗網路................................. 21
2.3.3 增進版霍氏生成對抗網路............................ 22
2.4 傳統語音辨識技術................................... 24
2.4.1 簡介............................................ 24
2.4.2 特徵抽取........................................ 24
2.4.3 聲學模型........................................ 25
2.4.4 語言模型........................................ 25
2.4.5 語音解碼........................................ 26
2.5 音訊切割技術...................................... 27
2.6 本章總結......................................... 29
三、以生成對抗網路學習映射關係......................... 30
3.1 簡介............................................. 30
3.2 模型架構介紹...................................... 31
3.2.1 音型之向量化.................................... 32
3.2.2 語音向量之分群.................................. 33
3.2.3 生成對抗網路及映射關係........................... 36
3.3 實驗設置.......................................... 39
3.3.1 資料集.......................................... 39
3.3.2 模型設定........................................ 40
3.3.3 實驗設定........................................ 40
3.4 實驗結果與討論..................................... 41
3.4.1 不同分群數目之分析............................... 41
3.4.2 語音辨識準確率................................... 46
3.4.3 與監督式語音辨識技術之比較........................ 47
3.5 本章總結.......................................... 48
四、生成對抗網路與隱藏式馬可夫模型的協同訓練............. 49
4.1 簡介.............................................. 49
4.2 生成對抗網路...................................... 51
4.2.1 生成器.......................................... 52
4.2.2 鑑別器.......................................... 53
4.2.3 訓練目標........................................ 54
4.3 隱藏式馬可夫模型協同訓練法.......................... 56
4.4 實驗設置.......................................... 57
4.4.1 資料集.......................................... 57
4.4.2 模型設定........................................ 58
4.4.3 實驗設定........................................ 60
4.5 實驗結果與討論..................................... 61
4.5.1 語音辨識準確率................................... 61
4.5.2 切除測試........................................ 64
4.5.3 與音素切割邊界相關的參數之分析.................... 65
4.5.4 與監督式語音辨識模型之比較........................ 67
4.6 本章總結.......................................... 68
五、結論與展望......................................... 69
5.1 研究貢獻與結論..................................... 69
5.2 未來展望.......................................... 70
5.2.1 套用至更大的數據集............................... 70
5.2.2 改進現有之生成器架構............................. 70
參考文獻.............................................. 71
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105.
[2] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan C ernocky, and Sanjeev Khudanpur,“Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.
[3] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[4] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
[5] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672-2680.
[7] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[8] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference
on machine learning, 2013, pp. 1139–1147.
[9] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research,
vol. 12, no. Jul, pp. 2121–2159, 2011.
[10] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[11] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks,
vol. 5, no. 2, pp. 157–166, 1994.
[12] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[14] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[15] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[16] Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
[18] Yu-HsuanWang, Cheng-Tao Chung, and Hung-Yi Lee, “Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries,” Proc. Interspeech 2017, pp. 3822–3826, 2017.
[19] Vince Lyzinski, Gregory Sell, and Aren Jansen, “An evaluation of graph clustering methods for unsupervised term discovery,” in Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[20] Yaodong Zhang, Ruslan Salakhutdinov, Hung-An Chang, and James Glass, “Resource configurable spoken query detection using deep boltzmann machines,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 5161–5164.
[21] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass, “Unsupervised cross-modal alignment of speech and text embedding spaces,” in Advances in Neural Information Processing Systems, 2018, pp. 7365–7375.
[22] Cheng-Tao Chung, Cheng-Yu Tsai, Chia-Hsiang Liu, and Lin-Shan Lee, “Unsupervised iterative deep learning of speech features and acoustic tokens with applications
to spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1914–1928, 2017.
[23] Chia-Hao Shen, Janet Y Sung, and Hung-Yi Lee, “Language transfer of audio word2vec: Learning audio segment representations without target language data,” arXiv preprint arXiv:1707.06519, 2017.
[24] Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee, and Lin-shan Lee, “Almost-unsupervised speech recognition with close-to-zero resource based on phonetic
structures learned from very small unpaired speech and text data,” arXiv preprint arXiv:1810.12566, 2018.
[25] Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, and Dong Yu, “Unsupervised speech recognition via segmental empirical output distribution matching,” in International Conference on Learning Representations, 2019.
連結至畢業學校之論文網頁 點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!