With the rapid development of machine learning technology, supervised speech recognition technology has been able to achieve good accuracy and has been integrated into people's daily life. Such supervised speech recognition technology must rely on a large amount of manually labeled data to train the model, but obtaining labeled data often requires a lot of resources. In contrast, in the era of Big Data, people can easily obtain a large amount of unlabeled data, which is why unsupervised speech recognition technology has its appeal and necessity. Therefore, this thesis starts with phoneme recognition, and proposes two different unsupervised phoneme recognition architectures, and both use recently studied Generative Adversarial Network to achieve unsupervised learning. . In the past unsupervised speech processing technology, it was only possible to find similar speech patterns in speech signals (Speech Tokens), and there was no way to identify which words or phonemes these speech patterns correspond to. Therefore, the first method proposed in this paper is to learn the mapping relationship between phonemes and phonemes by generating an adversarial network to achieve the effect of speech recognition. However, through experiments and other studies, it can be found that the biggest difficulty in constructing an unsupervised speech recognition system is that it must overcome the characteristics of the flexible length and segmentation structure of the speech signal, that is, each recognition unit such as a word, word, or phoneme Corresponds to long and short and continuous sound signals. Therefore, the second method of this paper is still learning by generating adversarial networks, but it improves the way to process voice signals and proposes a collaborative training method that also uses Hidden Markov Model (HMM). The collaborative and alternate learning of the model and the hidden Markov model improves the overall recognition accuracy.
中文摘要............................................... i
英文摘要............................................... ii
一、導論............................................... 1
1.1 研究動機........................................... 1
1.2 研究方向........................................... 4
1.3 章節安排........................................... 5
二、背景知識............................................ 6
2.1 深層類神經網路...................................... 6
2.1.1 簡介............................................. 6
2.1.2 訓練方式......................................... 8
2.1.3 卷積式類神經網路................................. 10
2.1.4 遞迴式類神經網路................................. 12
2.2 自編碼器........................................... 14
2.2.1 簡介............................................ 14
2.2.2 變分自編碼器..................................... 15
2.2.3 序列對序列自編碼器............................... 17
2.2.4 語音詞向量...................................... 17
2.3 生成對抗網路...................................... 19
2.3.1 簡介............................................ 19
2.3.2 霍氏生成對抗網路................................. 21
2.3.3 增進版霍氏生成對抗網路............................ 22
2.4 傳統語音辨識技術................................... 24
2.4.1 簡介............................................ 24
2.4.2 特徵抽取........................................ 24
2.4.3 聲學模型........................................ 25
2.4.4 語言模型........................................ 25
2.4.5 語音解碼........................................ 26
2.5 音訊切割技術...................................... 27
2.6 本章總結......................................... 29
三、以生成對抗網路學習映射關係......................... 30
3.1 簡介............................................. 30
3.2 模型架構介紹...................................... 31
3.2.1 音型之向量化.................................... 32
3.2.2 語音向量之分群.................................. 33
3.2.3 生成對抗網路及映射關係........................... 36
3.3 實驗設置.......................................... 39
3.3.1 資料集.......................................... 39
3.3.2 模型設定........................................ 40
3.3.3 實驗設定........................................ 40
3.4 實驗結果與討論..................................... 41
3.4.1 不同分群數目之分析............................... 41
3.4.2 語音辨識準確率................................... 46
3.4.3 與監督式語音辨識技術之比較........................ 47
3.5 本章總結.......................................... 48
四、生成對抗網路與隱藏式馬可夫模型的協同訓練............. 49
4.1 簡介.............................................. 49
4.2 生成對抗網路...................................... 51
4.2.1 生成器.......................................... 52
4.2.2 鑑別器.......................................... 53
4.2.3 訓練目標........................................ 54
4.3 隱藏式馬可夫模型協同訓練法.......................... 56
4.4 實驗設置.......................................... 57
4.4.1 資料集.......................................... 57
4.4.2 模型設定........................................ 58
4.4.3 實驗設定........................................ 60
4.5 實驗結果與討論..................................... 61
4.5.1 語音辨識準確率................................... 61
4.5.2 切除測試........................................ 64
4.5.3 與音素切割邊界相關的參數之分析.................... 65
4.5.4 與監督式語音辨識模型之比較........................ 67
4.6 本章總結.......................................... 68
五、結論與展望......................................... 69
5.1 研究貢獻與結論..................................... 69
5.2 未來展望.......................................... 70
5.2.1 套用至更大的數據集............................... 70
5.2.2 改進現有之生成器架構............................. 70
參考文獻.............................................. 71
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105.
[2] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan C ernocky, and Sanjeev Khudanpur,“Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.
[3] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[4] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
[5] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672-2680.
[7] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[8] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference
on machine learning, 2013, pp. 1139–1147.
[9] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research,
vol. 12, no. Jul, pp. 2121–2159, 2011.
[10] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[11] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks,
vol. 5, no. 2, pp. 157–166, 1994.
[12] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[14] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[15] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[16] Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
[18] Yu-HsuanWang, Cheng-Tao Chung, and Hung-Yi Lee, “Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries,” Proc. Interspeech 2017, pp. 3822–3826, 2017.
[19] Vince Lyzinski, Gregory Sell, and Aren Jansen, “An evaluation of graph clustering methods for unsupervised term discovery,” in Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[20] Yaodong Zhang, Ruslan Salakhutdinov, Hung-An Chang, and James Glass, “Resource configurable spoken query detection using deep boltzmann machines,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 5161–5164.
[21] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass, “Unsupervised cross-modal alignment of speech and text embedding spaces,” in Advances in Neural Information Processing Systems, 2018, pp. 7365–7375.
[22] Cheng-Tao Chung, Cheng-Yu Tsai, Chia-Hsiang Liu, and Lin-Shan Lee, “Unsupervised iterative deep learning of speech features and acoustic tokens with applications
to spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1914–1928, 2017.
[23] Chia-Hao Shen, Janet Y Sung, and Hung-Yi Lee, “Language transfer of audio word2vec: Learning audio segment representations without target language data,” arXiv preprint arXiv:1707.06519, 2017.
[24] Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee, and Lin-shan Lee, “Almost-unsupervised speech recognition with close-to-zero resource based on phonetic
structures learned from very small unpaired speech and text data,” arXiv preprint arXiv:1810.12566, 2018.
[25] Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, and Dong Yu, “Unsupervised speech recognition via segmental empirical output distribution matching,” in International Conference on Learning Representations, 2019.