使用编码器状态修订策略改进基于 Transformer 的因果模型的流式端到端 ASR,arXiv - CS - Sound

相关文章推荐

苦恼的山羊 · Jmeter BeanShell 内置变量v ...· 1 年前 ·

腼腆的薯片 · JavaScript Source Map ...· 1 年前 ·

深沉的书包 · xml换行符和回车-掘金· 1 年前 ·

逼格高的核桃 · fputc ...· 1 年前 ·

果断的冲锋衣 · url传数组参数并js获取-掘金· 2 年前 ·

在流式自动语音识别 (ASR) 中，通常需要在性能和延迟之间进行权衡。传统方法，例如前瞻和基于块的方法，通常需要来自未来帧的信息来提高识别精度，即使计算速度足够快，也会产生不可避免的延迟。在没有任何未来帧的情况下进行计算的因果模型可以避免这种延迟，但其性能明显低于传统方法。在本文中，我们提出了相应的修正策略来改进因果模型。首先，我们引入了一种实时编码器状态修正策略来修改以前的状态。编码器前向计算在接收到数据后开始，并在几帧后修改先前的编码器状态，无需等待任何正确的上下文。此外，设计了一种CTC尖峰位置对齐解码算法，以降低修订策略带来的时间成本。实验都是在 Librispeech 数据集上进行的。在基于 CTC 的 wav2vec2.0 模型上进行微调，我们的最佳方法可以在 test-clean/other 集上达到 3.7/9.2 WER，这也与基于块的方法和知识蒸馏方法具有竞争力。 There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets, which is also competitive with the chunk-based methods and the knowledge distillation methods.