在流式自动语音识别 (ASR) 中,通常需要在性能和延迟之间进行权衡。传统方法,例如前瞻和基于块的方法,通常需要来自未来帧的信息来提高识别精度,即使计算速度足够快,也会产生不可避免的延迟。在没有任何未来帧的情况下进行计算的因果模型可以避免这种延迟,但其性能明显低于传统方法。在本文中,我们提出了相应的修正策略来改进因果模型。首先,我们引入了一种实时编码器状态修正策略来修改以前的状态。编码器前向计算在接收到数据后开始,并在几帧后修改先前的编码器状态,无需等待任何正确的上下文。此外,设计了一种CTC尖峰位置对齐解码算法,以降低修订策略带来的时间成本。实验都是在 Librispeech 数据集上进行的。在基于 CTC 的 wav2vec2.0 模型上进行微调,我们的最佳方法可以在 test-clean/other 集上达到 3.7/9.2 WER,这也与基于块的方法和知识蒸馏方法具有竞争力。
There is often a trade-off between performance and latency in streaming
automatic speech recognition (ASR). Traditional methods such as look-ahead and
chunk-based methods, usually require information from future frames to advance
recognition accuracy, which incurs inevitable latency even if the computation
is fast enough. A causal model that computes without any future frames can
avoid this latency, but its performance is significantly worse than traditional
methods. In this paper, we propose corresponding revision strategies to improve
the causal model. Firstly, we introduce a real-time encoder states revision
strategy to modify previous states. Encoder forward computation starts once the
data is received and revises the previous encoder states after several frames,
which is no need to wait for any right context. Furthermore, a CTC spike
position alignment decoding algorithm is designed to reduce time costs brought
by the revision strategy. Experiments are all conducted on Librispeech
datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can
achieve 3.7/9.2 WERs on test-clean/other sets, which is also competitive with
the chunk-based methods and the knowledge distillation methods.