Direct Segmentation Models for Streaming Speech Translation
Published in EMNLP, 2020
Machine Translation systems are trained with full sentences, but in the Cascaded Speech Translation scenario, the output of the ASR system does not necessarily form sentences, which hampers performance. This publication introduces a streaming-ready segmenter applied to the output of the ASR system, in order to maximize downstream translation quality.
This is how I described this publication in my thesis:
This paper studies how to optimize the processing and segmentation of the ASR system output so that the downstream MT performance is maximized. Specifically, this publication is focused on studying the segmentation problem for the streaming scenario. We introduce a novel neural segmenter architecture, Direct Segmentation (DS), which considers the segmentation process as a classification problem. Using a sliding window approach, for every position of the ASR stream, the segmenter decides whether or not to produce a chunk by using a fixed local history and a small look-ahead window. The performance of this approach is evaluated on the previously introduced Europarl-ST corpus, by training an offline MT system and testing its performance when combined with different segmenters, for the English $\leftrightarrow$ \{German, French, Spanish\} directions. Experiments are also performed showing that adding audio features to the segmenter improves performance.
The proposed architecture is computationally efficient while outperforming other segmentation approaches, and is able to work straight-out-of-the box in the streaming scenario. Additionally, the work studies how the MT training data should be processed so that it better matches the ASR transcriptions, avoiding the need for an intermediate inverse text normalization step.