Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates
Published in ICASSP, 2020
Speech Translation datasets are a scarce resource, and this greatly hampers research in the area. Europarl-ST, first released in 2019, was a game-changer for Speech Translation research, thanks to the wide range of languages covered and a careful filtering pipeline. Currently (early 2023), close to 100 publications have cited Europarl-ST.
This is how I described this publication in my thesis:
Current ST research is often hampered by the lack of specific data resources for this task, as currently available ST datasets are restricted to a limited set of language pairs. This work presents Europarl-ST, a novel multilingual ST corpus containing paired audio-text samples from and into 6 European languages (English, German, French, Spanish, Italian, Portuguese), for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. The corpus creation process is described in detail, which has been carefully aligned and filtered in order to provide a reliable benchmark for streaming ST systems.
The paper presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource, carried out using the English, German, French and Spanish sets, for a total of 12 ST directions. The results show the usefulness of this resource for both domain adaptation and evaluation, as well as highlighting some of the challenges to be solved on the road to Streaming ST.