Parallel data, also known as parallel corpora, refers to collections of translation pairs comprising sentences and their corresponding translations. These datasets are utilized in the training and evaluation of machine translation models.
Creation of parallel data can be accomplished through manual, automatic, or synthetic means using monolingual data. It can be sourced from various processes such as human translation (traditional translation services), human post-editing, crawling, and alignment. Parallel data can also be generated by crawling and aligning monolingual test data, as well as through techniques like back-translation or back-copying.
However, the availability of parallel data poses quite a few challenges. While it is readily accessible for widely written language pairs, it is often unavailable for less common language pairs. Additionally, parallel data can contain errors such as misaligned sentences, faulty sentence segmentation, incorrect encodings, or a mixture of languages. These errors adversely affect how algorithms learn from examples and, thus, the quality of machine translation outputs. “Data Cleansing” is the term used for filtering methods to cleanse parallel data (parallel corpora) before it becomes part of a training set.
What is Parallel Data Used For?
The primary purpose of parallel data was to facilitate the training of statistical machine translation (SMT) systems in the first decades of 2000’s and neural machine translation engines since 2017.
With the emergence of SMT and subsequently neural machine translation systems, parallel corpora have gained significant prominence as highly desirable data. They serve as vital assets for training machine translation systems and are invaluable resources for various Artificial Intelligence (AI) applications, including but not limited to Natural Language Generation (NLG), where multilingual datasets are essential.
Is comparable corpora the same as parallel corpora?
comparable corpora are not the same as parallel corpora. While parallel corpora consist of aligned sentence pairs in different languages, comparable corpora contain texts in the same language or similar languages, but without explicit alignment between specific sentences.
Comparable corpora are often used for tasks such as cross-lingual information retrieval, cross-lingual document classification, and cross-lingual summarization. They provide a valuable resource for studying language variations, dialects, and related languages, but they require additional preprocessing steps, such as sentence alignment or document alignment, to establish correspondences between the texts in different languages or language variants.