Our linguists are skilled in understanding and interpreting day-to-day, conversational and nuanced language so you can improve your translation systems.
NLPC has the ability to create parallel corpora data sets from and into English from most languages in the world. With a diversified team of linguists around the world, we have turned around projects for Big Tech building its own translation models in several ways. We also have ongoing contracts with some translation agencies to augment their data and provide IP-services from their translation memories.
There are several methodologies to obtain massive amounts of parallel text data for machine learning.
- Translation (clients provide the text to translate and verify by humans)
If you have a certain amount of specific data to be translated to reinforce your models, NLPC can
- Translate the original: our tools will ensure no machine translation is copied and pasted.
- Post-edit the original content with machine translation (yours, or a 3rd party)
- Manual data augmentation
Our bilingual experts can manually augment the amount of data provided (tsv, csv, xlsx, TMX, xliff formats or any format / database with delimited values) by editing both the source and the target. This type of data augmentation is laborious but extremely useful when you are dealing with idiomatic expressions for which there are few examples and that need to be used in the fine-tuning of the models.
- NLPC collects original in-domain samples and translates them
We can run data collection services in the domain you desire (social media, dialogs, messaging, email-type of communications, etc.) and set up a translation or post-editing workflow according to your specifications.
- Our stock parallel data (own repositories)
NLPC has acquired exclusive rights over translation memories from translation companies. These translation memories have been duly anonymized, augmented, segmented and shuffled so that the resulting corpus is completely free of copyright and IP. In addition, NLPC has added its own translations resulting from its ongoing stock creation.
Parallel Corpora Data sets are divided thematically (per domain) and also in
- English Parallel Corpora (source or target)a
- Non-English Parallel Text Data : English to Vietnamese, Spanish into Japanese or Brazilian Portuguese to Polish, Russian or Arabic).