Parallel Text-Data-for-Machine-Learning (Translation)

Our linguists are skilled in understanding and interpreting day-to-day, conversational and nuanced language so you can improve your translation systems.
NLPC has the ability to create parallel corpora data sets from and into English from most languages in the world. With a diversified team of linguists around the world, we have turned around projects for Big Tech building its own translation models in several ways. We also have ongoing contracts with some translation agencies to augment their data and provide IP-services from their translation memories.
There are several methodologies to obtain massive amounts of parallel text data for machine learning.

  • Translation (clients provide the text to translate and verify by humans)

If you have a certain amount of specific data to be translated to reinforce your models, NLPC can

  • Translate the original: our tools will ensure no machine translation is copied and pasted.
  • Post-edit the original content with machine translation (yours, or a 3rd party)
  • Manual data augmentation

Our bilingual experts can manually augment the amount of data provided (tsv, csv, xlsx, TMX, xliff formats or any format / database with delimited values) by editing both the source and the target. This type of data augmentation is laborious but extremely useful when you are dealing with idiomatic expressions for which there are few examples and that need to be used in the fine-tuning of the models.

  • NLPC collects original in-domain samples and translates them

We can run data collection services in the domain you desire (social media, dialogs, messaging, email-type of communications, etc.) and set up a translation or post-editing workflow according to your specifications.

  • Our stock parallel data (own repositories)

NLPC has acquired exclusive rights over translation memories from translation companies. These translation memories have been duly anonymized, augmented, segmented and shuffled so that the resulting corpus is completely free of copyright and IP. In addition, NLPC has added its own translations resulting from its ongoing stock creation.
Parallel Corpora Data sets are divided thematically (per domain) and also in

  • English Parallel Corpora (source or target)a
  • Non-English Parallel Text Data : English to Vietnamese, Spanish into Japanese or Brazilian Portuguese to Polish, Russian or Arabic).
Why Choose Us


We Understand You

Our team is made up of Machine Learning and Deep Learning engineers, linguists, software personnel with years of experience in the development of machine translation and other NLP systems.

We don’t just sell data – we understand your business case.

Extend Your Team

Our worldwide teams have been carefully picked and have served hundreds of clients across thousands of use cases, from the from simple to the most demanding.

Quality that Scales

Proven record of successfully delivering accurate data in a secure way, on time and on budget. Our processes are designed to scale and also change with your growing needs and projects.

Predictability through subscription model

Do you need a regular influx of annotated data services? Are you working on a yearly budget? Our contract terms include all you need to predict ROI and succeed thanks to predictable hourly pricing designed to remove the risk of hidden costs.