Jenny Smith, Author at Ethical, Task-Specific Data To Train Smarter AI

Data Strategies for Under-Resourced Languages

September 25, 2025 No Comments

Artificial intelligence has transformed how we access knowledge and connect across languages. But for smaller or under-resourced languages, the digital shift has brought new risks. Instead of preservation, poorly trained AI systems often accelerate decline. Recent analyses from MIT, including cases from Greenlandic, Fulfulde, and Inuktitut Wikipedias, show how error-filled

The GPT-5 Wake-Up Call: When Bigger Stopped Being Better

September 7, 2025 No Comments

This is a guest post by one of our most esteemed clients, Manuel Herranz, CEO of Pangeanic. We collect, classify and supply data for AI training at NLP Consultancy: and working with data allows us to test models and understand what the market wants. Our close relationship with Pangeanic as

The End of Anonymous AI: How China and Spain Are Forcing a New Era of Transparency

April 26, 2025 No Comments

A Watershed Moment for AI Accountability China and Spain are setting new global benchmarks in AI regulation, demanding clear labelling of AI-generated content both visibly and invisibly by 2025. This regulatory shift marks the beginning of a new era: transparency-by-design in the age of generative AI. At NLP CONSULTANCY, where

DeepSeek-R1: The Contender Outperforming Giants in AI

January 25, 2025 1 Comment

In an ever-more-complex and competitive landscape dominated by titans like ChatGPT-4 and Anthropic’s Claude, DeepSeek-R1 has emerged as a surprising frontrunner. Although it has become clear that DeepSeek wasn’t built on $5M budget, this new language model not only competes with industry giants but also outperforms them in critical benchmarks.

Long-form parallel corpora

January 4, 2025 No Comments

The demand for high-quality datasets has never been more critical. Among these datasets, long-form parallel corpora are standing out as indispensable resources for advancing multilingual communication and linguistic automation. This is due to the new fluency by LLMs we have grown used to since late 2022 with the advent of

How to avoid bias in NLP

August 26, 2024 1 Comment

Discover comprehensive strategies to detect and mitigate bias in NLP models. Learn how diverse data collection, algorithmic fairness techniques, and human oversight create more ethical and equitable AI language systems. Practical insights for developers and businesses

Data Annotation: The Key to Building High-Performing AI and ML Models

October 14, 2023 1 Comment

Artificial intelligence (AI) and machine learning (ML) models are transforming the way we live and work. From powering our search engines and social media feeds to recommending products and diagnosing diseases, AI and ML are becoming increasingly essential to our everyday lives. But how do these powerful models work? At

Why Idiomatic Expressions Are Vital For Machine Translation Systems

July 20, 2023 No Comments

Machine translation (MT) systems, particularly Neural Machine Translation and LLM translation, have made enormous progress in recent years, allowing for seamless communication between different languages. However, to truly capture the essence and nuances of language, it is essential to include idiomatic expressions in the training process. Idioms are an essential

If Apple Translate is to improve, it better consider these parallel corpora tips

July 16, 2023 No Comments

Apple Translate, like other machine translation systems, relies on a large amount of high-quality, parallel corpora for ML that is designed to train and improve its translation models. This data consists of the same text in two or more languages, aligned so that the system can learn correspondences between linguistic

The Importance of Live Data for ASR Training

June 4, 2023 No Comments

Artificial Intelligence’s application in Automatic Speech Recognition (ASR) has become indispensable, with its numerous applications ranging from voice assistants, call center services, to assistive tools for the deaf and elderly. The accuracy of ASR systems is heavily dependent on substantial training data. This data can be speech, simulated dialogues involving

Ethical, Task-Specific Data To Train Smarter AI

Blog

Author: Jenny Smith