OFF_THE_SHELF_INVENTORY

Dataset Catalogue

Engineered assets ready for deployment. Browse our verified inventory of high-quality speech, text, and computer vision datasets. Select "Download Sample" to retrieve JSONL segments and metadata schemas via your secure portal.

Standardized Machine-Readable Formats

We don't just supply raw data; we supply engineered intelligence. Every dataset from NLP Consultancy ships with clear JSON metadata schemas (Data Cards), robust transcription formats, and explicit licensing wrappers ensuring your ML pipeline can ingest our data safely and instantly.

View our provenance methodology
sample_turn.json (RLHF DPO)
{
  "prompt": "Write a Python function to parse Apache logs securely.",
  "chosen": "def parse_logs(file_path): ...",
  "rejected": "def parse_logs(file_path): ...",
  "metadata": {
    "domain": "cybersecurity",
    "annotator_id": "expert_cy_44",
    "justification": "Chosen avoids eval() and handles malformed inputs."
  }
}
ID: ds-gulf-01

Gulf Arabic Conversational Speech

DATASET TYPE Speech
LANGUAGES Gulf Arabic (AE, SA, QA)
VOLUME 1,200 hours
LICENSE Commercial, Non-Exclusive
COLLECTION METHOD Consented Recording (Studio & Field)
METADATA Speaker demographics, acoustic conditions, region, device type
ANNOTATION Verbatim transcription, diarisation, timestamps
COMPLIANCE GDPR, ISO 27001 environment
ID: ds-rag-fin-02

Enterprise Financial RAG Corpus

DATASET TYPE Document
LANGUAGES English (US/UK)
VOLUME 500,000 documents
LICENSE Commercial, Royalty-Free
COLLECTION METHOD Licensed Archive / Public Domain Curated
METADATA Document type, timestamp, sector, entity tags
ANNOTATION Metadata extraction, bounding boxes for tables
COMPLIANCE SOC 2 environment
ID: ds-vis-ret-03

Retail Product Visual Search

DATASET TYPE Computer Vision
LANGUAGES N/A (Visual)
VOLUME 2.5M images
LICENSE Commercial, Non-Exclusive
COLLECTION METHOD In-store capture & Studio photography
METADATA Lighting, angle, product category, store type
ANNOTATION 2D Bounding boxes, polygon segmentation
COMPLIANCE GDPR (face blurred)

Dataset Definitions

What is a licensed AI training dataset?

A licensed AI training dataset is a collection of data procured with clear commercial usage rights, transparent provenance, and verified consent. It protects enterprise AI developers from copyright infringement and privacy compliance risks, ensuring the resulting models can be safely deployed in production environments.