DATASET // US-ENGLISH-SPEECH

Multi-Dialect US English Speech Datasets

Modeling North American speech requires accounting for the massive shift in vowel quality and rhoticity across the continent. We provide high-fidelity datasets spanning General American (GenAm), Southern American (SAM), African American Vernacular (AAVE), and distinct Northeastern standards.

New York City skyline representing Northeastern US English variants
LOC: NY_METRO_CORPUS_09

Precision Modeling for Continental Diversity

US English is defined by several major phonological shifts that can degrade ASR performance if not specifically targeted during training. From the Rhoticity variations of the Northeast to the Cot-Caught Merger prevalent in the West, our datasets provide the balanced acoustic distribution necessary for truly robust voice systems.

Our US English Speech Datasets are annotated with specific attention to socio-linguistic markers, including the Pin-Pen Merger in the South, the Northern Cities Shift in the Midwest, and the Southern Vowel Shift in Texas, ensuring equity and accuracy in high-stakes deployments.

VOWEL MERGER SENSITIVITY

Detailed capture of the low-back merger (cot vs. caught) and the Southern front-vowel merger (pin vs. pen) for accurate lexical disambiguation.

RHOTICITY GRADIENTS

Comprehensive coverage of non-rhotic variants in New York City, Boston, and the Deep South, ensuring stability across /r/ realization patterns.

Dialectal Acoustic Standards

We isolate the phonetic markers that define the major acoustic zones of North American speech.

General American (GenAm)

The benchmark for broadcast media, professional AI, and business communication.

PHONETIC CHARACTERISTICS

Fully rhotic. T-glottalization in final positions. Alveolar flapping of intervocalic /t/ and /d/ (e.g., butter). Minimal vowel mergers.

ASR CHALLENGES

Over-reliance on GenAm leads to high failure rates in urban and rural regional contexts. Precise flap modeling is critical for GenAm accuracy.

Southern American (SAM)

Featuring the iconic Southern Drawl and distinct monophthongization.

PHONETIC CHARACTERISTICS

Monophthongization of /ai/ (e.g., time becomes [ta:m]). Pin-pen merger. Non-rhoticity in coastal pockets. Lengthened vowel duration (the Southern Drawl).

ASR CHALLENGES

Vowel length variation confuses standard temporal models. Merger of front vowels requires strong n-gram language models for context-based correction.

AAVE & Urban Variants

Distinct socio-linguistic patterns prevalent in urban centers across the US.

PHONETIC CHARACTERISTICS

Consonant cluster reduction (e.g., desk becomes [des]). /th/ realization as [d] or [f]. Unique pitch contours and stress patterns.

ASR CHALLENGES

Often suffers from the highest WER in biased systems. Requires diverse urban training sets to handle consonant reduction and specific prosodic markers.

Midwest (Inland North)

Focusing on the Northern Cities Shift (NCS), a radical rotation of vowels in the Great Lakes region.

PHONETIC CHARACTERISTICS

The "short-a" raising (e.g., cat sounding like kyat). Fronting of /o/ (e.g., block sounding like black). Distinctive /ae/ raising and /a/ fronting chain reactions.

ASR CHALLENGES

Vowel rotation leads to significant phonetic ambiguity between standard GenAm and Inland North realizations. Highly sensitive to vowel formant shifts.

Chicago skyline representing the Inland North/Midwest dialect
LOC: CHI_NCS_04
Texas landscape representing the Southern Vowel Shift
LOC: TX_SVS_07

Texas & Southwestern

Deep documentation of the Southern Vowel Shift (SVS) and Texas-specific lexical markers.

PHONETIC CHARACTERISTICS

Pin-pen merger is near-universal. Vowel breaking (e.g., bed becomes [be-uhd]). Strong monophthongization of /ai/. High rhoticity compared to Gulf Southern.

ASR CHALLENGES

Complex diphthongization patterns require higher temporal resolution in acoustic models. SVS creates overlaps with standard Northern vowel targets.

Los Angeles skyline representing Western US English variants

WESTERN STANDARDS

COT-CAUGHT MERGER DOMAIN

Global Generalization

Our datasets aren't just collections of audio—they are balanced engineered samples of the American linguistic experience. From the high-rises of Manhattan to the sprawl of Southern California, we deliver the data that powers truly global voice AI.

US English Dataset Configurations

VARIANT / LOCALE AUDIO SPECS PRIMARY USE CASE
General American (US) 16kHz/48kHz, Studio & Telephony Media Captioning, Enterprise IVR
Southern American (SAM) 16kHz, Diverse SNR In-Car Systems, Regional Service Centers
Northeastern / NYC / Boston 16kHz, Urban Ambient Noise Law Enforcement Bodycam Transcription
Urban AAVE / African American 16kHz, Conversational spontaneous Bias Reduction, Social Media Monitoring
Inland North / Midwest (NCS) 48kHz, High Fidelity Vowel Shift Research, Regional Voice UI
Texas / Southwest (SVS) 16kHz/44.1kHz, Multi-Environment In-Car Voice Control, Logistics ASR

Technical Implementation FAQ

How do your datasets handle the Cot-Caught merger?

Our datasets are geographically tagged so models can learn the merger as a feature of Western and Midwestern speech. We provide specific metadata that identifies speakers who merge the vowels in words like "don" and "dawn," allowing for custom acoustic modeling.

Why is AAVE inclusion critical for US ASR systems?

Research has shown that many off-the-shelf ASR systems exhibit significant racial bias, with WERs for Black speakers being twice as high as for White speakers. Our AAVE-inclusive corpora are specifically designed to bridge this accuracy gap by providing high-quality, phonetically diverse training samples.

Do you provide data for North American Hispanic variants?

Yes. We have specialized corpora for Chicano English and other Hispanic-influenced US English variants, which often feature distinct syllable-timing and specific consonant realizations, essential for serving the 60M+ Hispanic population in the US.

What is the Northern Cities Shift and why does it matter?

The Northern Cities Shift (NCS) is a complex chain shift of vowels in the Inland North region (Chicago, Detroit, Cleveland). Because it significantly alters the acoustic target for "short-a" and other vowels, models trained exclusively on General American often fail to recognize common words in this region. Our datasets explicitly map these shifts to ensure high accuracy across the Great Lakes belt.

Procure US English Speech Data

Connect with our linguistic experts to discuss phonetic modeling requirements and data licensing.