Baltic Collection Indo-European Root

Latvian Speech Datasets: Tonal Baltic Precision

Capturing the rare tonal landscape of one of the world's last surviving Baltic languages. Latvian presents a unique ASR challenge: a phonemic pitch-accent system combined with a complex Indo-European morphological structure. Our datasets provide the tonal resolution and morphological depth required for modern AI.

Panoramic view of Riga Old Town, the linguistic and economic heart of Latvia

Data Collection

Riga, Latvia

Our primary recording hub for the Central dialect, ensuring acoustic purity for the Baltic digital infrastructure.

Latvian ASR: Mastering Baltic Phonology

As one of only two surviving Baltic languages, Latvian occupies a critical niche in the Indo-European family. For speech technology, it presents a fascinating and difficult landscape: it is one of the few European languages to retain a functional pitch-accent system. At NLPC, our Latvian corpora are engineered to decode these tonal and morphological complexities.

The Pitch-Accent Challenge

Latvian syllables can be pronounced with three distinct intonations: Level, Falling, and Broken. These are not merely expressive; they are phonemic. A change in intonation can shift the lexical meaning of a word. Conventional ASR models often fail to capture these subtle frequency shifts. Our datasets are recorded in studio-grade environments to preserve the fundamental frequency (F0) data necessary for tonal classification.

Fusional Morphology

Unlike the agglutinative Uralic languages to the north, Latvian is a fusional Indo-European language. Its 7 cases and complex verb conjugations involve internal stem changes and multi-functional suffixes. Our datasets include high-density morphological annotation, allowing models to predict inflective shifts across varied contexts.

Baltic Heritage & Modern Use

Latvian serves as the sole official language of Latvia and one of the official languages of the EU. In an era of sovereign AI, high-quality local data is essential for government, legal, and financial sectors. We provide datasets covering standard Riga Latvian as well as the Latgalian variety, ensuring wide-spectrum coverage for the Baltic region.

High-Fidelity Audio Architecture

Because of the tonal importance in Latvian, low-bitrate audio is insufficient for training high-end models. We deliver audio at 48kHz / 24-bit PCM as a standard, with time-aligned transcripts that mark both word boundaries and syllable-level intonation peaks.

Technical JSON Metadata Sample

{
  "dataset_id": "NLPC-LV-ASR-2024",
  "language": "lv-LV",
  "morphology": "Inflective / Fusional",
  "accent_system": "Pitch-Accent (3 Syllable Intonations)",
  "phonology": "Tonal Contrast",
  "samples": [
    {
      "audio_id": "LV_00452",
      "text": "Lācis",
      "analysis": {
        "root": "lāc- (bear)",
        "case": "nominative (-is)",
        "intonation": "Broken (lauzta)",
        "phonetic": "[lâːtsis]"
      },
      "translation": "Bear"
    }
  ]
}