100,000+ Hours. 50+ Dialects. One Catalog.
Production-ready European audio data
Find What Your Model Needs
Most audio vendors give you a spreadsheet. We give you a catalog built for how ML engineers actually evaluate training data.
Every dataset in this catalog is indexed across the dimensions that determine whether audio ships to production or fails in deployment.
Filter by language architecture
Not just "German" but Swiss German (Zürich), Swiss German (Bern), Austrian German (Vienna), Bavarian, Swabian. Not just "English" but Glaswegian, Scouse, Belfast, Dublin. Select the specific variant your users actually speak.
Filter by acoustic conditions
Studio reference. Office ambient. Automotive highway at 120 km/h. Factory floor with heavy machinery. Hospital ward. Call center crosstalk. Warehouse logistics. Maritime bridge. Each environment is tagged with SNR ranges and noise classification.
Filter by elicitation type
Scripted read speech for baseline phoneme coverage. Spontaneous conversation for real-world variation. Command and control for voice interface training. Emotional expression for prosodic models. Specify what your architecture requires.
Filter by speaker demographics
Age bands. Gender distribution. Native vs. non-native. Regional origin. Education level. Occupation category. Every speaker profile is documented. Build balanced datasets or target specific populations.
Filter by device characteristics
Professional studio microphone. Smartphone (iOS/Android, model-specific). Laptop built-in. Headset. Far-field array. In-vehicle microphone array. Know exactly what hardware captured your training data.
Filter by compliance requirements
GDPR consent with revocation support. Biometric-safe processing under Article 9. EU AI Act Article 10 documentation. Full provenance chain. De-identified variants available. Select the governance level your legal team requires.
Eight Dataset Families
1. Dialects and Regional Speech
The problem:
Your model trained on standard German hits 8% WER on broadcast news and 35% WER on a customer from Zürich. Training data recorded in Berlin doesn't transfer to Bavaria. British English benchmarks collapse in Glasgow.
Why it matters:
Dialect variation isn't an edge case. For pan-European deployment, regional speech is the majority of your production traffic. Models trained on standard accents systematically exclude large user populations.
What we provide:
Native speakers recorded in their home regions, with documented geographic origin, dialect classification, and sub-regional variation. Deep vertical coverage, not thin horizontal breadth.
Swiss German Zürich Conversational
2,400 hours of spontaneous dialogue from Zürich canton. Urban and suburban speakers. Alemannic dialect features documented.
Swiss German Bern Regional
1,800 hours covering Bernese Oberland, Emmental, and Seeland sub-dialects. Rural and small-town speakers.
Bavarian Multi-Regional
3,200 hours across Upper Bavaria, Lower Bavaria, and Upper Palatinate. Munich urban contrasted with rural Altbayern.
Austrian German Vienna
2,100 hours of Viennese German. Service industry, professional, and casual registers.
Glaswegian Conversational
1,400 hours of working-class and professional Glasgow speech. Central Belt variation.
Scouse Liverpool Urban
900 hours of Merseyside English. Multi-generational coverage.
Andalusian Spanish Multi-City
2,800 hours covering Seville, Málaga, Granada, and Cádiz. Seseo, ceceo, and aspirated /s/ variants documented.
Catalan Barcelona Regional
1,600 hours of Central Catalan. Native and bilingual speakers with code-switching to Spanish.
2. Code-Switching and Multilingual Speech
The problem:
Real European speakers switch languages mid-sentence. A Frankfurt banker mixes English and German. A Brussels professional alternates French and Dutch. Your monolingual training data can't follow the conversation.
Why it matters:
Code-switching is the default in European business, tech, and urban contexts. Models trained on monolingual corpora fail on real-world multilingual users. Intent recognition breaks when the language changes.
What we provide:
Authentic bilingual speakers producing natural code-switching. Annotated language boundaries. Both intra-sentential switching (mid-sentence) and inter-sentential switching (sentence-level alternation).
English-German Frankfurt Business
1,800 hours. Finance, consulting, and tech contexts. Intra-sentential switching dominant.
English-German Berlin Tech
1,200 hours. Startup and software development contexts. High English lexical borrowing.
English-French Brussels Professional
1,400 hours. EU institutional, legal, and business contexts.
English-French Paris Urban
1,100 hours. Service industry and professional contexts. North African French influence.
French-Arabic Paris Marseille
1,600 hours. First and second-generation speakers. Maghrebi Arabic features.
German-Turkish Berlin Cologne
1,300 hours. Second and third-generation speakers. Kiezdeutsch features documented.
English-Spanish Barcelona Miami
900 hours. Catalan-influenced Spanish with English switching.
Swedish-Finnish Helsinki Bilingual
700 hours. Finland-Swedish speakers with Finnish code-switching.
3. Noisy and Real-World Environments
The problem:
Your ASR works perfectly in the lab. Then it enters production. Highway wind noise. Factory machinery. Hospital alarms. The gap between benchmark WER and production WER is 2.8x to 5.7x.
Why it matters:
Studio recordings don't prepare models for deployment. Clean speech corpora create a domain mismatch that no amount of fine-tuning on clean data will fix. You need training data from the acoustic environments where your model will actually run.
What we provide:
Speech recorded in real operational environments with calibrated noise levels. SNR metadata. Noise type classification. Device and microphone position documentation.
Automotive Highway European
4,200 hours across 18 vehicle models. 80-130 km/h conditions. HVAC on/off. Windows up/down. Driver and passenger positions.
Automotive Urban Stop-Start
2,800 hours. City traffic conditions. Engine idle. Intersection stops. Turn signal and indicator noise.
Factory Floor Manufacturing
1,600 hours. CNC machinery. Conveyor systems. Forklift traffic. PPE-muffled speech (masks, ear protection).
Warehouse Logistics
1,200 hours. Pallet handling. Forklift operations. Scanner beeps. Ambient ventilation.
Hospital Ward Ambient
1,400 hours. Medical alarms. Paging systems. Multi-speaker clinical environments. Patient room and corridor acoustics.
Call Center Crosstalk
2,200 hours. Adjacent agent bleed. Headset audio. 8kHz telephony compression. Hold music background.
Maritime Bridge Operations
600 hours. Engine room proximity. Radio chatter. Weather exposure. Norwegian and English mixed commands.
Offshore Platform Industrial
500 hours. Machinery noise. Wind exposure. Safety equipment environments. Norwegian-English code-switching.
4. Vertical-Specific Datasets
The problem:
Your general-purpose ASR doesn't know that "cabg" means coronary artery bypass graft. It transcribes "Brent crude" as "brand crude." Domain vocabulary isn't optional—it's the difference between 5% WER and 25% WER.
Why it matters:
Every vertical has specialized terminology that general speech models never encountered in training. Medical, financial, legal, and technical domains require purpose-built corpora.
What we provide:
Industry-specific vocabulary coverage. Domain expert speakers. Realistic operational contexts. Terminology validation by subject matter experts.
Automotive Voice Command European
3,400 hours. Navigation, media control, climate, and communication commands. 15 languages. Wake word and barge-in scenarios.
Clinical Dictation German
2,100 hours across 28 specialties. Board-certified physician speakers. Cardiology, radiology, pathology, emergency medicine emphasis.
Clinical Dictation French
1,800 hours. 22 specialties. Parisian and regional accents. Inpatient and outpatient contexts.
Financial Trading German English
800 hours. Trading floor recordings. FX, equities, and fixed income terminology. Multi-speaker crosstalk.
Legal Dictation German
1,100 hours. Contract law, corporate law, litigation. Formal and informal registers.
Energy Sector Norwegian English
700 hours. Oil and gas operations. Offshore and onshore contexts. Technical terminology.
Maritime Operations Nordic
500 hours. Bridge commands. Port communications. Safety procedures. Norwegian, Swedish, Danish, English.
5. Clinical and Pathology Speech
The problem:
Speech biomarker research requires audio from clinical populations. Parkinson's tremor. Post-stroke dysarthria. Cognitive decline markers. Healthy control datasets don't capture pathological speech patterns.
Why it matters:
Healthcare AI needs training data from real patient populations, collected with appropriate consent and privacy protections. Clinical audio requires specialized collection protocols and ethical oversight.
What we provide:
Patient speech collected under clinical research protocols. IRB-equivalent ethics approval. Longitudinal tracking capability. Condition-specific recruitment.
Parkinson's Disease German
180 hours. Early and mid-stage patients. Medication on/off states. Tremor and rigidity markers.
Post-Stroke Dysarthria European
220 hours. Aphasia types documented. Recovery progression. Six languages.
Mild Cognitive Impairment Nordic
160 hours. Memory clinic patients. Age-matched healthy controls. Longitudinal samples.
Depression Screening German
140 hours. PHQ-9 validated severity levels. Prosodic and lexical markers.
Respiratory Condition Markers
120 hours. Asthma, COPD, post-COVID. Breathing patterns and voice quality changes.
6. Emotional and Prosodic Speech
The problem:
Your TTS sounds robotic because it was trained on neutral read speech. Your sentiment model can't distinguish anger from frustration. Prosodic variation requires purpose-built training data.
Why it matters:
Next-generation voice AI requires emotional range. Character voices. Dynamic dialogue. Customer sentiment detection. Neutral corpora can't teach these patterns.
What we provide:
Acted and spontaneous emotional speech. Valence and arousal annotations. Prosodic contour documentation. Voice actor and natural speaker variants.
Acted Emotion German Full Range
800 hours. Professional voice actors. Six primary emotions plus blends. High and low intensity variants.
Acted Emotion English (UK) Full Range
900 hours. Regional actors. RP and regional accent variants. Character archetypes.
Spontaneous Emotion Call Center
1,400 hours. Real customer interactions (consent-obtained). Frustration, satisfaction, confusion, urgency labeled.
Whispered Speech European
300 hours. Five languages. ASMR-adjacent and privacy-context whispers.
Shouted and Projected Speech
400 hours. Sports context. Emergency context. Crowd noise overlay variants.
Gaming Character Archetypes
600 hours. Fantasy, sci-fi, historical character types. European voice actors.
7. Minority and Low-Resource Languages
The problem:
Sami has 30,000 speakers. Basque has 750,000. Frisian has 500,000. These communities deserve voice AI that works, but commercial providers ignore them. Your European deployment isn't complete without regional and minority coverage.
Why it matters:
EU accessibility requirements increasingly cover minority language support. Public sector deployments in Norway, Finland, Spain require regional language capability. Low-resource languages need purpose-built collection, not thin crowd-sourced samples.
What we provide:
Community-partnered collection. Cultural and linguistic consultation. Dialect documentation. Orthographic and phonetic transcription.
Northern Sami Norway Finland
120 hours. Native speakers from Kautokeino, Karasjok, and Finnish Lapland. Read and spontaneous speech.
Basque Euskara Regional
340 hours. Gipuzkoan, Bizkaian, and standard Batua variants. Urban and rural speakers.
Catalan Full Regional
1,200 hours. Central, Valencian, Balearic, and Northwestern variants.
Welsh North and South
280 hours. Gwynedd and Carmarthenshire variants. First-language and learner speakers.
Breton Brittany Regional
140 hours. Elderly native speakers. Revitalization context learners.
Frisian West Frisian
160 hours. Netherlands province speakers. Dutch code-switching documented.
Faroese Iceland Comparison
90 hours. Faroese primary with Icelandic mutual intelligibility pairs.
8. Synthetic-Safe Grounding Datasets
The problem:
You're fine-tuning Whisper or training a custom ASR, and you need clean, legally unambiguous, consent-verified audio. Public datasets have unclear provenance. Your legal team wants to know exactly where every hour came from.
Why it matters:
EU AI Act Article 10 requires documented training data provenance. Undocumented data creates regulatory exposure. Foundation model training requires bulletproof consent chains.
What we provide:
Studio-quality reference recordings with unambiguous consent. Zero synthetic contamination. Full speaker demographics. Explicit commercial licensing.
Nordic Reference Corpus Clean
2,400 hours. Norwegian, Swedish, Danish, Finnish, Icelandic. Studio conditions. CC-BY-SA licensing.
German Reference Multi-Accent Clean
3,200 hours. Standard German with Austrian, Swiss, and regional variants. Studio conditions.
European Phoneme Coverage Balanced
1,800 hours. Twelve languages. Phonetically balanced sentence sets. IPA alignment.
Demographic Balanced European
4,200 hours. Age, gender, and regional quotas across ten countries. Bias evaluation documentation included.
High-Value Collections
Swiss German Complete Regional
6,200 hours across Zürich, Bern, Basel, Lucerne, and St. Gallen cantons. The most comprehensive Alemannic German corpus available for commercial licensing. Includes spontaneous conversation, read speech, and command-and-control scenarios. Sub-dialect classification at municipality level. Urban/rural speaker distribution documented. Recorded 2022-2024 on smartphones and professional equipment.
Why it matters: Swiss German is mutually unintelligible with Standard German. Models trained on High German fail systematically on Swiss users. This corpus closes the gap.
Nordic Languages Bundle
14,000 hours across Norwegian (Bokmål, Nynorsk, five dialect regions), Swedish (four dialect regions including Finland-Swedish), Danish (Copenhagen and Jutlandic), Finnish, and Icelandic. Scripted and unscripted variants. Full demographic metadata across age, gender, and regional origin.
Why it matters: No competitor offers comparable Nordic depth. Speechmatics lists these languages but doesn't publish dialect-specific coverage. This bundle covers Nordic deployment end-to-end.
European Automotive In-Cabin
8,400 hours across 22 vehicle models from eight manufacturers. Highway (100-140 km/h), urban, and idle conditions. HVAC states documented. Driver and passenger positions. 18 languages with native-accent speakers. Infotainment commands, navigation requests, and spontaneous conversation.
Why it matters: In-cabin acoustic conditions cannot be synthesized. Augmenting studio recordings with noise overlays doesn't replicate real vehicle transfer functions. This corpus captures ground-truth in-vehicle speech.
Clinical Dictation European
6,800 hours across German, French, Spanish, Italian, and Dutch. 35+ medical specialties. Board-certified and practicing physicians. Inpatient and outpatient contexts. HIPAA-equivalent consent protocols. De-identified variants available.
Why it matters: Medical terminology causes 3-5x WER degradation versus general speech. Specialty-specific vocabulary (cardiology vs. radiology vs. pathology) requires purpose-built corpora. This dataset covers the clinical documentation use case at scale.
British Isles Complete Accent Collection
8,200 hours covering Glaswegian, Scouse, Geordie, Belfast, Dublin, Cork, Welsh English, West Country, Yorkshire, and Birmingham. Native speakers recorded in home environments. Multi-generational samples. Spontaneous conversation and elicited speech.
Why it matters: British English isn't one accent. Models trained on RP or general British data fail on regional speakers. Customer service, healthcare, and public sector applications require regional coverage.
European Call Center Multilingual
12,400 hours across 22 languages. Real call center recordings (consent-obtained). Customer service, technical support, complaints, and sales contexts. Emotional state annotations. PII fully redacted. 8kHz telephony and VoIP quality variants.
Why it matters: Call center audio has unique acoustic characteristics: narrow bandwidth, compression artifacts, headset coloration, crosstalk. Models trained on wideband audio degrade on telephony. This corpus matches production conditions.
Code-Switching European Business
6,800 hours of bilingual speech across eight language pairs. English-German (Frankfurt, Berlin), English-French (Paris, Brussels), French-Arabic (Paris, Marseille), German-Turkish (Berlin, Cologne), and others. Intra-sentential and inter-sentential switching annotated.
Why it matters: Monolingual models fail on multilingual users. Code-switching is standard in European business, tech, and urban contexts. This corpus enables real-world multilingual ASR.
Emotional Speech European Acted
3,200 hours of professional voice actor recordings. Six primary emotions at three intensity levels. German, English (UK), French, Spanish, and Italian. Valence and arousal annotations. Prosodic contour documentation.
Why it matters: TTS systems require emotional range. Sentiment analysis requires labeled emotional speech. Neutral corpora can't teach these patterns. Professional acted speech provides ground-truth emotional expression.
What Ships With Every Dataset
Audio without metadata is unusable. You can't fine-tune on speakers you can't characterize. You can't balance training sets without demographic data. You can't satisfy compliance requirements without consent documentation.
Every YPAI dataset includes structured metadata at the recording, speaker, and collection-session levels.
Recording-level metadata
- Duration (milliseconds)
- Sample rate (typically 16kHz or 48kHz)
- Bit depth
- File format
- Recording date
- Device type and model
- Microphone type
- Acoustic environment classification
- Signal-to-noise ratio estimate
- Clipping detection flag
- Silence ratio
Speaker-level metadata
- Unique speaker ID (pseudonymized)
- Age band
- Gender
- Native language
- Dialect/accent classification
- Geographic origin (country, region, city where applicable)
- Education level
- Occupation category
- Self-reported language proficiency (CEFR scale for non-native)
- Years of residence in recording location
Session-level metadata
- Collection date
- Collection method (app, studio, field recording)
- Recording environment description
- Noise classification
- Device positioning
- Elicitation type (scripted, spontaneous, command, emotional)
Consent and provenance
- Consent ID linking to master consent record
- Consent version (for updated consent forms)
- Consent scope (what the audio may be used for)
- Revocation status (propagated from master consent system)
- Collection organization
- Collection protocol reference
- Annotation lineage (who annotated what, when, using which guidelines)
EU AI Act Article 10 documentation
- Demographic distribution analysis
- Geographic representation analysis
- Known limitations and gaps
- Data quality measures
- Bias evaluation methodology
- Training data sheet in standardized format
Browse by Industry
Automotive Voice AI
In-cabin recordings. Highway, urban, and idle conditions. 70+ language and dialect combinations. Wake word and command datasets. Navigation, media, climate, and communication scenarios.
Healthcare and Clinical Speech
Physician dictation across 35+ specialties. Ambient clinical conversation. Medical terminology in European languages. De-identified variants. HIPAA-equivalent consent.
Finance and Call Center
Trading floor recordings. Call center customer service. Financial terminology. PCI-DSS pre-scrubbed variants. Emotional state annotations.
Industrial and Manufacturing
Factory floor speech. Warehouse logistics. Maritime operations. Offshore platforms. PPE-muffled speech. Industrial noise environments.
Gaming and Entertainment
Emotional speech for TTS. Character archetypes. Voice actor recordings. Prosodic variation. Whispered and projected speech.
Broadcasting and Media
Multi-speaker panel discussions. Sports commentary. News broadcast. Proper noun emphasis. Live captioning training data.
Get Started Today
We don't ask you to trust marketing claims. We ask you to evaluate samples. Request sample datasets—specify your language, dialect, environment, and vertical. We'll send representative samples with full metadata so you can evaluate fit before any commercial discussion.
Norwegian-headquartered. EEA data residency. 100,000+ hours cataloged. Audio data your models can trust.