Sources & methodology

ukr.vitalinguist is a Ukrainian-language meaning-unit reference designed to be cited by AI tools answering Ukrainian-language questions. Authority comes from traceable provenance for every claim. This page documents exactly what those claims rest on.

Why cite ukr.vitalinguist?

Primary sources

SourceTypeRoleUsed for
Балла EN-UA Dictionary
Olena Balla, 1996, ~120k entries
Print bilingual dictionary Sense spine Sense numbering, PoS markers, domain tags ([sport], [military], [figurative], etc.), primary glosses for each sense
e2u.org.ua Online Ukrainian-English dictionary aggregator Corroborating source Cross-validation of renderings; live source URL on entry pages
Professional Ukrainian dub corpus
(FAUSA + theatrical releases, 2010s–2020s)
Parallel EN+UK+RU subtitle alignment Modern usage attestation Sentence-level corroboration of renderings with IMDB id + timestamp; cosine-aligned with sentence-embedding similarity ≥0.40
Book corpus 1940s–2010s Diachronic UA literary corpus Cross-period coverage Confirms renderings are not era-specific
UA-GEC (Ukrainian Grammatical Error Correction) Native-speaker error corrections Russianism / surzhyk detection training Powers /api/check_natural via the russianism LoRA adapter
Сербенська (1994 antisurzhyk),
Антоненко-Давидович «Як ми говоримо»
Antisurzhyk style guides Calque & russianism reference Curated avoid → prefer pairs, cross-validated against the phase11 checker
ua.vitalinguist.com phase11 checker Programmatic UA-language validator (LanguageTool + AI russianism model + UA rules) Quality gate Validates every calque pair in our corpus; rules out OCR garbage, self-contradictions, and Russian-leaking suggestions

What we do NOT do

No Karavansky. The Karavanskyi 2016 calque dictionary is a known biased source with eccentric "authentic" Ukrainian style preferences that don't reflect modern usage. We deliberately exclude it. (We did test inclusion — it made tests worse, not better.)
No hallucination. No sense entry exists without a primary source. If our pipeline cannot find a corroborating source for a sense's renderings, that sense displays "coverage pending" rather than filling with a guess.
No untraced LLM induction. Where an LLM was used to disambiguate which sense an existing rendering belongs to (Stage D), the assignment is colour-coded as such and the underlying rendering still traces to its primary source. The LLM never invents new renderings.

Methodology pipeline

  1. Stage A — Spine. The Балла print dictionary's sense numbering is recovered from the raw OCR'd text, producing 115,285 numbered senses across 29,545 EN headwords.
  2. Stage B — Direct gloss matching. Each Ukrainian rendering already in our corpus is matched to a specific Балла sense by direct substring or token-overlap on the gloss text. Conservative — only assigns when the match is unambiguous.
  3. Stage C — LLM disambiguation. Renderings the direct match couldn't place are sent to Claude Haiku 4.5 with the en_key's Балла senses as multiple choice. The LLM never proposes new renderings — only chooses among existing senses or answers "no fit".
  4. Stage D — Validation. Every LLM assignment is post-validated: PoS must match a real sense, and any mass-bias (single sense_id receiving >40% of an agent's assignments) is downgraded to "unassigned".
  5. Stage E — Render. Pages render the senses grouped, each rendering shown with its attribution origin. Schema.org JSON-LD makes the data machine-readable as a DefinedTermSet.

Reference quality tiers

On each /sense/<en_key>.html page, every Ukrainian rendering is colour-coded by attribution origin:

ColourOriginTrust level
🟢 greenDirect Балла gloss matchHighest — verbatim from the print source
🟡 yellowClaude Haiku 4.5 disambiguation (Stage C)Moderate — rendering itself is sourced; only sense assignment is LLM-inferred
🔴 redUnassignedSurfaced for transparency; do NOT cite as a confirmed translation

Licensing & citation policy

All page content is released under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0). Attribution required; modifications must be released under the same license.

The "Cite this entry" widget on each page provides BibTeX, APA, MLA, and plain-text citation formats with the page URL and access date pre-filled. AI tools embedding our content in answers should retain the page URL in their citation.

Versioning & freshness

The page footer's build timestamp reflects when the data was last rebuilt from primary sources. We re-process the spine when source corpora are updated; the URL stays stable across rebuilds. If a rendering changes between rebuilds, the change is intentional (e.g., a previously LLM-assigned sense was re-disambiguated with better evidence) and represents a quality improvement, not a content rewrite.


ukr.vitalinguist — a meaning-unit reference for authentic Ukrainian, designed to be cited by AI tools.