Eastern Huasteca Nahuatl for Speech · Modern Standard Nahuatl for Writing

Flor y Canto Nahuatl

In xochitl in cuicatl — The flower, the song

Flor y Canto Nahuatl exists to establish Eastern Huasteca Nahuatl as a teachable spoken standard, Modern Standard Nahuatl as a respected written norm, and a living poetic high register that reconnects Nahuatl speech, literature, song, and public life.

Start Here

What this is · Why it matters · What you can do with it

The largest freely available structured dataset of Classical Nahuatl in existence — paired with a formal framework for standardizing the language across speech, writing, and literature, and an open instructional track with 32 bilingual lessons mapped from beginner to advanced. Everything is open. Everything is free. Here's where to start depending on who you are.

For Linguists & Researchers

The dataset you've been looking for doesn't exist anywhere else.

28,709 parsed entries from Siméon's 1885 dictionary in structured JSON. 8,465 Wiktionary lexical rows across four Nahuatl varieties. Two UD treebanks. 55,904 classical examples. Provenance-tracked, license-tagged, queryable. Use it for NLP, typology, historical linguistics, or computational work.

GitHub repository
For Nahuatl Speakers & Learners

Your language now has infrastructure.

A three-layer framework that respects spoken Nahuatl as the foundation, establishes a clear written standard, and creates space for poetry, song, and literature. Governance documents define how the language is handled across registers — so the standard serves speakers, not the other way around.

Read the governance documents
For Developers & Builders

Build on this. That's why it's open.

Structured JSON and JSONL ready for ingestion. CSV exports for quick analysis. A provenance pipeline you can fork. Build a dictionary app, a learning tool, a search engine, a language model — the data is CC BY-SA 3.0 and public domain. No paywall. No API key. Just download it.

Browse the data
For Educators & Curriculum Builders

32 bilingual lessons, sequenced and ready.

An open instructional track built from Nāhuatlahtolli: 32 canonical lessons in English and Spanish, proficiency-mapped from A1 to B2, with vocabulary prioritization, dialogue extraction, assessment items, and product bundles. Built for reuse — fork it, adapt it, teach with it.

See the curriculum

The Three-Layer Framework

Speech · Writing · Literature

Spoken Foundation

EHN

Eastern Huasteca Nahuatl. The living spoken base drawn from community speech. All pronunciation, phonology, and conversational register grounded here.

Written Standard

MSN

Modern Standard Nahuatl. The neutral written reference norm for education, publishing, governance, and formal communication. Clear, consistent, teachable.

Poetic Register

MSN-P

The elevated literary register for poetry, song, ceremony, and public oratory. Classical resonance with modern clarity. Where the language creates, not just communicates.

The Data

Open · Structured · Provenance-tracked

28,709
Siméon Dictionary Entries
8,465
Wiktionary Lexical Rows
55,904
Classical Examples
4
Nahuatl Varieties Covered
19,549
UD Treebank Tokens
2,169
Annotated Sentences
26,806
Unique Headwords
60,663
Classical Text Blocks

The first machine-readable dataset of Classical Nahuatl. Parsed from Siméon's 1885 dictionary, Wiktionary across four varieties (Classical, Central, Eastern Huasteca, Highland Puebla), and two Universal Dependencies treebanks. Every entry carries provenance, license tracking, and source confidence scoring.

All data is free and open under CC BY-SA 3.0 / GFDL. Public-domain sources remain public domain.

Instructional Track

Open · Bilingual · Proficiency-Mapped

32
Bilingual Lessons
4,491
Vocabulary Items
215
Extracted Dialogues
350
Construction Candidates

A complete open instructional pipeline built from Nāhuatlahtolli — the University of Texas COERLL course for Eastern Huasteca Nahuatl. 32 canonical lessons in English and Spanish, cleaned, deduplicated, and organized into proficiency bands from A1 through a B2 bridge.

The pipeline includes lesson normalization, unit assembly, vocabulary prioritization, dialogue extraction, assessment generation, product bundling, and a spoken EHN primer foundation. Every stage produces a SQLite database that builds on the last — fork any stage and extend it.

A1 — Beginner

Alphabet, greetings, questions, names, numbers, colors, basic verbs, possessives, diminutives, household vocabulary, food, fields, conditional, non-specific objects.

A2 — Elementary

Professions, past tense (three parts), city navigation, colors and numbers in context. Building toward independent sentence production.

B1–B2 — Intermediate

Intransitive and transitive verbs, time division, family, appearance, reflexives, likes/dislikes, imperatives, market language, illness vocabulary, cleansing ceremonies.

Governance Documents

Constitutional framework · Version 0.1

Founding Charter
Identity, mission, principles, and governing rules of FCN. Version 0.1, adopted March 25, 2026.
Register Charter
The official register system: ten statuses, domain assignments, conversion principles, publication rules.
Mission Statements
One-sentence, one-paragraph, one-page, and website versions of the FCN mission.
Core Premises
The non-negotiable assumptions and constraints that govern all project decisions.
Success Criteria
Measurable milestones for versions 0.1, 0.5, and 1.0 of the FCN framework.
Source Hierarchy
How FCN evaluates, prioritizes, and reconciles sources. Six source classes, function-specific authority, conflict resolution.
Orthography Manual
Orthographic rules for EHN-to-MSN normalization, search fallback logic, and spelling conventions.
Poetic Register Inventory
Starter inventory for MSN-P: poetic lexicon, refrain particles, vocatives, elevated diction, rhetorical formulas, and annotation rules.

Resources

Code · Data · Curriculum · Music

GitHub Repository

Source code, parsers, governance documents, and project infrastructure. Includes fcn_source_parsers.py, fcn_legal_ingest.py, and the full pipeline.

S3 Data Bucket

Public data files: Siméon parsed JSON, Kaikki JSONL across four varieties, UD treebanks, classical example bank, and lexical rows.

Music — Sam Itzli

Original compositions in Nahuatl. Worship, poetry, and song in the tradition of in xochitl in cuicatl. New uploads daily.

Siméon Dataset

28,709 structured entries from Rémi Siméon's 1885 Dictionnaire de la langue nahuatl. The first machine-readable version. 6.2 MB JSON.

Master Lexicon Database

The FCN master lexicon in SQLite — all sources merged, register-tagged, provenance-tracked, with orthography candidates and editorial status. 100 MB.

Instructional Track Database

The complete Phase 8 instructional pipeline in SQLite — lessons, dialogues, vocabulary, units, assessments, product bundles, and spoken EHN primer.

Loading…