Flor y Canto Nahuatl

Start Here

What this is · Why it matters · What you can do with it

The largest freely available structured dataset of Classical Nahuatl in existence — paired with a formal framework for standardizing the language across speech, writing, and literature, and an open instructional track with 32 bilingual lessons mapped from beginner to advanced. Everything is open. Everything is free. Here's where to start depending on who you are.

For Linguists & Researchers

The dataset you've been looking for doesn't exist anywhere else.

28,709 parsed entries from Siméon's 1885 dictionary in structured JSON. 8,465 Wiktionary lexical rows across four Nahuatl varieties. Two UD treebanks. 55,904 classical examples. Provenance-tracked, license-tagged, queryable. Use it for NLP, typology, historical linguistics, or computational work.

GitHub repository →

For Nahuatl Speakers & Learners

Your language now has infrastructure.

A three-layer framework that respects spoken Nahuatl as the foundation, establishes a clear written standard, and creates space for poetry, song, and literature. Governance documents define how the language is handled across registers — so the standard serves speakers, not the other way around.

Read the governance documents →

For Developers & Builders

Build on this. That's why it's open.

Structured JSON and JSONL ready for ingestion. CSV exports for quick analysis. A provenance pipeline you can fork. Build a dictionary app, a learning tool, a search engine, a language model — the data is CC BY-SA 3.0 and public domain. No paywall. No API key. Just download it.

Browse the data →

For Educators & Curriculum Builders

32 bilingual lessons, sequenced and ready.

An open instructional track built from Nāhuatlahtolli: 32 canonical lessons in English and Spanish, proficiency-mapped from A1 to B2, with vocabulary prioritization, dialogue extraction, assessment items, and product bundles. Built for reuse — fork it, adapt it, teach with it.

See the curriculum →

The Three-Layer Framework

Speech · Writing · Literature

Spoken Foundation

EHN

Eastern Huasteca Nahuatl. The living spoken base drawn from community speech. All pronunciation, phonology, and conversational register grounded here.

Written Standard

MSN

Modern Standard Nahuatl. The neutral written reference norm for education, publishing, governance, and formal communication. Clear, consistent, teachable.

Poetic Register

MSN-P

The elevated literary register for poetry, song, ceremony, and public oratory. Classical resonance with modern clarity. Where the language creates, not just communicates.

The Data

Open · Structured · Provenance-tracked

28,709

Siméon Dictionary Entries

8,465

Wiktionary Lexical Rows

55,904

Classical Examples

Nahuatl Varieties Covered

19,549

UD Treebank Tokens

2,169

Annotated Sentences

26,806

Unique Headwords

60,663

Classical Text Blocks

The first machine-readable dataset of Classical Nahuatl. Parsed from Siméon's 1885 dictionary, Wiktionary across four varieties (Classical, Central, Eastern Huasteca, Highland Puebla), and two Universal Dependencies treebanks. Every entry carries provenance, license tracking, and source confidence scoring.

All data is free and open under CC BY-SA 3.0 / GFDL. Public-domain sources remain public domain.

Instructional Track

Open · Bilingual · Proficiency-Mapped

Bilingual Lessons

4,491

Vocabulary Items

215

Extracted Dialogues

350

Construction Candidates

A complete open instructional pipeline built from Nāhuatlahtolli — the University of Texas COERLL course for Eastern Huasteca Nahuatl. 32 canonical lessons in English and Spanish, cleaned, deduplicated, and organized into proficiency bands from A1 through a B2 bridge.

The pipeline includes lesson normalization, unit assembly, vocabulary prioritization, dialogue extraction, assessment generation, product bundling, and a spoken EHN primer foundation. Every stage produces a SQLite database that builds on the last — fork any stage and extend it.

A1 — Beginner

Alphabet, greetings, questions, names, numbers, colors, basic verbs, possessives, diminutives, household vocabulary, food, fields, conditional, non-specific objects.

A2 — Elementary

Professions, past tense (three parts), city navigation, colors and numbers in context. Building toward independent sentence production.

B1–B2 — Intermediate

Intransitive and transitive verbs, time division, family, appearance, reflexives, likes/dislikes, imperatives, market language, illness vocabulary, cleansing ceremonies.

Governance Documents

Constitutional framework · Version 0.1

Founding Charter

Identity, mission, principles, and governing rules of FCN. Version 0.1, adopted March 25, 2026.

→

The official register system: ten statuses, domain assignments, conversion principles, publication rules.

→

Mission Statements

One-sentence, one-paragraph, one-page, and website versions of the FCN mission.

→

Core Premises

The non-negotiable assumptions and constraints that govern all project decisions.

→

Success Criteria

Measurable milestones for versions 0.1, 0.5, and 1.0 of the FCN framework.

→

Source Hierarchy

How FCN evaluates, prioritizes, and reconciles sources. Six source classes, function-specific authority, conflict resolution.

→

Orthography Manual

Orthographic rules for EHN-to-MSN normalization, search fallback logic, and spelling conventions.

→

Poetic Register Inventory

Starter inventory for MSN-P: poetic lexicon, refrain particles, vocatives, elevated diction, rhetorical formulas, and annotation rules.

→

Flor y Canto Nahuatl

Start Here

The dataset you've been looking for doesn't exist anywhere else.

Your language now has infrastructure.

Build on this. That's why it's open.

32 bilingual lessons, sequenced and ready.

The Three-Layer Framework

EHN

MSN

MSN-P

The Data

Instructional Track

Governance Documents

Resources

GitHub Repository

S3 Data Bucket

Music — Sam Itzli

Siméon Dataset

Master Lexicon Database

Instructional Track Database