Open NLP Models for
Every Wikipedia Language
Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for 358+ languages. Built for researchers, educators, and developers.
pip install wikilangs What is Wikilangs?
Multilingual by Design
Pre-trained models for every Wikipedia language, from English to Moroccan Arabic to Swahili. Same API, any language.
Lightweight & Fast
Traditional NLP models optimized for resource-constrained environments. No GPU required. Run anywhere.
Research-Ready
Every language includes comprehensive evaluation reports with ablation studies, metrics, and visualizations.
Easy Integration
Simple Python API, HuggingFace integration, and LLM utilities for extending language models with new vocabularies.
Available Models
BPE Tokenizers
Byte-Pair Encoding tokenizers in 4 vocabulary sizes: 8k, 16k, 32k, 64k.
- SentencePiece format
- HuggingFace compatible
- Subword tokenization
from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tokens = tok.tokenize("Hello, world!") N-gram Models
Language models for text scoring and next-token prediction.
- 2, 3, 4, 5-gram sizes
- Word and subword variants
- Log probability scoring
from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
score = ng.score("Natural language processing") Markov Chains
Text generation models with configurable context depth.
- Context depths 1-5
- Probabilistic generation
- Seed text support
from wikilangs import markov
mc = markov('latest', 'en', depth=3)
text = mc.generate(length=50) Vocabularies
Word dictionaries with frequency and IDF information.
- Frequency counts
- IDF scores
- Prefix search
from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
info = vocab.lookup("language") Word Embeddings
Position-aware embeddings via BabelVec in multiple dimensions.
- 32d, 64d, 128d sizes
- Sentence embeddings
- RoPE, decay, sinusoidal
from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
vec = emb.embed_word("language") LLM Integration
Extend large language models with Wikilangs vocabularies.
- Vocabulary merging
- Token freezing
- Transformers compatible
from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000) What Can You Build?
View All RecipesWikilangs models enable a wide range of NLP applications. Here are some inspiring examples:
Language Detection
Identify what language text is written in by scoring against multiple language models.
View Recipe βAutocomplete
Build smart text completion with prefix matching and n-gram predictions.
View Recipe βText Similarity
Measure semantic similarity for FAQ matching, duplicate detection, and search.
View Recipe βExtend LLM Vocabulary
Add language support to LLMs like LLaMA for low-resource languages.
View Recipe βCode-Switching
Detect when text switches between languagesβSpanglish, Hinglish, and more.
View Recipe βAnomaly Detection
Find gibberish, spam, and out-of-domain content using perplexity scoring.
View Recipe βWho Uses Wikilangs?
Linguists & Researchers
Study language patterns, compare linguistic features across languages, and access pre-computed statistics for your research.
NLP Practitioners
Bootstrap projects with pre-trained models. Use lightweight alternatives to large language models for specific tasks.
Educators
Teach NLP concepts with real models. Demonstrate tokenization, language modeling, and text generation across languages.
Low-Resource Language Developers
Access models for underrepresented languages. Extend existing LLMs with new language support.
Affiliation
Private research lab specializing in low-resource languages, cultural alignment, and agentic systems
Sponsored By
We're grateful to Featherless.ai for their generous support making this project possible
Ready to get started?
Install wikilangs and start exploring 358+ languages in minutes.