Open NLP Models for
Every Wikipedia Language

Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for 358+ languages. Built for researchers, educators, and developers.

Get Started Browse Languages

pip install wikilangs

358+ Languages

4 Tokenizer Sizes

5 N-gram Depths

5 Markov Depths

3 Embedding Dims

6,765,741 Total Words

What is Wikilangs?

Multilingual by Design

Pre-trained models for every Wikipedia language, from English to Moroccan Arabic to Swahili. Same API, any language.

Lightweight & Fast

Traditional NLP models optimized for resource-constrained environments. No GPU required. Run anywhere.

Research-Ready

Every language includes comprehensive evaluation reports with ablation studies, metrics, and visualizations.

Easy Integration

Simple Python API, HuggingFace integration, and LLM utilities for extending language models with new vocabularies.

Available Models

BPE Tokenizers

Byte-Pair Encoding tokenizers in 4 vocabulary sizes: 8k, 16k, 32k, 64k.

SentencePiece format
HuggingFace compatible
Subword tokenization

from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tokens = tok.tokenize("Hello, world!")

N-gram Models

Language models for text scoring and next-token prediction.

2, 3, 4, 5-gram sizes
Word and subword variants
Log probability scoring

from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
score = ng.score("Natural language processing")

Markov Chains

Text generation models with configurable context depth.

Context depths 1-5
Probabilistic generation
Seed text support

from wikilangs import markov
mc = markov('latest', 'en', depth=3)
text = mc.generate(length=50)

Vocabularies

Word dictionaries with frequency and IDF information.

Frequency counts
IDF scores
Prefix search

from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
info = vocab.lookup("language")

Word Embeddings

Position-aware embeddings via BabelVec in multiple dimensions.

32d, 64d, 128d sizes
Sentence embeddings
RoPE, decay, sinusoidal

from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
vec = emb.embed_word("language")

LLM Integration

Extend large language models with Wikilangs vocabularies.

Vocabulary merging
Token freezing
Transformers compatible

from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000)

Featured Languages

View all 358+ languages

ar Models

Arabic

1,000,000 words

4.10x compression

ary Models

Moroccan Arabic

81,712 words

3.68x compression

What Can You Build?

View All Recipes

Wikilangs models enable a wide range of NLP applications. Here are some inspiring examples:

🌍

Language Detection

Identify what language text is written in by scoring against multiple language models.

View Recipe →

⌨️

Autocomplete

Build smart text completion with prefix matching and n-gram predictions.

View Recipe →

🔗

Text Similarity

Measure semantic similarity for FAQ matching, duplicate detection, and search.

View Recipe →

🤖

Extend LLM Vocabulary

Add language support to LLMs like LLaMA for low-resource languages.

View Recipe →

🔀

Code-Switching

Detect when text switches between languages—Spanglish, Hinglish, and more.

View Recipe →

🔍

Anomaly Detection

Find gibberish, spam, and out-of-domain content using perplexity scoring.

View Recipe →

Who Uses Wikilangs?

Linguists & Researchers

Study language patterns, compare linguistic features across languages, and access pre-computed statistics for your research.

NLP Practitioners

Bootstrap projects with pre-trained models. Use lightweight alternatives to large language models for specific tasks.

Educators

Teach NLP concepts with real models. Demonstrate tokenization, language modeling, and text generation across languages.

Low-Resource Language Developers

Access models for underrepresented languages. Extend existing LLMs with new language support.

Created By

Omar Kamali

Researcher focused on multilingual NLP and accessible AI

Affiliation

Omneity Labs

Private research lab specializing in low-resource languages, cultural alignment, and agentic systems

Ready to get started?

Install wikilangs and start exploring 358+ languages in minutes.

Quick Start Guide Read the Docs

Open NLP Models for Every Wikipedia Language

What is Wikilangs?

Multilingual by Design

Lightweight & Fast

Research-Ready

Easy Integration

Available Models

BPE Tokenizers

N-gram Models

Markov Chains

Vocabularies

Word Embeddings

LLM Integration

Featured Languages

Arabic

Moroccan Arabic

What Can You Build?

Language Detection

Autocomplete

Text Similarity

Extend LLM Vocabulary

Code-Switching

Anomaly Detection

Who Uses Wikilangs?

Linguists & Researchers

NLP Practitioners

Educators

Low-Resource Language Developers

Created By

Affiliation

Sponsored By

Ready to get started?

Open NLP Models for
Every Wikipedia Language