Open NLP Models for
Every Wikipedia Language

Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for 358+ languages. Built for researchers, educators, and developers.

pip install wikilangs
358+ Languages
4 Tokenizer Sizes
5 N-gram Depths
5 Markov Depths
3 Embedding Dims
6,765,741 Total Words

What is Wikilangs?

Multilingual by Design

Pre-trained models for every Wikipedia language, from English to Moroccan Arabic to Swahili. Same API, any language.

Lightweight & Fast

Traditional NLP models optimized for resource-constrained environments. No GPU required. Run anywhere.

Research-Ready

Every language includes comprehensive evaluation reports with ablation studies, metrics, and visualizations.

Easy Integration

Simple Python API, HuggingFace integration, and LLM utilities for extending language models with new vocabularies.

Available Models

BPE Tokenizers

Byte-Pair Encoding tokenizers in 4 vocabulary sizes: 8k, 16k, 32k, 64k.

  • SentencePiece format
  • HuggingFace compatible
  • Subword tokenization
from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tokens = tok.tokenize("Hello, world!")

N-gram Models

Language models for text scoring and next-token prediction.

  • 2, 3, 4, 5-gram sizes
  • Word and subword variants
  • Log probability scoring
from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
score = ng.score("Natural language processing")

Markov Chains

Text generation models with configurable context depth.

  • Context depths 1-5
  • Probabilistic generation
  • Seed text support
from wikilangs import markov
mc = markov('latest', 'en', depth=3)
text = mc.generate(length=50)

Vocabularies

Word dictionaries with frequency and IDF information.

  • Frequency counts
  • IDF scores
  • Prefix search
from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
info = vocab.lookup("language")

Word Embeddings

Position-aware embeddings via BabelVec in multiple dimensions.

  • 32d, 64d, 128d sizes
  • Sentence embeddings
  • RoPE, decay, sinusoidal
from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
vec = emb.embed_word("language")

LLM Integration

Extend large language models with Wikilangs vocabularies.

  • Vocabulary merging
  • Token freezing
  • Transformers compatible
from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000)

Who Uses Wikilangs?

Linguists & Researchers

Study language patterns, compare linguistic features across languages, and access pre-computed statistics for your research.

NLP Practitioners

Bootstrap projects with pre-trained models. Use lightweight alternatives to large language models for specific tasks.

Educators

Teach NLP concepts with real models. Demonstrate tokenization, language modeling, and text generation across languages.

Low-Resource Language Developers

Access models for underrepresented languages. Extend existing LLMs with new language support.

Created By

Omar Kamali

Researcher focused on multilingual NLP and accessible AI

Affiliation

Omneity Labs

Private research lab specializing in low-resource languages, cultural alignment, and agentic systems

Ready to get started?

Install wikilangs and start exploring 358+ languages in minutes.