A BERT-based Masked Language Model pre-trained from scratch on Vietnamese text, built with pure PyTorch.
ViMLM pre-trains a BERT encoder on Vietnamese corpora using the Masked Language Modeling (MLM) objective. The goal is to produce contextual word representations that capture Vietnamese morphology, tonal patterns, and syntax — suitable for fine-tuning on downstream NLP tasks.
Existing multilingual models (mBERT, XLM-R) under-represent Vietnamese. Training on a dedicated Vietnamese corpus yields richer, domain-specific representations.
src/: Python package containing the model components, pipelines, datasets, and callbacks.notebooks/: Contains the standalone pretraining notebook.config/: Pretraining configuration files.data/: Raw corpora for training and evaluation.
For quick testing or training in cloud environments with zero local setup:
This notebook includes all modules (model, dataset, training logic) inline. You can open and execute it directly in Google Colab using a GPU instance.
We recommend using the fast uv package manager for virtual environment setup and dependency installation:
# Create python virtual environment
python3 -m venv .venv
# Install dependencies
uv pip install --python .venv "torch>=2.4.1" huggingface_hub==0.35.3 PyYAML==6.0.2 transformers==4.53.2 wandb==0.28.0To run the Masked Language Model pretraining locally:
# Start pretraining with config parameters
.venv/bin/python -m src- Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers
- Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Nguyen et al. (2020) — PhoBERT: Pre-trained language models for Vietnamese