Turning 30,000 Arabic domains into a better crawl

Code and data accompanying the work described in this blog post to filter, geolocate and categorise a donation of Arabic seed domains.

ArabicDomainQuality.xlsx: The original data received from QCRI.
arabic_seeds.ipynb: A notebook detailing the data processing and analysis.
crawl_lang_info.tsv: Summarised language information for the domains found in the CC-MAIN-2026-{21,17,12} archives.
DomainQuality_Dashboard.ipynb: Additional analysis of the quality of the pre-filtered domains, carried out by researchers at QCRI.

We use uv to manage Python dependencies.

Acknowledgements

Thank you to Hamdy S. Hussein, Dr. Kareem M. Darwish and Dr. Mohamed Ahmed Yassin Eltabakh of the Qatar Computing Research Institute for providing the initial seed list, quality annotations and exploratory visualisations. These were created as part of the Fanar Project, an Arabic generative AI platform.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
ArabicDomainQuality.xlsx		ArabicDomainQuality.xlsx
DomainQuality_Dashboard.ipynb		DomainQuality_Dashboard.ipynb
README.md		README.md
arabic_seeds.ipynb		arabic_seeds.ipynb
crawl_lang_info.tsv		crawl_lang_info.tsv
pyproject.toml		pyproject.toml
uv.lock		uv.lock