Skip to content

commoncrawl/arabic-seed-processing

Repository files navigation

Turning 30,000 Arabic domains into a better crawl

Code and data accompanying the work described in this blog post to filter, geolocate and categorise a donation of Arabic seed domains.

Contents

  • ArabicDomainQuality.xlsx: The original data received from QCRI.
  • arabic_seeds.ipynb: A notebook detailing the data processing and analysis.
  • crawl_lang_info.tsv: Summarised language information for the domains found in the CC-MAIN-2026-{21,17,12} archives.
  • DomainQuality_Dashboard.ipynb: Additional analysis of the quality of the pre-filtered domains, carried out by researchers at QCRI.

We use uv to manage Python dependencies.

Acknowledgements

Thank you to Hamdy S. Hussein, Dr. Kareem M. Darwish and Dr. Mohamed Ahmed Yassin Eltabakh of the Qatar Computing Research Institute for providing the initial seed list, quality annotations and exploratory visualisations. These were created as part of the Fanar Project, an Arabic generative AI platform.

About

Turning 30,000 Arabic domains into a better crawl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors