This repository contains a collection of common datasets for testing lossless compression.
For each dataset, individual files were compressed using xz utility.
| Corpus | Compressed size | Decompressed size | Path | URL |
|---|---|---|---|---|
| Artificial | 88K | 304K | ./compressed/artificial | https://corpus.canterbury.ac.nz/descriptions/#artificl |
| Calgary | 916K | 3.1M | ./compressed/calgary | https://corpus.canterbury.ac.nz/descriptions/#calgary |
| Canterbury | 500K | 2.7M | ./compressed/canterbury | https://corpus.canterbury.ac.nz/descriptions/#cantrbry |
| Large | 2.5M | 11M | ./compressed/large | https://corpus.canterbury.ac.nz/descriptions/#large |
| Large Text | 245M | 1.1G | ./compressed/large_text | https://mattmahoney.net/dc/textdata.html |
| Miscellaneous | 428K | 980K | ./compressed/miscellaneous | https://corpus.canterbury.ac.nz/descriptions/#misc |
| Neuro | 22M | 55M | ./compressed/neuro | https://github.com/neurolabusc/zlib-bench |
| Silesia | 47M | 203M | ./compressed/silesia | https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia |
| Snappy | 868K | 2.9M | ./compressed/snappy | https://github.com/google/snappy |
| Squash | 228M | 756M | ./compressed/squash | https://github.com/nemequ/squash-corpus |
- Git Large File Storage: to access the actual dataset files
- XZ Utils: for decompressing the datasets using
xz
For simple usage, you can decompress all datasets by running the following:
./decompress.sh -aAfter that, datasets with decompressed files can be found at ./decompressed/.
Additionaly, you can merge each decompressed dataset into a single file by running the following:
./merge.sh -aResulting files will be stored at ./merged/.
$ ./decompress.sh
Decompress datasets using xz utility
Usage: ./decompress.sh [-a|--all] [-h|--help]
[-d|--dataset-dir <path>] [-o|--output-dir <path>]
{<dataset>}
Options:
-a|--all Decompress all datasets from --dataset-dir
-h|--help Print this help
-d|--dataset-dir Directory to find datasets for iterative decompression
(default: ./compressed)
-o|--output-dir Directory to which decompressed datasets will be saved
(default: ./decompressed)The file search depth both for the datasets directory and individual datasets is fixed to 1.
$ ./merge.sh
Merge datasets into singular large files
Usage: ./merge.sh [-a|--all] [-h|--help]
[-d|--dataset-dir <path>] [-o|--output-dir <path>]
{<dataset>}
Options:
-a|--all Merge all datasets from --dataset-dir
-h|--help Print this help
-d|--dataset-dir Directory to find datasets for iterative merging
(default: ./decompressed)
-o|--output-dir Directory to which merged datasets will be saved
(default: ./merged)Before merging, the files are sorted in case-insensitive alphanumeric order.
- Decompress all datasets and save to
~/my_datasets:
./decompress -a -o ~/my_datasets- Decompress Canterbury and Calgary:
./decompress compressed/canterbury compressed/calgary- Merge all datasets from
~/my_datasets:
./merge -a -d ~/my_datasets- Merge decompressed Silesia and save to
~/my_datasets:
./merge -o ~/my_datasets decompressed/silesiaThe selection and arrangement of datasets in this collection, along with additional scripts created by repository maintainer, is dedicated to the public domain under the CC0 1.0 Universal license.
I don't claim any of the datasets/corpora provided. The individual datasets remain the property of their respective authors, and are subject to their own original licenses. For any licensing information, please see the sources in the Datasets section.