comp-data

This repository contains a collection of common datasets for testing lossless compression.

Datasets

For each dataset, individual files were compressed using xz utility.

Corpus	Compressed size	Decompressed size	Path	URL
Artificial	88K	304K	./compressed/artificial	https://corpus.canterbury.ac.nz/descriptions/#artificl
Calgary	916K	3.1M	./compressed/calgary	https://corpus.canterbury.ac.nz/descriptions/#calgary
Canterbury	500K	2.7M	./compressed/canterbury	https://corpus.canterbury.ac.nz/descriptions/#cantrbry
Large	2.5M	11M	./compressed/large	https://corpus.canterbury.ac.nz/descriptions/#large
Large Text	245M	1.1G	./compressed/large_text	https://mattmahoney.net/dc/textdata.html
Miscellaneous	428K	980K	./compressed/miscellaneous	https://corpus.canterbury.ac.nz/descriptions/#misc
Neuro	22M	55M	./compressed/neuro	https://github.com/neurolabusc/zlib-bench
Silesia	47M	203M	./compressed/silesia	https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
Snappy	868K	2.9M	./compressed/snappy	https://github.com/google/snappy
Squash	228M	756M	./compressed/squash	https://github.com/nemequ/squash-corpus

Requirements

Git Large File Storage: to access the actual dataset files
XZ Utils: for decompressing the datasets using xz

Usage

For simple usage, you can decompress all datasets by running the following:

./decompress.sh -a

After that, datasets with decompressed files can be found at ./decompressed/.

Additionaly, you can merge each decompressed dataset into a single file by running the following:

./merge.sh -a

Resulting files will be stored at ./merged/.

Decompressing datasets

$ ./decompress.sh
Decompress datasets using xz utility

Usage: ./decompress.sh [-a|--all] [-h|--help]
                       [-d|--dataset-dir <path>] [-o|--output-dir <path>]
                       {<dataset>}
Options:
   -a|--all           Decompress all datasets from --dataset-dir
   -h|--help          Print this help
   -d|--dataset-dir   Directory to find datasets for iterative decompression
                      (default: ./compressed)
   -o|--output-dir    Directory to which decompressed datasets will be saved
                      (default: ./decompressed)

The file search depth both for the datasets directory and individual datasets is fixed to 1.

Merging datasets

$ ./merge.sh
Merge datasets into singular large files

Usage: ./merge.sh [-a|--all] [-h|--help]
                  [-d|--dataset-dir <path>] [-o|--output-dir <path>]
                  {<dataset>}
Options:
   -a|--all           Merge all datasets from --dataset-dir
   -h|--help          Print this help
   -d|--dataset-dir   Directory to find datasets for iterative merging
                      (default: ./decompressed)
   -o|--output-dir    Directory to which merged datasets will be saved
                      (default: ./merged)

Before merging, the files are sorted in case-insensitive alphanumeric order.

Examples

Decompress all datasets and save to ~/my_datasets:

./decompress -a -o ~/my_datasets

Decompress Canterbury and Calgary:

./decompress compressed/canterbury compressed/calgary

Merge all datasets from ~/my_datasets:

./merge -a -d ~/my_datasets

Merge decompressed Silesia and save to ~/my_datasets:

./merge -o ~/my_datasets decompressed/silesia

License

The selection and arrangement of datasets in this collection, along with additional scripts created by repository maintainer, is dedicated to the public domain under the CC0 1.0 Universal license.

I don't claim any of the datasets/corpora provided. The individual datasets remain the property of their respective authors, and are subject to their own original licenses. For any licensing information, please see the sources in the Datasets section.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
compressed		compressed
decompressed		decompressed
merged		merged
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
decompress.sh		decompress.sh
merge.sh		merge.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

comp-data

Datasets

Requirements

Usage

Decompressing datasets

Merging datasets

Examples

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

comp-data

Datasets

Requirements

Usage

Decompressing datasets

Merging datasets

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages