This repository contains everything that is needed to build the OneZoom tree and all the other files needed by the backend. It also contains scripting libraries for harvesting information from wikidata and images from wikimedia commons that can be used to populate a running OneZoom instance.
The first step to using this repo is to create a Python virtual environment and activate it:
# From the root of the repo, create a Python environment and activate it
python3 -m venv .venv
source .venv/bin/activate
# Install it
pip install -e '.[dev]'
# Set up git hooks including linting and DVC
pre-commit install --hook-type pre-push --hook-type post-checkout --hook-type pre-commit
After the first time, you just need to run the source .venv/bin/activate each time you want to activate it in a new shell.
To be able to run the pipeline, you'll also need to install wget.
Assuming you have installed the 'dev' dependencies, you should be able to run
python -m pytest --conf-file tests/appconfig.ini
Here we have used a basic conf file to create a fake OneZoom database. However, if you wish to test using the
real OneZoom database, you can specify a different path to an appconfig.ini file, or omit the --conf-file
option entirely, in which case the test suite will look for ../OZtree/private/appconfig.ini, which assumes
hat this repository is a sibling to a non-live
OZtree installation, and that the database used by this OZtree
installation is active.
python -m pytest # Uses the "real" OneZoom database - take care!
This uses mocked APIs. You can also run with the real APIs using the --real-apis swithc, in whcih case
you will need a valid Azure Image cropping key in your appconfig.ini.
This project uses DVC to manage the pipeline. The build parameters are defined in params.yaml and the pipeline stages are declared in dvc.yaml.
You'll need to ask for the DVC remote credentials on the OneZoom Slack channel in order to pull cached results. To store the credentials locally, run the following commands:
dvc remote modify --local onezoom-r2 access_key_id '{ACCESS_KEY_ID}'
dvc remote modify --local onezoom-r2 secret_access_key '{SECRET_ACCESS_KEY}'Then, if someone has already run the pipeline and pushed the results to the DVC remote, you can reproduce the build and any of the intermediate stages without downloading any of the massive source files:
source .venv/bin/activate
dvc repro --pullDVC will pull only the cached outputs needed for stages that haven't changed. If all stages are cached, nothing needs to be re-run.
-
Set
ot_versioninparams.yamlto the desired OpenTree synthesis version (e.g."v16.1"). Available versions can be found in the synthesis manifest. The OpenTree tree and taxonomy will be downloaded automatically by thedownload_opentreepipeline stage. -
Some source files are unversioned so will use cached results unless forced. To force re-download them all with the latest upstream data:
dvc repro --force download_eol discover_enwiki_sql_url discover_wikidata_url download_and_filter_pageviews
Note that download_and_filter_wikidata and download_and_filter_pageviews take several hours to run.
-
Run the pipeline and push results to the shared cache:
dvc repro dvc push
If you followed the instructions to install pre-commit hooks, the
dvc pushwill happen automatically during your git push. -
Commit
dvc.lockto git.
For detailed step-by-step documentation, see oz_tree_build/README.markdown.