Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ Please review your pull request for the following steps. Feel free to delete any
- [ ] If no generative AI was used, then tick this box

Alternatively, if generative AI was used, then confirm you have:

- [ ] Attributed any generative AI (such as GitHub Copilot) that was used in this PR. See [our contributing guide](https://github.com/ACCESS-Community-Hub/PyEarthTools?tab=contributing-ov-file#generative-ai-usage) for more information.
- [ ] Included the name and version of the tool or system in the pull request
- [ ] Described the scope of that use

Finally,
Finally,

- [ ] Mark the PR as ready to review. Note - we encourage you to ask for feedback at the outset or at any time during the work.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
|![](https://pyearthtools.readthedocs.io/en/latest/_images/notebooks_demo_FourCastNeXt_Inference_9_1.png)<br>A weather prediction from a model trained with PyEarthTools.|![](https://pyearthtools.readthedocs.io/en/latest/_images/notebooks_tutorial_Working_with_Climate_Data_14_2.svg)<br>A data processing flow composed for working with climate data.|
|:-:|:-:|

Source Code: [github.com/ACCESS-Community-Hub/PyEarthTools](https://github.com/ACCESS-Community-Hub/PyEarthTools)
Documentation: [pyearthtools.readthedocs.io](https://pyearthtools.readthedocs.io)
Tutorial Gallery: [available here](https://pyearthtools.readthedocs.io/en/latest/notebooks/Gallery.html)
New Users Guide: [available here](https://pyearthtools.readthedocs.io/en/latest/newuser.html)
Source Code: [github.com/ACCESS-Community-Hub/PyEarthTools](https://github.com/ACCESS-Community-Hub/PyEarthTools)
Documentation: [pyearthtools.readthedocs.io](https://pyearthtools.readthedocs.io)
Tutorial Gallery: [available here](https://pyearthtools.readthedocs.io/en/latest/notebooks/Gallery.html)
New Users Guide: [available here](https://pyearthtools.readthedocs.io/en/latest/newuser.html)

**If you use `PyEarthTools` for your work or a publication, [please cite our work](https://pyearthtools.readthedocs.io/en/latest/#citing-pyearthtools).**

Expand Down Expand Up @@ -59,8 +59,8 @@ PyEarthTools is a Python framework containing modules for:
- training ML models and managing experiments;
- performing inference with ML models;
- and evaluating ML models (coming soon).
PyEarthTools runs effectively on HPC (supercomputers), cloud, workstations and laptops.

PyEarthTools runs effectively on HPC (supercomputers), cloud, workstations and laptops.

## Overview of the Packages within PyEarthTools

Expand Down
4 changes: 2 additions & 2 deletions docs/api/bundled_models/bundled_index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Bundled Models Index

Unlike the other directories in the 'packages' directory of PyEarthTools, the "bundled_models" directory does not itself contain a "bundled models" Python package. Rather, it contains multiple model packages in separate directories. Each of these bundled models **is** a Python package. As such, "bundled_models" is not itself installable. This page will provide an index table for each bundled model.
Unlike the other directories in the 'packages' directory of PyEarthTools, the "bundled_models" directory does not itself contain a "bundled models" Python package. Rather, it contains multiple model packages in separate directories. Each of these bundled models **is** a Python package. As such, "bundled_models" is not itself installable. This page will provide an index table for each bundled model.

At the current time, the following bundled models are available:
- [FourCastNeXt by Guo et al. (2024).](https://doi.org/10.48550/arXiv.2401.05584)
- [FourCastNeXt by Guo et al. (2024).](https://doi.org/10.48550/arXiv.2401.05584)
- [LUCIE by Guan et al. (2025).](https://doi.org/10.48550/arXiv.2405.16297)

Bundled models also have configuration files in addition to the the Python code. Each yaml file is also included in the table for the bundled model.
Expand Down
2 changes: 1 addition & 1 deletion docs/api/data/data_how_to.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The PyEarthTools [Data API](/api/data/data_index.md) provides Data Accessors, wh

They handle the nuances of how the data set is stored and organised, such as how to walk the filesystem, how to match a user query to the files on disk, and how to subset the requested variables out of the data structure. They may also handle any transformations which are needed to the raw data, such as file compression.

A more detailed how-to guide will be written in future describing how to use the various classes in the data module.
A more detailed how-to guide will be written in future describing how to use the various classes in the data module.

For a general overview and examples of how to make use some of the data module's functionality, see:

Expand Down
2 changes: 1 addition & 1 deletion docs/api/pipeline/pipeline_how_to.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ It is somewhat similar to an IterableDataset in PyTorch, or a DataLoader in PyTo

For more information, please see:

- [Introduction to data pipelines](project:/notebooks/tutorial/Data_Pipelines.ipynb)
- [Introduction to data pipelines](project:/notebooks/tutorial/Data_Pipelines.ipynb)
- [Working with Multiple Data Sources](project:/notebooks/tutorial/MultipleSources.ipynb)
- [The pipeline API tutorials](project:/notebooks/Gallery.md#Deep-Dive---The-Pipeline-Module) in the tutorial gallery
- The [pyearthtools.pipeline](pipeline_index.md) API documentation index
8 changes: 4 additions & 4 deletions docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ Generative AI tools can be helpful, but contributors must be transparent about u
6. All contributions must adhere to the [code of conduct]([https://github.com/nci/scores/blob/develop/CODE_OF_CONDUCT.md](https://github.com/ACCESS-Community-Hub/PyEarthTools?tab=coc-ov-file)).
7. Given that generative tools are evolving rapidly, this policy will likely be adjusted over time.

[1] https://joss.readthedocs.io/en/latest/policies.html#ai-usage-policy
[2] https://www.ametsoc.org/ams/publications/ethical-guidelines-and-ams-policies/author-disclosure-and-obligations/
[3] https://www.egu.eu/news/1031/statement-on-the-use-of-ai-based-tools-for-the-presentation-and-publication-of-research-results-in-earth-planetary-and-space-science/
[4] https://rmets.onlinelibrary.wiley.com/hub/ai-policy
[1] https://joss.readthedocs.io/en/latest/policies.html#ai-usage-policy
[2] https://www.ametsoc.org/ams/publications/ethical-guidelines-and-ams-policies/author-disclosure-and-obligations/
[3] https://www.egu.eu/news/1031/statement-on-the-use-of-ai-based-tools-for-the-presentation-and-publication-of-research-results-in-earth-planetary-and-space-science/
[4] https://rmets.onlinelibrary.wiley.com/hub/ai-policy


## Contributor Recognition in Zenodo
Expand Down
18 changes: 9 additions & 9 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ Data is not provided by PyEarthTools (either directly or as a cloud service), it

PyEarthTools can efficiently access large, multi-terabyte data sets. These data sets are typically held on-disk at dedicated computing facilities.

At the moment, PyEarthTools has existing integrations with the data holdings at three HPC facilities:
- NCI (Australia).
At the moment, PyEarthTools has existing integrations with the data holdings at three HPC facilities:
- NCI (Australia).
- Met Office (UK).
- Earth Sciences New Zealand (formerly NIWA).
- Earth Sciences New Zealand (formerly NIWA).

If you are working at another HPC facility, feel free to get in touch to discuss how to most effectively utilise PyEarthTools in your environment.

Expand All @@ -42,38 +42,38 @@ Additionally, you can explore the [geonetwork](https://geonetwork.nci.org.au/geo

### Connecting PyEarthTools to a new dataset in any HPC facility by coding a new data accessor

For on-disk data access, you will need to create a new accessor based on the [`pyearthtools.data.ArchiveIndex`](project:./api/data/data_api.md#pyearthtools.data.indexes.ArchiveIndex) class (or [`pyearthtools.data.Index`](project:./api/data/data_api.md#pyearthtools.data.indexes.Index) for some use cases). Additional instructions for this still need to be written. In the meantime, refer to the [NCI site archive source code](https://github.com/ACCESS-Community-Hub/PyEarthTools/tree/develop/packages/nci_site_archive) for examples.
For on-disk data access, you will need to create a new accessor based on the [`pyearthtools.data.ArchiveIndex`](project:./api/data/data_api.md#pyearthtools.data.indexes.ArchiveIndex) class (or [`pyearthtools.data.Index`](project:./api/data/data_api.md#pyearthtools.data.indexes.Index) for some use cases). Additional instructions for this still need to be written. In the meantime, refer to the [NCI site archive source code](https://github.com/ACCESS-Community-Hub/PyEarthTools/tree/develop/packages/nci_site_archive) for examples.

The [HadISD tutorials](project:./notebooks/Gallery.md#Working-with-Station-Data-(Medium-Hardware-Requirements)) also demonstrate the process of creating a new data accessor. While these tutorials focus on connecting to the HadISD dataset, the patterns in these tutorials are repeatable and can be used for other datasets.

Additional considerations and rules-of-thumb for HPC environments are:

- Use large files rather than many small files. This makes formats like GRIB and NetCDF more appropriate than Zarr in many cases.
- If using dask for chunking data, use largeish chunks, aligned to the time dimension (or primary index dimension).
- Do not zip up large datasets. Use internal zip compression. Zip is inherently single-threaded, so can require a long, slow, bottlenecked decompression step before data subsets can be read from a large file.
- Do not zip up large datasets. Use internal zip compression. Zip is inherently single-threaded, so can require a long, slow, bottlenecked decompression step before data subsets can be read from a large file.
- Use a format like Parquet for point clouds, station data or other irregularly-spaced, sparse data.

## Using PyEarthTools on a Workstation or Laptop

You can use PyEarthTools successfully on a workstation or laptop with data you download yourself.
You can use PyEarthTools successfully on a workstation or laptop with data you download yourself.

While many geoscience datasets are so large (e.g. hundreds of terabytes) that they can only be used effectively in HPC environments, there are also many smaller datasets of interest which can be downloaded on a workstation or laptop.

### With pre-configured datasets, supported by PyEarthTools, which you download from the internet

The [Quick Start](project:./notebooks/Gallery.md#Quick-Start-(Low-Hardware-Requirements)) tutorials can run on a 4GB GPU, and include the download step for fetching around 3-10GB of data. They will also work in HPC environments.

The [station data](project:./notebooks/Gallery.md#Working-with-Station-Data-(Medium-Hardware-Requirements)) tutorials do not need a GPU, but require more data. They have been tested on a laptop with 36GB of RAM and as well as an HPC node with over 100GB of RAM. 29GB of station data will be downloaded. Additional disk space is needed for reprocessing the data, although intermediate files can later be deleted. These notebooks may require user modification to run with less than 36GB of RAM but it should be possible with at least 16G of RAM.
The [station data](project:./notebooks/Gallery.md#Working-with-Station-Data-(Medium-Hardware-Requirements)) tutorials do not need a GPU, but require more data. They have been tested on a laptop with 36GB of RAM and as well as an HPC node with over 100GB of RAM. 29GB of station data will be downloaded. Additional disk space is needed for reprocessing the data, although intermediate files can later be deleted. These notebooks may require user modification to run with less than 36GB of RAM but it should be possible with at least 16G of RAM.

### Connecting PyEarthTools to a new dataset on a workstation or laptop

For on-disk data access, you will need to create a new accessor based on the [`pyearthtools.data.ArchiveIndex`](project:./api/data/data_api.md#pyearthtools.data.indexes.ArchiveIndex) class (or [`pyearthtools.data.Index`](project:./api/data/data_api.md#pyearthtools.data.indexes.Index) for some use cases). Additional instructions for this still need to be written. In the meantime, refer to the [NCI site archive source code](https://github.com/ACCESS-Community-Hub/PyEarthTools/tree/develop/packages/nci_site_archive) for examples.
For on-disk data access, you will need to create a new accessor based on the [`pyearthtools.data.ArchiveIndex`](project:./api/data/data_api.md#pyearthtools.data.indexes.ArchiveIndex) class (or [`pyearthtools.data.Index`](project:./api/data/data_api.md#pyearthtools.data.indexes.Index) for some use cases). Additional instructions for this still need to be written. In the meantime, refer to the [NCI site archive source code](https://github.com/ACCESS-Community-Hub/PyEarthTools/tree/develop/packages/nci_site_archive) for examples.

The [HadISD tutorials](project:./notebooks/Gallery.md#Working-with-Station-Data-(Medium-Hardware-Requirements)) also demonstrate the process of creating a new data accessor. While these tutorials focus on connecting to the HadISD dataset, the patterns in these tutorials are repeatable and can be used for other datasets.

Additional considerations and rules-of-thumb for working on workstations or laptops are:

- Using a storage format like zarr is suitable, because they can efficiently use small files to index data
- RAM is likely to be more constrained, so consider limiting the number of workers for tools like dask or PyTorch
- Some datasets have so-called ARCO (analysis-ready, cloud-optimised) versions available. Downloading an entire dataset in this fashion may be cost-prohibitive and inefficient for model training, but may be suitable for occasional access or to download model initial conditions for a single model run.
- Some datasets have so-called ARCO (analysis-ready, cloud-optimised) versions available. Downloading an entire dataset in this fashion may be cost-prohibitive and inefficient for model training, but may be suitable for occasional access or to download model initial conditions for a single model run.
- There are also some versions of some datasets which have been heavily compressed using lossy compression techniques, but are still close analogs for the original data. These could be used for model training, but there will be caveats as to the accuracy of such models due to the lossy compression of the training data.
12 changes: 6 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@
<figcaption>A data processing flow composed for working with climate data.</figcaption>
</figure>

Source Code: [github.com/ACCESS-Community-Hub/PyEarthTools](https://github.com/ACCESS-Community-Hub/PyEarthTools)
Documentation: [pyearthtools.readthedocs.io](https://pyearthtools.readthedocs.io)
Tutorial Gallery: [available here](./notebooks/Gallery)
New Users Guide: [available here](newuser.md)
Source Code: [github.com/ACCESS-Community-Hub/PyEarthTools](https://github.com/ACCESS-Community-Hub/PyEarthTools)
Documentation: [pyearthtools.readthedocs.io](https://pyearthtools.readthedocs.io)
Tutorial Gallery: [available here](./notebooks/Gallery)
New Users Guide: [available here](newuser.md)

**If you use `PyEarthTools` for your work or a publication, [please cite our work](https://pyearthtools.readthedocs.io/en/latest/#citing-pyearthtools).**

Expand Down Expand Up @@ -72,8 +72,8 @@ PyEarthTools is a Python framework containing modules for:
- training ML models and managing experiments;
- performing inference with ML models;
- and evaluating ML models (coming soon).
PyEarthTools runs effectively on HPC (supercomputers), cloud, workstations and laptops.

PyEarthTools runs effectively on HPC (supercomputers), cloud, workstations and laptops.

## Overview of the Packages within PyEarthTools

Expand Down
2 changes: 0 additions & 2 deletions docs/notebooks/tutorial/Working_with_Climate_Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6285,5 +6285,3 @@
"nbformat": 4,
"nbformat_minor": 5
}


Original file line number Diff line number Diff line change
Expand Up @@ -18677,4 +18677,4 @@ output:
- CAPE
- CAPEmax
- CIN
- CINmax
- CINmax
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ def test_CoordinateFlatten_skip_missing():
def test_undo_CoordinateFlatten():

import sys

print(f"Recursion limit set to {str(sys.getrecursionlimit())}")

f = reshape.CoordinateFlatten(["height"])
Expand Down
1 change: 0 additions & 1 deletion packages/utils/NOTICE.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,3 @@ The file config.py in various modules contains code taken from https://github.co
The file packages/utils/src/pyearthtools/utils/initialisation/init_parsing.py contains code from https://github.com/Lightning-AI/pytorch-lightning, released under the Apache 2.0 license, with copyright attributed to the Lightning AI team.

The file packages/utils/src/pyearthtools/utils/parsing/init_parsing.py contains code from https://github.com/Lightning-AI/pytorch-lightning, released under the Apache 2.0 license, with copyright attributed to the Lightning AI team.

Loading