[DOCS] Draft release post for SedonaDB 0.4.0#3064
Conversation
| In addition to data frame operators, we increasingly realized that our hard-won library of 170+ spatial functions was difficult to explore and use (despite improved [SQL reference documentation](https://sedona.apache.org/sedonadb/latest/reference/sql/)!). Following the pattern of [Pandas-style datatype-specific accessors](https://pandas.pydata.org/docs/reference/series.html#accessors), you can now write expressions as chains with inline documentation helping you as you go. | ||
|
|
||
|
|
||
| ```python | ||
| countries.select( | ||
| countries.name, geometry=countries.geometry.geo.centroid().geo.buffer(0.1) | ||
| ).limit(4) | ||
| ``` |
There was a problem hiding this comment.
For this we could also have a screen cap video of what this looks like to type (where the functions and their parameters are shown as you go)
| ## Packaging for conda-forge | ||
|
|
||
| We're excited to announce that sedonadb is now available on conda-forge! Users of the conda ecosystem can now install SedonaDB with: | ||
|
|
||
| ```shell | ||
| conda install -c conda-forge sedonadb | ||
| ``` | ||
|
|
||
| Thank you to [p-vdp](https://github.com/p-vdp) for driving this work! |
There was a problem hiding this comment.
@p-vdp Did I get this right / is there anything else that should go in this section?
| ## GPU-Accelerated Spatial Join | ||
|
|
||
| ```bash | ||
| docker run -it --rm --gpus all -p 8888:8888 apache/sedona:sedonadb-latest | ||
| ``` |
There was a problem hiding this comment.
Hi, this is my release notes:
SedonaDB now introduces hardware acceleration via an integrated GPU Spatial Join Library. This feature significantly boosts the performance of compute-intensive spatial joins by offloading highly parallel filtering and refinement operations to the GPU.
Key Capabilities & Enhancements
- Ray Tracing (RT) Core Acceleration: Repurposes dedicated GPU RT cores to accelerate the bounding-box filtering stage of spatial queries and Point-in-Polygon (PIP) tests. This delivers massive performance gains on complex spatial joins (e.g.,
intersects,contains). The evaluation of PIP queries is heavily optimized to exploit RT cores, while other geometric operations run on CUDA cores. - GPU-Optimized Storage Layout: Unlike conventional GPU databases that load entire datasets into device memory, SedonaDB only loads geometries in Well-Known Binary (WKB) format to the GPU during query execution. This allows large queries to run efficiently even with limited device memory. The WKB data is subsequently converted into a GPU-friendly format, maximizing memory throughput and enabling parallel random access directly on the device.
- CPU Fallback: Currently, only a subset of spatial predicates are supported. When executing an unsupported spatial join, the engine automatically falls back to the CPU implementation.
Prerequisites & Deployment
By default, the GPU feature is disabled and is not included in the standard published python packages.
Hardware Requirements:
- An NVIDIA GPU with a compute capability of
$\ge$ 7.5. A GPU without RT cores (e.g., A100, H100) should also work.
Quick Start with Docker
We provide an official Docker image to easily try this feature with a single command:
docker run -it --rm --gpus all -p 8888:8888 apache/sedona:sedonadb-latest
⚠️ Note: This pre-built image supports GPU models with compute capabilities 7.5, 8.6, and 8.9.
For other GPU models, we encourage users to build the image from source to avoid time-consuming Just-In-Time (JIT) compilation:
docker build -f docker/sedonadb-gpu.dockerfile --build-arg CMAKE_CUDA_ARCHITECTURES="<your GPU compute capability>" -t sedonadb-gpu .
Usage
Launching the container provides a JupyterLab instance. From there, you can connect to SedonaDB and enable GPU acceleration using the following configuration:
import sedonadb
ctx = sedonadb.connect()
# Enable the GPU feature
ctx.sql("SET gpu.enable = true")
# Increase the batch size to feed sufficient data to the GPU
ctx.sql("SET datafusion.execution.batch_size = 100000") | ## Raster Infrastructure | ||
|
|
||
| While we're not ready to announce that SedonaDB supports raster data, SedonaDB contributors dedicated significant time laying the foundation for first-class raster and ND-array data support, drawing the best from [Sedona Spark's Raster SQL](https://sedona.apache.org/latest/api/sql/Raster-Functions/), [PostGIS Raster Support](https://postgis.net/docs/RT_reference.html), [GDAL](https://gdal.org/), and [Zarr](https://zarr.dev/) with vectorized execution and SedonaDB's ground-up spatial support. We look forward to building this feature in earnest with the community over the next few months! | ||
|
|
||
| Thank you to [Kontinuation](https://github.com/Kontinuation) and [james-willis](https://github.com/james-willis) for designing and driving this functionality! |
There was a problem hiding this comment.
@james-willis @Kontinuation @jiayuasu I made a passing effort at this paragraph but I'm happy to put whatever here if any of you have suggestions!
There was a problem hiding this comment.
Revised draft — now reads a real, public ERA5 rainfall Zarr pyramid (EPSG:3857, anonymous; the cube is chunked by year and spatially), uses the .rst DataFrame accessor, and shows real output. Verified end-to-end against current main.
N-Dimensional Rasters and Zarr
Geospatial raster data is increasingly a datacube: climate reanalyses, satellite time series, and model outputs all stack extra axes — time, year, band — on top of the spatial grid. In 0.4.0, SedonaDB's raster type goes natively N-dimensional, and the new sedonadb-zarr extension reads Zarr groups (v2 or v3) — local or in cloud object storage — straight into a queryable raster column.
Point SedonaDB at a Zarr datacube and explore its shape without reading a single pixel:
import sedona.db
import sedonadb_zarr
sd = sedona.db.connect()
sd.register(sedonadb_zarr.ZarrExtension())
# A public ERA5 rainfall pyramid (Zarr, anonymous). Reading + inspecting
# dimensions is a metadata-only round-trip — no pixel bytes fetched.
url = "https://weathermapdata.rdrn.me/era5_2015_2020_l5.zarr/2"
spec = sedonadb_zarr.Zarr().with_options({"arrays": ["rain_ok"]})
cube = sd.read(url, format=spec)
cube.select(
cube.raster.rst.num_dimensions().alias("ndim"),
cube.raster.rst.dim_names().alias("dims"),
cube.raster.rst.shape().alias("shape"),
cube.raster.rst.srid().alias("srid"),
).show(1)┌───────┬──────────────┬───────────────┬────────┐
│ ndim ┆ dims ┆ shape ┆ srid │
│ int32 ┆ list ┆ list ┆ uint32 │
╞═══════╪══════════════╪═══════════════╪════════╡
│ 3 ┆ [year, y, x] ┆ [1, 128, 128] ┆ 3857 │
└───────┴──────────────┴───────────────┴────────┘
sedonadb-zarr emits one row per Zarr chunk, so the storage layout is the data layout. This cube tiles each year into a 4×4 grid and chunks one year per chunk, so it loads as 16 × 6 = 96 rows — cube.count() confirms — each a single year of one spatial tile. Inspecting dimensions touches only the group schema: no pixel bytes.
RS_Slice collapses a named axis. Slicing the year dimension hands back a 2-D [y, x] rainfall field:
sliced = cube.select(plane=cube.raster.rst.slice("year", 0))
sliced.select(
dims=sliced.plane.rst.dim_names(),
shape=sliced.plane.rst.shape(),
).show(1)┌────────┬────────────┐
│ dims ┆ shape │
╞════════╪════════════╡
│ [y, x] ┆ [128, 128] │
└────────┴────────────┘
RS_Slice needs pixels, so SedonaDB resolves each row's Zarr chunk on demand — you never call a loader yourself.
Note: pixel-reading operations like
RS_Slicefetch a chunk's bytes on demand, and do so eagerly when the operator runs. We're separately making slice and other "crop" operators lazy — a lightweight view over the chunk, so bytes aren't retrieved until their values are consumed (#813).
And because each chunk is a georeferenced row, you can see a Zarr's layout on a map without decoding a pixel. RS_Envelope turns each chunk into its footprint; reproject to lon/lat and the 4×4 chunk grid draws straight onto a map:
from lonboard import viz # in a notebook
f = sd.funcs
chunks = cube.select(geom=f.st_transform(cube.raster.rst.envelope(), "EPSG:4326"))
# Draw outlines only, so the basemap shows through the chunk grid.
viz(
chunks,
polygon_kwargs=dict(
filled=False, stroked=True,
get_line_color=[236, 64, 160], line_width_min_pixels=2,
),
)(Figure to attach: the chunk grid drawn as a 4×4 lattice over a world basemap.)
For the full walkthrough — load a cube, inspect its dimensions, slice a plane, map the chunks, and hand a plane to NumPy — see Working with Zarr and NDArray data in SedonaDB.
We're excited about what shipped here — and we're just getting started. There's more user-facing functionality for N-dimensional rasters and Zarr on the way, and we'd love your input on where it goes next. If you're working with datacubes or cloud-native raster data, open an issue and tell us what you're building and what you need.
There was a problem hiding this comment.
@paleolimbot — revised this draft above. It now reads a real public ERA5 rainfall Zarr pyramid (EPSG:3857) instead of the placeholder cube, uses the .rst DataFrame accessor, and includes real output. The two pieces it needed both landed in sedona-db — CF/rioxarray CRS (#985) and the zlib codec (#987) — so it runs end-to-end against current main.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[DOCS] my subjectWhat changes were proposed in this PR?
Added a (draft) post for the forthcoming SedonaDB release candidate
How was this patch tested?
Just docs
Did this PR include necessary documentation updates?