Skip to content

refactor: define codec and data type classes upstream in a subpackage#3875

Open
d-v-b wants to merge 3 commits intozarr-developers:mainfrom
d-v-b:refactor/upstream-apis
Open

refactor: define codec and data type classes upstream in a subpackage#3875
d-v-b wants to merge 3 commits intozarr-developers:mainfrom
d-v-b:refactor/upstream-apis

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 6, 2026

Projects that want to implement their own codecs or data types have to import base classes from zarr-python. This means zarr-python can practically never depend on any externally-defined codecs or data types without creating a circular dependency (unacceptable). See #3867.

To remedy this situation, this PR defines our codec and data type ABCs in a separate package called zarr-interfaces. zarr-interfaces is a sub-package in this repo. The interfaces in zarr-interfaces are in versioned namespaces, which makes evolution of these APIs straightforward. Projects that want to implement a zarr-compatible codec or data type should depend on zarr-interfaces instead of depending on zarr-python itself. This will allow zarr-python to optionally depend on externally-defined codecs and data types.

I'm opening this as a draft because I'm not sure about quite a few things, and I would appreciate feedback on the basic direction.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 6, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 10, 2026

@zarr-developers/python-core-devs does anyone object to the basic proposal here: to upstream our basic codec + data type APIs? I think the current situation is untenable so I'd like to see it fixed. This PR is one approach, but I'm open to alternatives.

from zarr_interfaces.data_type.v1 import ZDType
```

Interfaces are versioned under a `v1` namespace to support future evolution
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the versioning will create confusion, because it is another version apart from the zarr package and the zarr data format versions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it's not confusing! the goal here is to allow zarr-python to gracefully evolve things like the codec API. Since different codec APIs would not interact, we could define the current ABC-based API under v1, and a newer protocol-based API under v2. I think only codec and data type developers would need to know about this, and I would count on that crowd being able to know what the versions mean.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

Since there have been no objections here, I am going to move forward with this PR.

@maxrjones
Copy link
Copy Markdown
Member

Since there have been no objections here, I am going to move forward with this PR.

I'm not sure you've really addressed Tom's concerns from #3867 (comment). I've restated them in #3867 (comment).

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

@maxrjones the primary goal of this change is to allow us to gracefully evolve our codec API by using a package structure that more accurately models the real dependency relationships. Nobody has objected to that. Being able to easily import externally-defined codecs is just a nice side-effect of this refactoring, but this direction is still valuable even if we define all our codecs internally.

@maxrjones
Copy link
Copy Markdown
Member

gracefully evolve our codec API by using a package structure that more accurately models the real dependency relationships.

These seem orthogonal. Sorry that I need to step back from this discussion, but I also wanted to at least voice skepticism before you continue to invest time. I won't block the approach if you find a different approver, but am not convinced.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

gracefully evolve our codec API by using a package structure that more accurately models the real dependency relationships.

These seem orthogonal. Sorry that I need to step back from this discussion, but I also wanted to at least voice skepticism before you continue to invest time. I won't block the approach if you find a different approver, but am not convinced.

they are not orthogonal at all. a circular dependency can look OK until you have to change one or the other pair. then the problems emerge. this is exactly what we experienced with 3.x and numcodecs.

for context, the codec api for zarr-python 2.x was defined in a separate package (numcodecs). zarrs defines the codec API(, and many other APIs, in separate packages. zarritia.js defines the ndarray and storage APIs in separate packages. It's actually normal and OK to do this!

@maxrjones
Copy link
Copy Markdown
Member

gracefully evolve our codec API by using a package structure that more accurately models the real dependency relationships.

These seem orthogonal. Sorry that I need to step back from this discussion, but I also wanted to at least voice skepticism before you continue to invest time. I won't block the approach if you find a different approver, but am not convinced.

they are not orthogonal at all. a circular dependency can look OK until you have to change one or the other pair. then the problems emerge. this is exactly what we experienced with 3.x and numcodecs.

for context, the codec api for zarr-python 2.x was defined in a separate package (numcodecs). zarrs defines the codec API(, and many other APIs, in separate packages. zarritia.js defines the ndarray and storage APIs in separate packages. It's actually normal and OK to do this!

Yes I am familiar with monorepos. I find it to be a matter of preference. I manage co-development in virtualizarr and virtual tiff just fine. Some times it's annoying to manage release timing, most times it's nice to have independent development.

I'm willing to follow what others prefer here.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

I'm willing to follow what others prefer here.

I would also like to hear some more voices in the conversation. I am sensitive to this situation because:

  • I have implemented an external implementation of a codec that I think should be shipped with zarr-python
  • I want to change the zarr-python codec API (to make it faster)

these two directions are in tension as long as we use the current (ahistorical) arrangement of defining the codec API inside zarr-python. Can someone provide an alternative proposal for how we can evolve our codec API while also depending on externally-defined codecs that depend on zarr-python?

@maxrjones maxrjones marked this pull request as ready for review April 14, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants