Skip to content

Improve s3fs / fsspec API parity for the built-in S3FileSystem #722

@laughingman7743

Description

@laughingman7743

Background

PyAthena ships its own fsspec.AbstractFileSystem implementation (pyathena/filesystem/s3.py) instead of depending on s3fs, to avoid the s3fs / aiobotocore dependency and version-conflict problems that motivated it (#344, #465, #476, #272).

The trade-off is that users migrating from s3fs occasionally hit behaviors our implementation doesn't cover yet — most recently the transaction small-file bug in #719. This is an umbrella issue to track closing the remaining gaps so the built-in filesystem becomes a more complete drop-in replacement for s3fs (and a more complete AbstractFileSystem).

Method

Compared S3FileSystem / S3File against s3fs (s3fs/core.py, main) and fsspec's AbstractFileSystem / AbstractBufferedFile. fsspec defaults that already work via our primitives — glob, walk, du, mv, head/tail, read_block, recursive get/put, multi-cat, pipe — are considered covered and out of scope.

Already at parity ✅

  • ls / info / exists / find (maxdepth, withdirs) / touch
  • cat_file (ranged, concurrent via _fetch_range), put_file, get_file
  • cp_file incl. multipart copy, bulk delete (_delete_objects), checksum, sign (presigned URL), created / modified
  • multipart upload + transactions / commit / discard (incl. the small-file fix from pyathena s3 fsspec replacement file system not writing when used with transactions & a context manager #719)
  • requester_pays, unsigned access, version_id on the read path
  • async filesystem (AioS3FileSystem / AioS3File)
  • FileNotFoundError translation for 404 / NoSuchKey / NoSuchBucket

Gaps (proposed checklist)

Directory operations

  • mkdir / makedirs / rmdir — not implemented (we inherit fsspec's base behavior). s3fs creates/removes buckets and handles key prefixes.

Object metadata & attributes

  • metadata / getxattr / setxattr (on the fs and on S3File) — read/write user metadata (x-amz-meta-*)
  • get_tags / put_tags — object tagging
  • chmod / canned-ACL support (acl= on writes)

Multipart upload management

  • list_multipart_uploads / clear_multipart_uploads — discover & abort stale/incomplete MPUs (storage-cost cleanup). Today we only call abort_multipart_upload inside discard().

Versioning

  • version_aware mode: version-aware ls / info / cat, plus object_version_info via list_object_versions. We support ?versionId= / version_id on reads but not version-aware listing.
  • (optional) bucket-versioning helpers (make_bucket_versioned, rm_versioned_bucket_contents)

Error translation

  • Map more botocore ClientErrors to the matching OSError subclasses, like s3fs's translate_boto_error (e.g. 403PermissionError), instead of only 404FileNotFoundError.

File-object helpers

  • S3File.url / S3File.metadata / getxattr / setxattr
  • (perf, optional) direct pipe_file single-PUT instead of the inherited open() + write path
  • (verify) async streaming-read parity with s3fs open_async / S3AsyncStreamedFile

Drop-in / DX

Scope / notes

  • Each area can be its own PR with unit tests (mocked, no AWS) plus a couple of integration tests. Keep large-file integration AWS cost bounded and at the filesystem layer (not multiplied across cursor types), as done in Fix empty object write for small files in fsspec transactions #721.
  • Not all of these are equally important — suggest prioritizing by what migrators from s3fs actually hit. Help wanted; happy to take these incrementally.

Implementation & contribution notes

Where

  • Sync: pyathena/filesystem/s3.pyS3FileSystem (the fs) and S3File (the AbstractBufferedFile).
  • Async: pyathena/filesystem/s3_async.pyAioS3FileSystem, and AioS3File(S3File) which inherits _upload_chunk / commit / discard. A fix on S3File usually covers the async path too; add an async test only where the executor (S3AioExecutor) actually differs.

Approach

  • Mirror how s3fs solves it. Read the corresponding code in s3fs/core.py and fsspec/spec.py (AbstractFileSystem / AbstractBufferedFile) before implementing, and verify each gap against the installed source rather than assumptions. Favor the smallest change that matches s3fs semantics over new bespoke state. (Context: Fix empty object write for small files in fsspec transactions #721 fixed the transaction bug with a one-line _upload_chunk returning not final, mirroring s3fs, after a more complex first attempt was dropped.)

Tests

  • Unit (no AWS), in tests/pyathena/filesystem/test_s3.py::TestS3File: build a file via S3File.__new__(S3File) + a MagicMock(spec=S3FileSystem) fs; use SimpleNamespace for S3 part results; helpers _make_write_file / _make_multipart_write_file already exist. Lock the contract in both directions (regression + anti-simplification).
  • Integration (real AWS): put these in TestS3FileSystem / TestAioS3FileSystem. The fs fixture is just S3FileSystem(connect()), so they run once — they are not multiplied across cursor types (pandas/arrow/polars). Parametrize over a small (one-shot PutObject) + one ~10 MiB (multipart) size; block size is 5 MiB (DEFAULT_BLOCK_SIZE = MULTIPART_UPLOAD_MIN_PART_SIZE), so 10 MiB triggers multipart.
  • Keep large-file integration cases to ~2-3 uploads total to bound AWS cost; skip a large case when an inherited sync test already covers the path.
  • Run make lint first, then the filesystem tests. PR Fix empty object write for small files in fsspec transactions #721 is a worked example of all of the above.

Kickoff prompt (paste into a new session)

Copy this, fill in the target item, and use it to start an implementation session:

Implement one item from the PyAthena s3fs-parity checklist in issue #722.

Target item: <paste one checkbox item, e.g. "get_tags / put_tags — object tagging">

Rules:

  • Read CLAUDE.md first. The built-in filesystem lives in pyathena/filesystem/s3.py (S3FileSystem, S3File) and pyathena/filesystem/s3_async.py (AioS3FileSystem, AioS3File(S3File) which inherits the write methods — a fix on S3File usually covers async too).
  • Mirror how s3fs implements this: read s3fs/core.py and fsspec/spec.py from the installed packages and match their semantics with the smallest change. Verify the current gap against the real source, not assumptions.
  • Unit tests (no AWS) in tests/pyathena/filesystem/test_s3.py::TestS3File: build files via S3File.__new__(S3File) + MagicMock(spec=S3FileSystem) + SimpleNamespace; reuse helpers _make_write_file / _make_multipart_write_file. Lock the contract (regression + anti-simplification).
  • Integration tests in TestS3FileSystem / TestAioS3FileSystem only — these run once and are NOT multiplied across cursor types. Parametrize small (one-shot) + one ~10 MiB (multipart) case; keep large uploads to 2-3 total (block size is 5 MiB).
  • Run make format, then make lint (ruff + mypy) and the filesystem tests before finishing.
  • Branch off master, commit with a Co-Authored-By trailer, open a draft PR using the repo PR template, and link this issue. PR Fix empty object write for small files in fsspec transactions #721 is a worked example.

Confirm the chosen item and your plan before implementing.

Spun out of the discussion in #719.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions