You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyAthena ships its own fsspec.AbstractFileSystem implementation (pyathena/filesystem/s3.py) instead of depending on s3fs, to avoid the s3fs / aiobotocore dependency and version-conflict problems that motivated it (#344, #465, #476, #272).
The trade-off is that users migrating from s3fs occasionally hit behaviors our implementation doesn't cover yet — most recently the transaction small-file bug in #719. This is an umbrella issue to track closing the remaining gaps so the built-in filesystem becomes a more complete drop-in replacement for s3fs (and a more complete AbstractFileSystem).
Method
Compared S3FileSystem / S3File against s3fs (s3fs/core.py, main) and fsspec's AbstractFileSystem / AbstractBufferedFile. fsspec defaults that already work via our primitives — glob, walk, du, mv, head/tail, read_block, recursive get/put, multi-cat, pipe — are considered covered and out of scope.
Already at parity ✅
ls / info / exists / find (maxdepth, withdirs) / touch
cat_file (ranged, concurrent via _fetch_range), put_file, get_file
requester_pays, unsigned access, version_id on the read path
async filesystem (AioS3FileSystem / AioS3File)
FileNotFoundError translation for 404 / NoSuchKey / NoSuchBucket
Gaps (proposed checklist)
Directory operations
mkdir / makedirs / rmdir — not implemented (we inherit fsspec's base behavior). s3fs creates/removes buckets and handles key prefixes.
Object metadata & attributes
metadata / getxattr / setxattr (on the fs and on S3File) — read/write user metadata (x-amz-meta-*)
get_tags / put_tags — object tagging
chmod / canned-ACL support (acl= on writes)
Multipart upload management
list_multipart_uploads / clear_multipart_uploads — discover & abort stale/incomplete MPUs (storage-cost cleanup). Today we only call abort_multipart_upload inside discard().
Versioning
version_aware mode: version-aware ls / info / cat, plus object_version_info via list_object_versions. We support ?versionId= / version_id on reads but not version-aware listing.
Map more botocore ClientErrors to the matching OSError subclasses, like s3fs's translate_boto_error (e.g. 403 → PermissionError), instead of only 404 → FileNotFoundError.
Each area can be its own PR with unit tests (mocked, no AWS) plus a couple of integration tests. Keep large-file integration AWS cost bounded and at the filesystem layer (not multiplied across cursor types), as done in Fix empty object write for small files in fsspec transactions #721.
Not all of these are equally important — suggest prioritizing by what migrators from s3fs actually hit. Help wanted; happy to take these incrementally.
Implementation & contribution notes
Where
Sync: pyathena/filesystem/s3.py — S3FileSystem (the fs) and S3File (the AbstractBufferedFile).
Async: pyathena/filesystem/s3_async.py — AioS3FileSystem, and AioS3File(S3File) which inherits_upload_chunk / commit / discard. A fix on S3File usually covers the async path too; add an async test only where the executor (S3AioExecutor) actually differs.
Approach
Mirror how s3fs solves it. Read the corresponding code in s3fs/core.py and fsspec/spec.py (AbstractFileSystem / AbstractBufferedFile) before implementing, and verify each gap against the installed source rather than assumptions. Favor the smallest change that matches s3fs semantics over new bespoke state. (Context: Fix empty object write for small files in fsspec transactions #721 fixed the transaction bug with a one-line _upload_chunk returning not final, mirroring s3fs, after a more complex first attempt was dropped.)
Tests
Unit (no AWS), in tests/pyathena/filesystem/test_s3.py::TestS3File: build a file via S3File.__new__(S3File) + a MagicMock(spec=S3FileSystem) fs; use SimpleNamespace for S3 part results; helpers _make_write_file / _make_multipart_write_file already exist. Lock the contract in both directions (regression + anti-simplification).
Integration (real AWS): put these in TestS3FileSystem / TestAioS3FileSystem. The fs fixture is just S3FileSystem(connect()), so they run once — they are not multiplied across cursor types (pandas/arrow/polars). Parametrize over a small (one-shot PutObject) + one ~10 MiB (multipart) size; block size is 5 MiB (DEFAULT_BLOCK_SIZE = MULTIPART_UPLOAD_MIN_PART_SIZE), so 10 MiB triggers multipart.
Keep large-file integration cases to ~2-3 uploads total to bound AWS cost; skip a large case when an inherited sync test already covers the path.
Copy this, fill in the target item, and use it to start an implementation session:
Implement one item from the PyAthena s3fs-parity checklist in issue #722.
Target item:<paste one checkbox item, e.g. "get_tags / put_tags — object tagging">
Rules:
Read CLAUDE.md first. The built-in filesystem lives in pyathena/filesystem/s3.py (S3FileSystem, S3File) and pyathena/filesystem/s3_async.py (AioS3FileSystem, AioS3File(S3File) which inherits the write methods — a fix on S3File usually covers async too).
Mirror how s3fs implements this: read s3fs/core.py and fsspec/spec.py from the installed packages and match their semantics with the smallest change. Verify the current gap against the real source, not assumptions.
Unit tests (no AWS) in tests/pyathena/filesystem/test_s3.py::TestS3File: build files via S3File.__new__(S3File) + MagicMock(spec=S3FileSystem) + SimpleNamespace; reuse helpers _make_write_file / _make_multipart_write_file. Lock the contract (regression + anti-simplification).
Integration tests in TestS3FileSystem / TestAioS3FileSystem only — these run once and are NOT multiplied across cursor types. Parametrize small (one-shot) + one ~10 MiB (multipart) case; keep large uploads to 2-3 total (block size is 5 MiB).
Run make format, then make lint (ruff + mypy) and the filesystem tests before finishing.
Background
PyAthena ships its own
fsspec.AbstractFileSystemimplementation (pyathena/filesystem/s3.py) instead of depending ons3fs, to avoid thes3fs/aiobotocoredependency and version-conflict problems that motivated it (#344, #465, #476, #272).The trade-off is that users migrating from
s3fsoccasionally hit behaviors our implementation doesn't cover yet — most recently the transaction small-file bug in #719. This is an umbrella issue to track closing the remaining gaps so the built-in filesystem becomes a more complete drop-in replacement fors3fs(and a more completeAbstractFileSystem).Method
Compared
S3FileSystem/S3Fileagainsts3fs(s3fs/core.py,main) andfsspec'sAbstractFileSystem/AbstractBufferedFile. fsspec defaults that already work via our primitives —glob,walk,du,mv,head/tail,read_block, recursiveget/put, multi-cat,pipe— are considered covered and out of scope.Already at parity ✅
ls/info/exists/find(maxdepth, withdirs) /touchcat_file(ranged, concurrent via_fetch_range),put_file,get_filecp_fileincl. multipart copy, bulk delete (_delete_objects),checksum,sign(presigned URL),created/modifiedcommit/discard(incl. the small-file fix from pyathena s3 fsspec replacement file system not writing when used with transactions & a context manager #719)requester_pays, unsigned access,version_idon the read pathAioS3FileSystem/AioS3File)FileNotFoundErrortranslation for404/NoSuchKey/NoSuchBucketGaps (proposed checklist)
Directory operations
mkdir/makedirs/rmdir— not implemented (we inherit fsspec's base behavior). s3fs creates/removes buckets and handles key prefixes.Object metadata & attributes
metadata/getxattr/setxattr(on the fs and onS3File) — read/write user metadata (x-amz-meta-*)get_tags/put_tags— object taggingchmod/ canned-ACL support (acl=on writes)Multipart upload management
list_multipart_uploads/clear_multipart_uploads— discover & abort stale/incomplete MPUs (storage-cost cleanup). Today we only callabort_multipart_uploadinsidediscard().Versioning
version_awaremode: version-awarels/info/cat, plusobject_version_infovialist_object_versions. We support?versionId=/version_idon reads but not version-aware listing.make_bucket_versioned,rm_versioned_bucket_contents)Error translation
ClientErrors to the matchingOSErrorsubclasses, like s3fs'stranslate_boto_error(e.g.403→PermissionError), instead of only404→FileNotFoundError.File-object helpers
S3File.url/S3File.metadata/getxattr/setxattrpipe_filesingle-PUT instead of the inheritedopen()+writepathopen_async/S3AsyncStreamedFileDrop-in / DX
s3protocol,fsspec.filesystem("s3")returns PyAthena's impl and s3fs's debug env (S3FS_LOGGING_LEVEL) silently does nothing. Consider an opt-out / notice and/or honoring a debug switch.Scope / notes
s3fsactually hit. Help wanted; happy to take these incrementally.Implementation & contribution notes
Where
pyathena/filesystem/s3.py—S3FileSystem(the fs) andS3File(theAbstractBufferedFile).pyathena/filesystem/s3_async.py—AioS3FileSystem, andAioS3File(S3File)which inherits_upload_chunk/commit/discard. A fix onS3Fileusually covers the async path too; add an async test only where the executor (S3AioExecutor) actually differs.Approach
s3fssolves it. Read the corresponding code ins3fs/core.pyandfsspec/spec.py(AbstractFileSystem/AbstractBufferedFile) before implementing, and verify each gap against the installed source rather than assumptions. Favor the smallest change that matches s3fs semantics over new bespoke state. (Context: Fix empty object write for small files in fsspec transactions #721 fixed the transaction bug with a one-line_upload_chunkreturningnot final, mirroring s3fs, after a more complex first attempt was dropped.)Tests
tests/pyathena/filesystem/test_s3.py::TestS3File: build a file viaS3File.__new__(S3File)+ aMagicMock(spec=S3FileSystem)fs; useSimpleNamespacefor S3 part results; helpers_make_write_file/_make_multipart_write_filealready exist. Lock the contract in both directions (regression + anti-simplification).TestS3FileSystem/TestAioS3FileSystem. Thefsfixture is justS3FileSystem(connect()), so they run once — they are not multiplied across cursor types (pandas/arrow/polars). Parametrize over a small (one-shotPutObject) + one ~10 MiB (multipart) size; block size is 5 MiB (DEFAULT_BLOCK_SIZE=MULTIPART_UPLOAD_MIN_PART_SIZE), so 10 MiB triggers multipart.make lintfirst, then the filesystem tests. PR Fix empty object write for small files in fsspec transactions #721 is a worked example of all of the above.Kickoff prompt (paste into a new session)
Copy this, fill in the target item, and use it to start an implementation session:
Spun out of the discussion in #719.