Improve s3fs / fsspec API parity for the built-in S3FileSystem

## Background

PyAthena ships its own `fsspec.AbstractFileSystem` implementation (`pyathena/filesystem/s3.py`) instead of depending on `s3fs`, to avoid the `s3fs` / `aiobotocore` dependency and version-conflict problems that motivated it (#344, #465, #476, #272).

The trade-off is that users migrating from `s3fs` occasionally hit behaviors our implementation doesn't cover yet — most recently the transaction small-file bug in #719. This is an **umbrella issue** to track closing the remaining gaps so the built-in filesystem becomes a more complete drop-in replacement for `s3fs` (and a more complete `AbstractFileSystem`).

## Method

Compared `S3FileSystem` / `S3File` against `s3fs` (`s3fs/core.py`, `main`) and `fsspec`'s `AbstractFileSystem` / `AbstractBufferedFile`. fsspec defaults that already work via our primitives — `glob`, `walk`, `du`, `mv`, `head`/`tail`, `read_block`, recursive `get`/`put`, multi-`cat`, `pipe` — are considered covered and out of scope.

## Already at parity ✅

- `ls` / `info` / `exists` / `find` (maxdepth, withdirs) / `touch`
- `cat_file` (ranged, concurrent via `_fetch_range`), `put_file`, `get_file`
- `cp_file` incl. multipart copy, bulk delete (`_delete_objects`), `checksum`, `sign` (presigned URL), `created` / `modified`
- multipart upload + transactions / `commit` / `discard` (incl. the small-file fix from #719)
- `requester_pays`, unsigned access, `version_id` on the **read** path
- async filesystem (`AioS3FileSystem` / `AioS3File`)
- `FileNotFoundError` translation for `404` / `NoSuchKey` / `NoSuchBucket`

## Gaps (proposed checklist)

### Directory operations
- [ ] `mkdir` / `makedirs` / `rmdir` — not implemented (we inherit fsspec's base behavior). s3fs creates/removes buckets and handles key prefixes.

### Object metadata & attributes
- [ ] `metadata` / `getxattr` / `setxattr` (on the fs and on `S3File`) — read/write user metadata (`x-amz-meta-*`)
- [ ] `get_tags` / `put_tags` — object tagging
- [ ] `chmod` / canned-ACL support (`acl=` on writes)

### Multipart upload management
- [ ] `list_multipart_uploads` / `clear_multipart_uploads` — discover & abort stale/incomplete MPUs (storage-cost cleanup). Today we only call `abort_multipart_upload` inside `discard()`.

### Versioning
- [ ] `version_aware` mode: version-aware `ls` / `info` / `cat`, plus `object_version_info` via `list_object_versions`. We support `?versionId=` / `version_id` on reads but not version-aware listing.
- [ ] (optional) bucket-versioning helpers (`make_bucket_versioned`, `rm_versioned_bucket_contents`)

### Error translation
- [ ] Map more botocore `ClientError`s to the matching `OSError` subclasses, like s3fs's `translate_boto_error` (e.g. `403` → `PermissionError`), instead of only `404` → `FileNotFoundError`.

### File-object helpers
- [ ] `S3File.url` / `S3File.metadata` / `getxattr` / `setxattr`
- [ ] (perf, optional) direct `pipe_file` single-PUT instead of the inherited `open()` + `write` path
- [ ] (verify) async streaming-read parity with s3fs `open_async` / `S3AsyncStreamedFile`

### Drop-in / DX
- [ ] Registry shadowing (raised in #719): when we register as the `s3` protocol, `fsspec.filesystem("s3")` returns PyAthena's impl and s3fs's debug env (`S3FS_LOGGING_LEVEL`) silently does nothing. Consider an opt-out / notice and/or honoring a debug switch.

## Scope / notes

- Each area can be its own PR with unit tests (mocked, no AWS) plus a couple of integration tests. Keep large-file integration AWS cost bounded and at the **filesystem layer** (not multiplied across cursor types), as done in #721.
- Not all of these are equally important — suggest prioritizing by what migrators from `s3fs` actually hit. Help wanted; happy to take these incrementally.

## Implementation & contribution notes

**Where**

- Sync: `pyathena/filesystem/s3.py` — `S3FileSystem` (the fs) and `S3File` (the `AbstractBufferedFile`).
- Async: `pyathena/filesystem/s3_async.py` — `AioS3FileSystem`, and `AioS3File(S3File)` which **inherits** `_upload_chunk` / `commit` / `discard`. A fix on `S3File` usually covers the async path too; add an async test only where the executor (`S3AioExecutor`) actually differs.

**Approach**

- Mirror how `s3fs` solves it. Read the corresponding code in `s3fs/core.py` and `fsspec/spec.py` (`AbstractFileSystem` / `AbstractBufferedFile`) before implementing, and verify each gap against the installed source rather than assumptions. Favor the smallest change that matches s3fs semantics over new bespoke state. (Context: #721 fixed the transaction bug with a one-line `_upload_chunk` returning `not final`, mirroring s3fs, after a more complex first attempt was dropped.)

**Tests**

- Unit (no AWS), in `tests/pyathena/filesystem/test_s3.py::TestS3File`: build a file via `S3File.__new__(S3File)` + a `MagicMock(spec=S3FileSystem)` fs; use `SimpleNamespace` for S3 part results; helpers `_make_write_file` / `_make_multipart_write_file` already exist. Lock the contract in both directions (regression + anti-simplification).
- Integration (real AWS): put these in `TestS3FileSystem` / `TestAioS3FileSystem`. The `fs` fixture is just `S3FileSystem(connect())`, so they run **once** — they are **not** multiplied across cursor types (pandas/arrow/polars). Parametrize over a small (one-shot `PutObject`) + one ~10 MiB (multipart) size; block size is 5 MiB (`DEFAULT_BLOCK_SIZE` = `MULTIPART_UPLOAD_MIN_PART_SIZE`), so 10 MiB triggers multipart.
- Keep large-file integration cases to ~2-3 uploads total to bound AWS cost; skip a large case when an inherited sync test already covers the path.
- Run `make lint` first, then the filesystem tests. PR #721 is a worked example of all of the above.

## Kickoff prompt (paste into a new session)

Copy this, fill in the target item, and use it to start an implementation session:

> Implement one item from the PyAthena s3fs-parity checklist in issue #722.
>
> **Target item:** _<paste one checkbox item, e.g. "`get_tags` / `put_tags` — object tagging">_
>
> Rules:
> - Read `CLAUDE.md` first. The built-in filesystem lives in `pyathena/filesystem/s3.py` (`S3FileSystem`, `S3File`) and `pyathena/filesystem/s3_async.py` (`AioS3FileSystem`, `AioS3File(S3File)` which inherits the write methods — a fix on `S3File` usually covers async too).
> - Mirror how `s3fs` implements this: read `s3fs/core.py` and `fsspec/spec.py` from the installed packages and match their semantics with the smallest change. Verify the current gap against the real source, not assumptions.
> - Unit tests (no AWS) in `tests/pyathena/filesystem/test_s3.py::TestS3File`: build files via `S3File.__new__(S3File)` + `MagicMock(spec=S3FileSystem)` + `SimpleNamespace`; reuse helpers `_make_write_file` / `_make_multipart_write_file`. Lock the contract (regression + anti-simplification).
> - Integration tests in `TestS3FileSystem` / `TestAioS3FileSystem` only — these run once and are NOT multiplied across cursor types. Parametrize small (one-shot) + one ~10 MiB (multipart) case; keep large uploads to 2-3 total (block size is 5 MiB).
> - Run `make format`, then `make lint` (ruff + mypy) and the filesystem tests before finishing.
> - Branch off `master`, commit with a `Co-Authored-By` trailer, open a **draft** PR using the repo PR template, and link this issue. PR #721 is a worked example.
>
> Confirm the chosen item and your plan before implementing.

_Spun out of the discussion in #719._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve s3fs / fsspec API parity for the built-in S3FileSystem #722

Background

Method

Already at parity ✅

Gaps (proposed checklist)

Directory operations

Object metadata & attributes

Multipart upload management

Versioning

Error translation

File-object helpers

Drop-in / DX

Scope / notes

Implementation & contribution notes

Kickoff prompt (paste into a new session)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve s3fs / fsspec API parity for the built-in S3FileSystem #722

Description

Background

Method

Already at parity ✅

Gaps (proposed checklist)

Directory operations

Object metadata & attributes

Multipart upload management

Versioning

Error translation

File-object helpers

Drop-in / DX

Scope / notes

Implementation & contribution notes

Kickoff prompt (paste into a new session)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions