Skip to content

feat(gooddata-sdk): handle datetime granularity and support bytes limits in arrow fetch#1563

Merged
Martozar merged 3 commits intogooddata:masterfrom
Martozar:c.mze-cq-105
Apr 21, 2026
Merged

feat(gooddata-sdk): handle datetime granularity and support bytes limits in arrow fetch#1563
Martozar merged 3 commits intogooddata:masterfrom
Martozar:c.mze-cq-105

Conversation

@Martozar
Copy link
Copy Markdown
Contributor

  • Add support for datetime granularity handling using AttributeConverterStore for fetching data using Arrow.
  • Unify Arrow/AFM API by adding max_bytes parameter to Arrow table fetch.
  • Improve logging

risk: low

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 93.84615% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.71%. Comparing base (c00ca3b) to head (add06e9).
⚠️ Report is 170 commits behind head on master.

Files with missing lines Patch % Lines
...ta-sdk/src/gooddata_sdk/compute/model/execution.py 50.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1563       +/-   ##
===========================================
+ Coverage    0.00%   78.71%   +78.71%     
===========================================
  Files         158      230       +72     
  Lines       11048    15449     +4401     
===========================================
+ Hits            0    12161    +12161     
+ Misses      11048     3288     -7760     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Martozar Martozar force-pushed the c.mze-cq-105 branch 2 times, most recently from 26b9953 to 660ac1a Compare April 20, 2026 12:15
Comment on lines +253 to +259
total_ref_cols = [f.name for f in table.schema if f.name.startswith(_COL_TOTAL_REF_PREFIX)]
if total_ref_cols:
if len(total_ref_cols) > 1:
logger.warning(
"Arrow table has %d __total_ref* columns; only %r is used for aggregation names.",
len(total_ref_cols),
total_ref_cols[0],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder, did this actually happen? multiple total ref columns means something is very hosed on the backend...

being defensive is fine, no problem here but if you run into this warning then perhaps there's a more serious problem in xtab impl?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it didn't happen yet, just a defensive warning "what if", so as you say, if it happens in the future, we should react on this. Original implementation silently picked the first column with the prefix.

Comment on lines +416 to +420
# Read the full HTTP response body before releasing the connection.
# pyarrow's IPC stream reader stops at the Arrow EOS marker and may leave
# the HTTP chunked-encoding terminator unread, which puts the connection in
# Request-sent state and breaks HTTP keep-alive reuse.
data = response.read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not 100% sure about this. it's worth investigating the implication on memory usage & allocations.

before: arrow reads stream, incrementally builds the table chunk by chunk.
now: likely: first read data for all chunks, then build table. this may (i'm not 100% certain though) to bigger spikes in RSS.

imho its worth investigating / measuring. could be there is no impact due to some technicalities in the underlying arrow code. could be that the load now needs ~2x more memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some benchmarks, there was a bit of a memory usage increase, but it wasn't that brutal (can run tests again to see the exact numbers). In any case, from my other measures, Arrow path has less memory consumption than original AFM, so it's kind of a trade-off.

However, I did run into the issue with the broken connection, and this was a fix for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i wonder though, isn't there a subtler way to address it?

let's say: try to read stream as before (passing response to Arrow) -> the happy path, ideal memory footprint. then if error happens (except), do response.read() to read everything that is remaining, and then finally close?

or does arrow take ownership of the stream and closes it 'incorrectly' before reading everything or something like that? so exception handler cannot do the cleanup anymore?

@Martozar Martozar merged commit 6d554ce into gooddata:master Apr 21, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants