feat(gooddata-sdk): handle datetime granularity and support bytes limits in arrow fetch#1563
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1563 +/- ##
===========================================
+ Coverage 0.00% 78.71% +78.71%
===========================================
Files 158 230 +72
Lines 11048 15449 +4401
===========================================
+ Hits 0 12161 +12161
+ Misses 11048 3288 -7760 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
26b9953 to
660ac1a
Compare
| total_ref_cols = [f.name for f in table.schema if f.name.startswith(_COL_TOTAL_REF_PREFIX)] | ||
| if total_ref_cols: | ||
| if len(total_ref_cols) > 1: | ||
| logger.warning( | ||
| "Arrow table has %d __total_ref* columns; only %r is used for aggregation names.", | ||
| len(total_ref_cols), | ||
| total_ref_cols[0], |
There was a problem hiding this comment.
i wonder, did this actually happen? multiple total ref columns means something is very hosed on the backend...
being defensive is fine, no problem here but if you run into this warning then perhaps there's a more serious problem in xtab impl?
There was a problem hiding this comment.
No, it didn't happen yet, just a defensive warning "what if", so as you say, if it happens in the future, we should react on this. Original implementation silently picked the first column with the prefix.
| # Read the full HTTP response body before releasing the connection. | ||
| # pyarrow's IPC stream reader stops at the Arrow EOS marker and may leave | ||
| # the HTTP chunked-encoding terminator unread, which puts the connection in | ||
| # Request-sent state and breaks HTTP keep-alive reuse. | ||
| data = response.read() |
There was a problem hiding this comment.
i'm not 100% sure about this. it's worth investigating the implication on memory usage & allocations.
before: arrow reads stream, incrementally builds the table chunk by chunk.
now: likely: first read data for all chunks, then build table. this may (i'm not 100% certain though) to bigger spikes in RSS.
imho its worth investigating / measuring. could be there is no impact due to some technicalities in the underlying arrow code. could be that the load now needs ~2x more memory.
There was a problem hiding this comment.
I did some benchmarks, there was a bit of a memory usage increase, but it wasn't that brutal (can run tests again to see the exact numbers). In any case, from my other measures, Arrow path has less memory consumption than original AFM, so it's kind of a trade-off.
However, I did run into the issue with the broken connection, and this was a fix for it.
There was a problem hiding this comment.
ok. i wonder though, isn't there a subtler way to address it?
let's say: try to read stream as before (passing response to Arrow) -> the happy path, ideal memory footprint. then if error happens (except), do response.read() to read everything that is remaining, and then finally close?
or does arrow take ownership of the stream and closes it 'incorrectly' before reading everything or something like that? so exception handler cannot do the cleanup anymore?
AttributeConverterStorefor fetching data using Arrow.max_bytesparameter to Arrow table fetch.risk: low