Skip to content

Avoid loading full datasets when generating benchmark dataset details#622

Open
R-Palazzo wants to merge 10 commits into
mainfrom
issue-609-improve-row-count-dataset-details
Open

Avoid loading full datasets when generating benchmark dataset details#622
R-Palazzo wants to merge 10 commits into
mainfrom
issue-609-improve-row-count-dataset-details

Conversation

@R-Palazzo

@R-Palazzo R-Palazzo commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Resolve #609
86ba43c4v

Tested for the multi-table benchmark upload workflow here

@R-Palazzo R-Palazzo requested review from gsheni and sarahmish June 9, 2026 15:26
@R-Palazzo R-Palazzo self-assigned this Jun 9, 2026
@R-Palazzo R-Palazzo requested a review from a team as a code owner June 9, 2026 15:26
@R-Palazzo R-Palazzo removed the request for review from a team June 9, 2026 15:30
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.83%. Comparing base (30ef01a) to head (698455e).

Files with missing lines Patch % Lines
sdgym/dataset_explorer.py 95.23% 2 Missing ⚠️
sdgym/s3.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #622      +/-   ##
==========================================
+ Coverage   85.81%   85.83%   +0.02%     
==========================================
  Files          40       40              
  Lines        3722     3771      +49     
==========================================
+ Hits         3194     3237      +43     
- Misses        528      534       +6     
Flag Coverage Δ
integration 44.41% <84.90%> (+0.24%) ⬆️
unit 81.75% <94.33%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread tests/integration/test_dataset_explorer.py Outdated
Comment thread tests/integration/test_dataset_explorer.py
@R-Palazzo R-Palazzo requested a review from gsheni June 10, 2026 10:35

@sarahmish sarahmish left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to calculate the number of cells in the dataset too? number of rows x number of columns?

Comment thread sdgym/dataset_explorer.py Outdated
Comment on lines +240 to +250
count = 0
has_header = False
for row in reader:
if not row:
continue

if not has_header:
has_header = True
continue

count += 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be optimized

Suggested change
count = 0
has_header = False
for row in reader:
if not row:
continue
if not has_header:
has_header = True
continue
count += 1
next(reader) # skip header
count = sum(1 for row in reader if row)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also count the number of new lines without a csv reader which would make it even faster

def _count_csv_rows(csv_file):
    text_file = io.TextIOWrapper(csv_file, encoding='utf-8-sig', newline='')
    count = sum(1 for line in text_file if line.strip()) - 1 # -1 for the header
    text_file.detach()
    return max(count, 0)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the csv reader otherwise the tests were failing:
https://github.com/sdv-dev/SDGym/actions/runs/27419559432/job/81041188188

For example, the following was treated as 2 lines

id,text
1,"first line
second line"

@R-Palazzo

Copy link
Copy Markdown
Collaborator Author

Do we want to calculate the number of cells in the dataset too? number of rows x number of columns?

@sarahmish this is something we could add, maybe in a separate issue. For now in the Dataset_Details table there are the Total_Num_Columns and Total_Num_Rows

@R-Palazzo R-Palazzo force-pushed the issue-609-improve-row-count-dataset-details branch from c161ff1 to 698455e Compare June 15, 2026 09:43
@R-Palazzo R-Palazzo requested a review from sarahmish June 15, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid loading full datasets when generating benchmark dataset details

3 participants