Avoid loading full datasets when generating benchmark dataset details by R-Palazzo · Pull Request #622 · sdv-dev/SDGym

R-Palazzo · 2026-06-09T15:26:22Z

Resolve #609
86ba43c4v

Tested for the multi-table benchmark upload workflow here

codecov · 2026-06-09T15:53:24Z

Codecov Report

❌ Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.83%. Comparing base (30ef01a) to head (698455e).

Files with missing lines	Patch %	Lines
sdgym/dataset_explorer.py	95.23%	2 Missing ⚠️
sdgym/s3.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #622      +/-   ##
==========================================
+ Coverage   85.81%   85.83%   +0.02%     
==========================================
  Files          40       40              
  Lines        3722     3771      +49     
==========================================
+ Hits         3194     3237      +43     
- Misses        528      534       +6

Flag	Coverage Δ
integration	`44.41% <84.90%> (+0.24%)`	⬆️
unit	`81.75% <94.33%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sarahmish

Do we want to calculate the number of cells in the dataset too? number of rows x number of columns?

sarahmish · 2026-06-10T16:58:11Z

+        count = 0
+        has_header = False
+        for row in reader:
+            if not row:
+                continue
+
+            if not has_header:
+                has_header = True
+                continue
+
+            count += 1


I think this can be optimized

Suggested change

count = 0

has_header = False

for row in reader:

if not row:

continue

if not has_header:

has_header = True

continue

count += 1

next(reader) # skip header

count = sum(1 for row in reader if row)

I think you can also count the number of new lines without a csv reader which would make it even faster

def _count_csv_rows(csv_file): text_file = io.TextIOWrapper(csv_file, encoding='utf-8-sig', newline='') count = sum(1 for line in text_file if line.strip()) - 1 # -1 for the header text_file.detach() return max(count, 0)

I kept the csv reader otherwise the tests were failing:
https://github.com/sdv-dev/SDGym/actions/runs/27419559432/job/81041188188

For example, the following was treated as 2 lines

id,text 1,"first line second line"

R-Palazzo · 2026-06-12T13:36:00Z

Do we want to calculate the number of cells in the dataset too? number of rows x number of columns?

@sarahmish this is something we could add, maybe in a separate issue. For now in the Dataset_Details table there are the Total_Num_Columns and Total_Num_Rows

R-Palazzo requested review from gsheni and sarahmish June 9, 2026 15:26

R-Palazzo self-assigned this Jun 9, 2026

R-Palazzo requested a review from a team as a code owner June 9, 2026 15:26

R-Palazzo removed the request for review from a team June 9, 2026 15:30

gsheni reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/integration/test_dataset_explorer.py Outdated

Comment thread tests/integration/test_dataset_explorer.py

R-Palazzo requested a review from gsheni June 10, 2026 10:35

gsheni approved these changes Jun 10, 2026

View reviewed changes

sarahmish reviewed Jun 10, 2026

View reviewed changes

R-Palazzo added 10 commits June 15, 2026 10:41

def 609

75ce472

tests

1a95cdc

update

22758a3

update for run

4144883

fix

162d7d5

unit tests

5c70c4d

integration tests

8d5896e

update test

01f03ca

optimize _count_csv_rows

6291101

fix integration test

698455e

R-Palazzo force-pushed the issue-609-improve-row-count-dataset-details branch from c161ff1 to 698455e Compare June 15, 2026 09:43

R-Palazzo requested a review from sarahmish June 15, 2026 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid loading full datasets when generating benchmark dataset details#622

Avoid loading full datasets when generating benchmark dataset details#622
R-Palazzo wants to merge 10 commits into
mainfrom
issue-609-improve-row-count-dataset-details

R-Palazzo commented Jun 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sarahmish left a comment

Uh oh!

sarahmish Jun 10, 2026

Uh oh!

sarahmish Jun 10, 2026

Uh oh!

R-Palazzo Jun 12, 2026

Uh oh!

R-Palazzo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

R-Palazzo commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

sarahmish left a comment

Choose a reason for hiding this comment

Uh oh!

sarahmish Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sarahmish Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

R-Palazzo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

R-Palazzo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

R-Palazzo commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading