Avoid loading full datasets when generating benchmark dataset details#622
Avoid loading full datasets when generating benchmark dataset details#622R-Palazzo wants to merge 10 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #622 +/- ##
==========================================
+ Coverage 85.81% 85.83% +0.02%
==========================================
Files 40 40
Lines 3722 3771 +49
==========================================
+ Hits 3194 3237 +43
- Misses 528 534 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
sarahmish
left a comment
There was a problem hiding this comment.
Do we want to calculate the number of cells in the dataset too? number of rows x number of columns?
| count = 0 | ||
| has_header = False | ||
| for row in reader: | ||
| if not row: | ||
| continue | ||
|
|
||
| if not has_header: | ||
| has_header = True | ||
| continue | ||
|
|
||
| count += 1 |
There was a problem hiding this comment.
I think this can be optimized
| count = 0 | |
| has_header = False | |
| for row in reader: | |
| if not row: | |
| continue | |
| if not has_header: | |
| has_header = True | |
| continue | |
| count += 1 | |
| next(reader) # skip header | |
| count = sum(1 for row in reader if row) |
There was a problem hiding this comment.
I think you can also count the number of new lines without a csv reader which would make it even faster
def _count_csv_rows(csv_file):
text_file = io.TextIOWrapper(csv_file, encoding='utf-8-sig', newline='')
count = sum(1 for line in text_file if line.strip()) - 1 # -1 for the header
text_file.detach()
return max(count, 0)There was a problem hiding this comment.
I kept the csv reader otherwise the tests were failing:
https://github.com/sdv-dev/SDGym/actions/runs/27419559432/job/81041188188
For example, the following was treated as 2 lines
id,text
1,"first line
second line"
@sarahmish this is something we could add, maybe in a separate issue. For now in the |
c161ff1 to
698455e
Compare
Resolve #609
86ba43c4v
Tested for the multi-table benchmark upload workflow here