feat: add keyword argument support for train_from_stream (#114)#155
feat: add keyword argument support for train_from_stream (#114)#155Yegorov wants to merge 4 commits into
train_from_stream (#114)#155Conversation
Greptile SummaryThis PR extends
Confidence Score: 3/5The multi-category keyword API is a useful addition, but the LSI implementation rebuilds the index after each category stream and the missing-argument case silently does nothing across all classifiers. The LSI build_index regression means any caller who streams multiple categories while auto_rebuild is enabled will trigger redundant and increasingly expensive index rebuilds. The silent no-op for partial arguments removes a safety net that existed for free under the old required-parameter contract. lib/classifier/lsi.rb needs the auto_rebuild guard hoisted outside the loop; lib/classifier/bayes.rb and logistic_regression.rb need an explicit guard for the partial-argument case; lib/classifier/streaming.rb base stub should be updated to match the optional signature. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["train_from_stream(category=nil, io=nil, **categories)"] --> B{category && io?}
B -- Yes --> C["{ category => io }"]
B -- No --> D["categories hash"]
C --> E[".each do |cat, stream|"]
D --> E
E --> F[Validate stream]
F --> G[LineReader + Progress]
G --> H[each_batch]
H --> I[Train + update progress]
I --> N{More batches?}
N -- Yes --> H
N -- No --> O{More categories?}
O -- Yes --> E
O -- No --> P[Done]
subgraph LSI_BUG ["LSI only"]
Q["save auto_rebuild, set false"]
R["ensure: restore + build_index per-category"]
end
E -.-> Q
Q --> G
G -.-> R
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
lib/classifier/lsi.rb:667-688
**`build_index` called once per category instead of once after all streams**
The `begin/ensure` block (which saves/restores `@auto_rebuild` and calls `build_index`) is now inside the `.each` loop. When multiple categories are provided, `build_index` fires after every stream rather than once at the end. This defeats the original optimisation: if `auto_rebuild` was `true` before the call, you'll pay the full re-index cost for each category, and every intermediate rebuild immediately becomes obsolete. The fix is to hoist the `auto_rebuild` save, the `@auto_rebuild = false` assignment, and the `ensure` block outside the loop so the index is rebuilt exactly once after all categories have been loaded.
### Issue 2 of 3
lib/classifier/bayes.rb:332-333
**Silent no-op when only one of `category`/`io` is supplied**
When a caller passes only `category` (e.g. `train_from_stream(:spam, batch_size: 100)`) or only `io`, the ternary resolves to `categories` which is an empty hash, so the method returns without training or raising. Before this PR, the required positional parameters made such a call a Ruby `ArgumentError`. Consider an explicit guard to preserve that contract.
```suggestion
def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories)
raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? == io.nil? && category.nil? && categories.empty?
raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?
(category && io ? { category => io } : categories).each do |category, io|
```
### Issue 3 of 3
lib/classifier/streaming.rb:29-30
**Base method signature still requires `category` and `io` as positional arguments**
The `Streaming` module's stub still declares `category` and `io` as required, but every concrete implementation now makes them optional. A caller coding to the base interface would be incorrectly told these are mandatory. Update the stub and its `@rbs` annotation to match.
```suggestion
# @rbs (?(Symbol | String), ?IO, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
def train_from_stream(category = nil, io = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block)
```
Reviews (1): Last reviewed commit: "feat: add keyword argument support for `..." | Re-trigger Greptile |
|
@cardmagic can you take a look, please? |
There was a problem hiding this comment.
The keyword-arg API is a nice addition and the multi-category path works — issues are around the edges (validation ordering, state mutation, base-module rbs). Test suite passes locally (678 runs, 0 failures).
Note: Greptile's earlier review is stale — it flagged a "build_index per-category" regression and "silent no-op for partial args" that you already fixed in a later commit. Those no longer apply.
See inline comments. Must-fix: #1, #2, #3. Should-fix: #4. The rest are nits/suggestions.
| raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? && io.nil? && categories.empty? | ||
| raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil? | ||
|
|
||
| original_auto_rebuild = @auto_rebuild |
There was a problem hiding this comment.
Must-fix #1 — ensure clobbers @auto_rebuild to nil when validation raises.
The two raise ArgumentError on lines 667-668 execute before original_auto_rebuild = @auto_rebuild on line 670. But the method-level ensure (line 675) always runs. Ruby has already declared original_auto_rebuild as a local in scope, so it's nil in the ensure block — meaning line 676 sets @auto_rebuild = nil, silently breaking subsequent operations.
Reproduced against a real Classifier::LSI instance:
Before: auto_rebuild=true
Caught ArgumentError
After: auto_rebuild=nil ← clobbered
KNN inherits this since it delegates to LSI.
Fix: hoist original_auto_rebuild = @auto_rebuild above the two validation raises, or wrap the body in an explicit begin/ensure that starts after validation.
| # Trains from an IO stream with a single category. | ||
| # @rbs (String | Symbol, IO, batch_size: Integer) { (Streaming::Progress) -> void } -> void | ||
| def stream_train_category(category, io, batch_size:) | ||
| raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line) |
There was a problem hiding this comment.
Must-fix #2 — IO validation happens mid-loop after partial mutation.
When called as train_from_stream(a: good_io, b: Object.new), the first stream's documents are added to the index before the second iteration raises. State is now partially mutated.
Fix: pre-validate all IOs in one pass before the .each loop in train_from_stream, so the call either fully succeeds or raises before touching state.
| # | ||
| # @rbs (Symbol | String, IO, ?batch_size: Integer) { (Progress) -> void } -> void | ||
| def train_from_stream(category, io, batch_size: DEFAULT_BATCH_SIZE, &block) | ||
| # @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void |
There was a problem hiding this comment.
Must-fix #3 — base-module type signature mismatches the implementations.
In RBS, **T is the type of each rest-kwarg value. So:
**IO(all four implementations) = each kwarg value is anIO— matches the runtime:train_from_stream(spam: io1)passes an IO as the value**Hash[Symbol, IO](this base stub) = each kwarg value is itself aHash[Symbol, IO]— would meantrain_from_stream(spam: { foo: io })
The base stub should match:
# @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **IO) { (Progress) -> void } -> voidThis may also let you drop the # @type var categories: untyped escape hatch in knn.rb:273.
| raise StandardError, "No such category: #{category}" unless @categories.key?(category) | ||
| raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line) |
There was a problem hiding this comment.
Should-fix #4 — use ArgumentError, not StandardError.
StandardError is the base of nearly every Ruby exception; rescuers can't distinguish bad-input from genuine errors. The public train_from_stream correctly raises ArgumentError (line 333-334); the private helper should too. Same issue in logistic_regression.rb:432-433 and lsi.rb:728.
The existing assert_raises(StandardError) tests still pass since ArgumentError < StandardError, but should be tightened to assert_raises(ArgumentError).
| progress.completed += batch.size | ||
| progress.current_batch += 1 | ||
| yield progress if block_given? | ||
| (category && io ? { category => io } : categories).each do |(category, io)| |
There was a problem hiding this comment.
Nit — unnecessary destructuring + reused variable names.
Two small things in (category && io ? { category => io } : categories).each do |(category, io)|:
- The block params reuse the method-parameter names
categoryandio; inside the block the outer values are inaccessible. - The
|(...)|destructuring is unnecessary —Hash#eachalready yields two args.
Cleaner:
pairs = category ? { category => io } : categories
pairs.each do |cat, stream|
stream_train_category(cat, stream, batch_size: batch_size, &)
endSame pattern at lsi.rb:672 and logistic_regression.rb:398.
| # @rbs (?(String | Symbol | nil), ?IO?, ?batch_size: Integer, **IO) { (Streaming::Progress) -> void } -> void | ||
| def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &) | ||
| raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? && io.nil? && categories.empty? | ||
| raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil? |
There was a problem hiding this comment.
Nit — XOR on booleans reads as Ruby exotica.
category.nil? ^ io.nil? works but ^ on booleans is rare enough in idiomatic Ruby to make readers pause. Consider:
raise ArgumentError, '...' if category.nil? != io.nil?or
raise ArgumentError, '...' if [category, io].one?(&:nil?)| def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE, &block) | ||
| @lsi.train_from_stream(category, io, batch_size: batch_size, &block) | ||
| # @rbs (?(String | Symbol | nil), ?IO?, ?batch_size: Integer, **IO) { (Streaming::Progress) -> void } -> void | ||
| def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) |
There was a problem hiding this comment.
Nit — inconsistent block forwarding.
KNN uses &block here but never references block by name; the other three classifiers all use anonymous & (Ruby 3.1+). Switch for consistency:
def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &)
@lsi.train_from_stream(category, io, batch_size: batch_size, **categories, &)
synchronize { @dirty = true }
endAlso: when must-fix #3 is fixed, the # @type var categories: untyped on line 273 may no longer be needed.
| raise StandardError, "No such category: #{category}" unless @categories.include?(category) | ||
| raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line) |
There was a problem hiding this comment.
Should-fix #4 (cross-ref) — ArgumentError, not StandardError. See the comment on bayes.rb:389-390.
| assert_raises(StandardError) do | ||
| @classifier.train_from_stream(spam: Object.new) | ||
| end | ||
| end |
There was a problem hiding this comment.
Test coverage gap.
The new tests don't exercise the public ArgumentError validation paths (train_from_stream() with no args, or train_from_stream(:spam) without io). No test would have caught must-fix #1. Suggested additions:
def test_train_from_stream_raises_without_args
assert_raises(ArgumentError) { @classifier.train_from_stream }
end
def test_train_from_stream_raises_with_partial_args
assert_raises(ArgumentError) { @classifier.train_from_stream(:spam) }
endLSI additionally needs a test for auto_rebuild: true combined with the multi-category keyword form (the existing test_train_from_stream_rebuilds_index_when_auto_rebuild only uses two positional calls). And a test for heterogeneous valid+invalid kwargs would lock in must-fix #2.
Closes #114