Skip to content

feat: add keyword argument support for train_from_stream (#114)#155

Open
Yegorov wants to merge 4 commits into
cardmagic:masterfrom
Yegorov:114
Open

feat: add keyword argument support for train_from_stream (#114)#155
Yegorov wants to merge 4 commits into
cardmagic:masterfrom
Yegorov:114

Conversation

@Yegorov
Copy link
Copy Markdown
Contributor

@Yegorov Yegorov commented May 21, 2026

Closes #114

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR extends train_from_stream across all four classifiers (Bayes, KNN, LogisticRegression, LSI) to accept multiple categories via keyword arguments (e.g. train_from_stream(spam: io1, ham: io2)), alongside the existing positional form. New tests cover the multi-category path and invalid-IO validation.

  • The LSI implementation moves the begin/ensure block inside the per-category loop, causing build_index to fire once per category rather than once after all streams have been loaded.
  • Making category and io optional without an explicit guard means callers who supply only one of the two arguments silently get a no-op instead of an error.
  • The Streaming base module stub was not updated to match the optional-positional signature now used by every implementation.

Confidence Score: 3/5

The multi-category keyword API is a useful addition, but the LSI implementation rebuilds the index after each category stream and the missing-argument case silently does nothing across all classifiers.

The LSI build_index regression means any caller who streams multiple categories while auto_rebuild is enabled will trigger redundant and increasingly expensive index rebuilds. The silent no-op for partial arguments removes a safety net that existed for free under the old required-parameter contract.

lib/classifier/lsi.rb needs the auto_rebuild guard hoisted outside the loop; lib/classifier/bayes.rb and logistic_regression.rb need an explicit guard for the partial-argument case; lib/classifier/streaming.rb base stub should be updated to match the optional signature.

Important Files Changed

Filename Overview
lib/classifier/lsi.rb Moving the begin/ensure block inside the category loop causes build_index to fire after each individual stream instead of once after all categories, breaking the original optimisation.
lib/classifier/bayes.rb Adds multi-category keyword support. Missing guard means partial argument combinations silently do nothing instead of raising.
lib/classifier/logistic_regression.rb Same pattern as Bayes; same silent no-op concern applies. synchronize and dirty handling looks structurally correct inside the loop.
lib/classifier/knn.rb Thin delegation to lsi.train_from_stream; correctly threads **categories through.
lib/classifier/streaming.rb Base stub not updated to match the optional-positional signature adopted by all implementations; @RBS annotation is also inconsistent.
test/lsi/streaming_test.rb Adds multi-category and invalid-IO tests; does not cover the auto_rebuild regression.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["train_from_stream(category=nil, io=nil, **categories)"] --> B{category && io?}
    B -- Yes --> C["{ category => io }"]
    B -- No --> D["categories hash"]
    C --> E[".each do |cat, stream|"]
    D --> E
    E --> F[Validate stream]
    F --> G[LineReader + Progress]
    G --> H[each_batch]
    H --> I[Train + update progress]
    I --> N{More batches?}
    N -- Yes --> H
    N -- No --> O{More categories?}
    O -- Yes --> E
    O -- No --> P[Done]
    subgraph LSI_BUG ["LSI only"]
        Q["save auto_rebuild, set false"]
        R["ensure: restore + build_index per-category"]
    end
    E -.-> Q
    Q --> G
    G -.-> R
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
lib/classifier/lsi.rb:667-688
**`build_index` called once per category instead of once after all streams**

The `begin/ensure` block (which saves/restores `@auto_rebuild` and calls `build_index`) is now inside the `.each` loop. When multiple categories are provided, `build_index` fires after every stream rather than once at the end. This defeats the original optimisation: if `auto_rebuild` was `true` before the call, you'll pay the full re-index cost for each category, and every intermediate rebuild immediately becomes obsolete. The fix is to hoist the `auto_rebuild` save, the `@auto_rebuild = false` assignment, and the `ensure` block outside the loop so the index is rebuilt exactly once after all categories have been loaded.

### Issue 2 of 3
lib/classifier/bayes.rb:332-333
**Silent no-op when only one of `category`/`io` is supplied**

When a caller passes only `category` (e.g. `train_from_stream(:spam, batch_size: 100)`) or only `io`, the ternary resolves to `categories` which is an empty hash, so the method returns without training or raising. Before this PR, the required positional parameters made such a call a Ruby `ArgumentError`. Consider an explicit guard to preserve that contract.

```suggestion
    def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories)
      raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? == io.nil? && category.nil? && categories.empty?
      raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?

      (category && io ? { category => io } : categories).each do |category, io|
```

### Issue 3 of 3
lib/classifier/streaming.rb:29-30
**Base method signature still requires `category` and `io` as positional arguments**

The `Streaming` module's stub still declares `category` and `io` as required, but every concrete implementation now makes them optional. A caller coding to the base interface would be incorrectly told these are mandatory. Update the stub and its `@rbs` annotation to match.

```suggestion
    # @rbs (?(Symbol | String), ?IO, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
    def train_from_stream(category = nil, io = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block)
```

Reviews (1): Last reviewed commit: "feat: add keyword argument support for `..." | Re-trigger Greptile

Comment thread lib/classifier/lsi.rb Outdated
Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/streaming.rb Outdated
@Yegorov
Copy link
Copy Markdown
Contributor Author

Yegorov commented May 21, 2026

@cardmagic can you take a look, please?

Copy link
Copy Markdown
Owner

@cardmagic cardmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword-arg API is a nice addition and the multi-category path works — issues are around the edges (validation ordering, state mutation, base-module rbs). Test suite passes locally (678 runs, 0 failures).

Note: Greptile's earlier review is stale — it flagged a "build_index per-category" regression and "silent no-op for partial args" that you already fixed in a later commit. Those no longer apply.

See inline comments. Must-fix: #1, #2, #3. Should-fix: #4. The rest are nits/suggestions.

Comment thread lib/classifier/lsi.rb
Comment on lines +667 to 670
raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? && io.nil? && categories.empty?
raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?

original_auto_rebuild = @auto_rebuild
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must-fix #1ensure clobbers @auto_rebuild to nil when validation raises.

The two raise ArgumentError on lines 667-668 execute before original_auto_rebuild = @auto_rebuild on line 670. But the method-level ensure (line 675) always runs. Ruby has already declared original_auto_rebuild as a local in scope, so it's nil in the ensure block — meaning line 676 sets @auto_rebuild = nil, silently breaking subsequent operations.

Reproduced against a real Classifier::LSI instance:

Before: auto_rebuild=true
Caught ArgumentError
After:  auto_rebuild=nil   ← clobbered

KNN inherits this since it delegates to LSI.

Fix: hoist original_auto_rebuild = @auto_rebuild above the two validation raises, or wrap the body in an explicit begin/ensure that starts after validation.

Comment thread lib/classifier/lsi.rb
# Trains from an IO stream with a single category.
# @rbs (String | Symbol, IO, batch_size: Integer) { (Streaming::Progress) -> void } -> void
def stream_train_category(category, io, batch_size:)
raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must-fix #2 — IO validation happens mid-loop after partial mutation.

When called as train_from_stream(a: good_io, b: Object.new), the first stream's documents are added to the index before the second iteration raises. State is now partially mutated.

Fix: pre-validate all IOs in one pass before the .each loop in train_from_stream, so the call either fully succeeds or raises before touching state.

#
# @rbs (Symbol | String, IO, ?batch_size: Integer) { (Progress) -> void } -> void
def train_from_stream(category, io, batch_size: DEFAULT_BATCH_SIZE, &block)
# @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
Copy link
Copy Markdown
Owner

@cardmagic cardmagic May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must-fix #3 — base-module type signature mismatches the implementations.

In RBS, **T is the type of each rest-kwarg value. So:

  • **IO (all four implementations) = each kwarg value is an IO — matches the runtime: train_from_stream(spam: io1) passes an IO as the value
  • **Hash[Symbol, IO] (this base stub) = each kwarg value is itself a Hash[Symbol, IO] — would mean train_from_stream(spam: { foo: io })

The base stub should match:

# @rbs (?(Symbol | String | nil), ?IO?, ?batch_size: Integer, **IO) { (Progress) -> void } -> void

This may also let you drop the # @type var categories: untyped escape hatch in knn.rb:273.

Comment thread lib/classifier/bayes.rb
Comment on lines +389 to +390
raise StandardError, "No such category: #{category}" unless @categories.key?(category)
raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix #4 — use ArgumentError, not StandardError.

StandardError is the base of nearly every Ruby exception; rescuers can't distinguish bad-input from genuine errors. The public train_from_stream correctly raises ArgumentError (line 333-334); the private helper should too. Same issue in logistic_regression.rb:432-433 and lsi.rb:728.

The existing assert_raises(StandardError) tests still pass since ArgumentError < StandardError, but should be tightened to assert_raises(ArgumentError).

Comment thread lib/classifier/bayes.rb
progress.completed += batch.size
progress.current_batch += 1
yield progress if block_given?
(category && io ? { category => io } : categories).each do |(category, io)|
Copy link
Copy Markdown
Owner

@cardmagic cardmagic May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — unnecessary destructuring + reused variable names.

Two small things in (category && io ? { category => io } : categories).each do |(category, io)|:

  1. The block params reuse the method-parameter names category and io; inside the block the outer values are inaccessible.
  2. The |(...)| destructuring is unnecessary — Hash#each already yields two args.

Cleaner:

pairs = category ? { category => io } : categories
pairs.each do |cat, stream|
  stream_train_category(cat, stream, batch_size: batch_size, &)
end

Same pattern at lsi.rb:672 and logistic_regression.rb:398.

Comment thread lib/classifier/bayes.rb
# @rbs (?(String | Symbol | nil), ?IO?, ?batch_size: Integer, **IO) { (Streaming::Progress) -> void } -> void
def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &)
raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? && io.nil? && categories.empty?
raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — XOR on booleans reads as Ruby exotica.

category.nil? ^ io.nil? works but ^ on booleans is rare enough in idiomatic Ruby to make readers pause. Consider:

raise ArgumentError, '...' if category.nil? != io.nil?

or

raise ArgumentError, '...' if [category, io].one?(&:nil?)

Comment thread lib/classifier/knn.rb
def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE, &block)
@lsi.train_from_stream(category, io, batch_size: batch_size, &block)
# @rbs (?(String | Symbol | nil), ?IO?, ?batch_size: Integer, **IO) { (Streaming::Progress) -> void } -> void
def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — inconsistent block forwarding.

KNN uses &block here but never references block by name; the other three classifiers all use anonymous & (Ruby 3.1+). Switch for consistency:

def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &)
  @lsi.train_from_stream(category, io, batch_size: batch_size, **categories, &)
  synchronize { @dirty = true }
end

Also: when must-fix #3 is fixed, the # @type var categories: untyped on line 273 may no longer be needed.

Comment on lines +432 to +433
raise StandardError, "No such category: #{category}" unless @categories.include?(category)
raise StandardError, 'Stream must respond to #each_line' unless io.respond_to?(:each_line)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix #4 (cross-ref) — ArgumentError, not StandardError. See the comment on bayes.rb:389-390.

assert_raises(StandardError) do
@classifier.train_from_stream(spam: Object.new)
end
end
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage gap.

The new tests don't exercise the public ArgumentError validation paths (train_from_stream() with no args, or train_from_stream(:spam) without io). No test would have caught must-fix #1. Suggested additions:

def test_train_from_stream_raises_without_args
  assert_raises(ArgumentError) { @classifier.train_from_stream }
end

def test_train_from_stream_raises_with_partial_args
  assert_raises(ArgumentError) { @classifier.train_from_stream(:spam) }
end

LSI additionally needs a test for auto_rebuild: true combined with the multi-category keyword form (the existing test_train_from_stream_rebuilds_index_when_auto_rebuild only uses two positional calls). And a test for heterogeneous valid+invalid kwargs would lock in must-fix #2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add keyword argument support for train_from_stream

2 participants