Skip to content

Feature request: sample weights support in RandomForest (and other tree-based models) #356

@nicksrandall

Description

@nicksrandall

Feature Request

Support for sample_weights: Vec<f64> in RandomForestRegressor::fit() (and ideally RandomForestClassifier and the underlying DecisionTree models as well).

Use Case

I'm training a RandomForest on time-series data where recent observations should be weighted more heavily than older ones (exponential decay: weight = 0.9^months_ago). This is a common pattern in scikit-learn:

model.fit(X, y, sample_weight=weights)

Without sample weights, there's no way to express "this training example matters more than that one" — which is important for recency weighting, class imbalance correction, and importance sampling.

Current State

Looking at the source code, the internal plumbing is close to supporting this:

  • BaseForestRegressor::sample_with_replacement() does uniform bootstrap sampling — this could be extended to weighted sampling
  • BaseTreeRegressor::fit_weak_learner() already accepts samples: Vec<usize> (bootstrap counts) and uses them as integer multipliers in split statistics:
    sum += *sample_i as f64 * y_m.get(i).to_f64().unwrap();
  • Generalizing samples from Vec<usize> (integer counts) to Vec<f64> (continuous weights) in the tree splitter would enable this

Proposed API

Option A — Add to parameters struct:

RandomForestRegressorParameters {
    // ... existing fields ...
    sample_weights: Option<Vec<f64>>,
}

Option B — Extend the fit signature (breaking change):

pub fn fit(x: &X, y: &Y, parameters: P, sample_weights: Option<&[f64]>) -> Result<Self, Failed>

Option A is backwards-compatible and probably preferable.

Scope

Two pieces:

  1. Weighted bootstrap sampling in BaseForestRegressor — sample with probability proportional to weights instead of uniformly
  2. Weighted split statistics in BaseTreeRegressor — use float weights instead of integer counts when computing mean/variance for split criteria

scikit-learn Reference

For reference, scikit-learn's implementation:

This is one of the most commonly used features in scikit-learn's RandomForest and would make smartcore a much more viable alternative for real-world ML pipelines.

Thank you for maintaining this crate — the WASM-first posture is exactly what drew me to it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions