Skip to content

Add Q&A Entry Regarding Feature Importance Computation in Khiops Sklearn Estimators #575

@popescu-v

Description

@popescu-v

The Q&A should be a consequence issue of #480 mitigation. It should:

  • explain why we cannot use an attribute like feature_importances_ on Khiops Sklearn estimators (see Improve Feature Importance Support in Sklearn Khiops Estimators #480 (comment)):

    • as Khiops uses part of the features in the input dataset (after feature selection in the preprocessing phase), plus the constructed features (feature pairs, trees, multi-table features or features derived via rules), importances (as averaged Shapley values over the training dataset) are computed for these used features, which only partially overlap with the input features.
    • as a result, it is impossible to provide, in practice and in general, a feature_importances_ estimator attribute that abides by the expectations of Scikit-learn, i.e. contain importances of all the input features and only those.
    • consequently, providing such an attribute would only make sense in very particular cases:
      • using only monotable training datasets (which would preclude the application of multi-table specific feature construction rules);
      • forbidding the construction of variable pairs, trees, and text features; while this is possible, this would deny much of the strength of the Khiops models, resulting in potentially subpar predictors (with respect to the achievable potential);
    • as this would seem very limiting and rather confusing, it seems better not to provide the feature_importances_ attribute directly (and to rely on the model_report_ attribute instead, in order to retrieve the importances of the variables that we wish).
  • show how the model_report_ KhiopsPredictor attribute can be used to determine the importances of evaluated and selected features, with the caveat of the relevance of such an approach in just a few limit cases (as stated above);

  • explain and show how feature importance can be given a consistent meaning in the Khiops context by:

    • using the Core API to use train_recoder in order to "flatten" a multi-table dataset, then train_predictor to build an SNB predictor, then to set as "unused" in the encoder model the variables that are "unused" variables in the predictor model.
    • using the flattened dataset representation (via the encoder - see point above) as input to a custom subclass of KhiopsPredictor which provides the feature_importances_ and feature_names_in_ attributes as explained above; the relevant features are the input features, and their importances are all non-zero because the input only uses "used" features in the predictor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Priority/1-MediumTo do after P0Status/StandByThe issue is on stand-by (usually blocked by external dependencies)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions