You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
as Khiops uses part of the features in the input dataset (after feature selection in the preprocessing phase), plus the constructed features (feature pairs, trees, multi-table features or features derived via rules), importances (as averaged Shapley values over the training dataset) are computed for these used features, which only partially overlap with the input features.
as a result, it is impossible to provide, in practice and in general, a feature_importances_ estimator attribute that abides by the expectations of Scikit-learn, i.e. contain importances of all the input features and only those.
consequently, providing such an attribute would only make sense in very particular cases:
using only monotable training datasets (which would preclude the application of multi-table specific feature construction rules);
forbidding the construction of variable pairs, trees, and text features; while this is possible, this would deny much of the strength of the Khiops models, resulting in potentially subpar predictors (with respect to the achievable potential);
as this would seem very limiting and rather confusing, it seems better not to provide the feature_importances_ attribute directly (and to rely on the model_report_ attribute instead, in order to retrieve the importances of the variables that we wish).
show how the model_report_KhiopsPredictor attribute can be used to determine the importances of evaluated and selected features, with the caveat of the relevance of such an approach in just a few limit cases (as stated above);
explain and show how feature importance can be given a consistent meaning in the Khiops context by:
using the Core API to use train_recoder in order to "flatten" a multi-table dataset, then train_predictor to build an SNB predictor, then to set as "unused" in the encoder model the variables that are "unused" variables in the predictor model.
using the flattened dataset representation (via the encoder - see point above) as input to a custom subclass of KhiopsPredictor which provides the feature_importances_ and feature_names_in_ attributes as explained above; the relevant features are the input features, and their importances are all non-zero because the input only uses "used" features in the predictor.
The Q&A should be a consequence issue of #480 mitigation. It should:
explain why we cannot use an attribute like
feature_importances_on Khiops Sklearn estimators (see Improve Feature Importance Support in Sklearn Khiops Estimators #480 (comment)):feature_importances_estimator attribute that abides by the expectations of Scikit-learn, i.e. contain importances of all the input features and only those.feature_importances_attribute directly (and to rely on themodel_report_attribute instead, in order to retrieve the importances of the variables that we wish).show how the
model_report_KhiopsPredictorattribute can be used to determine the importances of evaluated and selected features, with the caveat of the relevance of such an approach in just a few limit cases (as stated above);explain and show how feature importance can be given a consistent meaning in the Khiops context by:
KhiopsPredictorwhich provides thefeature_importances_andfeature_names_in_attributes as explained above; the relevant features are the input features, and their importances are all non-zero because the input only uses "used" features in the predictor.