diff --git a/apply_model/model_export_as_cpp_code_tutorial.md b/apply_model/model_export_as_cpp_code_tutorial.md index 52bfbc6..1d74f5e 100644 --- a/apply_model/model_export_as_cpp_code_tutorial.md +++ b/apply_model/model_export_as_cpp_code_tutorial.md @@ -5,8 +5,7 @@ Catboost model could be saved as standalone C++ code. This can ease an integrati The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/google/cityhash/tree/00b9287e8c1255b5922ef90e304d5287361b2c2a) (NOTE: The exact revision under the link is required). - -### Exporting from Catboost application via command line interface: +### Exporting from Catboost application via command line interface ```bash catboost fit --model-format CPP @@ -14,8 +13,7 @@ catboost fit --model-format CPP By default model is saved into *model.cpp* file. One could alter the output name using *-m* key. If there is more that one model-format specified, then the *.cpp* extention will be added to the name provided after *-m* key. - -### Exporting from Catboost python library interface: +### Exporting from Catboost python library interface ```python model = CatBoost() @@ -23,7 +21,6 @@ model.fit(train_pool) model.save_model(OUTPUT_CPP_MODEL_PATH, format="CPP") ``` - ## Models trained with only Float features If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface: @@ -32,14 +29,12 @@ If the model was trained using only numerical features (no cat features), then t double ApplyCatboostModel(const std::vector& features); ``` - ### Parameters | parameter | description | |-----------|--------------------------------------------------| | features | features of a single document to make prediction | - ### Return value Prediction of the model for the document with given features. @@ -58,7 +53,6 @@ double ApplyCatboostModel(const std::vector& features) { C++11 support of non-static data member initializers and extended initializer lists - ## Models trained with Categorical features If the model was trained with categorical features present, then the application function in output code will be generated with the following interface: @@ -67,7 +61,6 @@ If the model was trained with categorical features present, then the application double ApplyCatboostModel(const std::vector& floatFeatures, const std::vector& catFeatures); ``` - ### Parameters | parameter | description | @@ -77,7 +70,6 @@ double ApplyCatboostModel(const std::vector& floatFeatures, const std::ve NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here floatFeatures = {f1, f3}, catFeatures = {f2, f4}. - ### Return value Prediction of the model for the document with given features. @@ -92,21 +84,16 @@ double ApplyCatboostModel(const std::vector& floatFeatures, const std::ve } ``` - ### Compiler requiremens C++14 compiler with aggregate member initialization support. Tested compilers: g++ 5(5.4.1 20160904), clang++ 3.8. - ## Current limitations -- MultiClassification models are not supported. - applyCatboostModel() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents. - +- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported. ## Troubleshooting Q: Generated model results differ from native model when categorical features present A: Please check that CityHash version 1 is used. Exact required revision of [C++ Google CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56%29). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/blob/master/util/digest/city.h). This is due other versions of CityHash may produce different hash code for the same string. - - diff --git a/apply_model/model_export_as_python_code_tutorial.md b/apply_model/model_export_as_python_code_tutorial.md index e3d1ed2..40ac67b 100644 --- a/apply_model/model_export_as_python_code_tutorial.md +++ b/apply_model/model_export_as_python_code_tutorial.md @@ -5,8 +5,7 @@ Catboost model could be saved as standalone Python code. This can ease an integr The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). - -### Exporting from Catboost application via command line interface: +### Exporting from Catboost application via command line interface ```bash catboost fit --model-format Python @@ -14,8 +13,7 @@ catboost fit --model-format Python By default model is saved into *model.py* file, one could alter the output name using *-m* key. If there is more that one model-format specified, then the *.py* extention will be added to the name provided after *-m* key. - -### Exporting from Catboost python library interface: +### Exporting from Catboost python library interface ```python model = CatBoost() @@ -23,7 +21,6 @@ model.fit(train_pool) model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python") ``` - ## Models trained with only Float features If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface: @@ -32,19 +29,16 @@ If the model was trained using only numerical features (no cat features), then t def apply_catboost_model(float_features): ``` - ### Parameters | parameter | type | description | |----------------|----------------------------|--------------------------------------------------| | float_features | list of int or float values| features of a single document to make prediction | - ### Return value Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal'). - ## Models trained with Categorical features If the model was trained with categorical features present, then the application function in output code will be generated with the following interface: @@ -53,7 +47,6 @@ If the model was trained with categorical features present, then the application def apply_catboost_model(float_features, cat_features): ``` - ### Parameters | parameter | type | description | @@ -63,18 +56,28 @@ def apply_catboost_model(float_features, cat_features): NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here float_features=[f1,f3], cat_features=[f2,f4]. - ### Return value Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal'). - ## Current limitations -- MultiClassification models are not supported. -- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents. +- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents. +- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported. ## Troubleshooting Q: Generated model results differ from native model when categorical features present -A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string. +A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string. One option is to use the library [clickhouse-cityhash](https://pypi.org/project/clickhouse-cityhash/): + +```python +from clickhouse_cityhash.cityhash import CityHash64 + +def calc_cat_feature_hash(value: str): + value_hash = CityHash64(value.encode('utf-8')) % (2 ** 32) + + if value_hash >= 2 ** 31: + value_hash -= 2 ** 32 + + return value_hash +```