Coder Social home page Coder Social logo

david-cortes / isotree Goto Github PK

View Code? Open in Web Editor NEW
180.0 10.0 37.0 5.66 MB

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

Home Page: https://isotree.readthedocs.io

License: BSD 2-Clause "Simplified" License

R 9.73% Python 10.22% C++ 75.17% CMake 0.47% Cython 3.22% M4 0.06% C 1.12%
isolation-forest outlier-detection anomaly-detection imputation

isotree's Introduction

IsoTree

Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found here.

The new concepts in this software are described in:


For a quick introduction to the Isolation Forest concept as used in this library, see:

Short Python example notebooks:

(R examples are available in the internal documentation)

Description

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.

Note that this is a black-box model that will not produce explanations or importances - for a different take on explainable outlier detection see OutlierTree.

image

(Code to produce these plots can be found in the R examples in the documentation)

Comparison against other libraries

The folder timings contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.

Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:

Library Model Time (s) 256 Time (s) 1024 Time (s) 10k
isotree orig 0.00161 0.00631 0.0848
isotree ext 0.00326 0.0123 0.168
eif orig 0.149 0.398 4.99
eif ext 0.16 0.428 5.06
h2o orig 9.33 11.21 14.23
h2o ext 1.06 2.07 17.31
scikit-learn orig 8.3 8.01 6.89
solitude orig 32.612 34.01 41.01

Example AUC as outlier detector in typical datasets (notebook to produce results here):

  • Satellite dataset:
Library AUROC defaults AUROC grid search
isotree 0.70 0.84
eif - 0.714
scikit-learn 0.687 0.74
h2o 0.662 0.748
  • Annthyroid dataset:
Library AUROC defaults AUROC grid search
isotree 0.80 0.982
eif - 0.808
scikit-learn 0.836 0.836
h2o 0.80 0.80

(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)

Non-random splits

While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in "Revisiting randomized choices in isolation forests" along with some comparisons of different split guiding criteria.

Different outlier scoring criteria

Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in "Isolation forests: looking beyond tree depth".

Distance / similarity calculations

General idea was extended to produce distance (alternatively, similarity) between observations according to how many random splits it takes to separate them - idea is described in "Distance approximation using Isolation Forests".

Imputation of missing values

The model can also be used to impute missing values in a similar fashion as kNN, by taking the values from observations in the terminal nodes of each tree in which an observation with missing values falls at prediction time, combining the non-missing values of the other observations as a weighted average according to the depth of the node and the number of observations that fall there. This is not related to how the model handles missing values internally, but is rather meant as a faster way of imputing by similarity. Quality is usually not as good as chained equations, but the method is a lot faster and more scalable. Recommended to use non-random splits when used as an imputer. Details are described in "Imputing missing values with unsupervised random trees".

Highlights

There's already many available implementations of isolation forests for both Python and R (such as the one from the original paper's authors' or the one in SciKit-Learn), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:

  • Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
  • Can handle missing values (but performance with them is not so good).
  • Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
  • Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
  • Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
  • Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
  • Can produce missing value imputations according to observations that fall on each terminal node.
  • Can work with sparse matrices.
  • Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
  • Supports sample/observation weights, either as sampling importance or as distribution density measurement.
  • Supports user-provided column sample weights.
  • Can sample columns randomly with weights given by kurtosis.
  • Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
  • Can fit trees incrementally to user-provided data samples.
  • Produces serializable model objects with reasonable file sizes.
  • Can convert the models to treelite format (Python-only and depending on the parameters that are used) (example here).
  • Can translate the generated trees into SQL statements.
  • Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.

(Note that categoricals, NAs, and density-like sample weights, are treated heuristically with different options as there is no single logical extension of the original idea to them, and having them present might degrade performance/accuracy for regular numerical non-missing observations)

Installation

  • R:

Note: This package benefits from extra optimizations that aren't enabled by default for R packages. See this guide for instructions on how to enable them.

install.packages("isotree")

  • Python:

Note: requires C/C++ compilers configured for Python. See this guide for instructions.

pip install isotree

or if that fails:

pip install --no-use-pep517 isotree

Note for macOS users: on macOS, the Python version of this package might compile without multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:

brew install libomp

And then reinstall this package: pip install --upgrade --no-deps --force-reinstall isotree.


IMPORTANT: the setup script will try to add compilation flag -march=native. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable DONT_SET_MARCH=1, or by (b) manually supplying compilation CFLAGS as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:

export DONT_SET_MARCH=1
pip install isotree

or, by specifying some compilation flag for architecture:

export CFLAGS="-march=x86-64"
export CXXFLAGS="-march=x86-64"
pip install isotree

  • C and C++:
git clone --recursive https://www.github.com/david-cortes/isotree.git
cd isotree
mkdir build
cd build
cmake -DUSE_MARCH_NATIVE=1 ..
cmake --build .

### for a system-wide install in linux
sudo make install
sudo ldconfig

(Will build as a shared object - linkage is then done with -lisotree)

Be aware that the snippet above includes option -DUSE_MARCH_NATIVE=1, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.

The package has an optional dependency on the Robin-Map library, which is added to this repository as a linked submodule. If this library is not found under /src, will use the compiler's own hashmaps, which are less optimal.

  • Ruby:

See external repository with wrapper.

Sample usage

Warning: default parameters in this implementation are very different from default parameters in others such as Scikit-Learn's, and these defaults won't scale to large datasets (see documentation for details).

  • Python:

(Library is Scikit-Learn compatible)

import numpy as np
from isotree import IsolationForest

### Random data from a standard normal distribution
np.random.seed(1)
n = 100
m = 2
X = np.random.normal(size = (n, m))

### Will now add obvious outlier point (3, 3) to the data
X = np.r_[X, np.array([3, 3]).reshape((1, m))]

### Fit a small isolation forest model
iso = IsolationForest(ntrees = 10, nthreads = 1)
iso.fit(X)

### Check which row has the highest outlier score
pred = iso.predict(X)
print("Point with highest outlier score: ",
      X[np.argsort(-pred)[0], ])
  • R:

(see documentation for more examples - help(isotree::isolation.forest))

### Random data from a standard normal distribution
library(isotree)
set.seed(1)
n <- 100
m <- 2
X <- matrix(rnorm(n * m), nrow = n)

### Will now add obvious outlier point (3, 3) to the data
X <- rbind(X, c(3, 3))

### Fit a small isolation forest model
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)

### Check which row has the highest outlier score
pred <- predict(iso, X)
cat("Point with highest outlier score: ",
    X[which.max(pred), ], "\n")
  • C++:

The package comes with two different C++ interfaces: (a) a struct-based interface which exposes the full library's functionalities but makes little checks on the inputs it receives and is difficult to use due to the large number of arguments that functions require; and (b) a scikit-learn-like interface in which the model exposes a single class with methods like 'fit' and 'predict', which is less flexible than the struct-based interface but easier to use and the function signatures disallow some potential errors due to invalid parameter combinations. The latter ((b)) is recommended to use unless some specific functionality from (a) is required.

See files: isotree_cpp_oop_ex.cpp for an example with the scikit-learn-like interface (recommended); and isotree_cpp_ex.cpp for an example with the struct-based interface.

Note that the second interface does not expose all the functionalities - for example, it only supports inputs of classes 'double' and 'int', while the struct-based interface also supports 'float'/'size_t'.

  • C:

See file isotree_c_ex.c.

Note that the C interface is a simple wrapper over a subset of the scikit-learn-like C++ interface, but using only ISO C bindings for better compatibility and easier wrapping in other languages.

  • Ruby

See external repository with wrapper.

Examples

Documentation

  • Python: documentation is available at ReadTheDocs.
  • R: documentation is available internally in the package (e.g. help(isolation.forest)) and in CRAN.
  • C++: documentation is available in the public header (include/isotree.hpp) and in the source files. See also the header for the scikit-learn-like interface (include/isotree_oop.hpp).
  • C: interface is not documented per-se, but the same documentation from the C++ header applies to it. See also its header for some non-comprehensive comments about the parameters that functions take (include/isotree_c.h).
  • Ruby: see external repository with wrapper for the syntax and the Python docs for details about the parameters.

Reducing library size and compilation times

By default, this library will compile with some functionalities that are unlikely to be used and which can significantly increase the size of the library and compilation times - if using this library in e.g. embedded devices, it is highly recommended to disable some options, and if creating a docker images for serving models, one might want to make it as minimal as possible. Being a C++ templated library, it generates multiple versions of its functions that are specialized for different types (such as C double and float), and in practice not all the supported types are likely to be used.

In particular, the library supports usage of long double type for more precise aggregated calculations (e.g. standard deviations), which is unlikely to end up used (its usage is determined by a user-passed function argument and not available in the C or C++-OOP interfaces). For a smaller library and faster compilation, support for long double can be disabled by:

  • Defining an environment variable NO_LONG_DOUBLE, which will be accepted by the Python and R build systems - e.g. first run export NO_LONG_DOUBLE=1, then a pip install; or for R, run Sys.setenv("NO_LONG_DOUBLE" = "1") before install.packages.
  • Passing option NO_LONG_DOUBLE to the CMake script - e.g. cmake -DNO_LONG_DOUBLE=1 .. (only when using the CMake system, which is not used by the Python and R versions).

Additionally, the library will produce functions for different floating point and integer types of the input data. In practice, one usually ends up using only double and int types (these are the only types supported in the R interface and in the C and C++-OOP interfaces). When building it as a shared library through the CMake system, these can be disabled (leaving only double and int support) through option NO_TEMPLATED_VERSIONS - e.g.:

cmake -DNO_TEMPLATED_VERSIONS=1 ..

(this option is not available for the Python build system)

References

  • Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.
  • Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
  • Hariri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. "Extended Isolation Forest." arXiv preprint arXiv:1811.02141 (2018).
  • Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "On detecting clustered anomalies using SCiForest." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2010.
  • https://sourceforge.net/projects/iforest/
  • https://math.stackexchange.com/questions/3388518/expected-number-of-paths-required-to-separate-elements-in-a-binary-tree
  • Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
  • Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
  • Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
  • Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
  • Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
  • Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
  • Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

isotree's People

Contributors

2128506 avatar alkk avatar ankane avatar david-cortes avatar enchufa2 avatar svenvanhal avatar zds0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

isotree's Issues

Undefined symbols with C++ example

Hi, thanks for this library! When trying to build the C++ example, it's failing with:

Undefined symbols for architecture x86_64:
  "fit_iforest(IsoForest*, ExtIsoForest*, double*, unsigned long, int*, unsigned long, int*, double*, unsigned long*, unsigned long*, unsigned long, unsigned long, CoefType, double*, bool, bool, unsigned long, unsigned long, unsigned long, unsigned long, bool, bool, bool, double*, double*, bool, double*, bool, double, double, double, double, double, MissingAction, CategSplit, NewCategAction, bool, Imputer*, unsigned long, UseDepthImp, WeighImpRows, bool, unsigned long long, int)", referenced from:
      _main in isotree_cpp_ex-8e1641.o
ld: symbol(s) not found for architecture x86_64

I think it's due to this commit: 89673ba (as it works with the previous commit)

Using with python3.8 and numpy==1.24.4

Thanks for great work!
Is it possible to use this project with python3.8 and numpy 1.24.4?

I have error when I try to install latest isotree versions in this config.

Thank you,
Mike

pickle extended model takes uses 30gigs of ram

Hi. Thanks for sharing this great library.

When I pickle.dump a trained model with 200 trees memory usage exploded. Extended forest model (trained on 1mm rows and 350 columns) increases memory usage by over 30 gigs and 5 gigs on disk when finished. Note that I am using pickle protocol = 5 (with python=3.8.5). Using an earlier protocol crashed my machine (90gigs of ram) due to memory usage so was not able to pickle at all.

By contrast pickling sci-forest does not increase mem usage significantly and only takes up 10% of the disk space when done.

Is that the expected behavior? Anything I can do to optimize memory used? Many thanks,

Unexpected behaviour during imputations with all NA features

While using IsolationForest for imputation, although training data is all na for a feature (so no imputation can be done), transformed dataset (imputed test dataset which is nonoverlapping with the training, and also all na for that feature) includes mostly zeros(~93%) and some na values for the same feature. I could not replicate the issue with a smaller dataset, but maybe this description could help detect the problem.
For reference, my training and test dataset have shapes (400000, 1000) and there are 3 categorical features with 10 to 40 levels.
To sum up, IsolationForest's transform method introduces some zeros to "un-imputable" features.

New Functions in R package version 0.1.24 not visible

Hi David,

I updated the package to make sure that I had the latest version of isotree in R. However, the following functions don't show up when I try to call them:

  • predict.isolation_forest
  • print.isolation_forest
  • summary.isolation_forest

The error I receive is: Error: 'summary.isolation_forest' is not an exported object from 'namespace:isotree'

Is there anything I need to do in addition to updating the package?

Thanks!

(P.S. love this package by the way. It is been extremely useful and informative)

Problems building isottree library on Windows using Visual Studio 2022 and CMake

Hi dev team,

I'm trying to build the latest version of the isotree library on Window using Visual Studio and CMake.

Windows 10
Visual Studio 2022 - latest
CMake 3.25.1

The project is created using CMake GUI without any changes. The flag recommended in the ReadMe is set.

While building the library I run into an issue of redefinition of input and output operator.

image

Is there something special to be done when using Windows as target system?

Thanks for your support.

Sample weight explaination

Hi David, (mistype in the header...i meant column weights not sample weights)

I don't believe there is a full explanation on how the column_weights parameter gets applied in the isotree model. I understand that if i have 5 features i can pass a list to this parameter such as (5,2,3,4,7) in this case my fifth has the highest weight but what does that actually do in the model? Also, the help for this parameter says "Ignored when picking columns by deterministic criterion". How do you pick columns by the deterministic criteria? Is that the extended model? Thank you!

Typo: predict.isotree() documentation (type="avg_dpth" not accepted)

In isotree/R/isoforest.R, the function predict.isolation_forest documentation says that it allows the following types:

@param type Type of prediction to output. Options are:
\itemize{
\item "score" for the standardized outlier score, where values closer to 1 indicate more outlierness, while values
closer to 0.5 indicate average outlierness, and close to 0 more averageness (harder to isolate).
\item "avg_depth" for the non-standardized average isolation depth.
\item "dist" for approximate pairwise distances (must pass more than 1 row) - these are
standardized in the same way as outlierness, values closer to zero indicate nearer points,
closer to one further away points, and closer to 0.5 average distance.
\item "avg_sep" for the non-standardized average separation depth.
\item "tree_num" for the terminal node number for each tree - if choosing this option,
will return a list containing both the outlier score and the terminal node numbers, under entries
score and tree_num, respectively.
\item "impute" for imputation of missing values in newdata.
}

However, avg_depth is not in the allowed types within the code, so trying to use the argument type="avg_depth" returns an error.
>n <- 100
>m <- 2
>X <- matrix(rnorm(n * m), nrow = n)
>X <- rbind(X, c(3, 3))
>iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
>dpths <- predict(iso, X, type="avg_depth")
Error in check.str.option(type, "type", allowed_type) :
'type' must be one of "score", "avg_path", "dist", "avg_sep", "tree_num", "impute".

From line 758 of the same script:
allowed_type <- c("score", "avg_path", "dist", "avg_sep", "tree_num", "impute")

It looks like avg_path is meant to be avg_depth, or the other way around.

Installing isotree compatibility issues

Hi David,

We are having issues installing isotree on our R Server. I could install it fine locally on my laptop but when i requested to have it installed on our R Studio Server we are receiving incompatibility issues. I am being told its a possible C++ issue. Are you familiar with any sort of incompatibilities like this when trying to install the package?

Error in fit_model

I get this error when running the following model, but I don't always get the error. Sometimes it runs fine.

Error in fit_model(pdata$X_num, pdata$X_cat, unname(pdata$ncat), pdata$Xc, : negative length vectors are not allowed

isotree_mdl2 <- isolation.forest(df,

  •                             ntrees =400,
    
  •                             sample_size=256,
    
  •                             ndim=1,
    
  •                             prob_pick_pooled_gain=0,
    
  •                             prob_pick_avg_gain=0,
    
  •                             penalize_range = FALSE,
    
  •                             missing_action="fail",
    
  •                             nthreads = parallel::detectCores()-9)
    

Running isotree in databricks environment given requirements

Hi I seem to be getting errors trying to install isotree. I have checked with the requirements of the package and I have ensured they are satisfied. Please could you have a look at as well. I would like to apologise as this is similar to another question but I dont know what else to do as I have satisfied the requirements
The command is !pip install isotree

Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: isotree
Building wheel for isotree (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for isotree (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [604 lines of output]
/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/extension.py:134: UserWarning: Unknown Extension options: 'install_requires'
warnings.warn(msg)
/databricks/python/lib/python3.10/site-packages/setuptools/dist.py:530: UserWarning: Normalizing '0.6.1-2' to '0.6.1.post2'
warnings.warn(tmpl.format(**locals()))
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-310
creating build/lib.linux-x86_64-cpython-310/isotree
copying isotree/init.py -> build/lib.linux-x86_64-cpython-310/isotree
running build_ext
--- Checking compiler support for option '-march=native'
--- Checking compiler support for option '-fopenmp'
--- Checking compiler support for '__restrict' qualifier
--- Checking compiler support for option '-O3'
--- Checking compiler support for option '-fno-math-errno'
--- Checking compiler support for option '-fno-trapping-math'
--- Checking compiler support for option '-std=c++17'
--- Checking compiler support for option '-flto=auto'
cythoning isotree/cpp_interface.pyx to isotree/cpp_interface.cpp

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      void tmat_to_dense(double *tmat, double *dmat, size_t n, double fill_diag)
  
      void merge_models(IsoForest*     model,      const IsoForest*     other,
                        ExtIsoForest*  ext_model,  const ExtIsoForest*  ext_other,
                        Imputer*       imputer,    const Imputer*       iother,
                        TreesIndexer*  indexer,    const TreesIndexer*  ind_other) except + nogil
                                                                                                ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:223:95: undeclared name not builtin: nogil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          cdef int ret_val = 0
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val = \
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:841:23: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer *indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              add_tree(model_ptr, ext_model_ptr,
                     ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1042:20: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              predict_iforest(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1181:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          cdef TreesIndexer*  indexer_ptr = NULL
          if not self.indexer.indices.empty():
              indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              calc_similarity(numeric_data_ptr, categ_data_ptr,
                            ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1305:27: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              model_ptr      =  &self.isoforest
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              impute_missing_values(numeric_data_ptr, categ_data_ptr, is_col_major,
                                  ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1391:33: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
              ptr_indexer = &self.indexer
          if not other.indexer.indices.empty():
              prt_ind_other = &other.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              merge_models(ptr_model, ptr_other,
                         ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1441:24: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
  
          cdef vector[cpp_string] res
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              res = generate_sql(model_ptr, ext_model_ptr,
                               ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1563:30: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          if not self.indexer.indices.empty():
              indexer = &self.indexer
  
          cdef vector[cpp_string] res
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              res = generate_dot(model_ptr, ext_model_ptr, indexer,
                               ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1587:30: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          if not self.indexer.indices.empty():
              indexer = &self.indexer
  
          cdef vector[cpp_string] res
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              res = generate_json(model_ptr, ext_model_ptr, indexer,
                                ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1611:31: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          else:
              ext_model_ptr  =  &self.ext_isoforest
          cdef TreesIndexer*  indexer_ptr = &self.indexer
  
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              set_reference_points(model_ptr, ext_model_ptr, indexer_ptr,
                                 ^
  ------------------------------------------------------------
  
  isotree/cpp_interface.pyx:1752:32: Calling gil-requiring function not allowed without gil
  building 'isotree._cpp_interface' extension
  creating build/temp.linux-x86_64-cpython-310
  creating build/temp.linux-x86_64-cpython-310/isotree
  creating build/temp.linux-x86_64-cpython-310/src
  x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -D_USE_XOSHIRO -DNDEBUG -D_FOR_PYTHON "-DCYTHON_EXTERN_C=extern \"C\"" -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -D_USE_ROBIN_MAP -I/databricks/python/lib/python3.10/site-packages/numpy/core/include -I. -I./src -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-20cb01b3-10e5-4c84-aba1-67507ea71074/include -I/usr/include/python3.10 -c isotree/cpp_interface.cpp -o build/temp.linux-x86_64-cpython-310/isotree/cpp_interface.o -march=native -fopenmp -O3 -fno-math-errno -fno-trapping-math -std=c++17 -flto=auto
  isotree/cpp_interface.cpp:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
      1 | #error Do not use this file, it is the result of a failed Cython compilation.
        |  ^~~~~
  error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for isotree
Failed to build isotree
ERROR: Could not build wheels for isotree, which is required to install pyproject.toml-based projects

Can't install Isotree anymore

Hello,
I'm not sure what has changed since you don't have a new release. But I can't pip install isotree anymore. It was working a week ago.

                       from isotree/cpp_interface.cpp:1162:
      /tmp/pip-build-env-wopixisn/overlay/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
         17 | #warning "Using deprecated NumPy API, disable it with " \
            |  ^~~~~~~
      isotree/cpp_interface.cpp:3391:21: error: conflicting declaration of ‘void cy_warning(const char*)’ with ‘C++’ linkage
       3391 | __PYX_EXTERN_C void cy_warning(char const *); /*proto*/
            |                     ^~~~~~~~~~
      In file included from ./src/headers_joined.hpp:63,
                       from isotree/cpp_interface.cpp:1174:
      ./src/isotree.hpp:115:21: note: previous declaration with ‘C’ linkage
        115 |     extern "C" void cy_warning(const char *msg);
            |                     ^~~~~~~~~~
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for isotree
Failed to build isotree
ERROR: Could not build wheels for isotree, which is required to install pyproject.toml-based projects

I tried with python 3.11.1 and 3.11.4
pip version 23.2 and 22.3.1
Ubuntu 22.04.2 LTS

Trapped signals

Hi David, it looks like running fit_iforest adds a signal handler that causes signals to be ignored. You can test it out with a basic Flask app.

from flask import Flask
import numpy as np
from isotree import IsolationForest
app = Flask(__name__)

@app.route('/')
def hello_world():
    X = np.random.normal(size = (100, 2))
    iso = IsolationForest(ntrees = 10, ndim = 2, nthreads = 1)
    iso.fit(X)
    return 'Hello, World!'

Start the server and visit the page. When trying to exit with ctrl+C, it prints:

^CError: procedure was interrupted
^CError: procedure was interrupted
^CError: procedure was interrupted
^CError: procedure was interrupted
^CError: procedure was interrupted

Isotree Isolation forest generating large model pickle file

I am building anomaly detection model using isotree and the model pickle file if I dump via joblib without any compression, generates file of size 65GB. To load this model file for any realtime scoring requires around 256GB RAM for loading it into a python object and then scoring the new data. Is there any better way to do this or any tips on reducing the model size without impacting the accuracy of the model.

Importing regular Isolation Forest fails

IsolationForest.import_model() has a critical bug which corrupts regular iForest models.

When importing a model, first a new IsolationForest object is created, whose constructor calls _reset_obj(). However, as this call is made before any loaded properties are set, it falls back to using default values.

When loading a regular iForest model, this causes property _is_extended_ to be set to True, because the default value for ndim is 3. Subsequent custom parameters overwrites from file in import_model() leave _is_extended_ untouched. Therefore, the resulting object now has both self.ndim == 1 and self._is_extended_ == True.

This renders the imported model useless.

To reproduce:

import numpy as np
from isotree import IsolationForest


def import_bug(model_export_path):
    
    # Create IsolationForest and fit with some dummy data
    iso = IsolationForest(ndim=1)
    iso.fit(np.random.rand(10, 10))

    # Export model to file
    iso.export_model(model_export_path)

    # Import model
    iso_loaded = IsolationForest.import_model(model_export_path)

    # Original object prints "Isolation Forest"
    print(iso)

    # Imported object prints "Extended Isolation Forest" 🤔
    print(iso_loaded)

    assert iso_loaded.ndim == iso.ndim  # ndim == 1
    assert iso_loaded._is_extended_ == iso._is_extended_  # Fails! _is_extended_ is True even though ndim == 1


if __name__ == "__main__":
    import_bug("/tmp/isotree_import_bug.pkl")

pip install issue not fixed but closed

Hi again,

Unfortunately the referenced issue is not "closed" for me.

Like I said, I do not have sudo access to install build-essential. We cannot change the docker image, either. Is there another workaround? Are there pre-built wheels?

Will you respond to closed issues or do we have to keep it open to continue the discussion?

Originally posted by @rmurphy2718 in #44 (comment)

Error when using fit() and build_imputer=True is misleading

Can IsolationForest use a .fit() method when build_imputer=True? Every attempt I make fails, however the error message I receive doesn't necessarily make me think that this is impossible. See the example below:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
cali = pd.concat(fetch_california_housing(return_X_y=True, as_frame=True), axis=1)

for c in cali.columns:
    ind = np.random.choice(cali.shape[0], size=100)
    cali.loc[ind,c] = np.NaN

from isotree import IsolationForest
imputer = IsolationForest(
    build_imputer=True
)

# Works as intended
imputer.fit_transform(cali)

# Throws the error below
imputer.fit(cali)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\swilson\anaconda3\envs\impcomp\lib\site-packages\isotree\__init__.py", line 1382, in fit
    self._cpp_obj.fit_model(_get_num_dtype(X_num, sample_weights, column_weights),
  File "isotree\cpp_interface.pyx", line 855, in isotree._cpp_interface.isoforest_cpp_obj.fit_model
RuntimeError: Cannot produce missing data imputations at fit time when using sub-sampling.

The error makes me think that I just have an incorrect parameter, but I have tried different sampling parameters with no luck. The fact that .fit_transform() works with no problems makes me think that .fit() simply isn't supported.

Saving models between interfaces

Hi David,
I'm now trying to use your library and so far good. But i would like to train my models in Python to latter on export them and use them for inference in cpp. Is there any way with your library interfaces to accomplish this task?
Thanks,
Borja.

some-versus-many distance computation

predict_distance appears to currently support only returning O(n^2) distances, which is not scalable. Could you add an option to pass in two data frames, (X=n x k, Y=m x k) instead of one, such that the returned distances are of dimensionality n x m? Even a one-versus-all option would be very useful.

Mac arm nthreads problem

Im a new guy

In order to try to improve the efficiency I set the parameter nthreads=2, but received
Warning message: In isolation.forest(test, ntrees = 100, nthreads = 2) : Attempting to use more than 1 thread, but package was compiled without OpenMP support. See https://mac.r-project.org/openmp/
I read the phase guide and found that it should be specifying some compilation flag for architecture maybe in Makevars.
But I don't konw what should be add in this file.
Help pls

What is equivalent to contamination parameter in sklearn?

Thank you for providing this interesting package. I'd like to use this package instead of scikit-learn, however, I'm not getting the same results with scikit-learn. I used the following configuration:

IsolationForest(max_samples=100, random_state=self.rng, bootstrap=False, warm_start=True, n_jobs=None, contamination=0.003, verbose=0)
Contamination, I believe, is the most significant parameter for my aim. Could you please tell me how I may achieve the same results?

Error in fit_model(pdata$X_num, pdata$X_cat, unname(pdata$ncat), pdata$Xc, : std::bad_alloc

Hello,

Please let me know if you need more code or explanation, I'm new to reporting issues.

I'm getting the following error when trying to build an isolation tree with 360K obervations.

Code:
iso_forest <- isolation.forest(data, ntrees = 100, nthreads = 1)

Error:
Error in fit_model(pdata$X_num, pdata$X_cat, unname(pdata$ncat), pdata$Xc, :
std::bad_alloc

I have a hunch it is because of using it on 360K obersvations. Is there a limit to the number of observations on which I can use isotree on?

Models without `categ_cols` fail to export

Models without categ_cols (i.e. models fit on dataframes with numeric columns only) fail to export.

To reproduce, run the python example in the README

import numpy as np
from isotree import IsolationForest

### Random data from a standard normal distribution
np.random.seed(1)
n = 100
m = 2
X = np.random.normal(size = (n, m))

### Will now add obvious outlier point (3, 3) to the data
X = np.r_[X, np.array([3, 3]).reshape((1, m))]

### Fit a small isolation forest model
iso = IsolationForest(ntrees = 10, ndim = 2, nthreads = 1)
iso.fit(X)

### Check which row has the highest outlier score
pred = iso.predict(X)
print("Point with highest outlier score: ",
      X[np.argsort(-pred)[0], ])

and then try to call export_model

iso.export_model('test_isolationforest', use_cpp=True)

the following error is thrown

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-8ede0bd56e06> in <module>
----> 1 iso.export_model('test_isolationforest', use_cpp=True)

isotree/__init__.py in export_model(self, file, use_cpp)
   2128         """
   2129         assert self.is_fitted_
-> 2130         metadata = self._export_metadata()
   2131         with open(file + ".metadata", "w") as of:
   2132             json.dump(metadata, of, indent=4)

isotree/__init__.py in _export_metadata(self)
   2346             "cols_categ" : list(self.cols_categ_),
   2347             "cat_levels" : [list(m) for m in self._cat_mapping],
-> 2348             "categ_cols" : list(self.categ_cols),
   2349             "categ_max" : list(self._cat_max_lev)
   2350         }

TypeError: 'NoneType' object is not iterable

I was able to resolve this by replacing line 2348 (in the data_info dict within export_metadata()) with the following:

            "categ_cols": [] if self.categ_cols is None else list(self.categ_cols),

However, I'm unsure if this if the complete/proper solution. Thanks.

Same category is always imputed when enough trees are grown

See this example:

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = pd.concat(load_iris(return_X_y=True, as_frame=True), axis=1)
iris["target"] = iris["target"].astype("category")

amp_iris = iris.copy()
na_where = {}
for c in iris.columns:
    na_where[c] = sorted(np.random.choice(amp_iris.shape[0], size=25, replace=False))
    amp_iris.loc[na_where[c],c] = np.NaN

# Only class 0 was imputed
from isotree import IsolationForest
imputer = IsolationForest(
    ntrees=100,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
t = "target"
imp_iris.loc[na_where[t], t].unique()

# Use less trees, process is much more accurate
imputer = IsolationForest(
    ntrees=10,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
(imp_iris.loc[na_where[t], t] == iris.loc[na_where[t], t]).mean()

Using any number of trees over 100 caused only the first class (0) to ever be imputed. Using only 10 trees usually makes the imputation much more accurate. I tried playing around with different max_depths, but to no avail. Are there any obvious parameters I am missing to make the categorical imputation more accurate?

Ruby Library

Hi David, I came across your work last week and am a big fan of a number of your projects!

I wanted to let you know I created a Ruby library for IsoTree. It's fairly basic right now (only supports numeric data and has a limited number of methods and options) and follows the Python API. If you have any feedback, let me know or feel free to create an issue on the project. Thanks!

How to re-train when a few values are marked as anomalies when they should not have been?

Hello,
Let's say we have an array of size n. A few items are marked as anomalies that should not have been. How do you recommend refitting the model, so those items are not marked as anomalies in the future?
I considered extending the array with X copies of those items and re-training. Is that the right approach? If yes, what is the optimal value for X?

Example (columnar data):

array = [1, 4, 3, 15, ...]
15 is marked as an anomaly.
We copy 15 multiple times.
array = [1, 4, 3, 15, 15, 15 ...]
And fit the model again.

If it matters, here are the parameters I'm using:

IsolationForest(
            ndim=1, ntrees=100,
            penalize_range=False,
            prob_pick_pooled_gain=0,
            missing_action="impute",  # Dealing with None values
            new_categ_action="impute",  # Dealing with new categories
        ) 

model.imputer file not exported when build_imputer == True

When instantiating and fitting a model as IsolationForest(build_imputer=True).fit(), when export_model() is called the required model.imputer file is not saved.

Thus, when subsequently calling IsolationForest.import_model() the model is not loaded with the required imputing functionality.

The fix appears to be simple: the build_imputer param needs to be passed to _cpp_obj.serialize_obj(has_imputer=self.build_imputer)

To reproduce:

import numpy as np
from isotree import IsolationForest


def main():
    
    X = np.random.randn(10, 10)
    X_missing = np.random.randn(10, 10)
    X_missing[0, 0] = np.nan

    iso = IsolationForest(build_imputer=True)
    iso.fit(X)
    iso.export_model('example_model', use_cpp=True)

    assert not np.isnan(iso.transform(X_missing)).any() # properly imputes here

    iso_imported = IsolationForest.import_model('example_model', use_cpp=True)

    assert not np.isnan(iso_imported.transform(X_missing)).any() # fails bc example_model.imputer file unavailable to import


if __name__ == "__main__":
    main()

pip install not working

Hi David,
I'm trying to reinstall the isotree library in order to use the new python functions to export the serialized model.

When i execute

pip install isotree

The following error appears:

$ pip install isotree
Collecting isotree
  Using cached https://files.pythonhosted.org/packages/8c/f4/c69cdafdc278f2c387b409a4ebdc415d1acd15f6a0d9592b3551e1c2c714/isotree-0.1.15.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-jEjgh3/isotree/setup.py", line 81, in <module>
        include_dirs=[np.get_include(), ".", "./src", cycereal.get_cereal_include_dir()],
      File "/home/bruiz/.local/lib/python2.7/site-packages/cycereal/__init__.py", line 40, in get_cereal_include_dir
        raise ValueError("Could not find header files from 'cycereal' - please try reinstalling with 'pip install --force cycereal'")
    ValueError: Could not find header files from 'cycereal' - please try reinstalling with 'pip install --force cycereal
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jEjgh3/isotree/`

Already tried reinstalling cycereal (0.1.3), installed the requirements.txt file in your git code and the same error appears. Any insights?

De-serialized Tree estimators in R

Is there a way to get de-serialized tree estimators after we build the model?
To be precise, the model object in cpp_obj$serialized is not in a readable format, is there a way to look at each tree built by the algorithm? I'm looking for something similar to model.estimators_ (python - sklearn isolation forest) equivalent in R

R package build from source failing on Conda Forge

We're seeing an error on the Conda Forge builds for this package. Essentially, we download and unpack the tarball from CRAN, then run

R CMD INSTALL --build .

However, this has been resulting in a configure error since v0.5.22:

* installing to library ‘/home/conda/feedstock_root/build_artifacts/r-isotree_1701323302199/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/R/library’
* installing *source* package ‘isotree’ ...
** package ‘isotree’ successfully unpacked and MD5 sums checked
** using staged installation
configure: error: cannot find sources (isotree) in . or ..
ERROR: configuration failed for package ‘isotree’

Any idea why this would be happening? Are changes in the configure script incompatible with installing from unpacked source?

Thanks in advance for any insight!

Regular Isolation Forest fitting very slow for large max_samples

Hi David, let me first thank you for this excellent implementation of the Isolation Forest algorithm and variants.

I've discovered some strange behavior regarding the regular Isolation Forest fitting duration. Using a low value for max_samples, e.g. 256, fitting times are comparable if not faster than the scikit-learn implementation. However, increasing this number does not scale as one would expect.

For example, using a training set of 750,000 samples and a max_samples value of 256, fitting an IsoTree iForest takes 0.1s compared to 1.3s for the scikit-learn implementation. When increasing the number of max_samples to 65,536, the fitting time of the IsoTree model explodes to 119.7s, as opposed to only 2.6s for the scikit-learn implementation. The number of trees is fixed for all experiments and nthreads/n_jobs is set to -1.

Fitting an Extended Isolation Forest (ndim=2) in the exact same setting (750k training samples, max_samples=65536) for comparison takes only 2.6s.

Can you shed some light on this behavior?

I'm using the latest IsoTree version from pip (0.1.21)

Please find the benchmark code attached in this gist. My output is as follows:

# samples fit:     75,000
# samples predict: 25,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             0.0s          0.0s
[IsoTree / EIF]            0.0s          0.1s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             0.4s          0.0s
[IsoTree / EIF]            0.0s          0.1s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        0.2s          0.2s
[IsoTree / IF]             9.6s          0.1s
[IsoTree / EIF]            0.4s          0.1s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        0.4s          0.2s
[IsoTree / IF]             6.3s          0.1s
[IsoTree / EIF]            1.9s          0.3s

---

# samples fit:     750,000
# samples predict: 250,000

max_samples: 256
                       Fit time  Predict time
[Scikit-Learn / IF]        1.3s          1.5s
[IsoTree / IF]             0.1s          0.3s
[IsoTree / EIF]            0.1s          0.6s

max_samples: 2048
                       Fit time  Predict time
[Scikit-Learn / IF]        1.3s          1.7s
[IsoTree / IF]             0.6s          0.6s
[IsoTree / EIF]            0.1s          1.1s

max_samples: 16384
                       Fit time  Predict time
[Scikit-Learn / IF]        2.1s          2.2s
[IsoTree / IF]            11.3s          0.6s
[IsoTree / EIF]            0.7s          1.7s

max_samples: 65536
                       Fit time  Predict time
[Scikit-Learn / IF]        2.6s          2.6s
[IsoTree / IF]           119.7s          0.6s
[IsoTree / EIF]            2.6s          2.1s

Connection Lost

Whenever I try to train the model in R, my session is restarted without any further info:

[Info] Connection Lost
[Info] Restarting R...

Problems installing isotree in Sagemaker Studio

Silly question but any chance someone could elaborate what are the requirements to install isotree?

I'm facing some issues while compiling the code. I've tried a few things but nothing worked so thought that I might just ask at the source; being laughed at is a good trade. Logs depend on the kernel so below are logs from a fresh install on a Base Python 2.0 (Python 3.8.12).

Any suggestions are more than welcome.

> !python --version

Python 3.8.12


> !apt-get update && apt-get install -y build-essential

Hit:1 http://security.debian.org/debian-security bullseye-security InRelease
Hit:2 http://deb.debian.org/debian bullseye InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9).
0 upgraded, 0 newly installed, 0 to remove and 150 not upgraded.

> !python -m pip install --upgrade cython

Requirement already satisfied: cython in /usr/local/lib/python3.8/site-packages (3.0.0)


> !python -m pip install -U isotree==0.5.17

Collecting isotree==0.5.17
  Using cached isotree-0.5.17.tar.gz (288 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: isotree
  Building wheel for isotree (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for isotree (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [64 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/isotree
      copying isotree/__init__.py -> build/lib.linux-x86_64-cpython-38/isotree
      running build_ext
      --- Checking compiler support for option '-march=native'
      --- Checking compiler support for option '-fopenmp'
      --- Checking compiler support for '__restrict' qualifier
      --- Checking compiler support for option '-O3'
      --- Checking compiler support for option '-fno-math-errno'
      --- Checking compiler support for option '-fno-trapping-math'
      --- Checking compiler support for option '-std=c++17'
      --- Checking compiler support for option '-flto'
      Compiling isotree/cpp_interface.pyx because it changed.
      [1/1] Cythonizing isotree/cpp_interface.pyx
      building 'isotree._cpp_interface' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/isotree
      creating build/temp.linux-x86_64-cpython-38/src
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -D_USE_XOSHIRO -D_FOR_PYTHON -DSUPPORTS_RESTRICT=1 -D_USE_ROBIN_MAP -I/tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/numpy/core/include -I. -I./src -I/usr/local/include/python3.8 -c isotree/cpp_interface.cpp -o build/temp.linux-x86_64-cpython-38/isotree/cpp_interface.o -march=native -fopenmp -O3 -fno-math-errno -fno-trapping-math -std=c++17 -flto
      In file included from /tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1830,
                       from /tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
                       from /tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                       from isotree/cpp_interface.cpp:1162:
      /tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
         17 | #warning "Using deprecated NumPy API, disable it with " \
            |  ^~~~~~~
      isotree/cpp_interface.cpp:3373:21: error: conflicting declaration of ‘void cy_warning(const char*)’ with ‘C++’ linkage
       3373 | __PYX_EXTERN_C void cy_warning(char const *); /*proto*/
            |                     ^~~~~~~~~~
      In file included from ./src/headers_joined.hpp:63,
                       from isotree/cpp_interface.cpp:1174:
      ./src/isotree.hpp:115:21: note: previous declaration with ‘C’ linkage
        115 |     extern "C" void cy_warning(const char *msg);
            |                     ^~~~~~~~~~
      /tmp/pip-build-env-q0e7ht52/overlay/lib/python3.8/site-packages/setuptools/_distutils/extension.py:134: UserWarning: Unknown Extension options: 'install_requires'
        warnings.warn(msg)
      warning: isotree/cpp_interface.pyx:223:89: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:229:76: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:235:64: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:240:88: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:254:45: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:266:45: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:273:59: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:282:33: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:291:33: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:299:47: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:307:47: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:335:94: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:346:62: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:359:119: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:366:63: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:391:77: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:400:72: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:402:132: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:404:135: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:426:139: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:432:20: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:440:20: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      warning: isotree/cpp_interface.pyx:442:0: The 'IF' statement is deprecated and will be removed in a future Cython version. Consider using runtime conditions or C macros instead. See https://github.com/cython/cython/issues/4310
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for isotree
Failed to build isotree
ERROR: Could not build wheels for isotree, which is required to install pyproject.toml-based projects

Training Isotree with Python on Windows then Deserializing with C++ on Raspberry Pi 3B+ (Linux)

I am trying to serialize the model from example one of isotree_example.ipynb on a Windows system, then deserialize it using C++ on the Raspberry Pi 3B+ with a Linux OS but I am encountering some problems.

The model was serialized using the export_model method with add_metadata_file set to false.
image

The main file being executed is isotree_demo.cpp (shown below) which simply tries to deserialize the model.

image

Before deserializing with deserialize_combined, the model is checked with inspect_serialized_object which gives the following values:
image

Running isotree_demo led to an "unexpected error" in serialize.cpp in the deserialize_model function:
image

In an attempt to debug, I printed the values being checked by the deserialize_model function:
image

I am unsure why saved_int_t and saved_size_t are being set to the PlatformSize enum value 4 (Other) when the Raspberry Pi 3B+ is 32bit.

I attempted to force the 32bit check to true as shown in the screenshot below, but this caused a segmentation fault.
image

Any suggestions would be appreciated.

memory issues with build_imputer=True

I use IsolationForest for imputation in a for loop, with a sliding window for time series data. Here is a small example code:

for quarter_start in quarter_starts:
  #some code here
  imputer = IsolationForest(build_imputer=True, min_imp_obs=1, max_depth=None, min_gain=0.25, sample_size=0.5, 
                                  ntrees=100, ndim=2, prob_pick_pooled_gain=1, ntry=10)
  imputer.fit(subset_train[subset_train.columns[2:]])
  subset_imputed = imputer.transform(subset_test[subset_test.columns[2:]])
  #some more code here
  gc.collect()

My problem is, although I overwrite imputer object in for loop, memory usage adds up in each iteration. Outside the for loop,

del imputer
gc.collect()

also does not free any memory. I am talking about 400 GBs of memory used for one iteration, so I cannot think about any other source of the usage except imputer object.
Is there any other way to make sure memory is released?
Python version: 3.9.16
OS: Ubuntu 22.04.2 LTS

Isotree==0.5.22 is failing on databricks cluster

Hi,

I am trying to install isotree 0.5.22 on databricks cluster (12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12)), however it is giving issues while installing it (even after upgrading pip to latest version: 23.2.1, similar error appears)

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      void tmat_to_dense(double tmat, double dmat, size_t n, double fill_diag)
 
      void merge_models(IsoForest
     model,      IsoForest
     other,
                        ExtIsoForest*  ext_model,  ExtIsoForest*  ext_other,
                        Imputer*       imputer,    Imputer*       iother,
                        TreesIndexer*  indexer,    TreesIndexer*  ind_other) except + nogil
                                                                                          ^
  ------------------------------------------------------------
 
  isotree/cpp_interface.pyx:223:89: undeclared name not builtin: nogil
 
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
 
          cdef int ret_val = 0
 
          with nogil, boundscheck(False), nonecheck(False), wraparound(False):
              ret_val =
              fit_iforest(model_ptr, ext_model_ptr,
                        ^
  ------------------------------------------------------------
 
  isotree/cpp_interface.pyx:823:23: Calling gil-requiring function not allowed without gil
 

completer Error message:
image

Difficulty installing Isotree for R on Linux

I am trying to install isotree for R on Linux but I am getting the following error:

In file included from Rwrapper.cpp:75:0:
isotree.h:224:12: error: ‘isinf’ is already declared in this scope
using std::isinf;
^
isotree.h:225:12: error: ‘isnan’ is already declared in this scope
using std::isnan;
^
make: *** [Rwrapper.o] Error 1
ERROR: compilation failed for package ‘isotree’

  • removing ‘/home/myname/R/x86_64-pc-linux-gnu-library/4.0/isotree’
    Warning in install.packages :
    installation of package ‘isotree’ had non-zero exit status

The downloaded source packages are in
‘/tmp/RtmpGs2W4l/downloaded_packages’

I'm out of my depth with error and don't know how to troubleshoot it. I'd be very grateful if you had any suggestions.

Problem with saving trained isolation forest when `categ_cols` is not None.

Hello David,

I am facing another issue when I am trying to save a trained iso forest with categ_cols not None. When I do not provide any categorical column numbers (when its None), the model is saved, but when this is not the case, I get this error:

  File "/scratch/vsahil/data-drift-explanation/GOAD/goad-pyenv/lib/python3.6/site-packages/isotree/__init__.py", line 2132, in export_model
    json.dump(metadata, of, indent=4)
  File "/usr/lib64/python3.6/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib64/python3.6/json/encoder.py", line 430, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
    o = _default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'int32' is not JSON serializable```


Do you have any clue why this is happening and how can I circumvent this problem? I verified that all the values in the columns marked as categorical are integers with values starting at 0 (I am passing a numpy array in the `fit` function). 

pip install issue: cc1plus no such file or directory

Issue

I cannot pip install isotree. Is there a way to fix it without sudo? Is there a pre-built wheel anywhere? Thanks!

Command and output

pip install isotree --no-cache-dir
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: ...
Collecting isotree
  Downloading 
...tar.gz (288 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.1/288.1 kB 15.1 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: isotree
  Building wheel for isotree (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for isotree (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      /tmp/pip-build-env-v6v2li_1/overlay/lib/python3.9/site-packages/setuptools/_distutils/extension.py:134: UserWarning: Unknown Extension options: 'install_requires'
        warnings.warn(msg)
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-39
      creating build/lib.linux-x86_64-cpython-39/isotree
      copying isotree/__init__.py -> build/lib.linux-x86_64-cpython-39/isotree
      running build_ext
      g++: error: unrecognized command line option ‘-std=c++17’
      g++: error: unrecognized command line option ‘-std=gnu++14’
      --- Checking compiler support for option '-fopenmp'
      --- Checking compiler support for '__restrict' qualifier
      --- Checking compiler support for option '-O3'
      --- Checking compiler support for option '-fno-math-errno'
      --- Checking compiler support for option '-fno-trapping-math'
      --- Checking compiler support for option '-std=c++17'
      --- Checking compiler support for option '-std=gnu++14'
      --- Checking compiler support for option '-std=c++11'
      --- Checking compiler support for option '-flto'
      cythoning isotree/cpp_interface.pyx to isotree/cpp_interface.cpp
      building 'isotree._cpp_interface' extension
      creating build/temp.linux-x86_64-cpython-39
      creating build/temp.linux-x86_64-cpython-39/isotree
      creating build/temp.linux-x86_64-cpython-39/src
      gcc -pthread -B /opt/deep_learning/conda/envs/my_env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/deep_learning/conda/envs/my_env/include -fPIC -O2 -isystem /opt/deep_learning/conda/envs/my_env/include -march=x86-64 -fPIC -D_USE_XOSHIRO -D_FOR_PYTHON -DSUPPORTS_RESTRICT=1 -D_USE_ROBIN_MAP -I/tmp/pip-build-env-v6v2li_1/overlay/lib/python3.9/site-packages/numpy/core/include -I. -I./src -I/opt/deep_learning/conda/envs/my_env/include/python3.9 -c isotree/cpp_interface.cpp -o build/temp.linux-x86_64-cpython-39/isotree/cpp_interface.o -fopenmp -O3 -fno-math-errno -fno-trapping-math -std=c++11 -flto
      gcc: error trying to exec 'cc1plus': execvp: No such file or directory
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for isotree
Failed to build isotree
ERROR: Could not build wheels for isotree, which is required to install pyproject.toml-based projects

System information

Ubuntu 2018.03 on a managed cloud computing service within a docker container. I don't have the ability to sudo install stuff and there is a reluctance to change the docker image.

problem saving (exporting) model with imputer

Hi. I upgraded to the latest version (Successfully installed isotree-0.1.31) so the imputer model can get saved alongside the main model. I am getting the following error:

iso.export_model(model_save_folder + preprocess_missing_model_file_name, use_cpp=True)


RuntimeError Traceback (most recent call last)
in
5 res = _files_fct.save_object_to_pkl(missing_imputer, model_save_folder + preprocess_missing_model_file_name)
6 else:
----> 7 iso.export_model(model_save_folder + preprocess_missing_model_file_name, use_cpp=True)
8 del iso
9 gc.collect()

/anaconda/envs/isotree_2_missing/lib/python3.8/site-packages/isotree/init.py in export_model(self, file, use_cpp)
1458 with open(file + ".metadata", "w") as of:
1459 json.dump(metadata, of, indent=4)
-> 1460 self._cpp_obj.serialize_obj(file, use_cpp, self.ndim > 1, has_imputer=self.build_imputer)
1461 return self
1462

isotree/cpp_interface.pyx in isotree._cpp_interface.isoforest_cpp_obj.serialize_obj()

RuntimeError: Failed to write 3064 bytes to output stream! Wrote 864

Any idea? It's a large model - could this be the problem?

Thank you

isolation.forest() is not reproducible whenever `nthreads > 1`

Hi @david-cortes, thanks for a great package. I'm writing a book on tree-based methods and am including a section on isolation forests using your package (which works really well). I've noticed, however, that the anomaly scores are not reproducible (at least for me) when specifying the seed via set.seed() or the random_seed argument. Reproducible example below:

library(isotree)


# Generate fake data (no anomalies)
set.seed(101)
X <- as.data.frame(matrix(rnorm(5 * 100), ncol = 5))

# Fit an isolation forest
ifo <- isolation.forest(X, random_seed = 102)

# Compute anomaly scores
head(scores <- predict(ifo, newdata = X))
# [1] 0.4002608 0.4996714 0.5253563 0.4303659 0.4204118 0.4323855

#
# Run again, but notice different scores with same seed
#

# Generate fake data (no anomalies)
set.seed(101)
X <- as.data.frame(matrix(rnorm(5 * 100), ncol = 5))

# Fit an isolation forest
ifo <- isolation.forest(X, random_seed = 102)

# Compute anomaly scores
head(scores <- predict(ifo, newdata = X))
# [1] 0.3950409 0.4929140 0.5302152 0.4239435 0.4225947 0.4325836

Is this a bug, or am I missing something?

Not able to run a loop in parallel using joblib

Thank you for creating such a fantastic repository which is so easily accessible and with such great documentation. I have a question. I have a for loop in which I do predictions using the trained isolation forest. When I was using a different anomaly detection approach, I was able to run that loop in parallel using joblib, but when I switched to using isotree, the parallelization doesn't happen. When I execute it, it just does nothing and stays as it is (whenever n_jobs > 1, it only runs when n_jobs =1).

Any clue why this is happening and how can we parallelize that loop when using isotree?

Model features

Hi David,

Is there a way to tell which features are being chosen for each node?

Thanks,
Tara

Valgrind warning in CRAN checks

The automatic checks from CRAN are detecting an issue when running the examples with Valgrind:
https://www.stats.ox.ac.uk/pub/bdr/memtests/valgrind/isotree/isotree-Ex.Rout

The issue happens when calling R’s saveRDS, and the complaint is about un-initialized bytes that were allocated in the model object, which at that point only lives as a C++ object accessed through the external pointer system (Rcpp::Xptr). Oddly, no warning is given when the object is used all throughout the model fitting function.

Tried posting a message in the package development mailing lists, but so far no hints of what could be wrong:
https://stat.ethz.ch/pipermail/r-package-devel/2020q3/005721.html

The CRAN maintainers would like to see this issue fixed before uploading a new version (the same warning comes in when trying version 0.1.16), but I have no clue of what went wrong there or how to fix. Am not able to reproduce the valgrind warning in my local machine either.

Any help is appreciated.

Reproductibility problems with Extented Isolation Forest

Hi @david-cortes, thank you for this great package.

I'm currently using isotree to fit an extented isolation forest model. My issue is the following : I created, fitted and tested for anomaly detection an instance of IsolationForest with : (ndim=2, max_samples = int(len(data)/20, ntrees=500, ntry=1, random_seed=0,max_depth=12, missing_action="fail", coefs="normal", standardize_data=True, penalize_range=True,n_threads=2,bootstrap=False,prob_pick_pool_gain=1)

After this I implmeented the same model with the same hyperparameters in another script of mine. However, when looking at the scores after fitting this model to the same data as the previous one I find different values. The values are really close to the ones obtained previously but are still different. I wonder whether there is a randomness factor that I didn't control through my parameters (I thought fixing random seed would suffice) or if it is a real issue. Many thanks in advance for your assistance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.