Coder Social home page Coder Social logo

linkedin / isolation-forest Goto Github PK

View Code? Open in Web Editor NEW
219.0 16.0 50.0 715 KB

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.

License: Other

Scala 100.00%
isolation-forest spark linkedin outlier-detection unsupervised-learning machine-learning scala anomaly-detection

isolation-forest's Introduction

isolation-forest

Build Status Release License

Introduction

This is a Scala/Spark implementation of the Isolation Forest unsupervised outlier detection algorithm. This library was created by James Verbus from the LinkedIn Anti-Abuse AI team.

This library supports distributed training and scoring using Spark data structures. It inherits from the Estimator and Model classes in Spark's ML library in order to take advantage of machinery such as Pipelines. Model persistence on HDFS is supported.

Copyright

Copyright 2019 LinkedIn Corporation All Rights Reserved.

Licensed under the BSD 2-Clause License (the "License"). See License in the project root for license information.

How to use

Building the library

To build using the default of Scala 2.11.8 and Spark 2.3.0, run the following:

./gradlew build

This will produce a jar file in the ./isolation-forest/build/libs/ directory.

If you want to use the library with arbitrary Spark and Scala versions, you can specify this when running the build command.

./gradlew build -PsparkVersion=3.4.1 -PscalaVersion=2.13.12

To force a rebuild of the library, you can use:

./gradlew clean build --no-build-cache

Add an isolation-forest dependency to your project

Please check Maven Central for the latest artifact versions.

Gradle example

The artifacts are available in Maven Central, so you can specify the Maven Central repository in the top-level build.gradle file.

repositories {
    mavenCentral()
}

Add the isolation-forest dependency to the module-level build.gradle file. Here is an example for a recent spark scala version combination.

dependencies {
    compile 'com.linkedin.isolation-forest:isolation-forest_3.2.0_2.13:3.0.1'
}

Maven example

If you are using the Maven Central repository, declare the isolation-forest dependency in your project's pom.xml file. Here is an example for a recent Spark/Scala version combination.

<dependency>
  <groupId>com.linkedin.isolation-forest</groupId>
  <artifactId>isolation-forest_3.2.0_2.13</artifactId>
  <version>3.0.1</version>
</dependency>

Model parameters

Parameter Default Value Description
numEstimators 100 The number of trees in the ensemble.
maxSamples 256 The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
contamination 0.0 The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter.
contaminationError 0.0 The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.
maxFeatures 1.0 The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
bootstrap false If true, draw sample for each tree with replacement. If false, do not sample with replacement.
randomSeed 1 The seed used for the random number generator.
featuresCol "features" The feature vector. This column must exist in the input DataFrame for training and scoring.
predictionCol "predictedLabel" The predicted label. This column is appended to the input DataFrame upon scoring.
scoreCol "outlierScore" The outlier score. This column is appended to the input DataFrame upon scoring.

Training and scoring

Here is an example demonstrating how to import the library, create a new IsolationForest instance, set the model hyperparameters, train the model, and then score the training data. data is a Spark DataFrame with a column named features that contains a org.apache.spark.ml.linalg.Vector of the attributes to use for training. In this example, the DataFrame data also has a labels column; it is not used in the training process, but could be useful for model evaluation.

import com.linkedin.relevance.isolationforest._
import org.apache.spark.ml.feature.VectorAssembler

/**
  * Load and prepare data
  */

// Dataset from http://odds.cs.stonybrook.edu/shuttle-dataset/
val rawData = spark.read
  .format("csv")
  .option("comment", "#")
  .option("header", "false")
  .option("inferSchema", "true")
  .load("isolation-forest/src/test/resources/shuttle.csv")

val cols = rawData.columns
val labelCol = cols.last
 
val assembler = new VectorAssembler()
  .setInputCols(cols.slice(0, cols.length - 1))
  .setOutputCol("features")
val data = assembler
  .transform(rawData)
  .select(col("features"), col(labelCol).as("label"))

// scala> data.printSchema
// root
//  |-- features: vector (nullable = true)
//  |-- label: integer (nullable = true)

/**
  * Train the model
  */

val contamination = 0.1
val isolationForest = new IsolationForest()
  .setNumEstimators(100)
  .setBootstrap(false)
  .setMaxSamples(256)
  .setMaxFeatures(1.0)
  .setFeaturesCol("features")
  .setPredictionCol("predictedLabel")
  .setScoreCol("outlierScore")
  .setContamination(contamination)
  .setContaminationError(0.01 * contamination)
  .setRandomSeed(1)

val isolationForestModel = isolationForest.fit(data)
 
/**
  * Score the training data
  */

val dataWithScores = isolationForestModel.transform(data)

// scala> dataWithScores.printSchema
// root
//  |-- features: vector (nullable = true)
//  |-- label: integer (nullable = true)
//  |-- outlierScore: double (nullable = false)
//  |-- predictedLabel: double (nullable = false)

The output DataFrame, dataWithScores, is identical to the input data DataFrame but has two additional result columns appended with their names set via model parameters; in this case, these are named predictedLabel and outlierScore.

Saving and loading a trained model

Once you've trained an isolationForestModel instance as per the instructions above, you can use the following commands to save the model to HDFS and reload it as needed.

val path = "/user/testuser/isolationForestWriteTest"

/**
  * Persist the trained model on disk
  */

// You can ensure you don't overwrite an existing model by removing .overwrite from this command
isolationForestModel.write.overwrite.save(path)

/**
  * Load the saved model from disk
  */

val isolationForestModel2 = IsolationForestModel.load(path)

Validation

The original 2008 "Isolation forest" paper by Liu et al. published the AUROC results obtained by applying the algorithm to 12 benchmark outlier detection datasets. We applied our implementation of the isolation forest algorithm to the same 12 datasets using the same model parameter values used in the original paper. We used 10 trials per dataset each with a unique random seed and averaged the result. The quoted uncertainty is the one-sigma error on the mean.

Dataset Expected mean AUROC (from Liu et al.) Observed mean AUROC (from this implementation)
Http (KDDCUP99) 1.00 0.99973 ± 0.00007
ForestCover 0.88 0.903 ± 0.005
Mulcross 0.97 0.9926 ± 0.0006
Smtp (KDDCUP99) 0.88 0.907 ± 0.001
Shuttle 1.00 0.9974 ± 0.0014
Mammography 0.86 0.8636 ± 0.0015
Annthyroid 0.82 0.815 ± 0.006
Satellite 0.71 0.709 ± 0.004
Pima 0.67 0.651 ± 0.003
Breastw 0.99 0.9862 ± 0.0003
Arrhythmia 0.80 0.804 ± 0.002
Ionosphere 0.85 0.8481 ± 0.0002

Our implementation provides AUROC values that are in very good agreement the results in the original Liu et al. publication. There are a few very small discrepancies that are likely due the limited precision of the AUROC values reported in Liu et al.

Contributions

If you would like to contribute to this project, please review the instructions here.

References

  • F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422.
  • F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 1, p. 3, 2012.
  • Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.

isolation-forest's People

Contributors

eisber avatar jverbus avatar shipkit-org avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

isolation-forest's Issues

Feature request: warm_start

Hi,
I was just looking for possibilities to further train a previous model with additional/new data. This is quite relevant in the big data field as it would otherwise require to keep a large amount of data to retrain the model every time from scratch.

In the following article such a possibility is given with isolation forest using sklearn:
https://medium.com/grabngoinfo/isolation-forest-for-anomaly-detection-cd7871ae99b4

Are you planning to implement something similar for your spark-solution of isolation forest?

Kind regards and thank you for this great library!

Issue writing in synapse spark 3.2

I'm using azure synapse and nothing I'm doing is allowing me to write models. I've explicitly included spark-avro in my pom file and loaded the spark-avro package into the spark pool workspace.

    <properties>
        <spark.version>3.2.0</spark.version>
        <scala.version.major>2.12</scala.version.major>
        <scala.version.minor>15</scala.version.minor>
    </properties>
    <dependencies>
        <dependency>
            <groupId>com.linkedin.isolation-forest</groupId>
            <artifactId>isolation-forest_${spark.version}_${scala.version.major}</artifactId>
            <version>3.0.3</version>
        </dependency>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version.major}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version.major}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version.major}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_${scala.version.major}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>com.microsoft.azure.synapse</groupId>
            <artifactId>synapseutils_${scala.version.major}</artifactId>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.jmockit</groupId>
            <artifactId>jmockit</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_${scala.version.major}</artifactId>
        </dependency>
    </dependencies>
2024-01-30 01:31:47,163 INFO ApplicationMaster [shutdown-hook-0]: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.sql.AnalysisException:  Failed to find data source: com.databricks.spark.avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".        
	at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1028)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
	at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:876)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:275)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImplHelper(IsolationForestModelReadWrite.scala:262)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:241)
	at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)

The library gives error while writing model using Spark 2.4

First of all thanks for making the Isolation Forest library open source. We would like to use this library with Spark 2.4.0. We tried using this library with Spark 2.4 Job but it is giving the error related to json4s while writing the model to HDFS. The error is "Spark with json4s, parse function raise java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;"

We understand the breaking changes are because of Spark 2.4.0 which started using json4s version 3.5.3 while your library uses Spark 2.3 which uses json4s version 3.2.11.

We tried building the Isolation Forest library with Spark 2.4 but it is failing. Can you help us to make this library compatible with Spark 2.4.0? We understand we need to update the scala code. Can you help us with it ?

Multiple Rows as One Data Point

Hello,
I have a general question about the Isolation Forest algorithm.
My dataframe looks like this:

Metric_1 Metric_2
Row_1
Row_2

Is it possible that Isolation Forest observes multiple rows as a data instance? That means if Isolation Forest identifies an anomaly, the anomaly shall refer to multiple rows, for example Row_1 and Row_2.
Currently Isolation Forest gives me one row as anomaly but in my data set multiple rows need to be seen as collective and thus only multiple rows can be anomalous. Do you know if there is a solution for this with Isolation Forest or another algorithm?

Publish artifact for spark 3.0.0

Hello,

Would it be possible for you to publish the artifacts compiled against spark 3?

I've built locally and it seems build without any errors using this command line:
./gradlew test -PsparkVersion=3.0.0 -PscalaVersion=2.12.11

Although I haven't tested if it works in practice.

Publish for Scala 2.13

Hi,

Thanks for this library.
Since it's compatible with Spark 3.2, and this Spark version support 2.13, it would be nice to publish the library for this Scala version.

Wrong count of anomalies without respecting contamination

I expected that anomaly count and OutlierScore will change according to contamination provided. But when I try to modify the contamination value, at times it gives the anomaly count as 1 as outlier score is calculated wrong(equivalent to top score).

Issue saving the model

  1. I've built the library using: ./gradlew clean build -x test -PsparkVersion=2.4.3 -PscalaVersion=2.11.12
  2. I have spark instantiated via spark-2.4.3-hadoop2.6/sbin/start-all.sh
  3. I have the following code that I'm building using sbt (example adapted from README)
import com.linkedin.relevance.isolationforest._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.functions._
import org.apache.spark.SparkConf
import org.apache.spark.sql.{ SparkSession, DataFrame, Row }

object Main {
    def main(args: Array[String]): Unit = {
        //val conf = new SparkConf().setAppName("IsolShap").setMaster("spark://localhost:4040")
  val spark = SparkSession.builder
      .master("spark://ai-monster:7077")
      .appName("Isol")
      .config("spark.jars","lib/isolation-forest_2.4.3_2.11-2.0.6.jar")
      .getOrCreate()

  val rawData = spark.read
    .format("csv")
    .option("comment", "#")
    .option("header", "false")
    .option("inferSchema", "true")
    .load("resources/shuttle.csv")

  val cols = rawData.columns
  val labelCol = cols.last

  val assembler = new VectorAssembler()
    .setInputCols(cols.slice(0, cols.length - 1))
    .setOutputCol("features")
  val data = assembler
    .transform(rawData)
    .select(col("features"), col(labelCol).as("label"))

  val contamination = 0.1
  val isolationForest = new IsolationForest()
    .setNumEstimators(100)
    .setBootstrap(false)
    .setMaxSamples(256)
    .setMaxFeatures(1.0)
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(contamination)
    .setContaminationError(0.01 * contamination)
    .setRandomSeed(1)

val isolationForestModel = isolationForest.fit(data)

  /**
    * Score the training data
    */

  val dataWithScores = isolationForestModel.transform(data)

  isolationForestModel.save("/tmp/mymodel") <==== ERROR here

  dataWithScores.take(5).foreach(println)
  
  dataWithScores.printSchema()
  

The above run fails on model save stage with the following error:

21/06/03 18:57:27 INFO IsolationForestModelReadWrite$IsolationForestModelWriter: Saving IsolationForestModel tree data to path /tmp/mymodel/data
[error] (run-main-0) java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
[error] java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
[error]         at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImplHelper(IsolationForestModelReadWrite.scala:263)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:242)
[error]         at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
[error]         at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:306)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModel.save(IsolationForestModel.scala:20)
[error]         at Main$.main(Main.scala:63)
[error]         at Main.main(Main.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error] Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
[error]         at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
[error] stack trace is suppressed; run last Compile / bgRun for the full output
21/06/03 18:57:27 ERROR Utils: uncaught error in thread spark-listener-group-appStatus, stopping SparkContext

Now since spark-avro has already been compiled in the linkedin jar, I shouldn't have to include it again. But I added I tried multiple things;

  1. Converted the gradle dependencies from compile to implementation, but that didn't help.
  2. Added spark-avro to the dependency list of sbt while compiling the above code. Then I get:
[error] (run-main-0) org.apache.spark.SparkException: Job aborted.
[error] org.apache.spark.SparkException: Job aborted.
[error]         at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
[error]         at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
[error]         at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
[error]         at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
[error]         at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
[error]         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
[error]         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[error]         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
[error]         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error]         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
[error]         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
[error]         at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
[error]         at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
[error]         at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[error]         at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[error]         at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
[error]         at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
[error]         at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImplHelper(IsolationForestModelReadWrite.scala:263)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:242)
[error]         at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
[error]         at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:306)
[error]         at com.linkedin.relevance.isolationforest.IsolationForestModel.save(IsolationForestModel.scala:20)
[error]         at Main$.main(Main.scala:63)
[error]         at Main.main(Main.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 187, 10.40.14.5, executor 0): java.lang.ClassNotFoundExcep$
ion: org.apache.spark.sql.avro.AvroOutputWriterFactory
[error]         at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
[error]         at java.lang.Class.forName0(Native Method)
[error]         at java.lang.Class.forName(Class.java:348)
[error]         at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
[error]         at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
[error]         at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
[error]         at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
[error]         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
[error]         at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
[error]         at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
[error]         at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
[error]         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
[error]         at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
...

I'm using the Top of the tree source. Am I missing something here? Do I need to do something specific here?
Is there a bug in the packaging script of the library?

Note that the above code works without the save invocation.

Facing issues with json4s package, while saving model. Also not able to create a fat jar due to version conflict between liberaries.

I downloaded the iForest jar from : https://dl.bintray.com/linkedin/maven/com/linkedin/isolation-forest/isolation-forest_2.11/0.2.2/
But while trying to save model, I'm getting below options. I tried multiple versions of json4s modules, but still not able to fix the error.
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;

Apart from this I tried to compile iForest using the mentioned steps, but that didn't work, and threw version conflict errors.
Is it possible for you to share the detailed info, which've been used to compile the package, and versions of modules being used.

Unable to save and load model

I am using scala 2.11.8 and spark 2.4.0

Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter$$anonfun$8.apply(IsolationForestModelReadWrite.scala:302)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter$$anonfun$8.apply(IsolationForestModelReadWrite.scala:301)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.getMetadataToSave(IsolationForestModelReadWrite.scala:301)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveMetadata(IsolationForestModelReadWrite.scala:280)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImplHelper(IsolationForestModelReadWrite.scala:253)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:241)

InvalidClassExcepiton

Hello,
I used your build configuration and successfully built jar file(isolation-forest_2.11-0.3.1) by using gradlew. However when i use newly generated jar on my project it gives me an error while fitting data.
Error Detail :
"Caused by: java.io.InvalidClassException: com.linkedin.relevance.isolationforest.IsolationForest; local class incompatible: stream classdesc serialVersionUID = 5883725353499012901, local class serialVersionUID = 6413710209040362293"

I built source code on my

  • Virtual box Ubuntu Linux
  • Scala 2.11.11
  • Spark 2.4.4

My gradle.build file is following :
plugins {
// Apply the scala plugin to add support for Scala
id 'scala'
}

dependencies {
compile("com.chuusai:shapeless_2.11:2.3.2")
// compile("com.databricks:spark-avro_2.11:4.0.0")
compile("org.apache.spark:spark-avro_2.11:2.4.0")
compile("org.apache.spark:spark-core_2.11:2.4.0")
compile("org.apache.spark:spark-mllib_2.11:2.4.0")
compile("org.apache.spark:spark-sql_2.11:2.4.0")
compile("org.scalatest:scalatest_2.11:2.2.6")
compile("org.testng:testng:6.8.8")
}

test {
useTestNG()
}

archivesBaseName = "${project.name}_2.11"

Can you please help me about solving this issue.

P.S i followed exactly same steps to build release v0.2.2(isolation-forest_2.11-0.2.3) it works perfectly. Only v0.3.0(isolation-forest_2.11-0.3.1) has above problem

Thanks in advance

Load Model Error: java.lang.UnsupportedOperationException: empty collection

Hi Team,

I compiled the project for spark 2.4 following the instructions at homepage. It has been built successfully and passed all tests. But I still got the error below when I try to load trained model (model save can work). Any one can help?

Regards,
James

java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1380)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.loadMetadata(IsolationForestModelReadWrite.scala:192)
at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.loadImpl(IsolationForestModelReadWrite.scala:81)
at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.load(IsolationForestModelReadWrite.scala:50)
at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.load(IsolationForestModelReadWrite.scala:38)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at com.linkedin.relevance.isolationforest.IsolationForestModel$.load(IsolationForestModel.scala:140)
... 40 elided

Spark 3.4.0 support

I can see that this package only supports upto version 3.2.0 of Spark which is two years old now. Any plans to support newer versions?

PySpark support

First off, thank you for making this available. I’m wondering if anyone has had success in interfacing with this via Python/pyspark (or sparkR for that matter)? If not, is it possible? Given my limited experience with PySpark, it seems very possible.

Unable to save model

I am using spark 3.1 and Scala 2.12.
I am using below isolation forest model artifact in maven.

com.linkedin.isolation-forest
isolation-forest_3.0.0_2.12

Recently I started getting below error

java.lang.NoClassDefFoundError: org/json4s/JsonAssoc$
at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:239)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)

Below is our code.

def generateAnomalyScoreUsingIsolationForest(spark: SparkSession, year: String, month: String, day: String): Unit = {

    spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", "true")

    val model_path = f"/iforest_$year%s_$month%s_$day%s.model"
    val data_path = f"/anomalyScores_$year%s_$month%s_$day%s.parquet/"

    val df_final_table = spark.sql("select * from AppFeatures_v2")
    val cols = df_final_table.columns
    val labelCol = cols.slice(0,1).mkString("")

    val assembler = new VectorAssembler().setInputCols(cols.slice(1, cols.length)).setOutputCol("features")

    val data = assembler.transform(df_final_table).select(col("features"), col(labelCol).as("label"))

    val contamination = 0.002
    val max_samples = 0.3
    val max_features = 0.4
    val num_estimator = 1000

    val isolationForest = (new IsolationForest()
            .setNumEstimators(num_estimator)
            .setBootstrap(false)
            .setMaxSamples(max_samples)
            .setMaxFeatures(max_features)
            .setFeaturesCol("features")
            .setPredictionCol("predictedLabel")
            .setScoreCol("outlierScore")
            .setContamination(contamination)
            .setContaminationError(0.01 * contamination)
            .setRandomSeed(21))

    val isolationForestModel = isolationForest.fit(data)

    val dataWithScores = isolationForestModel.transform(data)


   // Failing on below line
    isolationForestModel.write.overwrite().save("/iforest_latest.model")
    isolationForestModel.write.overwrite().save(model_path)

    dataWithScores.select("label", "predictedLabel","outlierScore").write.mode("overwrite").option("overwriteSchema", "true").parquet(data_path)
}

It was working till couple of weeks ago. Can anyone help to solve this problem?

question about the withReplacement param in BaggedPoint.scala

if (withReplacement) {
val poisson = new PoissonDistribution(subsamplingRate)
convertToBaggedRDDHelper(input, subsamplingRate, numSubsamples, seed, poisson)
} else {
if (numSubsamples == 1 && subsamplingRate == 1.0) {
input.map(datum => BaggedPoint(datum, Array(1))) // Create bagged RDD without sampling
} else {
val binomial = new BinomialDistribution(1, subsamplingRate)
convertToBaggedRDDHelper(input, subsamplingRate, numSubsamples, seed, binomial)
}
}

Thanks for the grate work,I have a little question, why using PoissonDistribution means with replacement and BinomialDistribution means without replacement here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.