Coder Social home page Coder Social logo

Comments (11)

jverbus avatar jverbus commented on May 25, 2024

Hi @pramodreddy2006, thanks for reaching out!

Can you please share the exact parameter settings you were using when you encountered this potential issue? If you're able to share the data used for training and scoring that would be ideal.

Thanks!

from isolation-forest.

pramodreddy2006 avatar pramodreddy2006 commented on May 25, 2024

Archive.zip

Attached the archive containing data file(csv) and my java class.
Works fine if contamination is 0.05, but I get only 1 anomaly when set to 0.1

           ```
            IsolationForest isolationForest = new IsolationForest();
	isolationForest.setNumEstimators(100);
	// isolationForest.setContamination(0.05);
	isolationForest.setContamination(0.1);
	isolationForest.setFeaturesCol("indexedFeatures");
	IsolationForestModel ifModel = isolationForest.fit(dataset);
	dataset = ifModel.transform(dataset);
	Dataset<Row> countDF = dataset.groupBy("predictedLabel").count();
	List<Row> countRows = countDF.collectAsList();
	Long regular = 0l;
	Long anomalies = 0l;
	for (Row row : countRows) {
		Double label = row.getAs("predictedLabel");
		long count = row.getAs("count");
		if (label > 0) {
			anomalies = count;
		} else {
			regular = count;
		}
	}
	System.out.println("Regular :" + regular);
	System.out.println("Anomalies :" + anomalies);
	System.out.println("Outlier Score Threshold :" + ifModel.getOutlierScoreThreshold());


from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

Thanks for the additional information!

I translated some of your code into Scala and reproduced the issue.

import org.apache.spark.ml.feature.Imputer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorIndexer

import com.linkedin.relevance.isolationforest._

val path = "/Users/jverbus//Desktop/Archive/breast_cancer_data.csv"
val df = spark.read.option("header", true).csv(path)

val featureColumnList = Array(
  "clump_thickness",
  "cell_size_uniformity",
  "cell_shape_uniformity",
  "marginal_adhesion",
  "single_ep_cell_size",
  "bare_nuclei",
  "bland_chromatin",
  "normal_nucleoli",
  "mitoses")

val dfCast = df
  .withColumn("clump_thickness", df("clump_thickness").cast("double"))
  .withColumn("cell_size_uniformity", df("cell_size_uniformity").cast("double"))
  .withColumn("cell_shape_uniformity", df("cell_shape_uniformity").cast("double"))
  .withColumn("marginal_adhesion", df("marginal_adhesion").cast("double"))  
  .withColumn("single_ep_cell_size", df("single_ep_cell_size").cast("double"))
  .withColumn("bare_nuclei", df("bare_nuclei").cast("double"))
  .withColumn("bland_chromatin", df("bland_chromatin").cast("double"))
  .withColumn("normal_nucleoli", df("normal_nucleoli").cast("double"))
  .withColumn("mitoses", df("mitoses").cast("double"))

val imputer = new Imputer()
imputer.setInputCols(featureColumnList)
imputer.setOutputCols(featureColumnList)

val imputerModel = imputer.fit(dfCast)
val dfCastImputed = imputerModel.transform(dfCast)

val assembler = new VectorAssembler()
  .setInputCols(featureColumnList)
  .setOutputCol("features")
val dfCastImputedAssembled = assembler.transform(dfCastImputed)

val vectorIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
	.setMaxCategories(4)
	
val vectorIndexerModel = vectorIndexer.fit(dfCastImputedAssembled)
val dfCastImputedAssembledIndexed = vectorIndexerModel.transform(dfCastImputedAssembled)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.05)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel05 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.1)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel10 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores10 = isolationForestModel10.transform(dfCastImputedAssembledIndexed)

Which gives the results:

scala> scores05.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|               35.0|
+-------------------+

scala> scores10.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|                1.0|
+-------------------+

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

As we expect, the scores are the same regardless of the contamination choice:

scala> scores10.select("outlierScore").collect().deep == scores05.select("outlierScore").collect().deep
res25: Boolean = true

The odd behavior for contamination = 0.10 case must be due to the choice of the outlier score threshold, which is calculated using Spark's approxQuantile method:

https://github.com/linkedin/isolation-forest/blob/master/isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala#L141

We use this because it can be very costly to calculate the threshold exactly on very large datasets.

This file contains Spark's approxQuantile method: https://github.com/apache/spark/blob/eef3abbb903f95178f225ae0f6e3db2d9cf64175/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L61

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

This is the threshold calculated by the model with a contamination of 0.05:

scala> isolationForestModel05.getOutlierScoreThreshold
res23: Double = 0.6185162204274052

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0.05 * 0.01)
res26: Array[Double] = Array(0.6185162204274052)

If we set the relativeError to 0, which will force an exact calculation, we get the same result:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0)
res27: Array[Double] = Array(0.6185162204274052)

However, the case for the 0.10 threshold is different...

This is the threshold calculated by the model with a contamination of 0.10:

scala> isolationForestModel10.getOutlierScoreThreshold
res24: Double = 0.67621027121925

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1),  0.10 * 0.01)
res29: Array[Double] = Array(0.67621027121925)

If we set the relativeError to 0, which will force an exact calculation, we get a different result:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1), 0)
res28: Array[Double] = Array(0.5929591082174609)

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

If we use this exactly calculated threshold for the contamination = 0.10 case, we get reasonable results:

scala> val newScores10 = scores10.select("outlierScore").withColumn("newLabel", (col("outlierScore") >= 0.5929591082174609).cast("double"))
newScores10: org.apache.spark.sql.DataFrame = [outlierScore: double, newLabel: double]

scala> newScores10.agg(sum("newLabel")).show()
+-------------+
|sum(newLabel)|
+-------------+
|         70.0|
+-------------+


scala> newScores10.count()
res44: Long = 699

70 / 699 = 0.10

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

It is important to note this is a rare case. The model was validated on 12 benchmark datasets with varying contamination values without seeing this issue.

We need to keep the option of an approximate threshold calculation, because an exact calculation can cause issues on very large datasets.

I will add a model parameter that allows the user to choose if they want to exactly calculate the threshold. I will also add a check that will give a warning if the threshold is calculated approximately and the results don't make sense.

In the interim, you can calculate your own exact threshold and corresponding labels based upon the scores dataframe as shown above.

from isolation-forest.

pramodreddy2006 avatar pramodreddy2006 commented on May 25, 2024

@jverbus
Got it. Thanks a lot James.

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

@pramodreddy2006: Happy to help! Thanks for raising this!

from isolation-forest.

jverbus avatar jverbus commented on May 25, 2024

@pramodreddy2006: I just pushed a fix to the issue you reported.

#4

The library now uses an exact calculation of the threshold by default (slow and not as scalable). There is a new parameter contaminationError that you can specify if you want an approximate, but fast and scalable threshold calculation.

The underlying issue with Spark's approxQuantile() method is still there, so the approximate threshold calculation may occasionally have the bug you observed. If there is a disagreement between the expected and observed number of outliers during training, a warning will be shown to the user.

I reported the approxQuantile() bug to the Spark team. https://issues.apache.org/jira/browse/SPARK-29325

Please try version 0.3.0 out and let me know if it works for you!

from isolation-forest.

pramodreddy2006 avatar pramodreddy2006 commented on May 25, 2024

@jverbus Thanks. Works as explained.

from isolation-forest.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.