I expected that anomaly count and OutlierScore will change according to contamination

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

It is important to note this is a rare case. The model was <a href="https://github.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Wrong count of anomalies without respecting contamination about isolation-forest HOT 11 CLOSED

linkedin commented on May 25, 2024

Wrong count of anomalies without respecting contamination

from isolation-forest.

Comments (11)

jverbus commented on May 25, 2024

Hi @pramodreddy2006, thanks for reaching out!

Can you please share the exact parameter settings you were using when you encountered this potential issue? If you're able to share the data used for training and scoring that would be ideal.

Thanks!

from isolation-forest.

pramodreddy2006 commented on May 25, 2024

Archive.zip

Attached the archive containing data file(csv) and my java class.
Works fine if contamination is 0.05, but I get only 1 anomaly when set to 0.1

           ```
            IsolationForest isolationForest = new IsolationForest();
	isolationForest.setNumEstimators(100);
	// isolationForest.setContamination(0.05);
	isolationForest.setContamination(0.1);
	isolationForest.setFeaturesCol("indexedFeatures");
	IsolationForestModel ifModel = isolationForest.fit(dataset);
	dataset = ifModel.transform(dataset);
	Dataset<Row> countDF = dataset.groupBy("predictedLabel").count();
	List<Row> countRows = countDF.collectAsList();
	Long regular = 0l;
	Long anomalies = 0l;
	for (Row row : countRows) {
		Double label = row.getAs("predictedLabel");
		long count = row.getAs("count");
		if (label > 0) {
			anomalies = count;
		} else {
			regular = count;
		}
	}
	System.out.println("Regular :" + regular);
	System.out.println("Anomalies :" + anomalies);
	System.out.println("Outlier Score Threshold :" + ifModel.getOutlierScoreThreshold());

from isolation-forest.

jverbus commented on May 25, 2024

Thanks for the additional information!

I translated some of your code into Scala and reproduced the issue.

import org.apache.spark.ml.feature.Imputer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorIndexer

import com.linkedin.relevance.isolationforest._

val path = "/Users/jverbus//Desktop/Archive/breast_cancer_data.csv"
val df = spark.read.option("header", true).csv(path)

val featureColumnList = Array(
  "clump_thickness",
  "cell_size_uniformity",
  "cell_shape_uniformity",
  "marginal_adhesion",
  "single_ep_cell_size",
  "bare_nuclei",
  "bland_chromatin",
  "normal_nucleoli",
  "mitoses")

val dfCast = df
  .withColumn("clump_thickness", df("clump_thickness").cast("double"))
  .withColumn("cell_size_uniformity", df("cell_size_uniformity").cast("double"))
  .withColumn("cell_shape_uniformity", df("cell_shape_uniformity").cast("double"))
  .withColumn("marginal_adhesion", df("marginal_adhesion").cast("double"))  
  .withColumn("single_ep_cell_size", df("single_ep_cell_size").cast("double"))
  .withColumn("bare_nuclei", df("bare_nuclei").cast("double"))
  .withColumn("bland_chromatin", df("bland_chromatin").cast("double"))
  .withColumn("normal_nucleoli", df("normal_nucleoli").cast("double"))
  .withColumn("mitoses", df("mitoses").cast("double"))

val imputer = new Imputer()
imputer.setInputCols(featureColumnList)
imputer.setOutputCols(featureColumnList)

val imputerModel = imputer.fit(dfCast)
val dfCastImputed = imputerModel.transform(dfCast)

val assembler = new VectorAssembler()
  .setInputCols(featureColumnList)
  .setOutputCol("features")
val dfCastImputedAssembled = assembler.transform(dfCastImputed)

val vectorIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
	.setMaxCategories(4)
	
val vectorIndexerModel = vectorIndexer.fit(dfCastImputedAssembled)
val dfCastImputedAssembledIndexed = vectorIndexerModel.transform(dfCastImputedAssembled)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.05)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel05 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.1)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel10 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores10 = isolationForestModel10.transform(dfCastImputedAssembledIndexed)

Which gives the results:

scala> scores05.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|               35.0|
+-------------------+

scala> scores10.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|                1.0|
+-------------------+

from isolation-forest.

jverbus commented on May 25, 2024

As we expect, the scores are the same regardless of the contamination choice:

scala> scores10.select("outlierScore").collect().deep == scores05.select("outlierScore").collect().deep
res25: Boolean = true

The odd behavior for contamination = 0.10 case must be due to the choice of the outlier score threshold, which is calculated using Spark's approxQuantile method:

https://github.com/linkedin/isolation-forest/blob/master/isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala#L141

We use this because it can be very costly to calculate the threshold exactly on very large datasets.

This file contains Spark's approxQuantile method: https://github.com/apache/spark/blob/eef3abbb903f95178f225ae0f6e3db2d9cf64175/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L61

from isolation-forest.

jverbus commented on May 25, 2024

This is the threshold calculated by the model with a contamination of 0.05:

scala> isolationForestModel05.getOutlierScoreThreshold
res23: Double = 0.6185162204274052

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0.05 * 0.01)
res26: Array[Double] = Array(0.6185162204274052)

If we set the relativeError to 0, which will force an exact calculation, we get the same result:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0)
res27: Array[Double] = Array(0.6185162204274052)

However, the case for the 0.10 threshold is different...

This is the threshold calculated by the model with a contamination of 0.10:

scala> isolationForestModel10.getOutlierScoreThreshold
res24: Double = 0.67621027121925

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1),  0.10 * 0.01)
res29: Array[Double] = Array(0.67621027121925)

If we set the relativeError to 0, which will force an exact calculation, we get a different result:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1), 0)
res28: Array[Double] = Array(0.5929591082174609)

from isolation-forest.

jverbus commented on May 25, 2024

If we use this exactly calculated threshold for the contamination = 0.10 case, we get reasonable results:

scala> val newScores10 = scores10.select("outlierScore").withColumn("newLabel", (col("outlierScore") >= 0.5929591082174609).cast("double"))
newScores10: org.apache.spark.sql.DataFrame = [outlierScore: double, newLabel: double]

scala> newScores10.agg(sum("newLabel")).show()
+-------------+
|sum(newLabel)|
+-------------+
|         70.0|
+-------------+


scala> newScores10.count()
res44: Long = 699

70 / 699 = 0.10

from isolation-forest.

jverbus commented on May 25, 2024

It is important to note this is a rare case. The model was validated on 12 benchmark datasets with varying contamination values without seeing this issue.

We need to keep the option of an approximate threshold calculation, because an exact calculation can cause issues on very large datasets.

I will add a model parameter that allows the user to choose if they want to exactly calculate the threshold. I will also add a check that will give a warning if the threshold is calculated approximately and the results don't make sense.

In the interim, you can calculate your own exact threshold and corresponding labels based upon the scores dataframe as shown above.

from isolation-forest.

pramodreddy2006 commented on May 25, 2024

@jverbus
Got it. Thanks a lot James.

from isolation-forest.

jverbus commented on May 25, 2024

@pramodreddy2006: Happy to help! Thanks for raising this!

from isolation-forest.

jverbus commented on May 25, 2024

@pramodreddy2006: I just pushed a fix to the issue you reported.

The library now uses an exact calculation of the threshold by default (slow and not as scalable). There is a new parameter contaminationError that you can specify if you want an approximate, but fast and scalable threshold calculation.

The underlying issue with Spark's approxQuantile() method is still there, so the approximate threshold calculation may occasionally have the bug you observed. If there is a disagreement between the expected and observed number of outliers during training, a warning will be shown to the user.

I reported the approxQuantile() bug to the Spark team. https://issues.apache.org/jira/browse/SPARK-29325

Please try version 0.3.0 out and let me know if it works for you!

from isolation-forest.

pramodreddy2006 commented on May 25, 2024

@jverbus Thanks. Works as explained.

from isolation-forest.

Wrong count of anomalies without respecting contamination about isolation-forest HOT 11 CLOSED

Comments (11)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent