Comments (11)
Hi @pramodreddy2006, thanks for reaching out!
Can you please share the exact parameter settings you were using when you encountered this potential issue? If you're able to share the data used for training and scoring that would be ideal.
Thanks!
from isolation-forest.
Attached the archive containing data file(csv) and my java class.
Works fine if contamination is 0.05, but I get only 1 anomaly when set to 0.1
```
IsolationForest isolationForest = new IsolationForest();
isolationForest.setNumEstimators(100);
// isolationForest.setContamination(0.05);
isolationForest.setContamination(0.1);
isolationForest.setFeaturesCol("indexedFeatures");
IsolationForestModel ifModel = isolationForest.fit(dataset);
dataset = ifModel.transform(dataset);
Dataset<Row> countDF = dataset.groupBy("predictedLabel").count();
List<Row> countRows = countDF.collectAsList();
Long regular = 0l;
Long anomalies = 0l;
for (Row row : countRows) {
Double label = row.getAs("predictedLabel");
long count = row.getAs("count");
if (label > 0) {
anomalies = count;
} else {
regular = count;
}
}
System.out.println("Regular :" + regular);
System.out.println("Anomalies :" + anomalies);
System.out.println("Outlier Score Threshold :" + ifModel.getOutlierScoreThreshold());
from isolation-forest.
Thanks for the additional information!
I translated some of your code into Scala and reproduced the issue.
import org.apache.spark.ml.feature.Imputer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorIndexer
import com.linkedin.relevance.isolationforest._
val path = "/Users/jverbus//Desktop/Archive/breast_cancer_data.csv"
val df = spark.read.option("header", true).csv(path)
val featureColumnList = Array(
"clump_thickness",
"cell_size_uniformity",
"cell_shape_uniformity",
"marginal_adhesion",
"single_ep_cell_size",
"bare_nuclei",
"bland_chromatin",
"normal_nucleoli",
"mitoses")
val dfCast = df
.withColumn("clump_thickness", df("clump_thickness").cast("double"))
.withColumn("cell_size_uniformity", df("cell_size_uniformity").cast("double"))
.withColumn("cell_shape_uniformity", df("cell_shape_uniformity").cast("double"))
.withColumn("marginal_adhesion", df("marginal_adhesion").cast("double"))
.withColumn("single_ep_cell_size", df("single_ep_cell_size").cast("double"))
.withColumn("bare_nuclei", df("bare_nuclei").cast("double"))
.withColumn("bland_chromatin", df("bland_chromatin").cast("double"))
.withColumn("normal_nucleoli", df("normal_nucleoli").cast("double"))
.withColumn("mitoses", df("mitoses").cast("double"))
val imputer = new Imputer()
imputer.setInputCols(featureColumnList)
imputer.setOutputCols(featureColumnList)
val imputerModel = imputer.fit(dfCast)
val dfCastImputed = imputerModel.transform(dfCast)
val assembler = new VectorAssembler()
.setInputCols(featureColumnList)
.setOutputCol("features")
val dfCastImputedAssembled = assembler.transform(dfCastImputed)
val vectorIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
val vectorIndexerModel = vectorIndexer.fit(dfCastImputedAssembled)
val dfCastImputedAssembledIndexed = vectorIndexerModel.transform(dfCastImputedAssembled)
val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.05)
isolationForest.setFeaturesCol("indexedFeatures")
val isolationForestModel05 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)
val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.1)
isolationForest.setFeaturesCol("indexedFeatures")
val isolationForestModel10 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores10 = isolationForestModel10.transform(dfCastImputedAssembledIndexed)
Which gives the results:
scala> scores05.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
| 35.0|
+-------------------+
scala> scores10.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
| 1.0|
+-------------------+
from isolation-forest.
As we expect, the scores are the same regardless of the contamination choice:
scala> scores10.select("outlierScore").collect().deep == scores05.select("outlierScore").collect().deep
res25: Boolean = true
The odd behavior for contamination = 0.10 case must be due to the choice of the outlier score threshold, which is calculated using Spark's approxQuantile method:
We use this because it can be very costly to calculate the threshold exactly on very large datasets.
This file contains Spark's approxQuantile method: https://github.com/apache/spark/blob/eef3abbb903f95178f225ae0f6e3db2d9cf64175/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L61
from isolation-forest.
This is the threshold calculated by the model with a contamination of 0.05:
scala> isolationForestModel05.getOutlierScoreThreshold
res23: Double = 0.6185162204274052
We can reproduce this result using the approxQuantile method on our scores dataframe:
scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0.05 * 0.01)
res26: Array[Double] = Array(0.6185162204274052)
If we set the relativeError to 0, which will force an exact calculation, we get the same result:
scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0)
res27: Array[Double] = Array(0.6185162204274052)
However, the case for the 0.10 threshold is different...
This is the threshold calculated by the model with a contamination of 0.10:
scala> isolationForestModel10.getOutlierScoreThreshold
res24: Double = 0.67621027121925
We can reproduce this result using the approxQuantile method on our scores dataframe:
scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1), 0.10 * 0.01)
res29: Array[Double] = Array(0.67621027121925)
If we set the relativeError to 0, which will force an exact calculation, we get a different result:
scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1), 0)
res28: Array[Double] = Array(0.5929591082174609)
from isolation-forest.
If we use this exactly calculated threshold for the contamination = 0.10 case, we get reasonable results:
scala> val newScores10 = scores10.select("outlierScore").withColumn("newLabel", (col("outlierScore") >= 0.5929591082174609).cast("double"))
newScores10: org.apache.spark.sql.DataFrame = [outlierScore: double, newLabel: double]
scala> newScores10.agg(sum("newLabel")).show()
+-------------+
|sum(newLabel)|
+-------------+
| 70.0|
+-------------+
scala> newScores10.count()
res44: Long = 699
70 / 699 = 0.10
from isolation-forest.
It is important to note this is a rare case. The model was validated on 12 benchmark datasets with varying contamination values without seeing this issue.
We need to keep the option of an approximate threshold calculation, because an exact calculation can cause issues on very large datasets.
I will add a model parameter that allows the user to choose if they want to exactly calculate the threshold. I will also add a check that will give a warning if the threshold is calculated approximately and the results don't make sense.
In the interim, you can calculate your own exact threshold and corresponding labels based upon the scores dataframe as shown above.
from isolation-forest.
@jverbus
Got it. Thanks a lot James.
from isolation-forest.
@pramodreddy2006: Happy to help! Thanks for raising this!
from isolation-forest.
@pramodreddy2006: I just pushed a fix to the issue you reported.
The library now uses an exact calculation of the threshold by default (slow and not as scalable). There is a new parameter contaminationError
that you can specify if you want an approximate, but fast and scalable threshold calculation.
The underlying issue with Spark's approxQuantile() method is still there, so the approximate threshold calculation may occasionally have the bug you observed. If there is a disagreement between the expected and observed number of outliers during training, a warning will be shown to the user.
I reported the approxQuantile() bug to the Spark team. https://issues.apache.org/jira/browse/SPARK-29325
Please try version 0.3.0 out and let me know if it works for you!
from isolation-forest.
@jverbus Thanks. Works as explained.
from isolation-forest.
Related Issues (16)
- The library gives error while writing model using Spark 2.4 HOT 6
- Unable to save and load model HOT 5
- Publish for Scala 2.12 HOT 8
- Facing issues with json4s package, while saving model. Also not able to create a fat jar due to version conflict between liberaries. HOT 8
- Publish artifact for spark 3.0.0 HOT 3
- PySpark support HOT 2
- Issue saving the model HOT 2
- Multiple Rows as One Data Point HOT 2
- Unable to save model HOT 2
- Publish for Scala 2.13 HOT 2
- question about the withReplacement param in BaggedPoint.scala HOT 2
- Installation in spark HOT 3
- Spark 3.4.0 support HOT 2
- InvalidClassExcepiton HOT 12
- Load Model Error: java.lang.UnsupportedOperationException: empty collection HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from isolation-forest.