Coder Social home page Coder Social logo

mathyns / emm Goto Github PK

View Code? Open in Web Editor NEW
11.0 1.0 4.0 7.06 MB

Exceptional Model Mining is a descriptive data mining technique to find interesting patterns in datasets. This package contains a Python inmplementation of Exceptional Model Mining that can be applied to any dataset.

Python 0.64% Jupyter Notebook 99.36%
exceptional-model-mining mining-descriptive-patterns pattern-mining emm beam-search

emm's Introduction

Exceptional Model Mining (EMM)

Excetional Model Mining is a descriptive modeling technique to discover interesting patterns in datasets. Whether a pattern is exceptional or not is often quite subjective and dependent on the context. The columns or attributes of the dataset are divided into two categories: descriptive attributes and target attributes. Descriptive attributes are used to divide the dataset into subsets, or often called subgroups. The target attributes will be used to evaluate the interestingness of a subgroup on. In order to evaluate this interestingness a model is needed to capture the interaction between the target variables and a evaluation method to score the interestingness of the interaction.

Best way to explain this technique is by an example. In this example the Windsor housing dataset is used (https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/Housing.html). The table below shows this dataset:

Our assumption with domain knowledge about houses would be that the price increases if the lotsize increases. Exceptional subgroups would be subgroups for which the increase would be significantly lower or higher than for the whole dataset or even decreasing. Our model and evaluation metric should include the coëficient for the correlation between lotsize and price, the difference between the corelation of the subset and the size of the subset. If we do not include the size of the subset in the metric we would end up with small subgroups where an exceptional might occur but is not very evident by the small amount of examples.

Let's first look at the interaction between the target variables lotsize and price on the full dataset. In the figure below lotsize is plotted on the x-axis and price on the y-axis. Each dot represents a row in the dataset and the line is a linear regression through all the data points.

Now we run the EMM algorithm with our as model the (estimated) correlation coefficient between lotsize and price. We apply this model to the full dataset and every subgroup and score each subgroup by taking the absolute difference between the model of the subgroup and the dataset and multiply it by the entropy of the subset and dataset. Exact formulas can be found in the appendix. Running the EMM algorithm with settings depth=3, width=10 and above model and metric results in the most 5 interesting subgroups as visualized below:

This is just an example to show the application of EMM, many different models and evalutation metrics are possible to mine datasets for interesting patterns.

Installation

For now this module is not available on PyPi (yet). Easiest way to use the package is to clone the git repository and import the EMM class to from your own codebase. Cloning as a submodule would be a advisable.

Usage

Quickstart

This easy example demonstrates the basic usage of the EMM module. The results of running this code would be similar to the example used in the introduction.

import pandas as pd
from emm import EMM
df = pd.read_csv('data/Housing.csv', index_col=0)
target_columns = ['lotsize', 'price']
clf = EMM(width=10, depth=2, evaluation_metric='correlation')
clf.search(df, target_cols=target_columns)
clf.visualise(subgroups=5, cols=3) 

Both the EMM() class en the search() method have options that will be discussed in next section.

Options

The EMM class has the following configurable parameters:

Attribute Type Default Options Description
width int None - Width parameter of the beam search: amount of subgroups to keep before moving on to the next depth step
depth int None - Depth parameter of the beam search: amount of iterations in the process, subgroups are described by at most depth attributes
evaluation_metric str or callable None See next section Function to evaluate the subgroups with. You can choose one of the models described in the next section by referencing them as a string or create your own evaluation function
n_jobs int -1 - Amount of CPU cores to use for the EMM algorithm. To indicate to use all available cores use -1 as value.
strategy str 'maximize' ('maximize', minimize') Indicate if the score of the evaluated subgroups must be maximized or minimized to find the most exceptional subgroups.
n_bins int 10 - For int or float columns of the dataset not all options are used to create subgroups. Values are divided into bins for which the amount of bins can be specified.
bin_strategy str or None 'equidepth' ('equidepth', 'equiwidth', None) Method to create bins for int and float columns. If None is specified all possible values are used to create subgroups. This will increase running time significant.
candidate_size int width^2 - After each (depth) step a pruning algorithm is used to remove overlapping subgroups before proceding to the next step. The candidate size is the amount of subgroups to keep in memory before this pruning proces. By default this is the width squared

The search method has the following configurable parameters:

Attribute Type Default Description
data pandas.DataFrame None Data(Frame) with data to mine for exceptional patterns.
target_cols str or list None Single column or list of columns with the target attributes. Obviously the supplied columns must be present in the provided DataFrame.
descriptive_cols str ot list All columns except target columns Single column or list of columns that can be used to create subgroups with. Specified columns may not overlap with the target columns.

Available models and metrics

correlation
distribution
custom

References

This module is built on top of the work of the following papers. All models and metrics are originating from these paper.

[1] Wouter Duivesteijn, Ad J. Feelders, and Arno Knobbe. Exceptional Model Mining: Supervised descriptive local pattern mining with complex target concepts. Data Mining and Knowledge Discovery, 30(1):47–98, 2016.

[2] Matthijs van Leeuwen and Arno Knobbe. Diverse subgroup set discovery. Data Mining and Knowledge Discovery, 25(2):208–242, September 2012.

[3] Mehdi Kaytoue, Marc Plantevit, Albrecht Zimmermann, Anes Bendimerad, and C´eline Robardet. Exceptional contextual subgraph mining. Machine Learning, 106(8):1171–1211, August 2017.

[4] Florian Lemmerich, Martin Becker, Philipp Singer, Denis Helic, Andreas Hotho, and Markus Strohmaier. Mining Subgroups with Exceptional Transition Behavior. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 965–974, Jume 2016.

emm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

emm's Issues

Error in EMM.py

Hey MathynS,
I encountered an issue when using your code for a project I'm working on. I read the README and pasted the provided quick start code in a separate file. When running the code, I got:
[RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.]
This error came from EMM.py in make_subgroups "w.start()"

This made me add the mentioned code and got:
[OSError: [WinError 6] The handle is invalid]
This error came from workers.py in evaluate_subgroups "item=queue_from.get()".

I hope you are able to help me circumvent this error and make the code work on my system.
Thank you in advance,
Berend

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.