Coder Social home page Coder Social logo

gap's Introduction

this script is using the gap statistics to run k-means algorithm for many times to find the best K value for the dataset.

because k-mean really depends on the initial points and thus the results can be different given different initial points; therefore use sklearn packages to run many times with different initial ponits, and this can be one parameter for the gap statistics.

this module should be imported into other python scripts and combined with sklearn to find the best K value.

parameters:

refs: np.array or None, it is the replicated data that you want to compare with if there exists one; 
if no existing replicated/proper data, just use None, and the function will automatically generates them; 

B: int, the number of replicated samples to run gap-statistics; it is recommended as 10, and it should not be changed/decreased that to a smaller value;

K: list, the range of K values to test on;

N_init: int, states the number of initial starting points for each K-mean running under sklearn, in order to get stable clustering result each time; 
you may not need such many starting points, so it can be reduced to a smaller number to quicken the computation;

n_jobs: int, clarifies the parallel computing, could fasten the computation, this can be only changed inside the script, not as an argument of the function;

to install

pip install gapkmean

to use as a module in python

from gapkmean import gap

to find the best K value of K-mean algorithm

#note `data` should be an numpy.array
gaps, s_k, K = gap.gap_statistic(data, refs=None, B=10, K=range(1,11), N_init = 10)
bestKValue = gap.find_optimal_k(gaps, s_k, K)

gap's People

Contributors

minddrummer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gap's Issues

Error: No module named gapkmean

The package was successfully installed in python 2.7.12, scikit-learn==0.17.1 using pip install gapkmean. However, once I wanted to import the package from anaconda I received the following error "No module named gapkmean". I see this package when I do "pip.get_installed_distributions()" and when I do "conda list"

GapStatistic

Hi minddrummer,
Can I use the same logic to find optimal clusters for Hierarchical Clustering?

python 3.x compatible version

Hi, the current version does not work for Python 3.x.
It seems that replacing "print log" with "print(log)" will make it work for both Python 2.x and Python 3.x.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.