Coder Social home page Coder Social logo

faizywaizy / minsizekmeans Goto Github PK

View Code? Open in Web Editor NEW

This project forked from behrouz-babaki/minsizekmeans

0.0 0.0 0.0 220 KB

A python implementation of KMeans clustering with minimum cluster size constraint (Bradley et al., 2000)

License: GNU General Public License v3.0

Shell 7.42% Python 92.58%

minsizekmeans's Introduction

MinSizeKmeans

A python implementation of KMeans clustering with minimum cluster size constraint (Bradley et al., 2000)[1]

Note that the paper suggests using a dedicated minimum-cost network flow algorithm for solving the subproblem, but in my implementation I use standard MIP solvers. My motivation for doing so was this comment from Tobias Achterberg:

If you have a pure network, then the constraint matrix is totally unimodular, so any vertex solution will be integral. And since the simplex algorithm always returns a vertex solution (if one exists), your solution will be integral.

Note that in our experience, a network simplex algorithm is not faster than the dual simplex, which is the reason why Gurobi does not provide a network simplex.

Usage

usage: run_mskmeans.py [-h] [-n NUM_ITER] [-o OUTFILE] datafile k min_size

positional arguments:
  datafile              file containing the coordinates of instances
  k                     number of clusters
  min_size              minimum size of each cluster

optional arguments:
  -h, --help            show this help message and exit
  -n NUM_ITER, --NUM_ITER NUM_ITER
                        run the algorithm for NUM_ITER times and return the best clustering
  -o OUTFILE, --OUTFILE OUTFILE
                        store the result in OUTFILE

To see a run of algorithm on example data, run the script example_run.sh in minsize_kmeans directory.

Dependencies

This program uses Pulp to solve the assignment subproblem. You should also have a MIP solver (Gurobi, Cplex, Cbc, etc.) installed and have it configured to work with Pulp.

Extension

This algorithm can be easily extended to account for a constraint on maximum number of instances in a cluster. To do so, in the network flow formulation of [1], we add constraints on the capacity of arcs that enter the sink node.

To run the algorithm with both minimum and maximum size constraints, use this command:

usage: minmax_kmeans.py [-h] [-n NUM_ITER] [-o OUTFILE] datafile k min_size max_size

positional arguments:
  datafile              file containing the coordinates of instances
  k                     number of clusters
  min_size              minimum size of each cluster
  max_size              maximum size of each cluster

optional arguments:
  -h, --help            show this help message and exit
  -n NUM_ITER, --NUM_ITER NUM_ITER
                        run the algorithm for NUM_ITER times and return the best clustering
  -o OUTFILE, --OUTFILE OUTFILE
                        store the result in OUTFILE

An example run is included in the script example_run.sh in minsize_kmeans directory.

Constrained K-Means with Weighted Instances

Sometimes instead of the size of clusters, we want to constrain the total weight of instances that are in each cluster. Unfortunately, the subproblem for constrained k-means in this case will be NP-complete. However, we can still give this subproblem to a MIP solver and hope that it will be solved in a reasonable time. To run this algorithm, use this command:

usage: weighted_mm_kmeans.py [-h] [-n NUM_ITER] [-o OUTFILE]
                             datafile k weightfile min_weight max_weight

positional arguments:
  datafile              file containing the coordinates of instances
  k                     number of clusters
  weightfile            file containing the weights of instances
  min_weight            minimum total weight for each cluster
  max_weight            maximum total weigth for each cluster

optional arguments:
  -h, --help            show this help message and exit
  -n NUM_ITER, --NUM_ITER NUM_ITER
                        run the algorithm for NUM_ITER times and return the best clustering
  -o OUTFILE, --OUTFILE OUTFILE
                        store the result in OUTFILE

An example run is included in the script example_weighted.sh in minsize_kmeans directory.

References

  1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.

minsizekmeans's People

Contributors

ofir-reich avatar behrouz-babaki avatar pgoelz avatar nattomi avatar voglster avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.