Coder Social home page Coder Social logo

cluster-outliers's Introduction

Streaming k-Clusters with Outliers Approximation

Implement k-cluster with outliers streaming algorithm.

It contains the implementation of two algorithms from these papers:

  • Section 3 from Algorithms for Facility Location Problems with Outliers, Moses Charikar, Samir Khuller, David M. Mount, Giri Narasimhan acm.

  • Algorithm 3.1 from Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity, Richard Matthew McCutchen, Samir Khuller springer. Where the requirement to an offline 4-approximation algorithm has been replaced with the above algorithm.

Building

Dependencies

The core code only requires a C++17/OpenMP capable compilator and the library jsoncpp.

Building maps and graphs requires python3 together with libraries glob, pillow and matplotlib.

Compilation

https://github.com/remi-dupre/cluster-outliers.git
cd cluster-outliers
make release

Usage

Running

# Use every logical threads as a default, but it seems to behave better without
# any hyperthreading going on.
export OMP_NUM_THREADS=4

# Cover the graph saved in dataset.txt using 50 clusters and 10 outliers.
./clustering dataset.txt 50 10

The input file must contain one vertex per line, beeing described with three columns separated with spaces. The first columns is ignored, and the second and last columns contains longitude and latitude of the point.

Testing

# Build 2D-maps in the directory maps/ from existing logs
scripts/map.py

# Plot a benchmark about existing logs
scripts/plot.py

Results

All tests were run on this dataset containing the geolocalisation of 1,000,000 tweets around the world.

Here are measures for several parameters:

  • clusters: the number of allowed clusters in the solution
  • outliers: number of allowed outliers
  • alpha: disks growth parameter in the algorithm
  • cost: max radius of the solution
  • min opt: a lower bound to the optimal solution
  • max opt: an upper bound to the optimal solution
  • time: run time of the algorithm
clusters outliers α cost min opt max ratio time
20 10 1.1 0.8282 0.3125 2.6502 2.6360s
20 50 1.1 0.6900 0.1992 3.4637 18.3627s
20 10 4 1.4160 0.3125 4.5312 1.9803s
20 20 4 0.8843 0.2666 3.3169 2.4288s
20 30 4 0.8665 0.2373 3.6514 2.4714s
20 40 4 1.3749 0.2178 6.3136 2.6495s
20 50 4 1.2991 0.1992 6.5210 3.1103s

For parameters clusters = 20, outliers = 50 and α = 1.1 the output looks like:

Result with k=20 z=50 Animation with k=20 z=50

The bound on the optimal is computed by finding (clusters + outliers + 1) disks that don't intersect, this is how it looks like using previous parameters:

Counter example with k=20 z=50

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.