Coder Social home page Coder Social logo

kjahan / k_means Goto Github PK

View Code? Open in Web Editor NEW
167.0 3.0 94.0 280 KB

A Python implementation of k-means clustering algorithm

Home Page: http://www.kazemjahanbakhsh.com/codes/k-means.html

Python 100.00%
machine-learning clustering python3 kmeans-algorithm k-means-clustering k-means

k_means's Introduction

K-Means

General description

This project is a Python implementation of k-means clustering algorithm.

Requirements

You should setup the conda environment (i.e. kmeans) using the environment.yml file:

conda env create -f environment.yml

Activate conda environment:

conda activate kmeans

(Run unset PYTHONPATH on Mac OS)

Input

A list of points in two-dimensional space where each point is represented by a latitude/longitude pair.

Output

The clusters of points. By default we stores the computed clusters into a csv file: output.csv. You can specify your output filename using --output argument option.

How to run:

python -m src.run --input YOUR_LOC_FILE --clusters CLUSTERS_NO

Note that the runner expects the location file be in data folder.

Run tests

python -m pytest tests/

Technical details

This project is an implementation of k-means algorithm. It starts with a random point and then chooses k-1 other points as the farthest from the previous ones successively. It uses these k points as cluster centroids and then joins each point of the input to the cluster with the closest centroid.

Next, it recomputes the new centroids by calculating the means of obtained clusters and repeats the first step again by finding which cluster each point belongs to.

The program repeats these two steps until the cluster centroids converge and do not change anymore. See the following link to read more about this project and see some real examples of running k-means algorithm:

K-Means algorithm description

To deactivate the conda environment:

conda deactivate

k_means's People

Contributors

kjahan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

k_means's Issues

Separation of concerns?

It appears as if clustering.py inextricably combines the k-means clustering algorithm with matplotlib graphics. This has several disadvantages:

  • Makes it difficult to use another gui
  • Makes it difficult to use clustering as standalone (e.g. as a tool for analyses)

error showing pandas module not installed

raceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/jsnagavishnusai/Downloads/k_means-master/src/run.py", line 3, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Calculating distance more quickly

Rather than using math functions (such as math.sqrt and math.pow) to calculate euclidean distance, use the builtin operators: x**2 and x**0.5

The code object disassembly shows fewer instructions in the latter:

>>> def foper(x1, y1, x2, y2): #Using builtins
...     return ((x2-x1)**2 + (y2-y1)**2)**0.5
... 
>>> import math
>>> def ffunc(x1, y1, x2, y2): #Using functions from math module
...     return math.sqrt(math.pow(x2-x1,2) + math.pow(y2-y1,2))
... 
>>> dis.dis(foper)
  2           0 LOAD_FAST                2 (x2)
              3 LOAD_FAST                0 (x1)
              6 BINARY_SUBTRACT     
              7 LOAD_CONST               1 (2)
             10 BINARY_POWER        
             11 LOAD_FAST                3 (y2)
             14 LOAD_FAST                1 (y1)
             17 BINARY_SUBTRACT     
             18 LOAD_CONST               1 (2)
             21 BINARY_POWER        
             22 BINARY_ADD          
             23 LOAD_CONST               2 (0.5)
             26 BINARY_POWER        
             27 RETURN_VALUE        
>>> dis.dis(ffunc)
  2           0 LOAD_GLOBAL              0 (math)
              3 LOAD_ATTR                1 (sqrt)
              6 LOAD_GLOBAL              0 (math)
              9 LOAD_ATTR                2 (pow)
             12 LOAD_FAST                2 (x2)
             15 LOAD_FAST                0 (x1)
             18 BINARY_SUBTRACT     
             19 LOAD_CONST               1 (2)
             22 CALL_FUNCTION            2
             25 LOAD_GLOBAL              0 (math)
             28 LOAD_ATTR                2 (pow)
             31 LOAD_FAST                3 (y2)
             34 LOAD_FAST                1 (y1)
             37 BINARY_SUBTRACT     
             38 LOAD_CONST               1 (2)
             41 CALL_FUNCTION            2
             44 BINARY_ADD          
             45 CALL_FUNCTION            1
             48 RETURN_VALUE        
>>> 

This can also be empirically show by the timeit results:

>>> import timeit
>>> toper = timeit.Timer("((23-76)**2 + (45-43)**2)**0.5")
>>> tfunc = timeit.Timer("math.sqrt(math.pow(23-76,2) + math.pow(45-43,2))", "import math")
>>> toper.timeit()
0.11890888214111328
>>> tfunc.timeit()
0.4799330234527588

Using the operators is about 75% faster.

Hardcoded path to CVS input file

The file main.py has an barcoded path to this input file:
/home/kazem/Downloads/Hackathon/drinkingFountains.csv

It should take it as input parameter.

clustering.py problem

The file clustering.py has some lines of code:

def __init__(self, geo_locs_, k_):
        self.geo_locations = geo_locs_
        self.k = k_
        self.clusters = []  #clusters of nodes
        self.means = []     #means of clusters
        self.debug = True  #debug flag

self.clusters = []
the clusters is a list.
Should the self.clusters is a dict?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.