kjahan / k_means Goto Github PK

View Code? Open in Web Editor NEW

167.0 3.0 94.0 280 KB

A Python implementation of k-means clustering algorithm

Home Page: http://www.kazemjahanbakhsh.com/codes/k-means.html

Python 100.00%

machine-learning clustering python3 kmeans-algorithm k-means-clustering k-means

k_means's Introduction

K-Means

General description

This project is a Python implementation of k-means clustering algorithm.

Requirements

You should setup the conda environment (i.e. kmeans) using the environment.yml file:

conda env create -f environment.yml

Activate conda environment:

conda activate kmeans

(Run unset PYTHONPATH on Mac OS)

Input

A list of points in two-dimensional space where each point is represented by a latitude/longitude pair.

Output

The clusters of points. By default we stores the computed clusters into a csv file: output.csv. You can specify your output filename using --output argument option.

How to run:

python -m src.run --input YOUR_LOC_FILE --clusters CLUSTERS_NO

Note that the runner expects the location file be in data folder.

Run tests

python -m pytest tests/

Technical details

This project is an implementation of k-means algorithm. It starts with a random point and then chooses k-1 other points as the farthest from the previous ones successively. It uses these k points as cluster centroids and then joins each point of the input to the cluster with the closest centroid.

Next, it recomputes the new centroids by calculating the means of obtained clusters and repeats the first step again by finding which cluster each point belongs to.

The program repeats these two steps until the cluster centroids converge and do not change anymore. See the following link to read more about this project and see some real examples of running k-means algorithm:

K-Means algorithm description

To deactivate the conda environment:

conda deactivate

k_means's People

Contributors

Stargazers

Watchers

Forkers

azizur77 vschiavoni febinpaul bmswgnp radkac catcatmeow sunnygrace shaynekasai pandamax oshahid96 himanshumangla google1234 asaurav025 roowang tscung pankajkumarkbn horaceheqi poojitat vei7 lihua213 xuequeen peterxiaoguo technologyevangelist wl1446445456 yaswanthkumarm limleespirit sunwanyi yshihui shin-wang pankaj-pundir toccator sudikhya okayjosh hotview sumegh-git mr2coder bmartinez12 tadpole258 jameshsu007 haruna-kawai zz20200 great-benny sucrammitchell hyunjin5 sgullett crazieemma yummycats githubforxiaoming emilybuffy koys007 aimaze drshahidmehmood ibtisamdev phm1234567 dqhcjlu06 tigerly iffriend phymucs chw0806-github wenyishengkingkong bettayebma gu-dongkai nicowangdev monali25t-sys sachinyar nastypig harshithasunkavalli s1s1ty ai-hub-deep-learning-fundamental ziyangye-sys jeffgan99 pandorals mochamadzamzam28 hernandez2804 anisarosalina ariyoatmojo nabilla2 alifmhmmd herolin12 szpercy mur-sonja omgloveling kunalwagh30092002 leey0509 lyallm1 j1rome arklu py71sydu 1217ljj romandevjavascript harrybarker anzeedan himanshugupta777 xiaosanmeng

k_means's Issues

Separation of concerns?

It appears as if clustering.py inextricably combines the k-means clustering algorithm with matplotlib graphics. This has several disadvantages:

Makes it difficult to use another gui
Makes it difficult to use clustering as standalone (e.g. as a tool for analyses)

error showing pandas module not installed

raceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/jsnagavishnusai/Downloads/k_means-master/src/run.py", line 3, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

How to run the picture？

Pictures do not show up.

Calculating distance more quickly

Rather than using math functions (such as math.sqrt and math.pow) to calculate euclidean distance, use the builtin operators: x**2 and x**0.5

The code object disassembly shows fewer instructions in the latter:

>>> def foper(x1, y1, x2, y2): #Using builtins
...     return ((x2-x1)**2 + (y2-y1)**2)**0.5
... 
>>> import math
>>> def ffunc(x1, y1, x2, y2): #Using functions from math module
...     return math.sqrt(math.pow(x2-x1,2) + math.pow(y2-y1,2))
... 
>>> dis.dis(foper)
  2           0 LOAD_FAST                2 (x2)
              3 LOAD_FAST                0 (x1)
              6 BINARY_SUBTRACT     
              7 LOAD_CONST               1 (2)
             10 BINARY_POWER        
             11 LOAD_FAST                3 (y2)
             14 LOAD_FAST                1 (y1)
             17 BINARY_SUBTRACT     
             18 LOAD_CONST               1 (2)
             21 BINARY_POWER        
             22 BINARY_ADD          
             23 LOAD_CONST               2 (0.5)
             26 BINARY_POWER        
             27 RETURN_VALUE        
>>> dis.dis(ffunc)
  2           0 LOAD_GLOBAL              0 (math)
              3 LOAD_ATTR                1 (sqrt)
              6 LOAD_GLOBAL              0 (math)
              9 LOAD_ATTR                2 (pow)
             12 LOAD_FAST                2 (x2)
             15 LOAD_FAST                0 (x1)
             18 BINARY_SUBTRACT     
             19 LOAD_CONST               1 (2)
             22 CALL_FUNCTION            2
             25 LOAD_GLOBAL              0 (math)
             28 LOAD_ATTR                2 (pow)
             31 LOAD_FAST                3 (y2)
             34 LOAD_FAST                1 (y1)
             37 BINARY_SUBTRACT     
             38 LOAD_CONST               1 (2)
             41 CALL_FUNCTION            2
             44 BINARY_ADD          
             45 CALL_FUNCTION            1
             48 RETURN_VALUE        
>>>

This can also be empirically show by the timeit results:

>>> import timeit
>>> toper = timeit.Timer("((23-76)**2 + (45-43)**2)**0.5")
>>> tfunc = timeit.Timer("math.sqrt(math.pow(23-76,2) + math.pow(45-43,2))", "import math")
>>> toper.timeit()
0.11890888214111328
>>> tfunc.timeit()
0.4799330234527588

Using the operators is about 75% faster.

def __init__(self, geo_locs_, k_):
        self.geo_locations = geo_locs_
        self.k = k_
        self.clusters = []  #clusters of nodes
        self.means = []     #means of clusters
        self.debug = True  #debug flag

self.clusters = []
the clusters is a list.
Should the self.clusters is a dict?