baylorml
Machine learning library supporting convolutional neural networks.
Used to contain code for fast k-means, but that has moved to https://github.com/ghamerly/fast-kmeans
Machine learning library currently supporting convolutional neural networks and fast k-means clustering.
License: MIT License
Machine learning library supporting convolutional neural networks.
Used to contain code for fast k-means, but that has moved to https://github.com/ghamerly/fast-kmeans
I input a set of data and cluster them as k parts. next, I want to know which data is included in the k-th cluster and also the cluster centers. I didn't find the output parameters I need, or which variables are used to calculate them?
Thank you very much.
Report from Peter Jaeckel:
I don't know if the following matters in the context of its application, but I believe the function void HamerlyKmeans::update_bounds(int startNdx, int endNdx) is not always guaranteed to find the
second-furthest center movement. Here is the first half of that function:
https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L143
Consider the case that element 0 is the largest in the array centerMovement[]. In that case, secondLongest == longest at the end of the above loop. This may not matter much in the context. I guess the
lower bound for all data points assigned to the furthest moving centre is then simply lower than it need be, but the algorithm later on recovers from that. Here's the second half of the function where the
subtraction of the longest instead of the second-longest happens in that case:
https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L157
To fix it, if you wish to do so, you could take the first two values, and assign them (sorted by their size) to longest and secondLongest, and start the loop as of the third element in the array. You may
want to safeguard against the array being at least two elements, then, of course, though.
I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.
float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
//here comes the loop
return sum_distribution;
}
But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.
Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
As i can see max_dist wont be used anywhere.
I try to use the code in my app, but im getting the following result from the data smallDataset.txt:
cluster: 2 99.7490
cluster: 0 88.3680
cluster: 1 12.2960
cluster: 2 55.6740
cluster: 1 34.1620
cluster: 0 14.0850
cluster: 1 20.7000
cluster: 2 17.0020
cluster: 1 16.7000
cluster: 1 71.9680
cluster: 0 66.1370
cluster: 0 25.9600
cluster: 0 11.7440
cluster: 0 98.7920
99.7490, 88.3680 and 12.2960 belong to the same record, so why are they in different cluster?
Here is the code:
Dataset *x = NULL;
// Get the file name
std::string dataFileName;
std::cin >> dataFileName;
// Open the data file
std::ifstream input(dataFileName.c_str());
if (! input) {
std::cerr << "Unable to open data file: " << dataFileName << std::endl;
}
// Read the parameters
int n, d;
input >> n >> d;
// Allocate storage
delete x;
x = new Dataset(n, d);
// Read the data values directly into the dataset
for (int i = 0; i < n * d; ++i) {
input >> x->data[i];
}
// Clean up and print success message
std::cout << "loaded dataset " << dataFileName << ": n = " << n << ", d = " << d << std::endl;
int numThreads = 1;
int maxIterations = 80000;
Kmeans *algorithm = NULL;
unsigned short *assignment = NULL;
//Number of means
unsigned short k = 3;
algorithm = new HamerlyKmeans();
Dataset *c = NULL;
c = init_centers_kmeanspp_v2(*x, k);
delete [] assignment;
assignment = new unsigned short[x->n];
for (int i = 0; i < x->n; ++i) {
assignment[i] = 0;
}
assign(*x, *c, assignment);
delete c;
// Make a working copy of the set of centers
unsigned short *workingAssignment = new unsigned short[x->n];
std::copy(assignment, assignment + x->n, workingAssignment);
algorithm->initialize(x, k, workingAssignment, numThreads);
algorithm->run(maxIterations);
int cluster;
for (int i = 0; i < 14; ++i)
{
cluster = algorithm->getAssignment(i);
printf("cluster: %d %.4f \n", cluster, x->data[i]);
}
I guess this loop's https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L217 function is to prevent a point to be chosen more then once as center.
Let's say we have 10 points in our dataset. We init the centers with with init_centers_kmeanspp_v2.
https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L218 results a unique = FALSE, we have to do the 3rd draw again.
// unique is instantiated as TRUE, outside(!) the loop
unique = unique && chosen_pts[2] != chosen_pts[0] // TRUE && FALSE (2 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE (2 !=5) = FALSE
// !FALSE -> new loop
// unique is still FALSE from the previous loop
unique = unique && chosen_pts[2] != chosen_pts[0] // FALSE && TRUE(8 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE(8 !=5) = FALSE
// !FALSE -> new loop
Maybe I'm getting it wrong, but with some mini datasets (where is the chance to pick the same center again is bigger) I often run into an endless loop.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.