baylorcs / baylorml Goto Github PK

View Code? Open in Web Editor NEW

22.0 5.0 12.0 4.87 MB

Machine learning library currently supporting convolutional neural networks and fast k-means clustering.

License: MIT License

Makefile 0.05% C++ 98.44% CMake 0.21% C 1.29%

baylorml's Introduction

baylorml

Machine learning library supporting convolutional neural networks.

Used to contain code for fast k-means, but that has moved to https://github.com/ghamerly/fast-kmeans

baylorml's People

Contributors

Stargazers

Watchers

Forkers

mrgloom petrrysavy victoriachuba zshwuhan minghao2016 klatremus philipwhen kuangzhenxi ftauheed lutzcle paperwhite

baylorml's Issues

how to get the cluster centers after clustering

I input a set of data and cluster them as k parts. next, I want to know which data is included in the k-th cluster and also the cluster centers. I didn't find the output parameters I need, or which variables are used to calculate them?

Thank you very much.

hamerly update_bounds: may not be as tight as possible

Report from Peter Jaeckel:

I don't know if the following matters in the context of its application, but I believe the function void HamerlyKmeans::update_bounds(int startNdx, int endNdx) is not always guaranteed to find the
second-furthest center movement. Here is the first half of that function:

https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L143

Consider the case that element 0 is the largest in the array centerMovement[]. In that case, secondLongest == longest at the end of the above loop. This may not matter much in the context. I guess the
lower bound for all data points assigned to the furthest moving centre is then simply lower than it need be, but the algorithm later on recovers from that. Here's the second half of the function where the
subtraction of the longest instead of the second-longest happens in that case:

https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L157

To fix it, if you wish to do so, you could take the first two values, and assign them (sorted by their size) to longest and secondLongest, and start the loop as of the third element in the array. You may
want to safeguard against the array being at least two elements, then, of course, though.

multithreading center init

I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187

I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.

float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
    //here comes the loop
    return sum_distribution;
}

But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.

Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?

            if (dist2[i].first > max_dist) {
                max_dist = dist2[i].first;
            }

As i can see max_dist wont be used anywhere.

members of a record in different cluster

I try to use the code in my app, but im getting the following result from the data smallDataset.txt:

cluster: 2  99.7490 
cluster: 0  88.3680 
cluster: 1  12.2960 
cluster: 2  55.6740 
cluster: 1  34.1620 
cluster: 0  14.0850 
cluster: 1  20.7000 
cluster: 2  17.0020 
cluster: 1  16.7000 
cluster: 1  71.9680 
cluster: 0  66.1370 
cluster: 0  25.9600 
cluster: 0  11.7440 
cluster: 0  98.7920

99.7490, 88.3680 and 12.2960 belong to the same record, so why are they in different cluster?

Here is the code:

   Dataset *x = NULL;

    // Get the file name
    std::string dataFileName;
    std::cin >> dataFileName;

    // Open the data file
    std::ifstream input(dataFileName.c_str());
    if (! input) {
        std::cerr << "Unable to open data file: " << dataFileName << std::endl;
    }

    // Read the parameters
    int n, d;
    input >> n >> d;

    // Allocate storage
    delete x;

    x = new Dataset(n, d);

    // Read the data values directly into the dataset
    for (int i = 0; i < n * d; ++i) {
        input >> x->data[i];
    }

    // Clean up and print success message
    std::cout << "loaded dataset " << dataFileName << ": n = " << n << ", d = " << d << std::endl;
    int numThreads = 1; 
    int maxIterations = 80000;

    Kmeans *algorithm = NULL;
    unsigned short *assignment = NULL;
    //Number of means
    unsigned short k = 3;
    algorithm = new HamerlyKmeans();
    Dataset *c = NULL;

    c = init_centers_kmeanspp_v2(*x, k);

    delete [] assignment;
    assignment = new unsigned short[x->n];
    for (int i = 0; i < x->n; ++i) {
        assignment[i] = 0;
    }
    assign(*x, *c, assignment);
    delete c;

    // Make a working copy of the set of centers
    unsigned short *workingAssignment = new unsigned short[x->n];
    std::copy(assignment, assignment + x->n, workingAssignment);


    algorithm->initialize(x, k, workingAssignment, numThreads);
    algorithm->run(maxIterations);


    int cluster;
    for (int i = 0; i < 14; ++i)
    {

        cluster = algorithm->getAssignment(i);

        printf("cluster: %d  %.4f \n", cluster, x->data[i]);

    }

k-measn++ uniqueness check

I guess this loop's https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L217 function is to prevent a point to be chosen more then once as center.

Let's say we have 10 points in our dataset. We init the centers with with init_centers_kmeanspp_v2.

The first center is picked uniformly randomly: chosen_pts = {2}.
The 2nd point is picked in the while loop, let's say: 5. chosen_pts = {2, 5}.
The 3rd draw results: chosen_pts[2] = 2;

https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L218 results a unique = FALSE, we have to do the 3rd draw again.

// unique is instantiated as TRUE, outside(!) the loop
unique = unique && chosen_pts[2] != chosen_pts[0]  // TRUE && FALSE (2 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] //  FALSE && TRUE (2 !=5) = FALSE
// !FALSE -> new loop

3rd draw (2nd do while loop): chosen_pts[2] = 8;

// unique is still FALSE from the previous loop
unique = unique && chosen_pts[2] != chosen_pts[0] // FALSE && TRUE(8 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE(8 !=5) = FALSE
// !FALSE -> new loop

Maybe I'm getting it wrong, but with some mini datasets (where is the chance to pick the same center again is bigger) I often run into an endless loop.

baylorcs / baylorml Goto Github PK

baylorml's Introduction

baylorml

baylorml's People

Contributors

Stargazers

Watchers

Forkers

baylorml's Issues

how to get the cluster centers after clustering

hamerly update_bounds: may not be as tight as possible

multithreading center init

members of a record in different cluster

k-measn++ uniqueness check

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent