Coder Social home page Coder Social logo

baylorcs / baylorml Goto Github PK

View Code? Open in Web Editor NEW
22.0 5.0 12.0 4.87 MB

Machine learning library currently supporting convolutional neural networks and fast k-means clustering.

License: MIT License

Makefile 0.05% C++ 98.44% CMake 0.21% C 1.29%

baylorml's Introduction

baylorml's People

Contributors

acu192 avatar ghamerly avatar kno10 avatar petrrysavy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

baylorml's Issues

how to get the cluster centers after clustering

I input a set of data and cluster them as k parts. next, I want to know which data is included in the k-th cluster and also the cluster centers. I didn't find the output parameters I need, or which variables are used to calculate them?

Thank you very much.

hamerly update_bounds: may not be as tight as possible

Report from Peter Jaeckel:


I don't know if the following matters in the context of its application, but I believe the function void HamerlyKmeans::update_bounds(int startNdx, int endNdx) is not always guaranteed to find the
second-furthest center movement. Here is the first half of that function:

https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L143

Consider the case that element 0 is the largest in the array centerMovement[]. In that case, secondLongest == longest at the end of the above loop. This may not matter much in the context. I guess the
lower bound for all data points assigned to the furthest moving centre is then simply lower than it need be, but the algorithm later on recovers from that. Here's the second half of the function where the
subtraction of the longest instead of the second-longest happens in that case:

https://github.com/BaylorCS/baylorml/blob/master/fast_kmeans/hamerly_kmeans.cpp#L157

To fix it, if you wish to do so, you could take the first two values, and assign them (sorted by their size) to longest and secondLongest, and start the loop as of the third element in the array. You may
want to safeguard against the array being at least two elements, then, of course, though.

multithreading center init

I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187

I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.

float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
    //here comes the loop
    return sum_distribution;
}

But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.

Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?

            if (dist2[i].first > max_dist) {
                max_dist = dist2[i].first;
            }

As i can see max_dist wont be used anywhere.

members of a record in different cluster

I try to use the code in my app, but im getting the following result from the data smallDataset.txt:

cluster: 2  99.7490 
cluster: 0  88.3680 
cluster: 1  12.2960 
cluster: 2  55.6740 
cluster: 1  34.1620 
cluster: 0  14.0850 
cluster: 1  20.7000 
cluster: 2  17.0020 
cluster: 1  16.7000 
cluster: 1  71.9680 
cluster: 0  66.1370 
cluster: 0  25.9600 
cluster: 0  11.7440 
cluster: 0  98.7920 

99.7490, 88.3680 and 12.2960 belong to the same record, so why are they in different cluster?

Here is the code:

   Dataset *x = NULL;

    // Get the file name
    std::string dataFileName;
    std::cin >> dataFileName;

    // Open the data file
    std::ifstream input(dataFileName.c_str());
    if (! input) {
        std::cerr << "Unable to open data file: " << dataFileName << std::endl;
    }

    // Read the parameters
    int n, d;
    input >> n >> d;

    // Allocate storage
    delete x;

    x = new Dataset(n, d);

    // Read the data values directly into the dataset
    for (int i = 0; i < n * d; ++i) {
        input >> x->data[i];
    }

    // Clean up and print success message
    std::cout << "loaded dataset " << dataFileName << ": n = " << n << ", d = " << d << std::endl;
    int numThreads = 1; 
    int maxIterations = 80000;

    Kmeans *algorithm = NULL;
    unsigned short *assignment = NULL;
    //Number of means
    unsigned short k = 3;
    algorithm = new HamerlyKmeans();
    Dataset *c = NULL;

    c = init_centers_kmeanspp_v2(*x, k);

    delete [] assignment;
    assignment = new unsigned short[x->n];
    for (int i = 0; i < x->n; ++i) {
        assignment[i] = 0;
    }
    assign(*x, *c, assignment);
    delete c;

    // Make a working copy of the set of centers
    unsigned short *workingAssignment = new unsigned short[x->n];
    std::copy(assignment, assignment + x->n, workingAssignment);


    algorithm->initialize(x, k, workingAssignment, numThreads);
    algorithm->run(maxIterations);


    int cluster;
    for (int i = 0; i < 14; ++i)
    {

        cluster = algorithm->getAssignment(i);

        printf("cluster: %d  %.4f \n", cluster, x->data[i]);

    }

k-measn++ uniqueness check

I guess this loop's https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L217 function is to prevent a point to be chosen more then once as center.

Let's say we have 10 points in our dataset. We init the centers with with init_centers_kmeanspp_v2.

  • The first center is picked uniformly randomly: chosen_pts = {2}.
  • The 2nd point is picked in the while loop, let's say: 5. chosen_pts = {2, 5}.
  • The 3rd draw results: chosen_pts[2] = 2;

https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L218 results a unique = FALSE, we have to do the 3rd draw again.

// unique is instantiated as TRUE, outside(!) the loop
unique = unique && chosen_pts[2] != chosen_pts[0]  // TRUE && FALSE (2 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] //  FALSE && TRUE (2 !=5) = FALSE
// !FALSE -> new loop
  • 3rd draw (2nd do while loop): chosen_pts[2] = 8;
// unique is still FALSE from the previous loop
unique = unique && chosen_pts[2] != chosen_pts[0] // FALSE && TRUE(8 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE(8 !=5) = FALSE
// !FALSE -> new loop

Maybe I'm getting it wrong, but with some mini datasets (where is the chance to pick the same center again is bigger) I often run into an endless loop.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.