Coder Social home page Coder Social logo

lab3-km's Introduction

Advanced Computing Systems 2020-2021 Lab 3 - Accelerating a clustering algorithm

This repository contains all files and explanations needed to perform the third lab of the EE4C07 - Advanced Computing Systems course at the Delft University of Technology.

What do I have to do for Lab 3?

  • Download the baseline project.
  • Improve the throughput of the project using any of the techniques learned in the course, as much as you can.
  • Write a report.
  • Turn in your report.
  • Rejoice with the knowledge that you have gained.

What does the baseline project do?

The baseline project implements a variant of the K-Means clustering algorithm. The K-means clustering algorithm takes a data set consisting of feature vectors. It then attempts to cluster feature vectors that are nearby each other into K clusters, where K is parameterizable.

For example, suppose we take a dataset of a group of students, where the feature vectors contain: {age, height, grade}. We then want to create two groups of students with the most overlapping features. The idea about K-means clustering is that we will create groups of the feature vectors that amongst each other have the smallest Euclidean distance.

There is ample literature available online about K-means clustering. The baseline example implements the pseudocode found in this wikibook. However, there is one difference to make this variant of K-means clustering a bit more unique.

Specifically, instead of supplying a maximum number of iterations, you can provide a threshold number of iterations. After the threshold, all Euclidean distances will be scaled for vectors wanting to switch to another cluster. This causes the clustering to converge more quickly, although the accuracy will drop.

Whether or not this change is totally insane or actually has some merit, is still undecided. Therefore, the algorithm variant is called KrazyMeans, just to be sure.

What should I do with the baseline project?

You should read the baseline source code and figure out how to program works.

Then, you must:

  • Improve the throughput of the project using any of the techniques learned in the course as much as possible.

    • You MUST use OpenCL or CUDA for the major run-time components.
    • You can use SIMD, OpenMP for some other parts.
    • You cannot use any external libraries unless explicit permission was given.
  • You can not change the way the algorithm works in a functional sense. E.g. you cannot change it to another variant of K-means. The distance scaling functionality MUST be included in your implementation. If you are unsure if some algorithmic change is allowed, ask the lab instructors - this is your responsibility.

What data set should I use for testing/debugging?

The code contains some functions to generate an example data set which can be visualized through a Python script. Using the -p option you can output the clustering context as some CSV files. The scripts will draw an image of those. Don't use this on large data sets. It only works for feature vectors with two dimensions.

You can run it with:

python3 plot_cluster.py

The script expects the points.csv and centroids.csv files in its working directory. It also requires some packages like Pandas and Matplotlib.

What data set should I use for benchmarking / report discussions?

Eventually you should benchmark your implementation using a generated data set using the -b option. As the feature vectors will be uniformly distributed throughout the feature space, any clustering of this data set will on average result in that every cluster contains about as many points.

You can use this to test your own implementation. There is no functional test supplied with this code, you need to write a test yourself to verify your results.

The figure of merit will be the run-time when running the improved program with ...

./krazymeans -k 24 -s 1e-5 -t 30 -i benchmark.kmd -o labels.kml

... where benchmark.kmd was generated using the -b option.

Can I share some of my knowledge with people outside my group?

There are some bonus points awarded to the groups with the fastest implementation.

You can decide whether sharing your industry secrets on optimizations of this algorithm with other groups is a smart thing to do ;-)

  • 1st place: +15% on lab 3 grade
  • 2nd place: +10% on lab 3 grade
  • 3rd place: +5% on lab 3 grade

What files can I change?

You can change any files in the project.

Can I fork this repository as a student?

Answer: NO, DO NOT FORK IT (publicly). See the Lab 1 statement about this.

Is there a template for the lab report?

You can use the templates from lab 1 and lab 2, and modify them according to what you want to convey through your report. However, these things MUST be included in the discussion of your report:

  1. An application profile of the baseline / intermediate versions.
  2. An estimation on what performance can be gained after improving specific parts, based on the profiles.
  3. A short discussion on how you've improved specific parts.
  4. Go back to 1 and repeat until you've squeezed out as much throughput as possible.

Other than that we will use exactly the same rubric as supplied in lab 1 to grade the lab work and report. So make sure you convey the information needed for the lab instructors to base a grade on.

lab3-km's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.