Coder Social home page Coder Social logo

amirint / worder Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 28.77 MB

The Word Frequency program with CUDA project aims to develop an advanced solution for efficiently analyzing the frequency of words in large datasets. The project will leverage NVIDIA's CUDA framework to harness the power of GPUs, surpassing the performance of traditional CPU-based implementations.

Cuda 89.30% C++ 10.70%
cpp cuda cuda-kernels cuda-programming word-frequency word-frequency-count

worder's Introduction

worder

This project is an attempt to implement a GPU-augmented word frequency analyser using the CUDA programming language. As depicted in the following sections, compared to CPU, GPU has had significant performance and speedup.

Input

  • Keywords

We used 10K most used English words from Google to use as our subject histogram and keywords. The number and offset of keywords to read from the dataset file is arbitrary but subject to the upperbound of 1489.

  • Data

We gathered text datasets of varying sizes from 5MB to 59MB for testing purposes. The dataset consists of texts from articles and books. Proportional to the size of the dataset, the application loads a specific number of words into memory.

Data Number of Words to Read Size (MB)
Small 131072 4
Medium 393216 12
Large 786432 24
Huge 1572864 48

Implementation

The application reads the file of keywords and creates a list of them in memory. Likewise, the data file to process gets withdrawn. The application reserves 32-byte chunks of memroy for each word. Thus, words are tokenised by default.

  • Plain

This version of implementation, performs no preprocessing steps and directly processes data using CPU and GPU. This version achieves the best compute throughput.

Image illustrating the compute throughput of GPU

Data CPU Time (ms) GPU Time (ms)
Small 185 19.7651
Medium 579 54.6565
Large 1072 107.103
Huge 2017 208.281

As can be seen, we achieved an average speedup of approximately 10.

  • Preprocess

This version of implementation, performs some preprocessing steps including lowercasing and punctuation removal before processing data using CPU and GPU.

Data CPU Time (ms) GPU Time (ms)
Small 190 20.7706
Medium 613 57.7872
Large 1126 112.563
Huge 2106 218.822

As result data indicates, we achieved an average speedup of nearly 10 yet again.

  • Streams

This version of implementation, makes use of CUDA streams to pipeline data transfer and processing. By breaking the input data (data decomposition), each data section independently of others, gets preprocessed and processed by GPU.

Data GPU Time (ms)
Small 31.8311
Medium 69.9853
Large 127.618
Huge 236.412

This time, to our surprise, we achieved a lower speedup compared to the previous steps, an average of nearly 8. The reason behind this slow-down is the occupancy of the kernels. The written kernel achieves an occupancy of about 98%. Thus, the different streams, although called in parallel, run in serial. With the overhead of many streams and kernel creations and launches, the computation times just deteriorate.

Image illustrating the compute throughput of GPU

Optimisation Strategies

The main optimisation steps include:

  1. Use of shared memory for the keywords array; using shared memory considerably speeds up search among the keywords to find the right histogram index to update. When each block maintains its own keywords inside shared memory, the access time to keywords reduces by several orders of magnitude compared to when reading keywords from the global memory.
  2. Coalescing accesses to the global memory; as much as possible, the application is written in a way that threads access consecutive memory addresses. This way threads accesses coalesce into one memory access for multiple threads and the read/write time to and from memory decreases.
  3. Use of local histogram to reduce conflicts; using local histograms limits the race conditions on histogram updates to block level during the main execution loads and increases performance.

worder's People

Contributors

amirint avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.