Coder Social home page Coder Social logo

mohamedtaoufik / 1brc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gunnarmorling/1brc

0.0 1.0 0.0 352 KB

1οΈβƒ£πŸπŸŽοΈ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

License: Apache License 2.0

Shell 11.65% Java 88.35%

1brc's Introduction

1οΈβƒ£πŸπŸŽοΈ The One Billion Row Challenge

Status Jan 1: This challenge is open for submissions!

The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file. Grab all your (virtual) threads, reach out to SIMD, optimize your GC, or pull any other trick, and create the fastest implementation for solving this task!

1BRC

The text file contains temperature values for a range of weather stations. Each row is one measurement in the format <string: station name>;<double: measurement>. The following shows ten rows as an example:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0

The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this (i.e. sorted alphabetically by station name, and the result values per station in the format <min>/<mean>/<max>, rounded to one fractional digit):

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, AbΓ©chΓ©=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...}

Submit your implementation by Jan 31 2024 and become part of the leaderboard!

Results

# Result (m:s:ms) Implementation Submitter
1. 00:23.366 link Roy van Rijn
2. 00:38.510 link Hampus Ram
3. 00:50.547 link Aurelian Tutuianu
4. 02:08.315 link itaske
5. 02:08.650 link Kuduwa Keshavram
6. 04:13.449 link (baseline) Gunnar Morling

See below for instructions how to enter the challenge with your own implementation.

Prerequisites

Java 21 must be installed on your system.

Running the Challenge

This repository contains two programs:

  • dev.morling.onebrc.CreateMeasurements (invoked via create_measurements.sh): Creates the file measurements.txt in the root directory of this project with a configurable number of random measurement values
  • dev.morling.onebrc.CalculateAverage (invoked via calculate_average.sh): Calculates the average values for the file measurements.txt

Execute the following steps to run the challenge:

  1. Build the project using Apache Maven:

    ./mvnw clean verify
    
  2. Create the measurements file with 1B rows (just once):

    ./create_measurements.sh 1000000000
    

    This will take a few minutes. Attention: the generated file has a size of approx. 12 GB, so make sure to have enough diskspace.

  3. Calculate the average measurement values:

    ./calculate_average.sh
    

    The provided naive example implementation uses the Java streams API for processing the file and completes the task in ~2 min on environment used for result evaluation. It serves as the base line for comparing your own implementation.

  4. Optimize the heck out of it:

    Adjust the CalculateAverage program to speed it up, in any way you see fit (just sticking to a few rules described below). Options include parallelizing the computation, using the (incubating) Vector API, memory-mapping different sections of the file concurrently, using AppCDS, GraalVM, CRaC, etc. for speeding up the application start-up, choosing and tuning the garbage collector, and much more.

The following rules and limits apply:

  • Any of these Java distributions may be used:
    • Any builds provided by SDKMan
    • Early access builds available on openjdk.net may be used (including EA builds for OpenJDK projects like Valhalla)
    • Builds on builds.shipilev.net If you want to use a build not available via these channels, reach out to discuss whether it can be considered.
  • No external library dependencies may be used
  • Implementations must be provided as a single source file
  • The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary

Entering the Challenge

To submit your own implementation to 1BRC, follow these steps:

  • Create a fork of the onebrc GitHub repository.
  • Create a copy of CalculateAverage.java, named CalculateAverage_<your_GH_user>.java, e.g. CalculateAverage_doloreswilson.java.
  • Make that implementation fast. Really fast.
  • Create a copy of calculate_average.sh, named calculate_average_<your_GH_user>.sh, e.g. calculate_average_doloreswilson.sh.
  • Adjust that script so that it references your implementation class name. If needed, provide any JVM arguments via the JAVA_OPTS variable in that script.
  • (Optional) If you'd like to use native binaries (GraalVM), adjust the pom.xml file so that it builds that binary.
  • Create a pull request against the upstream repository, clearly stating
    • The name of your implementation class.
    • The JDK build to use (if not specified, the latest OpenJDK 21 upstream build will be used).
    • The execution time of the program on your system and specs of the same (CPU, number of cores, RAM). This is for informative purposes only, the official runtime will be determined as described below.
  • I will run the program and determine its performance as described in the next section, and enter the result to the scoreboard.

Note: I reserve the right to not evaluate specific submissions if I feel doubtful about the implementation (I.e. I won't run your BitCoin miner ;).

If you'd like to discuss any potential ideas for implementing 1BRC with the community, you can use the GitHub Discussions of this repository. Please keep it friendly and civil.

The challenge runs until Jan 31 2024. Any submissions (i.e. pull requests) created after Jan 31 2024 23:59 UTC will not be considered.

Evaluating Results

Results are determined by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the results table above.

If you'd like to spin up your own box for testing on Hetzner Cloud, you may find these set-up scripts (based on Terraform and Ansible) useful. Note this will incur cost you are responsible for, I am not going to pay your cloud bill :)

Prize

If you enter this challenge, you may learn something new, get to inspire others, and take pride in seeing your name listed in the scoreboard above. Rumor has it that the winner may receive a unique 1οΈβƒ£πŸπŸŽοΈ t-shirt, too!

FAQ

Q: Can I use Kotlin or other JVM languages other than Java?
A: No, this challenge is focussed on Java only. Feel free to inofficially share implementations significantly outperforming any listed results, though.

Q: Can I use non-JVM languages and/or tools?
A: No, this challenge is focussed on Java only. Feel free to inofficially share interesting implementations and results though. For instance it would be interesting to see how DuckDB fares with this task.

Q: Can I use JNI?
A: Submissions must be implemented in Java; JNI requires glue code written in C/C++, so it cannot be used. You could use AOT compilation of Java code via GraalVM though, either by AOT-compiling the entire application, or by creating a native library which you then invoke using the new Java FFI (https://openjdk.org/jeps/442[JEP 442]).

Q: What is the encoding of the measurements.txt file?
A: The file is encoded with UTF-8.

Q: Can I make assumptions on the names of the weather stations showing up in the data set?
A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; character).

Q: Can I copy code from other submissions? A: Yes, you can. The primary focus of the challenge is about learning something new, rather than "winning". When you do so, please give credit to the relevant source submissions. Please don't re-submit other entries with no or only trivial improvements.

Q: Why 1οΈβƒ£πŸπŸŽοΈ ?
A: It's the abbreviation of the project name: One Billion Row Challenge.

License

This code base is available under the Apache License, version 2.

Code of Conduct

Be excellent to each other! More than winning, the purpose of this challenge is to have fun and learn something new.

1brc's People

Contributors

gunnarmorling avatar nipafx avatar lpmi-13 avatar bjhara avatar keshavram-apptware avatar royvanrijn avatar itaske avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.