Coder Social home page Coder Social logo

juditacs / wordcount Goto Github PK

View Code? Open in Web Editor NEW
123.0 123.0 68.0 2.42 MB

Counting words in different programming languages.

Python 8.40% C++ 2.50% PHP 5.06% Shell 9.05% JavaScript 2.32% Perl 0.31% Java 32.43% Batchfile 0.33% Julia 0.80% Go 4.03% C# 6.12% Haskell 1.33% Rust 2.08% Clojure 0.79% Scala 1.17% TypeScript 3.72% Lua 0.98% Elixir 5.97% D 1.67% C 10.92%

wordcount's People

Contributors

airbreather avatar bpatrik avatar coderdreams avatar crbelaus avatar daurnimator avatar davidnemeskey avatar eksperimental avatar flababah avatar gaborszabo88 avatar gaebor avatar getzdan avatar hansbogert avatar juditacs avatar kpeu3i avatar kundralaci avatar larion avatar leventov avatar matias-te avatar nexor avatar nobbz avatar shedar avatar svanoort avatar szarnyasg avatar szelpe avatar timposey2 avatar unipolar avatar xupwup avatar zseder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordcount's Issues

Both Java versions fail on the first two tests

bash scripts/test.sh java -classpath java WordCount

---- java -classpath java WordCount ----
  test1 fails
  test2 fails
FAIL

or testing just one file:

cat data/test/test1.in | java -classpath java WordCount
aaa     3
abc     3
bbb     2
        1
ccc     1

It looks like empty lines are counted as words.

Concurrency or not concurrency?

There is that rule in the README:

  • single-thread is preferred but you can add multi-threaded or multicore versions too

This reads as if there shall be a singlethreaded contribution for a language, and then multithreaded contributions can be added as seperate entries.

Now there was this cleanup which threw out quite a lot of programs in languages that had multiple entries.

Since Elixir is a language that bets on concurrency, I'd like to retry a concurrent version before starting this in Erlang.

So how do we handle this?

It would be very easy to add a CLI-switch/subcommand to the script which does enable/disable concurrency easily, but then still the question remains, how to handle it during the runs?

Rename solutions

  • have one solution for each language except the two original baselines (Python2 and cpp)
  • every source file should be named wordcount (or whatever capitalization is the naming convention in that language)

wordcount.js fails on all tests

bash scripts/test.sh node javascript/wordcount.js 

---- node javascript/wordcount.js ----
  test1 fails
  test2 fails
  test3 fails
  test4 fails
FAIL

The test filed are located in data/test/test*
the ones ending in .in are the input and the corresponding .out is their supposed output.

Use unambiguous docker base image

Using "ubuntu" as base is not best practice. In my case it fails. I installed docker a long time ago, and my ubuntu image was still at 12.04

Let use

FROM ubuntu:14.04

AWS test environment + automated test/benchmark in Jenkins

I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.

Work progress:

  • Create an r3.large instance with ephemeral storage and assigned benchmarking-specific IAM role
  • Create setup script that installs docker, xz, git, starts docker, and pulls the ubuntu-16 allthelanguages docker image
  • Create and attach policy to IAM role for benchmarking that allows read of the dataset S3 bucket, and write of the results bucket
  • Compress huwiki, huwikisource, cleaned huwiki with xz -9 (smallest size) and upload to new S3 buckets (data is a private bucket, benchmark results is initially private, but later public).
  • Add script commands to setup script that will download data from S3 and decompress it
  • Set the AWS host to use ephemeral (instance) storage for /tmp folder
  • Run benchmark using docker
  • Upload first result to S3 - available here
  • Create scripting to grab instance + package info to metadata file
    • Git hash used in build
    • Timestamp
    • Host type, from aws cli
    • Hash of input file
  • Timeouts and resource limits on individual runs (Node.js for example hung on the instance, and needed to be manually killed, another one ran out of RAM and broke the Docker session)
  • Create scripting to name results by run/host info individually
  • Jenkins: job to run tests (inside a resource-limited container) against main wordcount branch + PRs
  • Jenkins - role or similar to allow control of benchmarking host?
    • Public view-only access to builds now enabled on dynamic.codeablereason.com/jenkins
    • HTTPS access added to dynamic.codeablereason.com (with LetsEncrypt)
    • Enforce HTTPS for all but badges/static resources on Jenkins (for performance/access reasons)
    • Enable limited-access users for wordcount use
  • Jenkins - job to fire benchmarks (github triggering)

Hardware/specs:

  • Storage: use SSD instance storage to benchmark (limits instance types). General purpose EBS SSD storage is generally slower and would run out of I/O credits after 1/2 hour (benchmarks need several hours).
  • Memory: either 7 GB (small datasets or where memory is not needed) or 15 GB (large or high-memory datasets).
  • CPUs: 2 or 4 core.
  • Instance types: m3.large (2-core, 7. 5GB RAM) for the small datasets, and r3.large (2-core, 15.25 GB ram) for big. If we do lots of parallelized implementations, add m3.xlarge (4-core, 15 GB RAM).
  • Cost: I am not spending more than $10-15/month on it, beyond my existing Jenkins host (reserved t2.micro) and domain/S3 hosting. Instances will be created to run a set of benchmarks and then terminated, with frequency to keep costs within limits.

Architecture:

  • Instances are spun up by my Jenkins host, with an appropriate IAM role or credentials to do this in a limited way.
  • Benchmark datasets will be self-hosted to not hit their sources hard. They won't be fully public unless small.
  • Instance gets an IAM role that allows uploading to a public (?) S3 results bucket.
  • Instance runs benchmarks on instance storage
  • Instance will upload each result to the S3 bucket as it completes, stamped with the git commit hash, timestamp run, language, etc.
  • All testing will use a reasonable timeout for both individual tests and the whole test set, if it hangs it is killed or skipped.
  • All testing uses the docker image, for reproducibility across hardware.

Two options for how to set it up:

  • EBS based & on-demand instaces:
    • Use an EBS volume containing benchmark data and preconfigured system, and just start/stop the instance.
    • When run, the git repo is cloned, the dataset is copied to the data folder, and tests are run & uploaded.
    • Easier to set up and run, but more expensive.
  • S3 based/spot instances:
    • cheaper (1/4 the instance price) but more maintenance.
    • Submit spot bids, instances are configured using the "user data" field to submit a startup script which sets up and runs benchmarks.
    • Private S3 buckets host compressed corpus data, these are fetched and decompressed.

Open questions:

  • What to use for controlling instances?
    • AWS CLI is easy
    • Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
    • Ansible is kind of amazing and easy to work with

Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).

Needs a better dataset for comparison with large data size

The full Hungarian wiki has ~4.3 GB of data, but ~2.5GB of unique string content:

cat data/huwiki-latest-pages-meta-current.xml | sed 's/[\t ]/\n/g' | grep -v ^$ | sort | uniq | wc -m

2507384541

There are ~25M unique tokens.

This means that we are generating gigantic hashtables with generally count = 1, and languages that store Unicode strings as 2-byte representations in memory suffer greatly due to memory overheads. Much of the memory used will simply be storing the unique strings.

Simple vs. optimized versions

Originally I wanted only one version in each language but simple/vanilla and optimized versions reasonably differ (Java is a prime example of this), so we should support more than one version. Should it be one simple and one optimized or maybe more?

The project is looking for a new owner

Due to numerous other engagement I am unable to continue maintaining the project and therefore looking for a new owner. I will try to fix the current issues mostly due to bitrot and after that the project will officially be discontinued until someone is willing take over.

Tidy up Dockerfile

The Dockerfile is a mess right now.
Install commands should be grouped by the language they are required for.

Any help would be welcome.

Reorganize setup

Reorganize the setup of everything such that it is much easier to begin with everything.

I struggled a lot during the setup phase, and I'm still unsure if everything is working as it should. Probably either the documentation or the workflow itself should be updated in a way that makes the process more clear for potential contributors.

Memory usage of Bash isn't comparable to rest

I think the memory usage as shown in the overview page is not comparable to other languages.

The bash script fork to different processes many times by piping outputs to other binaries. I think the memory usage of those separate processes is not accounted for.

Python 3 version is not working

When executed with an UTF-8 locale, it outputs completely different counts than e.g. the py2 / java versions. When executed with the C locale, it fails at the output phase with

UnicodeEncodeError: 'ascii' codec can't encode character '\xee' in position 0: ordinal not in range(128)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.