juditacs / wordcount Goto Github PK

Counting words in different programming languages.

Python 8.40% C++ 2.50% PHP 5.06% Shell 9.05% JavaScript 2.32% Perl 0.31% Java 32.43% Batchfile 0.33% Julia 0.80% Go 4.03% C# 6.12% Haskell 1.33% Rust 2.08% Clojure 0.79% Scala 1.17% TypeScript 3.72% Lua 0.98% Elixir 5.97% D 1.67% C 10.92%

wordcount's People

Contributors

Stargazers

Watchers

Forkers

davidnemeskey tbmihailov zseder matya siklosid szelpe fape larion getzdan gaborszabo88 coderdreams gaebor svanoort bencoder peehaa 1player daurnimator toburger elliotgoodrich karmiphuc ctrl-f5 drsect0r hansbogert ircmaxell doanlmit belthazornv bpatrik clearsense lverweijen michaelachrisco dszakallas agrison nobbz kundralaci yashchenkon unipolar timposey2 yazd airbreather rosko kpeu3i shedar autkid eksperimental-forks flababah leventov arpangupta cemiarni yodamaster zlabst arnebab dvvskiran rahulgopala stonesteel1023 pushpakchakraborty wxxxq riginding gindachen liangti james-leste farzinnasiri hua-c-c cjmcgraw codeperfio zdmooc

wordcount's Issues

Both Java versions fail on the first two tests

bash scripts/test.sh java -classpath java WordCount

---- java -classpath java WordCount ----
  test1 fails
  test2 fails
FAIL

or testing just one file:

cat data/test/test1.in | java -classpath java WordCount
aaa     3
abc     3
bbb     2
        1
ccc     1

It looks like empty lines are counted as words.

Concurrency or not concurrency?

There is that rule in the README:

single-thread is preferred but you can add multi-threaded or multicore versions too

This reads as if there shall be a singlethreaded contribution for a language, and then multithreaded contributions can be added as seperate entries.

Now there was this cleanup which threw out quite a lot of programs in languages that had multiple entries.

Since Elixir is a language that bets on concurrency, I'd like to retry a concurrent version before starting this in Erlang.

So how do we handle this?

It would be very easy to add a CLI-switch/subcommand to the script which does enable/disable concurrency easily, but then still the question remains, how to handle it during the runs?

Rename solutions

have one solution for each language except the two original baselines (Python2 and cpp)
every source file should be named wordcount (or whatever capitalization is the naming convention in that language)

Typescript fails test9

Using nodejs 5.x - since test9 deals with whitespace, it's probably fudging that.

wordcount.js fails on all tests

bash scripts/test.sh node javascript/wordcount.js 

---- node javascript/wordcount.js ----
  test1 fails
  test2 fails
  test3 fails
  test4 fails
FAIL

The test filed are located in data/test/test*
the ones ending in .in are the input and the corresponding .out is their supposed output.

JVM: -XX:+UseG1GC -XX:+UseStringDeduplication

Could you try and see if this gives any benefits in performance. Does it even make sense to use G1 GC? @svanoort
See 13:45ff Talk / Slides

Use unambiguous docker base image

Using "ubuntu" as base is not best practice. In my case it fails. I installed docker a long time ago, and my ubuntu image was still at 12.04

Let use

FROM ubuntu:14.04

AWS test environment + automated test/benchmark in Jenkins

I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.

Work progress:

Hardware/specs:

Storage: use SSD instance storage to benchmark (limits instance types). General purpose EBS SSD storage is generally slower and would run out of I/O credits after 1/2 hour (benchmarks need several hours).
Memory: either 7 GB (small datasets or where memory is not needed) or 15 GB (large or high-memory datasets).
CPUs: 2 or 4 core.
Instance types: m3.large (2-core, 7. 5GB RAM) for the small datasets, and r3.large (2-core, 15.25 GB ram) for big. If we do lots of parallelized implementations, add m3.xlarge (4-core, 15 GB RAM).
Cost: I am not spending more than $10-15/month on it, beyond my existing Jenkins host (reserved t2.micro) and domain/S3 hosting. Instances will be created to run a set of benchmarks and then terminated, with frequency to keep costs within limits.

Architecture:

Instances are spun up by my Jenkins host, with an appropriate IAM role or credentials to do this in a limited way.
Benchmark datasets will be self-hosted to not hit their sources hard. They won't be fully public unless small.
Instance gets an IAM role that allows uploading to a public (?) S3 results bucket.
Instance runs benchmarks on instance storage
Instance will upload each result to the S3 bucket as it completes, stamped with the git commit hash, timestamp run, language, etc.
All testing will use a reasonable timeout for both individual tests and the whole test set, if it hangs it is killed or skipped.
All testing uses the docker image, for reproducibility across hardware.

Two options for how to set it up:

EBS based & on-demand instaces:
- Use an EBS volume containing benchmark data and preconfigured system, and just start/stop the instance.
- When run, the git repo is cloned, the dataset is copied to the data folder, and tests are run & uploaded.
- Easier to set up and run, but more expensive.
S3 based/spot instances:
- cheaper (1/4 the instance price) but more maintenance.
- Submit spot bids, instances are configured using the "user data" field to submit a startup script which sets up and runs benchmarks.
- Private S3 buckets host compressed corpus data, these are fetched and decompressed.

Open questions:

What to use for controlling instances?
- AWS CLI is easy
- Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
- Ansible is kind of amazing and easy to work with

Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).

Delete wordcount2.js

@kundralaci @szelpe I would like to delete wordcount2.js, it's slower than wordcount.js. Is it ok?

Needs a better dataset for comparison with large data size

The full Hungarian wiki has ~4.3 GB of data, but ~2.5GB of unique string content:

cat data/huwiki-latest-pages-meta-current.xml | sed 's/[\t ]/\n/g' | grep -v ^$ | sort | uniq | wc -m

2507384541

There are ~25M unique tokens.

This means that we are generating gigantic hashtables with generally count = 1, and languages that store Unicode strings as 2-byte representations in memory suffer greatly due to memory overheads. Much of the memory used will simply be storing the unique strings.

Simple vs. optimized versions

Originally I wanted only one version in each language but simple/vanilla and optimized versions reasonably differ (Java is a prime example of this), so we should support more than one version. Should it be one simple and one optimized or maybe more?

Replace build script with Makefile

This would be a very nice addition.

The project is looking for a new owner

Due to numerous other engagement I am unable to continue maintaining the project and therefore looking for a new owner. I will try to fix the current issues mostly due to bitrot and after that the project will officially be discontinued until someone is willing take over.

UnicodeEncodeError: 'ascii' codec can't encode character '\xee' in position 0: ordinal not in range(128)