Coder Social home page Coder Social logo

tweet_ingestion's Introduction

Tweet Ingestion - Insights Coding Challenge Submitted by Victor Chiapaikeo

Overview

Tweet Ingestion application designed to efficiently ingest a high volume of tweets and perform processing which includes the following:

  • Produce a file that groups words from tweets and counts their frequency. Example output is as follows:
	analytics  		    		1
	bigdata 					3
	kdn 						1
	smb 						1
  • Produce a running median of unique word counts from tweets. Example output is as follows:
	11.0
	12.5
	14.0

Setup

This code is portable across the following OS's: Linux distributions, Mac and Windows OS's. Scripts were written using Python 2.7 and have not been tested for portability to Python 3.X.

You are encouraged to use a python virtual environment using virtualenv and pip. NOTE (2015-07-18): As of now, the requirements file is empty because no modules outside the default build are used.

$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements

Description of modules imported and application

  • os - operating system interface related and used to port-ably join paths, remove files, etc.
  • sys - interpreter-related and used to parse parameters from the commandline
  • heapq - priority queue algorithm used to efficiently obtain root from an array
  • unittest - framework used to create and employ unit tests

Check out and run

Applications can be run separately or together from a shell script.

To run separately:

Both words_tweeted.py and median_unique.py accepts the same two parameters:

  1. Input file: this is the file containing tweets separated by newlines and located in
  2. Output file: this is the file that is produced and located
$ git clone https://github.com/vchiapaikeo/tweet_ingestion.git
$ cd tweet_ingestion
$ python src/words_tweeted.py tweet_input/tweets.txt tweet_output/ft1.txt
$ python src/median_unique.py tweet_input/tweets.txt tweet_output/ft1.txt

To run from shell script:

This scenario is simpler and will execute both scripts back to back.

$ ./run.sh

Test

Unit tests have been created to test functions within the app. A test should be executed on the commandline at the top-level dir. Two tests are available corresponding to each of module in src and can be executed as follows:

$ cd tweet_ingestion
$ python -m src.tests.unit_words_tweeted
$ cd tweet_ingestion
$ python -m src.tests.unit_median_unique

The following output should return:

..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

Next Steps (how you can help!)

It goes without saying that this script is a work in progress. A number of items could still be added to increase functionality, performance, and robustness of this script. A few of my favorite wish-list items are listed.

  1. Reduce memory footprint in src/median_unique.py - the current implementation (submitted 2015-07-19) retains lists of streaming unique word counts per line in memory...
  2. Support multiple files in input dir using glob? Could this be something we want?
  3. Add more tests!

tweet_ingestion's People

Contributors

vchiapaikeo avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.