Welcome to the Classification Bot codebase. Classification Bot is an attempt of simplifying the collection, extraction and preprocessing of data as well as providing an end to end pipeline for using them to train large deep neural networks.
The system is composed of scrapers, data extractors, preprocessors, deep neural network models using Keras provided by Francois Chollet and an easy to use deployment module.
Make sure you have a GPU as the training is very compute intensive
- (OSX) Install gcc:
brew install gcc
- Install CUDA_toolkit 7.5
- Install cuDNN 4
- Install Theano, using
sudo pip install git+git://github.com/Theano/Theano.git
- Install OpenCV
- Install hdf5 library (libhdf5-dev)
- Make sure you have Python 2.7.6 and virtualenv installed on your system
- Install Python dependencies
$ virtualenv --python=python2 --system-site-packages env
$ . env/bin/activate
$ pip install -r requirements.txt
Use google_image_scraper.py
to download images. It takes a .csv file of the categories you want, and downloads a number of images per line.
The first line of the .csv file will be ignored.
The number of images per category is configurable. We suggest a number between 200-1000:
$ google_image_scraper.py -n 200 yourfilehere.csv
(For users that have a list of categories available at hand):
- Create a .csv file with one category per line of what you want the scraper to search for.
- Now let's download some images! Run
python google_image_scraper.py yourfilehere.csv
(For users that know an online repo that has their categories and want to fetch them, or if their categories are too many and you MUST automate the procedure, or if you much rather code stuff rather than copy and paste)
- Write a script that can fetch your categories using Wikipedia or any other resource you would like. For an example look at
examples/anime_names.py
to see what we used to get our categories. - Have your script create a .csv file with the categories you require.
- Then run
python google_image_scraper.py yourfilehere.csv
- Once you have your data ready, run
python train.py extract_data
to get all of your data ready and saved in HDF5 files.
- Once all of the above have been met then you are ready to train your network, by running
python train.py --run
to load data from HDF5 files orpython train.py --run --extract_data
to extract data and train in one procedure. - If you want to continue training a model, you can. After each epoch the weights are saved. If you want to continue training simply run
python train.py --run --continue
- Once your training has finished and a good model has been trained then you can deploy your model.
- To deploy a model on a single URL image use
python deploy.py --URL [URL_LINK]
- To deploy a model on a folder full of images use
python deploy --image-folder path/to/folder
- To deploy a model on a single file use
python deploy --image-path path/to/file
Once deployed the model should return the top 5 predictions on each image in a nice string formatted view: e.g.
Image Name: Tengen.Toppa.Gurren-Lagann.full.174481.jpg
Categories:
0. Gurren Lagann: 0.999914288521
1. Kill La Kill: 7.29278544895e-05
2. Naruto: 4.92283288622e-06
3. Redline: 2.71744352176e-06
4. Cowboy Bebop: 1.41406655985e-06
_________________________________________________
- Create your own classifiers
- Try different model architectures (Hint: go to google scholar or arxiv and search for GoogLeNet, VGG-Net, AlexNet, ResNet and follow the waves :) )
deepanimebot/bot.py
is a Twitter bot that provides an interface for querying the classifier.
- A classifier
- A Twitter app registered under the bot account
- Consumer key and secret for that app
- Your access token and secret for that app
Copy bot.ini.example
to bot.ini
and overwrite with your consumer key/secret and access token/secret.
$ PYTHONPATH=. python deepanimebot/bot.py -c bot.ini --debug --classifier=local
python deepanimebot/bot.py --help
will list all available command line options.
deepanimebot/webapp.py
is a Flask app for querying the classifier.
$ PYTHONPATH=. python deepanimebot/webapp.py
This repo comes with the necessary support files for deploying the Twitter bot and/or the web app to Google Cloud Platform.
- A classifier
- Twitter app credentials (see above)
- Docker tools and an account on a docker registry
- Google Cloud SDK
- A Google Cloud Platform project
classificationbot/base:latest
comes with all the dependencies installed.
If you've modified the code and added a new dependency,
make a new Docker image based on the dockerfiles in this repo.
This repo's base images are built with these commands:
$ docker build -t classificationbot/base:latest -f dockerfiles/base/Dockerfile .
$ docker push classificationbot/base:latest
$ docker build -t classificationbot/ci:latest -f dockerfiles/ci/Dockerfile .
$ docker push classificationbot/ci:latest
There are two options:
- (Not used anymore) Google Compute Engine, container-optimized instance, supervisord + tweepy: bot-standalone
- Google Container Engine, kubernetes, gunicorn + flask + tweepy: follow this gist
Special thanks to Francois Chollet (fchollet) for building the superb Keras deep learning library. We couldn't have brought a project ready to be used by non-machine learning people if it wasn't for the ease of use of Keras.
Special thanks to https://github.com/shuvronewscred/ for building the image scraper we adapted for our project. Original source code can be found at https://github.com/shuvronewscred/google-search-image-downloader