TPU models for Open Images 2019 Object Detection Challenge | Kaggle

How-to

I recommend these tutorials to get started with TPUs:

https://cloud.google.com/tpu/docs/tutorials/resnet for image classification;
https://cloud.google.com/tpu/docs/tutorials/retinanet for object detection;
https://cloud.google.com/tpu/docs/tutorials/mask-rcnn for instance segmentation.

My changes to this repo

I added scripts/ directory with my data processing and inference scripts.
I added a lot of models into the models/official/detection/configs/yaml/.
I added EfficientNet support into the RetinaNet code: https://github.com/artyompal/tpu_models/blob/master/models/official/detection/modeling/architecture/efficientnet_model.py
I also added SE-ResNext support, but it's incomplete and works too slow. At least, you'd have to transpose channels: https://github.com/artyompal/tpu_models/blob/master/models/official/detection/modeling/architecture/seresnext.py
Here I fixed the train_and_eval loop not supporting restarts: tensorflow/tpu#496
Here I fixed loading of pretrained models. Standard tf.init_from_checkpoint() function fails even if a single variable is missing. So it fails with pretrained EfficientNets in case they use different optimizer than you. I provided a more permissive function.

Training and inference process

Data preparation

My training pipeline for this complex dataset is quite complex. Maybe you won't need all of this. Anyway, this is what I do.

I remove any annotations with IsGroupOf. Then I split classes into 6 groups by frequency: 1-100, 101-200, 201-300, 301-400, 400-432 and Human Parts (11 classes, a special dataset is provided for these). Then I pick 5 images per class for validation (I handpicked correct samples by blacklisting incorrect ones). Finally, I generate COCO-style validation.json. To achieve all of this, just do:

cd scripts/
./prepare_datasets.sh

Then I build TFRecord files and upload them to GCS. Replace links to gs://new_tpu_storage/ with the link to your very own GSC storage and run this:

./prepare_tfrecords_v1.sh
./upload_files_to_gcs.sh

Create an instance

Open the Cloud Console in your Google Cloud Project webpage and use the ctpu tool to create cloud instances with TPUs:

gcloud config set project $YOUR_TPU_PROJECT_ID

and then one of:
ctpu up --tpu-size=v3-8 --machine-type n1-standard-4 --zone ZONE --name TPU_NAME
or:
ctpu up --tpu-size=v2-8 --machine-type n1-highmem-2  --zone ZONE --name TPU_NAME
or:
ctpu up --tpu-size=v2-8 --machine-type n1-highmem-2 --zone ZONE --name TPU_NAME --preemptible
or when your TPU has been preempted and you need a new one:
ctpu up --tpu-size=v2-8 --zone ZONE --name TPU_NAME --preemptible --tpu-only --noconf

optional: open the port for TensorBoard
gcloud compute firewall-rules create tensorboard --allow tcp:6006 --source-tags=TPU_NAME --source-ranges=0.0.0.0/0

Train a model

SSH into an instance:

gcloud beta compute --project YOUR_TPU_PROJECT_ID ssh --zone YOUR_ZONE TPU_NAME

And then:

cd tpu_models/scripts/
git pull
./train_on_dataset.sh DATASET_PART VERSION
or
./train_on_fold.sh DATASET_PART VERSION FOLD_NUMBER

Export a model

Saved models are in TPU-specific format. You can export them into a device-independent format like this:

./export_saved_model.sh PART VERSION STEP

Infer predictions

When you have downloaded your model from the cloud, you can run the inference on your GPU like this. I use Docker to take care of TensorFlow and CUDA, but it's not necessary:

./docker_run.sh python inference.py PREDICTIONS.PKL MODEL_DIR

Generate submission

When you have a Pickle file with predictions, you can generate a csv submission:

# generate a csv file:
./gen_sub.py PREDICTIONS.PKL

# combine csv files from different parts (simply concatenates)
./join_subs.py sub.csv sub_part_0.csv sub_part_1.csv sub_part_2.csv sub_part_3.csv sub_part_4.csv sub_human_parts.csv

# combine predictions from several models (uses Soft-NMS)
./merge_subs.py ensemble.csv model1.csv model2.csv model3.csv ...

# drop predictions below threshold, e.g. 0.03
./trim_sub_by_threshold.py trimmed_sub.csv sub.csv 0.03

# drop prediction by total number, sorting by confidence (maximum for Kaggle scorer is about 150M predictions)
./trim_sub_by_num_of_predicts.py trimmed_sub.csv sub.csv 150000000

Useful tips

My way of configuring an instance

sudo apt install -y mc htop python-tk

echo export PYTHONPATH=$HOME/tpu_models/models >>~/.bashrc
export PYTHONPATH=$HOME/tpu_models/models

git config --global core.editor "vim"
git config --global diff.tool "vimdiff"
git config --global --add difftool.prompt false
git config --global user.name "some name"
git config --global user.email "[email protected]"
git config --global alias.alias "config --get-regexp ^alias\."
git config --global alias.lg "log --graph --pretty=format:'%Cgreen(%h) -%Cblue(%ci) %C(yellow)<%an>%d%Creset %s' --abbrev-commit"

pip install --user Cython matplotlib opencv-python-headless pyyaml Pillow
pip install --user 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI'

git clone [email protected]:artyompal/tpu_models.git
cd tpu_models/scripts/

Print variables in a TensorFlow checkpoint

import tensorflow as tf

def restore_from_checkpoint(self):
  print([n.name for n in tf.get_default_graph().as_graph_def().node])

# import the inspect_checkpoint library
from tensorflow.python.tools import inspect_checkpoint as chkp

# print all tensors in checkpoint file
chkp.print_tensors_in_checkpoint_file(CHECKPOINT_PATH, tensor_name='', all_tensors=True)

Monitor training

Using TensorBoard for 10+ instances is a pain, so I use Bash scripting with some grepping:

# print all AP50 scores
cat training.log | grep -oE '(.AP50.:[ .0-9]+|Restoring.*)' | uniq

# take the freshest log file from the current directory
cat `ls -ct1 | head -n 1` | grep -oE '(.AP50.:[ .0-9]+|Restoring.*)' | uniq

# take the freshest log file and print the best score
cat `ls -ct1 | head -n 1` | grep -oE '(.AP50.:[ .0-9]+)' | uniq | sort -r | head -n 1

Training ImageNet model

It works out of the box. Register on ImageNet and use tools/dataset/imagenet_to_gcs.py to download the data and convert it to TFRecords. Then run this:

export DEPTH=101 # 50, 101, 152, 200
python resnet_main.py --tpu=$HOSTNAME --data_dir=GS_SOURCE_STORAGE \
    --model_dir=GS_DESTINATION_STORAGE --resnet_depth=$DEPTH \
    --config_file=configs/resnet$DEPTH.yaml 2>&1 | tee -a ~/resnet$DEPTH.log

# print all validation results
cat resnet$DEPTH.log | grep 'Saving dict' | grep -v INFO

# print the best validation scores
cat resnet$DEPTH.log | grep 'Saving dict' | grep -v INFO | cut -d' ' -f 19- | sort -r | head

# print the best validation score with step number
cat resnet$DEPTH.log | grep `cat resnet$DEPTH.log | grep 'Saving dict' | grep -v INFO | cut -d' ' -f 19 | sort -r | head -n 1` | grep Saving | grep INFO

seeker1943 / tpu_models Goto Github PK

tpu_models's Introduction