Coder Social home page Coder Social logo

zonemercy / talkingdata-adtracking-fraud-detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from flowlight0/talkingdata-adtracking-fraud-detection

0.0 2.0 0.0 6.83 MB

My solution for TalkingData AdTracking Fraud Detection Challenge (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/)

License: MIT License

Makefile 0.44% C++ 43.37% Dockerfile 0.30% Shell 0.81% Python 55.07%

talkingdata-adtracking-fraud-detection's Introduction

talkingdata-adtracking-fraud-detection

flowlight side of 1st place solution for TalkingData AdTracking Fraud Detection Challenge (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/)

Please see https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56475 or https://www.slideshare.net/TakanoriHayashi3/talkingdata-adtracking-fraud-detection-challenge-1st-place-solution (in Japanese) if you want to know the overview of our solution.

Disclaimer: I did not care quality/readability of code in this repository, and the following documentation may contain errata.

Setup

Since this repository assumes all the experiments are conducted on a Docker container on an AWS docker host created by docker-machine, you need to create a docker host and image before experiments.

  • (Caution: If you run the command in this section, it will automatically create a Docker host in an AWS EC2 spot instance. if you don't want to create it, please skip this section and create your own environment by your self) You can create a docker host by running a script docker_generate.sh under docker/ directory (If you haven't installed docker-machine in your computer, please download before running this script). This scripts also create a docker image kaggle/flowlight, which contains some extra libraries in addition to kaggle/python.
    • cd ./docker
    • AWS_VPC_ID=<your own AWS VPC id> ./docker_generate.sh
  • You are expected to start running a container with this image by running the following commands:
    • eval $(docker-machine env talkingdata)
    • docker run -it kaggle/flowlight bash
  • After logging into a container, you need to download datasets from Kaggle (If you wonder how you can get these files in a remote server, it's a good opportunity to learn kaggle-api).
    • train.csv
    • test.csv
    • test_supplement.csv
  • Then, you have to convert csv format files into feather format ones to accelerate later processes by running python scripts/convert_csv_to_feather.py. During the competition, all the datasets and features are stored in feather files because reading feather files are much faster than csv files.
  • Next, you have to create mapping from full test data (test_supplement.csv) click ids to submission test data (test.csv) click ids by running python scripts/click_id_mapper.py.
  • Last, you are expected to build cpp programs used in feature generation:
    • cd ./cpp
    • make -j

Experiment Configuration

In this repository, all the experiments are configured with a JSON configuration file. Please see JSON files under configs directory.

{
  "features": [
    "app",
    "hour",
    "count",
    "duplicated_row_index_diff",
    "future_click_count_10",
    "future_click_count_80",
    "next_click_time_delta",
    "prev_click_time_delta",
    "all_click_count",
    "average_attributed_ratio",
    "komaki_lda_10_ip",
    "komaki_lda_20_no_device_ip",
    "komaki_lda_20_no_device_os",
    "komaki_lda_20_no_device_channel",
    "komaki_lda_20_no_device_app",
    "komaki_lda_5_no_device",
    "komaki_nmf_5_no_device",
    "komaki_pca_5_no_device"
  ],
  "model": {
    "name": "lightgbm",
    "model_params": {
      "boosting_type": "gbdt",
      "objective": "binary",
      "metric": "auc",
      "learning_rate": 0.01,
      "num_leaves": 255,
      "max_depth": 8,
      "min_child_samples": 200,
      "subsample": 0.9,
      "subsample_freq": 1,
      "colsample_bytree": 0.5,
      "min_child_weight": 0,
      "subsample_for_bin": 1000000,
      "min_split_gain": 0,
      "reg_lambda": 0,
      "verbose": 0
    },
    "train_params": {
      "num_boost_round": 5000,
      "early_stopping_rounds": 30
    }
  },
  "dataset": {
    "input_directory": "data/input/",
    "cache_directory": "data/working/",
    "files": {
      "train": "train.feather",
      "test": "old_test.feather"
    },
    "negative_down_sampling": {
      "enabled": true,
      "bagging_size": 5
    }
  },
  "rank_average": false,
  "test_hours": {
    "filter_validation": true,
    "higher_train_weight": false
  },
  "note": "100 with min_child_samples = 200"
}
  • features: You can specify a list of features to be used with this field. Each field and an actual feature generator are tied in run.py.
  • model: A model name and model definition/training parameters can be specified. Actually, this repository only supports LightGBM (There are lot of hard-coded logic for LightGBM).
  • dataset:
    • input_directory: Our script reads training/test files from this directory.
    • cache_directory: All the created features are cached under this directory.
    • files: You can specify a name of train/test files here. Our usage for this field was to first run an experiment with very small datasets and use full datasets only after features were successfully created from small datasets.
    • negative_down_sampling:
      • enabled: I don't know if you can run experiment when you set this parameter to false. It means you should always set this parameter to true.
      • bagging_size: this parameter specifies the number of datasets created by negative down-sampling. Our script generates each down-sampled dataset with a different random seed.
  • rank_average:
    • This field specifies a way of creating final prediction from predictions for multiple down-sampled datasets. If rank_average = true, Rank averaging (see KAGGLE ENSEMBLING GUIDE) is used, otherwise the average of prediction values is used.
  • test_hours:
    • filter_validation: If filter_validation = true, we filter out validation data based on its click hour because test dataset for submission contains an only small time-range of one day.

Experiment Execution

You just need to run a command python run.py --config <configuration file>. It will create two files data/output/<configuration file>.result.json and data/output/<configuration file>.submission.csv. The former file contains statistics of this experiment and the latter file contains prediction for a specified test dataset. Note that you can skip prediction on a test dataset by specifying --train_only option.

talkingdata-adtracking-fraud-detection's People

Contributors

flowlight0 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.