Coder Social home page Coder Social logo

msusicky / mlp_2017_workshop Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gaussalgo/mlp_2017_workshop

0.0 2.0 0.0 385 KB

Machine Learning Prague 2017 - Advanced data analysis on Hadoop clusters workshop details.

Python 92.27% Scala 7.73%

mlp_2017_workshop's Introduction

MLP_2017_workshop

Introduction

This is an accompanying website for MLPrague 2017 workshop Advanced data analysis on Hadoop clusters. Specifically, source codes for the machine learning part are provided. Description of the used data can be found here too, see below.

Source codes are created for Spark.

Problem Statement

Tha practical part of machine learning can be divided into two parts:

  1. Community detection in telecommunication networks
  2. Churn prediction in telecommunication industry

As churn prediction part assumes results from community detection, it is necesarry to run the codes described in Community Detection section first. Then the churn prediction part can be executed by running the main.py script.

Community Detection

Given the phone call records, the task is to find communities in a network created from these phone calls. Customers represent vertices in such a network and edges link customers who called to each other.

The presented solution creates a graph from one-month call records. Only customers with at least 10 calls are linked together. Label Propagation Algorithm is used for community detection.

Scala source codes for Spark can be found in phase_0_community_detection/ directory. The scala script assumes mlp_sampled_cdr_records.parquet data available.

This script will create two new data files: lpa_20160301_20160401.parquet and lpa_20160401_20160501.parquet.

Churn Prediction

In this part, the task is to predict customers who are likely to churn. All source codes for this part are written in python are assumed to be run by PySpark. Created machine learning model uses features extracted from one month and predicts potential churners for the next month. For example, it takes phone call records from March and predicts which customers are likely to churn in April. Features are built from the input data described below.

This part is divided into three phases:

  1. Data preparation - creates various features from the input data
  2. Data preprocessing - imputing and trasforming features; it also adds some new derived features
  3. Classification - trains a classification model on a train dataset and applies it on a test dataset

Evaluation of the model is performed outside of those phases for the sake of detailed illustration.

Input Data Description

mlp_sampled_cdr_records.parquet - phone call records from two months

  • record_type: string - type of voice records
  • date_key: string - date of the call
  • duration: integer - duration of the call in seconds
  • frommsisdn_prefix: string - operator prefix
  • frommsisdn: long - home operator number (either receiving or calling - according to the record type)
  • tomsisdn_prefix: string - operator prefix
  • tomsisdn: long - number of the second customer (can be either of the home operator or not)

mlp_sampled_ebr_base_20160401.parquet, mlp_sampled_ebr_base_20160501.parquet - information about home operator customers

  • msisdn: long - number of the customer
  • customer_type: string - either private or business
  • commitment_from_key: string - date of the commitment start
  • commitment_to_key: string - date of the commitment end
  • rateplan_group: long - name of the rateplan group
  • rateplan_name: long - name of the raplan

mlp_sampled_ebr_churners_20151201_20160630.parquet - list of churned customers from two months

  • msisdn: long - number of the customer
  • date_key: string - date of the churn

Other Information

Directory scripts contains various python scripts for data exploration. Script move_data.py illustrates how to save parquet data from a remote AWS S3 repository to local repository.

Description of Features

NOTE: 'callcenters' are numbers behaving like callcenters - i.e. they call to a huge number of phone numbers. We select TOP 12 such 'callcenters' from data.

churned - binary label attribute msisdn customer_type rateplan_group rateplan_name committed - whether the customer is committed at this point committed_days - for how long is the customer committed commitment_remaining - how many days till the end of the commitment callcenter_calls_count - count of phone calls with so called 'callcenters' callcenter_calls_duration - total duration of phone calls with so called 'callcenters' cc_\cnt_X1 - count of phone calls with call center X1, where X1 is the number of the callcenter cc_\dur_X1 - duration of phone calls with call center X1, where X1 is the number of the callcenter cc_\avg_X1 - average duration of phone calls with call center X1, where X1 is the number of the callcenter cc_\std_X1 - standard deviation of duration of phone calls with call center X1, where X1 is the number of the callcenter com_degree com_degree_total com_count_in_group com_degree_in_group com_score com_group_leader

     'com_group_follower', 'com_churned_cnt', 'com_leader_churned_cnt

mlp_2017_workshop's People

Contributors

darkwah21 avatar karelvaculik avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.