Coder Social home page Coder Social logo

hxsylzpf / feature-selection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from loyalzc/feature-selection

0.0 2.0 0.0 9.96 MB

Features selection algorithm based on self selected algorithm, loss function and validation method

Home Page: https://pypi.org/project/MLFeatureSelection/

License: MIT License

Python 100.00%

feature-selection's Introduction

Features Selection

This code is for general features selection based on certain machine learning algorithm and evaluation methods

More features selection method will be included in the future!

More examples are added in example folder include:

  • Simple Titanic with 5-fold validation and evaluated by accuracy

  • Demo for S1 score improvement in JData 2018 predict purchase time competition

new features

  • Set sample ratio for large dataset

  • Set maximum quantity of features

  • Set maximum running time

  • Set certain features library

To run the demo, please install via pip3

version update

  • version 0.0.4.1: fix bug of sampling when sample ratio equals to 1
pip3 install MLFeatureSelection

Demo is here!

How to run

This demo is based on the IJCAI-2018 data moning competitions

  • Import library from FeatureSelection.py and also other necessary library
from MLFeatureSelection import FeatureSelection as FS 
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np
  • Generate for dataset
def prepareData():
    df = pd.read_csv('data/train/trainb.csv')
    df = df[~pd.isnull(df.is_trade)]
    item_category_list_unique = list(np.unique(df.item_category_list))
    df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
    return df
  • Define your loss function
def modelscore(y_test, y_pred):
    return log_loss(y_test, y_pred)
  • Define the way to validate
def validation(X,y, features, clf,lossfunction):
    totaltest = 0
    for D in [24]:
        T = (X.day != D)
        X_train, X_test = X[T], X[~T]
        X_train, X_test = X_train[features], X_test[features]
        y_train, y_test = y[T], y[~T]
        clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
        totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
    totaltest /= 1.0
    return totaltest
  • Define the cross method (required when Cross = True)
def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return (x + 0.001)/(y + 0.001)

CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,}
  • Initial the seacher with customized procedure (sequence + random + cross)
sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching
  • Import loss function
sf.ImportLossFunction(modelscore,direction = 'descend')
  • Import dataset
sf.ImportDF(prepareData(),label = 'is_trade')
  • Import cross method (required when Cross = True)
sf.ImportCrossMethod(CrossMethod)
  • Define non-trainable features
sf.InitialNonTrainableFeatures(['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade'])
  • Define initial features' combination
sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])
  • Generate feature library, can specific certain key word and selection step
sf.GenerateCol(key = 'mean', step = 2) #can iterate different features set
  • Set maximum features quantity
sf.SetFeaturesLimit(40) #maximum number of features
  • Set maximum time limit (in minutes)
sf.SetTimeLimit(100) #maximum running time in minutes
  • Set sample ratio of total dataset, when samplemode equals to 0, running the same subset, when samplemode equals to 1, subset will be different each time
sf.SetSample(0.1, samplemode = 0)
  • Define algorithm
sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)
  • Define log file name
sf.SetLogFile('record.log')
  • Run with self-define validate method
sf.run(validation)

see complete code in demo.py

  • This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()

This features selection method achieved

Algorithm details

Procedure

feature-selection's People

Contributors

duxuhao avatar jiaqiangbandongg avatar

Watchers

James Cloos avatar CMU学Drama avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.