Coder Social home page Coder Social logo

gopy's People

Contributors

mw58 avatar wmji avatar

Watchers

 avatar  avatar

gopy's Issues

迁移学习

Understanding the decision tree structure

The decision tree structure can be analysed to gain further insight on the relation between the features and the target to predict. In this example, we show how to retrieve:

  • the binary tree structure;
  • the depth of each node and whether or not it’s a leaf;
  • the nodes that were reached by a sample using the decision_path method;
  • the leaf that was reached by a sample using the apply method;
  • the rules that were used to predict a sample;
  • the decision path shared by a group of samples.

Out:

The binary tree structure has 5 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 3] <= 0.800000011921 else to node 2.
        node=1 leaf node.
        node=2 test node: go to node 3 if X[:, 2] <= 4.94999980927 else to node 4.
                node=3 leaf node.
                node=4 leaf node.

Rules used to predict sample 0:
decision id node 4 : (X_test[0, -2] (= 5.1) > -2.0)

The following samples [0, 1] share the node [0 2] in the tree
It is 40.0 % of all nodes.
------
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
estimator.fit(X_train, y_train)

# The decision estimator has an attribute called tree_  which stores the entire
# tree structure and allows access to low level attributes. The binary tree
# tree_ is represented as a number of parallel arrays. The i-th element of each
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
# Some of the arrays only apply to either leaves or split nodes, resp. In this
# case the values of nodes of the other type are arbitrary!
#
# Among those arrays, we have:
#   - left_child, id of the left child of the node
#   - right_child, id of the right child of the node
#   - feature, feature used for splitting the node
#   - threshold, threshold value at the node
#

# Using those arrays, we can parse the tree structure:

n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold


# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.

node_indicator = estimator.decision_path(X_test)

# Similarly, we can also have the leaves ids reached by each sample.

leave_id = estimator.apply(X_test)

# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.

sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:
    if leave_id[sample_id] != node_id:
        continue

    if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
        threshold_sign = "<="
    else:
        threshold_sign = ">"

    print("decision id node %s : (X_test[%s, %s] (= %s) %s %s)"
          % (node_id,
             sample_id,
             feature[node_id],
             X_test[sample_id, feature[node_id]],
             threshold_sign,
             threshold[node_id]))

# For a group of samples, we have the following common node.
sample_ids = [0, 1]
common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
                len(sample_ids))

common_node_id = np.arange(n_nodes)[common_nodes]

print("\nThe following samples %s share the node %s in the tree"
      % (sample_ids, common_node_id))
print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))

Ref

对科比投篮数据的分析代码

代码是对科比投篮数据的分析,省略了作者数据可视化的部分。
从数据处理,清洗,到特征选择,到模型选择,行云流水般的潇洒。

    import warnings
    warnings.filterwarnings('ignore')
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.decomposition import PCA, KernelPCA
    from sklearn.cross_validation import KFold, cross_val_score
    from sklearn.metrics import make_scorer
    from sklearn.grid_search import GridSearchCV
    from sklearn.feature_selection import VarianceThreshold, RFE, SelectKBest, chi2
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.pipeline import Pipeline, FeatureUnion
    from sklearn.linear_model import LogisticRegression
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.ensemble import (BaggingClassifier,
                                  ExtraTreesClassifier,
                                  GradientBoostingClassifier,
                                  VotingClassifier,
                                  RandomForestClassifier,
                                  AdaBoostClassifier)
    ###################################数据处################################
    ######################################
    # 数据预处理
    ######################################
    pd.set_option('display.max_columns', None)
    data = pd.read_csv('./data/data.csv')
    data.set_index('shot_id', inplace=True)
    data["action_type"] = data["action_type"].astype('object')
    data["combined_shot_type"] = data["combined_shot_type"].astype('category')
    data["game_event_id"] = data["game_event_id"].astype('category')
    data["game_id"] = data["game_id"].astype('category')
    data["period"] = data["period"].astype('object')
    data["playoffs"] = data["playoffs"].astype('category')
    data["season"] = data["season"].astype('category')
    data["shot_made_flag"] = data["shot_made_flag"].astype('category')
    data["shot_type"] = data["shot_type"].astype('category')
    data["team_id"] = data["team_id"].astype('category')
    unknown_mask = data['shot_made_flag'].isnull()
    data_cl = data.copy() # create a copy of data frame
    target = data_cl['shot_made_flag'].copy()
    # Remove some columns
    data_cl.drop('team_id', axis=1, inplace=True) # Always one number
    data_cl.drop('lat', axis=1, inplace=True) # Correlated with loc_x
    data_cl.drop('lon', axis=1, inplace=True) # Correlated with loc_y
    data_cl.drop('game_id', axis=1, inplace=True) # Independent
    data_cl.drop('game_event_id', axis=1, inplace=True) # Independent
    data_cl.drop('team_name', axis=1, inplace=True) # Always LA Lakers
    data_cl.drop('shot_made_flag', axis=1, inplace=True)
    data_cl['seconds_from_period_end'] = 60 * data_cl['minutes_remaining'] + data_cl['seconds_remaining']
    data_cl['last_5_sec_in_period'] = data_cl['seconds_from_period_end'] < 5
    data_cl.drop('minutes_remaining', axis=1, inplace=True)
    data_cl.drop('seconds_remaining', axis=1, inplace=True)
    data_cl.drop('seconds_from_period_end', axis=1, inplace=True)
    ## Matchup - (away/home)
    data_cl['home_play'] = data_cl['matchup'].str.contains('vs').astype('int')
    data_cl.drop('matchup', axis=1, inplace=True)
    # Game date
    data_cl['game_date'] = pd.to_datetime(data_cl['game_date'])
    data_cl['game_year'] = data_cl['game_date'].dt.year
    data_cl['game_month'] = data_cl['game_date'].dt.month
    data_cl.drop('game_date', axis=1, inplace=True)
    # Loc_x, and loc_y binning
    data_cl['loc_x'] = pd.cut(data_cl['loc_x'], 25)
    data_cl['loc_y'] = pd.cut(data_cl['loc_y'], 25)
    # Replace 20 least common action types with value 'Other'
    rare_action_types = data_cl['action_type'].value_counts().sort_values().index.values[:20]
    data_cl.loc[data_cl['action_type'].isin(rare_action_types), 'action_type'] = 'Other'
    categorial_cols = [
        'action_type', 'combined_shot_type', 'period', 'season', 'shot_type',
        'shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'game_year',
        'game_month', 'opponent', 'loc_x', 'loc_y']
    for cc in categorial_cols:
        dummies = pd.get_dummies(data_cl[cc])
        dummies = dummies.add_prefix("{}#".format(cc))
        data_cl.drop(cc, axis=1, inplace=True)
        data_cl = data_cl.join(dummies)
    #  异常点检测方法
    def detect_outliers(series, whis=1.5):
        q75, q25 = np.percentile(series, [75 ,25])
        iqr = q75 - q25
        return ~((series - series.median()).abs() <= (whis * iqr))
    # Separate dataset for validation
    data_submit = data_cl[unknown_mask]
    # 训练数据
    X = data_cl[~unknown_mask]
    Y = target[~unknown_mask]
    ################################### Feature Selection###################
    #################################
    #RandomForestClassifier 来选择特征
    ###############################
    threshold = 0.90
    vt = VarianceThreshold().fit(X)
    feat_var_threshold = data_cl.columns[vt.variances_ > threshold * (1-threshold)]
    feat_var_threshold
    model = RandomForestClassifier()
    model.fit(X, Y)
    feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])
    feat_imp_20 = feature_imp.sort_values("importance", ascending=False).head(20).index
    #################################
    # Univariate feature selection
    #################################
    X_minmax = MinMaxScaler(feature_range=(0,1)).fit_transform(X)
    X_scored = SelectKBest(score_func=chi2, k='all').fit(X_minmax, Y)
    feature_scoring = pd.DataFrame({
            'feature': X.columns,
            'score': X_scored.scores_
        })
    feat_scored_20 = feature_scoring.sort_values('score', ascending=False).head(20)['feature'].values
    feat_scored_20
    #################################
    # Recursive Feature Elimination
    #################################
    rfe = RFE(LogisticRegression(), 20)
    rfe.fit(X, Y)
    feature_rfe_scoring = pd.DataFrame({
            'feature': X.columns,
            'score': rfe.ranking_
        })
    feat_rfe_20 = feature_rfe_scoring[feature_rfe_scoring['score'] == 1]['feature'].values
    feat_rfe_20
    ###############################
    # 合并所有特征选择方法的结果
    ################################
    features = np.hstack([
            feat_var_threshold,
            feat_imp_20,
            feat_scored_20,
            feat_rfe_20
        ])
    features = np.unique(features)
    print('Final features set:\n')
    for f in features:
        print("\t-{}".format(f))
    ################################
    # clearn data
    ###############################
    data_cl = data_cl.ix[:, features]
    data_submit = data_submit.ix[:, features]
    X = X.ix[:, features]
    print('Clean dataset shape: {}'.format(data_cl.shape))
    print('Subbmitable dataset shape: {}'.format(data_submit.shape))
    print('Train features shape: {}'.format(X.shape))
    print('Target label shape: {}'. format(Y.shape))
    #################################
    # PCA
    #################################
    components = 8
    pca = PCA(n_components=components).fit(X)
    pca_variance_explained_df = pd.DataFrame({
        "component": np.arange(1, components+1),
        "variance_explained": pca.explained_variance_ratio_
        })
    ax = sns.barplot(x='component', y='variance_explained', data=pca_variance_explained_df)
    ax.set_title("PCA - Variance explained")
    plt.show()
    ###################################
    # 评估函数
    ###################################
    seed = 7
    processors=1
    num_folds=3
    num_instances=len(X)
    scoring='log_loss'
    kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    #################################模型选则##############################
    #################################
    # 常用模型
    #################################
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    models.append(('K-NN', KNeighborsClassifier(n_neighbors=5)))
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('NB', GaussianNB()))
    #models.append(('SVC', SVC(probability=True)))
    # Evaluate each model in turn
    results = []
    names = []
    for name, model in models:
        cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
        results.append(cv_results)
        names.append(name)
        print("{0}: ({1:.3f}) +/- ({2:.3f})".format(name, cv_results.mean(), cv_results.std()))
    ##################################
    # Bootstrap Aggregation
    ###################################
    cart = DecisionTreeClassifier()
    num_trees = 100
    model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
    results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    #####################################
    # Random Forest
    #####################################
    num_trees = 100
    num_features = 10
    model = RandomForestClassifier(n_estimators=num_trees, max_features=num_features)
    results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    #####################################
    # extra tree
    #######################################
    num_trees = 100
    num_features = 10
    model = ExtraTreesClassifier(n_estimators=num_trees, max_features=num_features)
    results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    #######################################
    # AdaBoost
    ######################################
    model = AdaBoostClassifier(n_estimators=100, random_state=seed)
    results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    ############################################
    # Stochastic Gradient Boosting
    #############################################
    model = GradientBoostingClassifier(n_estimators=100, random_state=seed)
    results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    ############################参数寻找################################
    #####################################
    # Logistic 参数寻找
    ######################################
    lr_grid = GridSearchCV(
        estimator = LogisticRegression(random_state=seed),
        param_grid = {
            'penalty': ['l1', 'l2'],
            'C': [0.001, 0.01, 1, 10, 100, 1000]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    lr_grid.fit(X, Y)
    print(lr_grid.best_score_)
    print(lr_grid.best_params_)
    #########################################
    # LinearDiscriminant
    ########################################
    lda_grid = GridSearchCV(
        estimator = LinearDiscriminantAnalysis(),
        param_grid = {
            'solver': ['lsqr'],
            'shrinkage': [0, 0.25, 0.5, 0.75, 1],
            'n_components': [None, 2, 5, 10]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    lda_grid.fit(X, Y)
    print(lda_grid.best_score_)
    print(lda_grid.best_params_)
    #######################################
    # KNN
    ############################################
    knn_grid = GridSearchCV(
        estimator = Pipeline([
            ('min_max_scaler', MinMaxScaler()),
            ('knn', KNeighborsClassifier())
        ]),
        param_grid = {
            'knn__n_neighbors': [25],
            'knn__algorithm': ['ball_tree'],
            'knn__leaf_size': [2, 3, 4],
            'knn__p': [1]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    knn_grid.fit(X, Y)
    print(knn_grid.best_score_)
    print(knn_grid.best_params_)
    ###############################################
    # 寻找随机森林参数
    ##############################################
    rf_grid = GridSearchCV(
        estimator = RandomForestClassifier(warm_start=True, random_state=seed),
        param_grid = {
            'n_estimators': [100, 200],
            'criterion': ['gini', 'entropy'],
            'max_features': [18, 20],
            'max_depth': [8, 10],
            'bootstrap': [True]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    rf_grid.fit(X, Y)
    print(rf_grid.best_score_)
    print(rf_grid.best_params_)
    ############################################
    # AdaBoost 参数寻找
    ##############################################
    ada_grid = GridSearchCV(
        estimator = AdaBoostClassifier(random_state=seed),
        param_grid = {
            'algorithm': ['SAMME', 'SAMME.R'],
            'n_estimators': [10, 25, 50],
            'learning_rate': [1e-3, 1e-2, 1e-1]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    ada_grid.fit(X, Y)
    print(ada_grid.best_score_)
    print(ada_grid.best_params_)
    #################################################
    # GradientBoosting  
    #################################################
    gbm_grid = GridSearchCV(
        estimator = GradientBoostingClassifier(warm_start=True, random_state=seed),
        param_grid = {
            'n_estimators': [100, 200],
            'max_depth': [2, 3, 4],
            'max_features': [10, 15, 20],
            'learning_rate': [1e-1, 1]
        },
        cv = kfold,
        scoring = scoring,
        n_jobs = processors)
    gbm_grid.fit(X, Y)
    print(gbm_grid.best_score_)
    print(gbm_grid.best_params_)
    #################################################
    # 组合上面选择的模型
    #################################################
    estimators = []
    estimators.append(('lr', LogisticRegression(penalty='l2', C=1)))
    estimators.append(('gbm', GradientBoostingClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, max_features=15, warm_start=True, random_state=seed)))
    estimators.append(('rf', RandomForestClassifier(bootstrap=True, max_depth=8, n_estimators=200, max_features=20, criterion='entropy', random_state=seed)))
    estimators.append(('ada', AdaBoostClassifier(algorithm='SAMME.R', learning_rate=1e-2, n_estimators=10, random_state=seed)))
    # create the ensemble model
    ensemble = VotingClassifier(estimators, voting='soft', weights=[2,3,3,1])
    results = cross_val_score(ensemble, X, Y, cv=kfold, scoring=scoring,n_jobs=processors)
    print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
    ###############################################
    # 预测
    ###############################################
    model = ensemble
    model.fit(X, Y)
    preds = model.predict_proba(data_submit)
    submission = pd.DataFrame()
    submission["shot_id"] = data_submit.index
    submission["shot_made_flag"]= preds[:,0]
    submission.to_csv("sub.csv",index=False)

how to extract the decision rules from scikit-learn decision-tree?

Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree -
as a textual list ? something like: "if A>0.4 then if B<0.2 then if C>0.8 then class='X' etc...

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)
  • paulkernfeld

print "{}return {}".format(indent, tree_.value[node]) should be changed to print "{}return {}".format(indent, np.argmax(tree_.value[node][0])) for the function to return the class index.

  • soupault

ref

understanding the gradient boosting tree in fitted model

Gradient Boosting learns a function that looks something like this:

F(X) = W1*T1(X) + W2*T2(X) + ... + Wi*Ti(X)

where Wi are weights and Ti are weak learners (decision trees).
I know how to extract the individual Ti (estimators_ property) from a fitted gradient boosting model in scikit-learn, but is there a way to extract the Wi?

  • dood

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper

just reading this great paper and trying to implement this:

... We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1- of-K coding of this type of features. For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector [0, 1, 0, 1, 0], where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree ...

Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:

for sample in X_test:
    for tree in gbc.estimators_:
        leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
        ...

The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.

def makeTreeBins(gbc, X):
    '''
    Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
    Returns a numpy array of dim (rows(X), num_estimators), where each row
    represents the set of terminal nodes that the record X[i] falls into across
    all estimators in the GBC.  Note, each tree produces 2^max_depth terminal nodes.
    I append a prefix to the terminal node id in each incremental estimator so that
    I can use these as feature ids in other classifiers.
    '''
    for i, dt_i in enumerate(gbc.estimators_):
        prefix = (i + 2)*100 #Must be an integer
        nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
        if i == 0:
            nd_mat = nds.reshape(len(nds), 1)        
        else:
            nd_mat = np.hstack((nd_mat, nds.reshape(len(nds), 1)))
    return nd_mat

Ref

Generate code for sklearn's GradientBoostingClassifier

I want to generate code (Python for now, but ultimately C) from a trained gradient boosted classifier (from sklearn). As far as I understand it, the model takes an initial predictor, and then adds predictions from sequentially trained regression trees (scaled by the learning factor). The chosen class is then the class with the highest output value.

This is the code I have so far:

def recursep_gbm(left, right, threshold, features, node, depth, value, out_name, scale):
    # Functions for spacing
    tabs = lambda n: (' ' * n * 4)[:-1]
    def print_depth():
        if depth: print tabs(depth),
    def print_depth_b():
        if depth: 
            print tabs(depth), 
            if (depth-1): print tabs(depth-1),

    if (threshold[node] != -2):
        print_depth()
        print "if " + features[node] + " <= " + str(threshold[node]) + ":"
        if left[node] != -1:
            recursep_gbm(left, right, threshold, features, left[node], depth+1, value, out_name, scale)
        print_depth()
        print "else:"
        if right[node] != -1:
            recursep_gbm(left, right, threshold, features, right[node], depth+1, value, out_name, scale)
    else:
        # This is an end node, add results
        print_depth()
        print out_name + " += " + str(scale) + " * " + str(value[node][0, 0])

def print_GBM_python(gbm_model, feature_names, X_data, l_rate):
    print "PYTHON CODE"

    # Get trees
    trees = gbm_model.estimators_

    # F0
    f0_probs = np.mean(clf.predict_log_proba(X_data), axis=0)
    probs    = ", ".join([str(prob) for prob in f0_probs])
    print "# Initial probabilities (F0)"
    print "scores = np.array([%s])" % probs
    print 

    print "# Update scores for each estimator"
    for j, tree_group in enumerate(trees):
        for k, tree in enumerate(tree_group):
            left      = tree.tree_.children_left
            right     = tree.tree_.children_right
            threshold = tree.tree_.threshold
            features  = [feature_names[i] for i in tree.tree_.feature]
            value = tree.tree_.value

            recursep_gbm(left, right, threshold, features, 0, 0, value, "scores[%i]" % k, l_rate)
        print

    print "# Get class with max score"
    print "return np.argmax(scores)"

This is an example of what it generates (with 3 classes, 2 estimators, 1 max depth and 0.1 learning rate):

# Initial probabilities (F0)
scores = np.array([-0.964890, -1.238279, -1.170222])

# Update scores for each estimator
if X1 <= 57.5:
    scores[0] += 0.1 * 1.60943587225
else:
    scores[0] += 0.1 * -0.908433703247
if X2 <= 0.000394500006223:
    scores[1] += 0.1 * -0.900203054177
else:
    scores[1] += 0.1 * 0.221484425933
if X2 <= 0.0340005010366:
    scores[2] += 0.1 * -0.848148803219
else:
    scores[2] += 0.1 * 1.98100820717

if X1 <= 57.5:
    scores[0] += 0.1 * 1.38506104792
else:
    scores[0] += 0.1 * -0.855930587354
if X1 <= 43.5:
    scores[1] += 0.1 * -0.810729087535
else:
    scores[1] += 0.1 * 0.237980820334
if X2 <= 0.027434501797:
    scores[2] += 0.1 * -0.815242297324
else:
    scores[2] += 0.1 * 1.69970863021

# Get class with max score
return np.argmax(scores)

I used the log probability as F0, based on this.

For one estimator it gives me the same predictions as the predict method on the trained model. However when I add more estimators the predictions start to deviate. Am I supposed to incorporate the step length (described here)? Also, is my F0 correct? Should I be taking the mean? And should I convert the log-probabilities to something else?

  • Pokey McPokerson

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.