Coder Social home page Coder Social logo

deepgbm's People

Contributors

guolinke avatar motefly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepgbm's Issues

关于pytorch的相关问题

最近工作需要将deepGBM的pytorch版本翻译成了tensorflow的版本,感觉是完全按照源码的翻译,然后发现在criteo的数据集上测试,只测了10分之一,使用GPU训练,发现pytorch的源码在GPU和CPU的利用率都也别高,但是相对于tensorflow就cpu和gpu的利用率就很低,导致同样的参数训练,同样使用GPU,pytorch的性能高了6倍,然后我看源码里面有的地方也有tensorflow的相关注释 比如使用tf.summary(),所以感觉作者应该也是熟悉tensorflow的,所以想问问,当时为什么选择的pytoch而没有使用tensorflow有什么原因么

importance feature and shap

Can the model output variable importance feature?
Can the model use Shapley Additive explanation(SHAP) for interpretability analysis?

关于split_gain的问题

您好:
我在阅读您的代码的时候发现一个问题,self.gain = getItemByTree(self, 'split_gain'),这行代码应该是获取节点每次分裂的信息增益,但是在getItemByTree里面的getFeature里面并没有相对应的操作。
def getItemByTree(tree, item='split_feature'):
root = tree.raw['tree_structure']
split_nodes = tree.split_nodes
res = np.zeros(split_nodes+tree.raw['num_leaves'], dtype=np.int32)
if 'value' in item or 'threshold' in item or 'split_gain' in item:
res = res.astype(np.float64)
def getFeature(root, res):
if 'child' in item:
if 'split_index' in root:
node = root[item]
if 'split_index' in node:
res[root['split_index']] = node['split_index']
else:
res[root['split_index']] = node['leaf_index'] + split_nodes # need to check
else:
res[root['leaf_index'] + split_nodes] = -1
elif 'value' in item:
if 'split_index' in root:
res[root['split_index']] = root['internal_'+item]
else:
res[root['leaf_index'] + split_nodes] = root['leaf_'+item]
else:
if 'split_index' in root:
res[root['split_index']] = root[item]
else:
res[root['leaf_index'] + split_nodes] = -2
if 'left_child' in root:
getFeature(root['left_child'], res)
if 'right_child' in root:
getFeature(root['right_child'], res)
getFeature(root, res)
return res

代码中的两个问题

你好!代码中有两个问题:
1,models/deepgbm.py中36-40行的if语句,根据参数num_model的不同取值对self.gbdtnn进行初始化,而在forward函数第62行的else分支中,因为此时self.gbdtnn为None,63行应该会报错,不过这个一般来说没什么关系。
2,我真正想问的问题其实是这个:deepgbm.py中的67行,这里的 != 是否应该是 == ?根据论文,当num_model == ‘gbdt2nn' 时,模型输出不应该是gbdt2nn+deepfm两部分的输出吗?为何第70行的分支将deepfm的输出忽略了?
谢谢!

can not find file

preprocess/encoding_cate.py
when
import category_encoders as ce

No module named 'category_encoders'

how to modify the network for multi-classification?

I'm implementing a version for a multi-classification task, not sure where to change the work.
Is it right to change the BatchDense part with parameter out_features from "1" in original codes to n_classes in my case?

Leaf Embedding Model Performance

Hi, I am trying to replicate this work for my own Dataset, which is around 0.2 Million-sized corpus and I trained the GBDT2NN model for a 5-category classification task. I found that the leaf embedding model does not perform well on the testset (accuracy is way lower than GBDT), and as a result, the GBDT2NN model targeting this pretrained leaf embedding performs even worse(30-40% decrease on accuracy).

Since the paper did not presents any evaluation on this leaf embedding learning, I wonder if you could make some clarification on a few things:

  1. The paper presents results only for binary classification and regression, the tree number will be a lot bigger for multi-category classification tasks. So will large tree_number affects the performance? As the initial leaf embedding size is [n_clusters, max_leaves, num_classes, leaf_emb_size], larger number of trees may lead to too much variance in embedding?.

  2. Also if the depth is deeper, leading to very complex tree structures, what adjustion would you recommand for tuning the model for better performance?

  3. For all your datasets, the leaf embdding model learns at most 10epoches(many learns for 2epoches at most), did all of these models outperform their GBDT counterparts by only learning to predict leaf values? Did these models actually converge for such few epochs? Or more training actually hurts the performance( this happens in my experiments)

Dataset About Flight

Hello,

I have downloaded the flight from site http://stat-computing.org/dataexpo/2009/the-data.html.
I find that there are 29 fields in this dataset, but 12 fields are used in your paper. I don't know the specific field in your paper is used, could you tell me the detail about how to use this dataset?

Thanks.

Model saving and data prediction

Hello,
I've noticed the train_DEEPGBM method will return three variables,
return deepgbm_model, opt, metric
but the main function didn't handle these outputs,

elif args.model == "deepgbm":
        num_data = dh.load_data(args.data+'_num')
        cate_data = dh.load_data(args.data+'_cate')
        # designed for faster cateNN
        cate_data = dh.trans_cate_data(cate_data)
        train_DEEPGBM(args, num_data, cate_data, plot_title)

Could you tell me how to save the model for inference after training?
And I also want to know how to preprocess the data and how to feed the data to the saved model.

Thank you very much!

Add dot cpu method for the variable 'outputs'

Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main()
  File "main.py", line 83, in main
    train_cateModels(args, cate_data, plot_title, key="")
  File "C:\Users\ying\Desktop\DeepGBM\experiments\train_models.py", line 116, in train_cateModels
    opt, args.max_epoch, args.batch_size, 1, key)
  File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 168, in TrainWithLog
    test_loss, pred_y = EvalTestset(test_x, test_y, model, args.test_batch_size, test_x_opt)
  File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 89, in EvalTestset
    return sum_loss / test_len, np.concatenate(y_preds, 0)
  File "C:\Users\ying\Miniconda3\lib\site-packages\torch\tensor.py", line 458, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Solution:

def EvalTestset(test_x, test_y, model, test_batch_size, test_x_opt=None):
    test_len = test_x.shape[0]
    test_num_batch = math.ceil(test_len / test_batch_size)
    sum_loss = 0.0
    y_preds = []
    model.eval()
    with torch.no_grad():
        for jdx in range(test_num_batch):
            tst_st = jdx * test_batch_size
            tst_ed = min(test_len, tst_st + test_batch_size)
            inputs = torch.from_numpy(test_x[tst_st:tst_ed].astype(np.float32)).to(device)
            if test_x_opt is not None:
                inputs_opt = torch.from_numpy(test_x_opt[tst_st:tst_ed].astype(np.float32)).to(device)
                outputs = model(inputs, inputs_opt)
            else:
                outputs = model(inputs)
            targets = torch.from_numpy(test_y[tst_st:tst_ed]).to(device)
            if isinstance(outputs, tuple):
                outputs = outputs[0]

            #########################This Line#################################
            y_preds.append(outputs.cpu()) # y_preds.append(outputs) -> y_preds.append(outputs.cpu())


            loss_tst = model.true_loss(outputs, targets).item()
            sum_loss += (tst_ed - tst_st) * loss_tst
    return sum_loss / test_len, np.concatenate(y_preds, 0)
```

y_preds.append(outputs) -> y_preds.append(outputs.cpu())

关于测试集的问题

test set 好像先在训练GBDT的时候用来做了early stop,那它可以看做是一个验证集,但是它又在训练整个DeepGBM的时候当做了测试集,我想问这俩不应该是同一个数据集吧?

The predicted y output for the nips_A dataset is not binary

I tried to replicate the experiment using nipsA_deepgbm_offline script but when i tried to print the output of the predictions (i.e., pred_y), the predicted values given are real number as follows:

pred_y: [0.34316275 0.36292914 0.3552964 0.35970888 0.35918367 0.3515028
0.36127064 0.34485954 0.37243855 0.3597494 0.37358335 0.35961217
0.3446463 0.3496978 0.3566822 0.35322982 0.36830378 0.37577894
0.3774116 0.361733 0.34249693 0.36554664 0.36565682 0.3529494
0.3586624 0.35545474 0.35036924 0.37629476 0.36555907 0.37944567
0.37318298 0.37047318 0.36648336 0.3657822 0.36535558 0.37917492
0.3766928 0.37294027 0.365851 0.36476082 0.3654274 0.37421528
0.35136348 0.3707069 0.37431964 0.38356563 0.35120273 0.37153876
0.3950685 0.3748232 0.36651853 0.35673445 0.36918446 0.36931384
0.36340126 0.3641296 0.38208184 0.3779632 0.36068133 0.34996226
0.34128729 0.3601104 0.35272548 0.34229857 0.35786942 0.352486
0.34367353 0.34292746 0.3950783 0.36609793 0.3616757 0.38065642

Isn't the predicted output suppose to be 0 or 1? Can you please advise.

关于二分类的在线学习和预测问题

当我进行二分类在线训练和预测任务的时候,出现了如下报错:
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert
triggered
说是使用BCELoss的时候,tensor范围超出了[0, 1],我也上网查找了许多方法,并且对https://github.com/motefly/DeepGBM/blob/master/experiments/helper.py中TrainWithLog函数的outputs,以及https://github.com/motefly/DeepGBM/blob/master/experiments/models/components.py中true_loss函数的out分别进行了修改,但是运行结果也还是报相同的错误,想请问一下,到底是哪出了问题?

max_ntree_per_split is 0

It seems the max_ntree_per_split is not properly initiated, and it's always 0.

Traceback (most recent call last):
File "main.py", line 102, in
main()
File "main.py", line 94, in main
train_DEEPGBM(args, num_data, cate_data, plot_title, key="")
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/train_models.py", line 165, in train_DEEPGBM
emb_model = EmbeddingModel(n_models, max_ntree_per_split, args.embsize, args.maxleaf+1, n_output, group_average, task=args.task).to(device)
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/models/components.py", line 230, in init
stdv = math.sqrt(1.0 /(max_ntree_per_split))
ZeroDivisionError: float division by zero

catboost2nn

请问如果想实现catboost2nn,代码改动大吗,是把关于gdbt的部分替换为catboost吗(学生小白想问问,求指教)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.