lamda-nju / deep-forest Goto Github PK
View Code? Open in Web Editor NEWAn Efficient, Scalable and Optimized Python Framework for Deep Forest (2021.2.1)
Home Page: https://deep-forest.readthedocs.io
License: Other
An Efficient, Scalable and Optimized Python Framework for Deep Forest (2021.2.1)
Home Page: https://deep-forest.readthedocs.io
License: Other
It seems that the latest version of deep forest has some bugs in terms of "sklearn" backend. I don't know the exact reason, but it seems that there may exist transformation errors in the deep forest package, which makes the data dimensions unmatched in fitting and prediction steps.
ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 13
The forest model excels at the regression problem, and tends to have a much smaller variance than GBDTs. Therefore, it would be nice if deep forest could further support univariate and multivariate regression.
Related issue
#3
Possible steps on this feature request
sklearn.RandomForestRegressor
to sklearn.ExtraTreesRegressor
in estimator.py
layer.py
CascadeForestRegressor
in cascade.py
CascadeForestRegressor
This issue collects all features requests. Any one is welcomed to work on issues listed below, and do not forget to include your contributions and name in the CHANGELOG.rst
.
If you want to work on a requested feature, please re-open the linked issue, and leave a comment below to let us know that you want to work on it.
CascadeForestRegressor
class for regression problem (#4)export_graphviz
method on visualizing decision trees in deep forest (#12)CascadeForestSurvAnalyzer
class for survival analysis (#71)Hello I am a new user of deep-forest.
I've read the intro slide. Could you tell me that the sub-partition
method of Distributed representation learning
been implemented or not by now.
Thank you so much.
@xuyxu
mod=model.get_layer_feature_importances(layer_idx=0)
print(mod)
RuntimeError: Please use the sklearn backend to get the feature importances property for each cascade layer.
I want to get layer feature importances, but I donot know how to get it .
DF可以像RF一样获得到特征的重要性排序吗
请问DF进行特征选取的过程是怎样的
When I was using this package, I experienced the following problem. According to my observation, there is still a lot of available memory. Thus, what's the problem?
File "deepforest/tree/_tree.pyx", line 123, in deepforest.tree._tree.DepthFirstTreeBuilder.build
File "deepforest/tree/_tree.pyx", line 256, in deepforest.tree._tree.DepthFirstTreeBuilder.build
File "deepforest/tree/_tree.pyx", line 480, in deepforest.tree._tree.Tree._resize_node_c
File "deepforest/tree/_utils.pyx", line 34, in deepforest.tree._utils.safe_realloc
MemoryError: could not allocate 0 bytes
I am glad to see that such a method has achieved promising results on many machine learning tasks. However, in real world scenario, we often tune the hyper-parameter of a specific classifier based on cross-validation scheme. Currently, I am working on constructing a machine learning benchmark, and I believe that a proper set of the parameter grid is vital for fairly comparing the performance between different algorithms. Consequently, could you provide a recommended parameter grid for deep-forest? Or, at least provide a guideline for tuning the hyper-parameter of deep forest?
I have a question if I want to build a completely-random tree forests should I use ExtraTreesClassifier and how I set max_features?
Thanks for your help
ERROR: Could not find a version that satisfies the requirement deep-forest (from versions: none)
ERROR: No matching distribution found for deep-forest
system: mac
python version: 3.8.5
pip version: 20.2.4
anaconda中安装,确认所需库的版本号正确,运行例子时出现“ numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject”,请问这是为什么
Hi. I want to set the size of the sliding window, but I did not find the code of multi-grained scanning part in DF21. Does the function of multi-grained scanning part exist in DF21? If so, where can I find the corresponding file? Also, I wonder, do subcascades of each cascade still exist, e.g., Level 1A, Level 1B, Level 1C. Look forward to your help. Thank you!
请问现在可以支持gpu了吗?另外是否支持sklearn的gridSearchCV呢?
我需要进行网格搜索,那么我需要动态定义CascadeForestClassifier,就需要把字典形式的参数组传入CascadeForestClassifier生成,但是CascadeForestClassifier好像没有这方面的函数
CascadeForestClassifier
class cannot be used in sklearn's cross_val_score
directly, maybe we can inherited from BaseEstimator
?
Something like this:
from deepforest import CascadeForestClassifier
from sklearn.base import BaseEstimator
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
X, y = load_breast_cancer(return_X_y=True)
class CFC(CascadeForestClassifier, BaseEstimator):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def score(self, X, y):
return accuracy_score(y, self.predict(X))
score = cross_val_score(CFC(random_state=10), X, y)
print(score)
Trying the basic tutorial:
X, y = load_digits(return_X_y=True)
y += 100 # This is what I changed
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
# Train
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("Testing Accuracy: {:.3f} %".format(acc))
And get:
Testing Accuracy: 0.000 %
This is because the class labels are made into "101", "102", "103", ..., instead of "1", "2", "3",... .
But the predict() function (or the model itself) could not deal with these labels.
Is it possible to let the CascadeForestClassifier.predict() function use the original class label (e.g., "101" instead of "1")? It is the basic feature of sklearn models. Or it is also fine to explain in the documentation how CascadeForestClassifier maps original class labels into integers. Now it is a bit confusing and not very convenient to use.
BTW, big fan of your work! :))) I've been waiting for it even since it is published on IJCAI.
Is this implementation support Multi grained scanning ?
Note : M.G.S is the first part of Gcforest
Got error with code:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deepforest import CascadeForestClassifier
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
TypeError Traceback (most recent call last)
in
6
7 model = CascadeForestClassifier(random_state=1)
----> 8 model.fit(X_train, y_train.values.ravel())
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in fit(self, X, y, sample_weight)
1395 y = self._encode_class_labels(y)
1396
-> 1397 super().fit(X, y, sample_weight)
1398
1399 def predict_proba(self, X):
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in fit(self, X, y, sample_weight)
754
755 # Bin the training data
--> 756 X_train_ = self.bin_data(binner, X, is_training_data=True)
757 X_train_ = self.buffer_.cache_data(0, X_train_, is_training_data=True)
758
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in _bin_data(self, binner, X, is_training_data)
665 tic = time.time()
666 if is_training_data:
--> 667 X_binned = binner.fit_transform(X)
668 else:
669 X_binned = binner.transform(X)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
697 if y is None:
698 # fit method of arity 1 (unsupervised transformation)
--> 699 return self.fit(X, **fit_params).transform(X)
700 else:
701 # fit method of arity 2 (supervised transformation)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/_binner.py in fit(self, X)
128 self.validate_params()
129
--> 130 self.bin_thresholds = _find_binning_thresholds(
131 X,
132 self.n_bins - 1,
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/_binner.py in _find_binning_thresholds(X, n_bins, bin_subsample, bin_type, random_state)
75 if n_samples > bin_subsample:
76 subset = rng.choice(np.arange(n_samples), bin_subsample, replace=False)
---> 77 X = X.take(subset, axis=0)
78
79 binning_thresholds = []
TypeError: take() got an unexpected keyword argument 'axis'
Dataset is loaded with vaex, is this a problem particular for vaex?
As stated in the documentation, the goal of this package is to:
Provide users with an effective and powerful option to traditional tree-based ensemble models such as random forest and gradient boosting decision tree.
In order to prompt the use of deep forest, and make this package progress towards another popular option when you are considering to use tree-based ensemble models, we would like to call for user reports on using deep forest.
We are particularly interested in:
In a future release, we will set up another webpage in our documentation, and your contributions would be posted there. Notice that there is no strict limitation on the form of your contribution, it could be your winning solution on the competition, the link to your published articles, and many more.
Please feel free to comment below or send me an e-mail if you are willing to share your achievements with us. Thanks!
Hi! I am learning deep forest, and I come across a question. That is when we do sliding, say we have 400-dim raw input features, and then we generate 301 instances with size of 100, for the first instance, we train a forest, and then the second instance comes, so the second instance is input into the first forest or used to train another forest? In other words, in multi-grained scanning part, we train 1 forest or 301 forests?
Document like https://deep-forest.readthedocs.io/en/latest/how_to_get_started.html is not available
Describe the bug
CascadeForestRegressor
somehow cannot be inserted into a DataFrame
To Reproduce
import pandas as pd
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor
ngr = NGBRegressor() # ngboost regressor for example. xgb, lgb should also be no problem.
cfr = CascadeForestRegressor()
df= pd.DataFrame()
# somehow OK
df.insert(0, "ngr", [ngr])
# somehow error
df.insert(0, "cf", [cforest])
Expected behavior
No error
Additional context
ValueError Traceback (most recent call last)
<ipython-input-32-ab0139d10254> in <module>
----> 1 df.insert(0, "cf", [cforest])
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
3760 )
3761 self._ensure_valid_index(value)
-> 3762 value = self._sanitize_column(column, value, broadcast=False)
3763 self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
3764
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3900 if not isinstance(value, (np.ndarray, Index)):
3901 if isinstance(value, list) and len(value) > 0:
-> 3902 value = maybe_convert_platform(value)
3903 else:
3904 value = com.asarray_tuplesafe(value)
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in maybe_convert_platform(values)
110 """ try to do platform conversion, allow ndarray or list here """
111 if isinstance(values, (list, tuple, range)):
--> 112 values = construct_1d_object_array_from_listlike(values)
113 if getattr(values, "dtype", None) == np.object_:
114 if hasattr(values, "_values"):
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in construct_1d_object_array_from_listlike(values)
1636 # making a 1D array that contains list-likes is a bit tricky:
1637 result = np.empty(len(values), dtype="object")
-> 1638 result[:] = values
1639 return result
1640
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
518
519 def __getitem__(self, index):
--> 520 return self._get_layer(index)
521
522 def _get_n_output(self, y):
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
561 logger.debug("self.n_layers_ = "+ str(self.n_layers_))
562 logger.debug("layer_idx = "+ str(layer_idx))
--> 563 raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
564
565 layer_key = "layer_{}".format(layer_idx)
ValueError: The layer index should be in the range [0, 1], but got 2 instead.
This bug can be simpliy fixed if we change if not 0 <= layer_idx < self.n_layers_:
to if not 0 <= layer_idx <= self.n_layers_:
, but I still don't know the cause of this error and whether this fix is corret.
I have a dataset with ~500 variables. Some are boolean variables. I had these error when try to fit the model
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
And when I fit the model without these boolean variables, it worked.
调用数据集训练出现错误:
File "deepforest/_cutils.pyx", line 59, in deepforest._cutils._map_to_bins
File "deepforest/_cutils.pyx", line 76, in deepforest._cutils._map_to_bins
ValueError: Buffer dtype mismatch, expected 'const X_DTYPE_C' but got 'long'
Hi maintainer,
I am wondering is that possible to cascade random survival forest (maybe a sksurv model) instead of RF in your deep forest model? As in #48, it seems that the supported model types are classification and regression. (or did I miss some parts of those tutorial docs?)
Thanks.
Maybe we can add a formatter like black in the Makefile and gitaction to make life easier for developer.
What do you think?
Or we need to extract feature with hog/sift etc...
When I fit the models, an error raises :
Check failed: weights_.Size() == num_row_ (385683 vs. 308546) : Size of weights must equal to number of rows.
I have checked the source code of kfoldwrapper, and I find that :
Maybe the “sample_weight” should be “sample_weight[train_idx]" ? Otherwise the shape of sample_weight can not math to that of the X
Hey,
Thanks for your awesome repo.
I have a question if you don't mind could you please give me an example on how to change RandomForestClassifier and ExtraTreesClassifier in the CascadeForestClassifier?
I found that the MacOS version is not yet available in the pypi repository. I wonder if there are plans to provide a mac version, or is there any way to install it manually?
Describe the bug
cannot correctly clone CascadeForestClassifier
/CascadeForestRegressor
object with sklearn.base.clone
when using customized stimators
To Reproduce
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.base import clone
from deepforest import CascadeForestRegressor
import xgboost as xgb
import lightgbm as lgb
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestRegressor(random_state=1)
# set estimator
n_estimators = 4 # the number of base estimators per cascade layer
estimators = [lgb.LGBMRegressor(random_state=i) for i in range(n_estimators)]
model.set_estimator(estimators)
# set predictor
predictor = xgb.XGBRegressor()
model.set_predictor(predictor)
# clone model
model_new = clone(model)
# try to fit
model.fit(X_train, y_train)
Expected behavior
No error
Additional context
~/miniconda3/envs/pycaret/lib/python3.8/site-packages/deep_forest-0.1.5-py3.8-linux-x86_64.egg/deepforest/cascade.py in fit(self, X, y, sample_weight)
1004 if not hasattr(self, "predictor_"):
1005 msg = "Missing predictor after calling `set_predictor`"
-> 1006 raise RuntimeError(msg)
1007
1008 binner_ = Binner(
RuntimeError: Missing predictor after calling `set_predictor`
This bug occours because when the model is cloned, if the model has customized predictor or estimators, predictor='custom'
will be cloned, while self.predictor_
/ self.dummy_estimators
will not be correctly cloned, which introduced the bug described above.
I think this bug can be easily fixed by putting the predictor and the list of estimators into the parameter of CascadeForestClassifier
/CascadeForestRegressor
, just like the way of those meta estimators (e.g. ngboost
), but maybe the corresponding APIs will have to be changed.
For example, the API parameters could be:
model = CascadeForestRegressor(
estimators=[lgb.LGBMRegressor(random_state=i) for i in range(n_estimators)],
predictor=xgb.XGBRegressor(),
)
for my experiment,I set params like this:
parameters = [
{
'n_estimators': [2, 5, 8, 10],
'n_trees': [50, 100, 150, 200, 250, 300],
'predictors': ['xgboost', 'lightgbm', 'forest'],
'max_layers': [20, 50, 80, 120, 150],
'use_predictor': [True]
},
{
'n_estimators': [2, 5, 8, 10, 13],
'n_trees': [50, 100, 150, 200, 250, 300, 400],
'max_layers': [20, 50, 80, 120, 150, 200],
},
]
finally, the experiment shows same result with different predictors
when use_predictor
is True
, and different max_laters
can also get same result.
I would like to konw the situation is correct?
如果在参数中通过model.set_params(dict)设置use_predcitor =Ture, predictor = 'lightgbm',实际上在模型中依然predictor为forest
Can this library support for regression problems?
Thanks to the contributors, many new features have been developed. As a result, the current version of documentation could be ambiguous, and requires more explanation or demonstration.
This issue collects suggestions on the documentation. Any one is welcomed to improve the readability of the documentation. For contributors unfamiliar with our workflow on building the documentation, please refer to the instructions below.
git clone https://github.com/LAMDA-NJU/Deep-Forest.git
cd Deep-Forest/docs
pip install -r requirements.txt
.rst
file. (Wiki of rst)make html
The generated html files are available in the directory _build/html/
, and the homepage is index.html
.
Readthedocs has been integrated into our CI, and you can also view the documentation after creating your PR, available in the last row of GitHub Checks on your PR page.
Full list available at Contributors.
你好!
请问我想知道DF21的CascadeForestRegressor或者是CascadeForestClassifier支持输出多个特征吗?(eg:多输入=》model=》多维度的输出)
Hello!
I am very glad to find your greatful and useful work, But I want to know how to implement the Model of Multi Inputs Multi Outputs in DF21 . I really need it in my environment. Can you help me ?
Please
Maybe at some point, we can refactor and allows user supplying custom models for the layers
hi,I would to know whether deep forest could support minibatch like DL?
I keep having this errors:
Traceback (most recent call last):
File "classification.py", line 18, in
model.fit(X_train, y_train)
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/cascade.py", line 1418, in fit
super().fit(X, y, sample_weight)
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/cascade.py", line 811, in fit
X_train_, y, sample_weight=sample_weight
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/_layer.py", line 222, in fit_transform
sample_weight,
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/_layer.py", line 40, in _build_estimator
X_aug_train = estimator.fit_transform(X, y, sample_weight)
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/estimator.py", line 212, in fit_transform
self.estimator.fit(X, y, sample_weight)
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/forest.py", line 479, in fit
for i, t in enumerate(trees)
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/parallel.py", line 921, in call
if self.dispatch_one_batch(iterator):
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "/afs/crc.nd.edu/user/a/alaguna/.local/lib/python3.7/site-packages/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/forest.py", line 119, in _parallel_build_trees
tree.random_state, n_samples, n_samples_bootstrap
File "/afs/crc.nd.edu/user/a/alaguna/Documents/OngoingResearch/Forest/Deep-Forest/deepforest/forest.py", line 98, in _generate_sample_mask
sample_mask = _LIB._c_sample_mask(sample_indices, n_samples)
File "deepforest/_cutils.pyx", line 38, in deepforest._cutils._c_sample_mask
cpdef np.ndarray _c_sample_mask(const INT32_t [:] indices,
File "deepforest/_cutils.pyx", line 46, in deepforest._cutils._c_sample_mask
np.ndarray[BOOL, ndim=1] sample_mask = np.zeros((n_samples,),
ValueError: Does not understand character buffer dtype format string ('?')
TypeError: unhashable type: 'slice'
输入的数据是纯数字数据,样本个数与标签匹配,不知道为何会报错。期待并感谢您的回复。
Considering the number of aggregations from each model in each layer, it would be nice to train the models faster
Is it possible that deep forest-based migration learning will appear in this library in the future ? Because I find that the transfer ability of the model is very important in practical engineering tasks and many academic papers. If deep-forest can also be applied in the field of transfer learning, it will be very competitive compared to neural networks. Thank you very much!
我们知道DF会自动将训练集划分一部分验证集,请问这个划分的比例是?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.