Comments (5)
It's not just the split gain that is different on the left root child: it's also not splitting on the same feature.
from pygbm.
Here is another example, this time with pre-binned data.
I can't explain why the left root child has a different split gain. When I print the split gain values considered by lightgbm, no split is equal to 0.923
.
The discrepancy may not come from the actual binning strategy here, but could be due to how the bins are treated afterwards. Some of them may not be considered, or merged, I don't know.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree
from pygbm.binning import BinMapper
import numpy as np
rng = np.random.RandomState(seed=2)
n_leaf_nodes = 4
n_trees = 1
lr = 1.
min_samples_leaf = 1
max_bins = 255
n_samples = 100
X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0) & (X[:, 1] > .5)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
X_train = BinMapper().fit_transform(X_train)
pygbm_model = GradientBoostingMachine(
loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
lightgbm_model = LGBMClassifier(
objective='binary', n_estimators=n_trees, max_bin=max_bins,
num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
plot_tree(pygbm_model, lightgbm_model, view=True)
from pygbm.
Ok I made some small progress on this. Still don't know the details of lightgbm binning but I can explain the 2 previous comments.
For the first comment (#39 (comment)), it looks like LightGBM forces -1e-35
and 1e-35
as binning thresholds, regardless of the binning strategy (see here and here). Now I understand why the function is called FindBinWithZeroAsOneBin
... They will also add 0 as one of the 'unique values' (see here) but this is not completely related here.
Do we want to do such a thing as well?
For the binning threshold, something like
midpoints = np.insert(midpoints, np.searchsorted(midpoints, 0), 0)
would do it but that would make midpoints
bigger than 256 in size in most cases.
For my second comment (#39 (comment)), the discrepancy comes from the min_data_in_bin
parameter of LightGBM which is 3 by default. Setting it to 1
gives the same trees. I should have seen this sooner :s
Side note: when debugging, it's helpful to set enable_bundle
to False
because the bundling (of mutually exclusive features) changes the inner order of the features in lightGBM which makes it harder to debug: feature 0 of Lightgbm is not feature 0 of pygbm, etc. Regardless of the debugging, we should probably set it to False
all the time in our checks, just in case.
from pygbm.
Do we want to do such a thing as well?
Maybe we should ask the LightGBM developers to explain why this is useful.
Regardless of the debugging, we should probably set it to False all the time in our checks, just in case.
+1, and we can reenable it the day we implement feature bundling (hopefully).
from pygbm.
For my second comment (#39 (comment)), the discrepancy comes from the min_data_in_bin parameter of LightGBM which is 3 by default. Setting it to 1 gives the same trees. I should have seen this sooner :s
Nice catch.
from pygbm.
Related Issues (20)
- API documentation is broken HOT 1
- All the examples require lightgbm HOT 1
- Allow score monitoring regardless of early stopping
- Optimize score loss computation
- Remove empty slice check (numba fixed the issue)
- Reuse grower (and thus the splitter) instead of creating a new one
- Updating to Scipy 1.2.0 breaks loss tests... HOT 2
- Optionally use left/right indices buffer HOT 7
- Avoid ordered_gradients? HOT 7
- Remove constant_hessian_value? HOT 1
- sum_gradient and sum_hessians computation in find_node_split_subtraction HOT 4
- Optimize categorical crossentropy gradient update HOT 3
- _update_raw_predictions() throws a deprecation warning HOT 1
- numba-integration-test failure HOT 6
- Status of this project? HOT 2
- Implement native support for missing values
- did you stopped development since since can not do better than lightGBM pr Xgboost pr catboost? HOT 4
- Implement histogram recycling to improve memory efficiency
- Recent Numba not usable with pygbm HOT 1
- Parallel splitting fails in nopython mode
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pygbm.