Comments (2)
Because EBMs are a restricted model class so that they remain intelligible, their simplicity means that they do not need large amounts of data compared to some other model types such as neural nets or boosted decision trees. In practice their data complexity is more comparable to linear and logistic regression than it is to deep neural nets, but they do often need/benefit from more data than a linear model would. The more features in the dataset the more data you need to be able to learn an accurate model, and the more complex the function needed for each feature, the more data will be needed to shape those functions accurately, so it is difficult to give numbers without knowing more about the data and problem. Our experience is that useful models with a few dozen features can be trained on data with 500 or more cases if the data is not too imbalanced. I like to look at the size of the smallest important class when I think about data size. If there are 10k training cases, but the data is only 1% positives, then there's only 100 positive cases and this no longer behaves like a large 10k sample. Also there is a difference between classification and regression --- often regression can work with fewer samples because there is more information in the label of each sample compared to Boolean classification where the label is only 0 or 1.
In summary, EBMs are reasonably sample efficient, needing somewhat more data than linear methods, but usually not as much data as more complex black-box methods such as neural nets and unrestricted boosted trees, and EBMs often work well with sample sizes of about 1000 cases or more. If there are very few samples for training, sometimes it helps to play with the EBM hyperparameters to do more outer bagging, fewer bins, and even shorter trees.
from interpret.
Thank you very much sir, I really liked your EBM model. It means if we have 3 or 4 features we can still get insights from EBM using less data. There are some research papers using EBM which have used data around 300 or less. If we read research papers we can't really say how much data is actually required for perticular model because every paper's data varies from 50 -100 and more.
from interpret.
Related Issues (20)
- How to get word importance HOT 1
- Development installation: Requirements? HOT 2
- Query: performance prospects on massive data sets (curse of dimensionality?) HOT 3
- How to speed up EBM model? Unbelievable slow. HOT 9
- Question: Parallel boosting? HOT 4
- Integrate EBM into the pytorch framework HOT 7
- Visualising Decision Tree explainer gives a Cytoscape object which is not savable to my local machine HOT 2
- [DP-EBM] Question regarding range R and sensitivity
- Support for more parameters in the Differentially Private models HOT 1
- NAM Model HOT 1
- Some hyperparameter questions HOT 3
- Lookup Table for single feature and feature interaction terms HOT 3
- Operations when merging EBM HOT 6
- EBM Classifier Global Feature Importance x Random Forest Classifier with Morris Sensitivity Analysis HOT 1
- possibility of adding `sample_weight` to `interpret.glassbox.ClassificationTree` HOT 6
- 2d PDP Z-axis colours appear too similar HOT 1
- Exporting EBM as PMML HOT 3
- Feature Request: Passing Validation Set or Index HOT 2
- Explore the data with continuous output and category input HOT 4
- Using the init_score in EBM Classifier HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from interpret.