Comments (10)
Hi @nilslacroix , thanks for your feedback ! This is indeed a problem that can arise when the test set has a higher number of samples. As explained in the theoretical description of the documentation (see Fig of CV+), MAPIE needs to compute the distribution of residuals and predictions for all training samples for each test point. MAPIE does it in a vectorized way with two-dimensional arrays of size (n_train_samples, n_test_samples)
, hence exceeding the memory when the number of training and test samples are high.
We will fix the problem in a later PR by dividing the test set into batches. Meanwhile, we invite you to split your test set explicitly and call mapie.predict()
in a loop.
from mapie.
Hey @nilslacroix , I don't think that MAPIE builds matrices of size n_training_samples*n_test_samples
but only of size n_training_samples*n_estimators
with n_estimators=10
in case cv=10
. Am I right @vtaquet ? As for the number of features, it is not a scaling parameter of MAPIE, only on the internal model you provide.
from mapie.
I thought so too, but when I use the parameters above I get an error because the matrice is too big, and the given sice is exactly what I describe. So despite the fact that I use CV+ Method (by parameters as described in the docs) a n_training_samples*n_test_samples
matrice is constructed.
Maybe there is an size argument not working properly?
from mapie.
For X_test = 149689
and X_train = 152363
during the prediction I got this error. As you can see method is plus and cv=5 for a standard default lgbm regressor.
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Input In [72], in <cell line: 3>()
1 mapie = MapieRegressor(grid_search.best_estimator_, method="plus", cv=5, n_jobs=multiprocessing.cpu_count()-1)
2 mapie.fit(X_train, y_train)
----> 3 y_pred, y_interval = mapie.predict(X_test, alpha = 0.20)
4 y_low, y_up = y_interval[:, 0, :], y_interval[:, 1, :]
6 score_coverage = regression_coverage_score(np.expm1(y_test), np.expm1(y_low), np.expm1(y_up))
File ~\miniconda3\envs\Master_ML\lib\site-packages\mapie-0.3.2-py3.9.egg\mapie\regression.py:652, in MapieRegressor.predict(self, X, ensemble, alpha)
639 y_pred_multi = np.column_stack([e.predict(X) for e in self.estimators_])
641 # At this point, y_pred_multi is of shape
642 # (n_samples_test, n_estimators_).
643 # If ``method`` is "plus":
(...)
649 # ``aggregate_with_mask`` fits it to the right size
650 # thanks to the shape of k_.
--> 652 y_pred_multi = self.aggregate_with_mask(y_pred_multi, self.k_)
654 if self.method == "plus":
655 if self.residual_score_.sym:
File ~\miniconda3\envs\Master_ML\lib\site-packages\mapie-0.3.2-py3.9.egg\mapie\regression.py:439, in MapieRegressor.aggregate_with_mask(self, x, k)
437 if self.agg_function in ["mean", None]:
438 K = np.nan_to_num(k, nan=0.0)
--> 439 return np.matmul(x, (K / (K.sum(axis=1, keepdims=True))).T)
440 raise ValueError("The value of self.agg_function is not correct")
MemoryError: Unable to allocate 170. GiB for an array with shape (149689, 152363) and data type float64
from mapie.
This is a significant barrier to using this package in my opinion/experience, and it seems to be avoidable. Could you not calculate the quantiles from self.conformity_scores_
, then add those to original predictions directly? Something like:
bounds = np_nanquantile(self.conformity_scores_, 1 - alpha_np) # shape: (len(alpha),)
lower_bounds = np.add(y_pred[:, np.newaxis], -bounds) # shape: (n_test_samples, len(alpha))
upper_bounds = np.add(y_pred[:, np.newaxis], bounds) # shape: (n_test_samples, len(alpha))
This may only work with naive and base; I admit I don't fully understand how plus and minmax operate.
from mapie.
Hi @nilslacroix , thanks for your feedback ! This is indeed a problem that can arise when the test set has a higher number of samples. As explained in the theoretical description of the documentation (see Fig of CV+), MAPIE needs to compute the distribution of residuals and predictions for all training samples for each test point. MAPIE does it in a vectorized way with two-dimensional arrays of size
(n_train_samples, n_test_samples)
, hence exceeding the memory when the number of training and test samples are high.We will fix the problem in a later PR by dividing the test set into batches. Meanwhile, we invite you to split your test set explicitly and call
mapie.predict()
in a loop.
I still have this issue in mapie==0.6.1
from mapie.
I have this same problem in mapie==0.6.5. It's trying to allocated over 50gb. I like mapie a lot, but I can't use this lib if it's going to do such naive approaches. Please advise if this will be fixed any time soon.
from mapie.
Hello @scottee, I recommend that you consult other issues that have this problem. Without context, it's difficult for us to understand your particular problem. But here are the answers I've been able to provide: #328 #326
TL;DR: A priori, this is not a problem with prefit
mode. This is a problem that can arise when the calibration set and test test have a larger number of samples. This behavior is unintended, as the predict
method is generally used with a smaller number of test samples during inference.
TL;DR: This is a problem that can arise when the calibration and the test set have a larger number of samples. This behavior is unintended, as the predict
method, called in the fit method of MapieTimeSeriesRegressor
, is generally used with a smaller number of test samples during inference.
Recommendation: prefer to use a smaller calibration set. MAPIE will still be just as effective, but will run faster (200k samples is too unreasonable). The cv
feature should be used if you don't have many samples. On the contrary, use the prefit
or split
features.
from mapie.
Actually, I would recommend to implement the looping process described by @vtaquet directly within the predict methods of all Mapie classes, essentially by inheritance of a base class. This class would typically have a predict
method with an additional argument called batch_size
(like in e.g. tensorflow).
For example :
alpha = [0.01, 0.05]
model = MapieRegressor(base_model, method="plus", cv=5)
model.fit(X_train, y_train)
y_preds, y_pis = model.predict(X_test, alpha=alpha, batch_size=100)
Which could be roughly equivalent to
alpha = [0.01, 0.05]
model = MapieRegressor(base_model, method="plus", cv=5)
model.fit(X_train, y_train)
# code that initiate y_preds and y_pis
n_batches = len(X_test) // batch_size
for X_test_batch in np.array_split(X_test, n_batches):
y_preds_batch, y_pis_batch = model.predict(X_test_batch, alpha=alpha)
# code that populate y_preds and y_pis with the results of the current batch
from mapie.
I'm seeing this as well, I do believe the approach of batching the predictions works to get around the large space requirements, however it would be great if this was handled within the library rather than outside it, it can be quite a surprise that the memory blows up in the predict method. If the library owners think this is a good idea I can try and implement it.
from mapie.
Related Issues (20)
- 'ConformalMultiQuantile' object has no attribute 'calibration_adjustments' HOT 1
- Documentation for Winkler Interval Score
- Update from developers HOT 1
- Sort imports in MAPIE code
- Clean up the history file before release
- Support CatBoost Models with RMSEwithUncertainty loss HOT 2
- Readme page broken
- Move some methods of MapieClassifier to a utils file
- Something is wrong with unitary test and mac os
- ENH : Build an Ensemble classifier based on regression refactoring
- Does MAPIE support multi-step regression? I am using a Tensorflow Model with KerasRegressor HOT 2
- Issue with version 1.4.2 of scikit-learn
- A measure of uncertainty for multioutput regression HOT 2
- Specifying conda version
- MapieRegressor sets method to 'base' from ACI HOT 2
- MapieQuantileRegressor with prefit model from Keras/Tensorflow HOT 1
- Conformal Prediction With Conditional Guarantees
- Lower bounds > Upper Bounds for Conformalized Quantile Regression HOT 2
- This video suggests there might be a bug in implementation due to coverage dropping for ICP HOT 1
- Coverage validity not verified in MapieRegressor when the number of calibration data is very small HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mapie.