yuenshingyan / missforest Goto Github PK
View Code? Open in Web Editor NEWArguably the best missing values imputation method.
License: MIT License
Arguably the best missing values imputation method.
License: MIT License
Thank for sharing with us the implementation. I am having an error ValueError: at least one array or dtype is required
when I run mfe= mfe.impute(data, rfc, rfr)
. It is working fine with I read fish = pd.read_csv('Fish.csv')
But When I read some other file it gives the error. Although my DF is fine "[699 rows x 10 columns]", Type "Dataframe". Could please check?
Thanks for sharing the implementation. I haven't figured out the way you deal with the categorical variables. Could you please tell me what type the input categorical variables should take the form of? It seems to me they could be string labels, and you apply one-hot encoding to them before imputation. Not sure if I understand it correctly. Thanks in advance for your help.
Import statement 'from missforest.missforest import MissForest' in README.md was incorrect. Correct import statement should be 'from missforest.miss_forest import MissForest'.
importing MissForest with:
from missforest.missforest import
Leads to an Error:
`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 23
15 from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, LabelEncoder
17 # Small fix to make missingpy forward compatible
18 # import sklearn.neighbors._base
19 # sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
20
21 # from missingpy import MissForest
---> 23 from missforest.missforest import MissForest
ModuleNotFoundError: No module named 'missforest.missforest'
`
The README.md file currently doesn't include any information about how MissForest deals with categorical variables. A new section or paragraph might be added to provides more information on that.
I'll post updates as I make progress. If anyone has any suggestions or insights, feel free to share!
Running transform_fit generating error of: ValueError: Input data must be 2 dimensional and non empty.
The input data is 2 dimensional and non-empty:
Code to reproduce:
np.random.seed(1234)
N = np.arange(1000)
vary = lambda v: np.random.choice(np.arange(v))
a = 2
b = 1/2
eps = np.array([norm(0, vary(50)).rvs() for n in N])
y = (a + b*N + eps) / 100
x = (N + norm(10, vary(250)).rvs(len(N))) / 100
y[binom(1, 0.4).rvs(len(N)) == 1] = np.nan
#convert to dataframe
df = pd.DataFrame({"y": y, "x": x})
df.head()
mf = MissForest()
df_imputed = mf.fit_transform(df)
Error:
ValueError Traceback (most recent call last)
Cell In[87], line 3
1 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
2 mf = MissForest()
----> 3 df_imputed = mf.fit_transform(df)
File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:531, in MissForest.fit_transform(self, X, categorical)
512 """
513 Class method 'fit_transform' calls class method 'fit' and 'transform'
514 on 'X'.
(...)
527 Imputed dataset (features only).
528 """
530 self.fit(X, categorical)
--> 531 X = self.transform(X)
533 return X
File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:457, in MissForest.transform(self, X)
455 X_missing = X_imp.loc[miss_index]
456 X_missing = X_missing.drop(c, axis=1)
--> 457 y_pred = estimator.predict(X_missing)
458 y_pred = pd.Series(y_pred)
459 y_pred.index = self._miss_row[c]
File /opt/conda/lib/python3.10/site-packages/lightgbm/sklearn.py:918, in LGBMModel.predict(self, X, raw_score, start_iteration, num_iteration, pred_leaf, pred_contrib, validate_features, **kwargs)
915 predict_params = _choose_param_value("num_threads", predict_params, self.n_jobs)
916 predict_params["num_threads"] = self._process_n_jobs(predict_params["num_threads"])
--> 918 return self._Booster.predict( # type: ignore[union-attr]
919 X, raw_score=raw_score, start_iteration=start_iteration, num_iteration=num_iteration,
920 pred_leaf=pred_leaf, pred_contrib=pred_contrib, validate_features=validate_features,
921 **predict_params
922 )
File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:4220, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features, **kwargs)
4218 else:
4219 num_iteration = -1
-> 4220 return predictor.predict(
4221 data=data,
4222 start_iteration=start_iteration,
4223 num_iteration=num_iteration,
4224 raw_score=raw_score,
4225 pred_leaf=pred_leaf,
4226 pred_contrib=pred_contrib,
4227 data_has_header=data_has_header,
4228 validate_features=validate_features
4229 )
File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:1004, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features)
995 _safe_call(
996 _LIB.LGBM_BoosterValidateFeatureNames(
997 self._handle,
(...)
1000 )
1001 )
1003 if isinstance(data, pd_DataFrame):
-> 1004 data = _data_from_pandas(
1005 data=data,
1006 feature_name="auto",
1007 categorical_feature="auto",
1008 pandas_categorical=self.pandas_categorical
1009 )[0]
1011 predict_type = _C_API_PREDICT_NORMAL
1012 if raw_score:
File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:677, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
670 def _data_from_pandas(
671 data: pd_DataFrame,
672 feature_name: _LGBM_FeatureNameConfiguration,
673 categorical_feature: _LGBM_CategoricalFeatureConfiguration,
674 pandas_categorical: Optional[List[List]]
675 ) -> Tuple[np.ndarray, List[str], List[str], List[List]]:
676 if len(data.shape) != 2 or data.shape[0] < 1:
--> 677 raise ValueError('Input data must be 2 dimensional and non empty.')
679 # determine feature names
680 if feature_name == 'auto':
ValueError: Input data must be 2 dimensional and non empty.
Hi,
I am using MissForest to impute a data frame consisting of two columns, of which only the second one has missing values. I get an error message:
492 if (
493 n_iter >= 2 and
494 len(self.categorical) > 0 and
495 all_gamma_cat[-1] > all_gamma_cat[-2]
496 ):
497 break
499 if (
500 n_iter >= 2 and
501 len(self.numerical) > 0 and
--> 502 all_gamma_num[-1] > all_gamma_num[-2]
503 ):
504 break
506 # mapping the encoded values back to its categories.
IndexError: list index out of range
The code works well as long as I have missing values in both columns (tested by artificially adding a nan in the first column), but not if the first column is completely observed.
Can this be fixed?
Thank you!
Hi i'm getting the following error and have been unable to debug.
Thanks in advance!
kf = KFold(n_splits=5, shuffle=True, random_state=seed)
for fold, (train_index, test_index) in enumerate(kf.split(df)):
print(f"Processing fold {fold + 1}")
X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
clf = RandomForestClassifier(n_jobs=-1) #for categorical
rgr = RandomForestRegressor(n_jobs=-1) #for numerical
imputer = MissForest(clf,rgr)
X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
X_test_imputed = imputer.transform(X_test)
# Save imputed datasets
train = pd.concat([X_train_imputed, y_train], axis=1)
test = pd.concat([X_test_imputed, y_test], axis=1)
train.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_train.feather')
test.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_test.feather')
KeyError Traceback (most recent call last)
Cell In[9], line 15
12 imputer = MissForest(clf,rgr)
14 X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
---> 15 X_test_imputed = imputer.transform(X_test)
17 # Save imputed datasets
18 train = pd.concat([X_train_imputed, y_train], axis=1)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\missforest\missforest.py:475, in MissForest.transform(self, x)
473 # Predict the missing column with the trained estimator
474 miss_index = self._missing_row[c]
--> 475 x_missing = x_imp.loc[miss_index]
476 x_missing = x_missing.drop(c, axis=1)
477 y_pred = estimator.predict(x_missing)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1192, in _LocationIndexer.__getitem__(self, key)
1190 maybe_callable = com.apply_if_callable(key, self.obj)
1191 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1192 return self._getitem_axis(maybe_callable, axis=axis)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1421, in _LocIndexer._getitem_axis(self, key, axis)
1418 if hasattr(key, "ndim") and key.ndim > 1:
1419 raise ValueError("Cannot index with multidimensional key")
-> 1421 return self._getitem_iterable(key, axis=axis)
...
-> 6248 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6250 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
6251 raise KeyError(f"{not_found} not in index")
KeyError: "None of [Index([382], dtype='int64')] are in the [index]"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.