yuenshingyan / missforest Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 5.0 188 KB

Arguably the best missing values imputation method.

License: MIT License

Python 100.00%

missforest's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare tdl77 sep905 mvvmvv virajpwr

missforest's Issues

Not Handling Categorical Variables unseen by training data

I get this error when transforming on my dataset that contains a postal districts column (UK). I think the error comes from categorical variables that have not been seen in the training set. is there a mechanism in place for handling this? Thanks.

mf.transform(x=test) gives error

rgr = RandomForestRegressor(n_jobs=-1)

mf = MissForest(clf, rgr)
mf.fit(X_missing)

X_imputed = mf.transform(X_missing)

x_test_imputed = mf.transform(x=test) #this generates the error

ValueError: at least one array or dtype is required

Thank for sharing with us the implementation. I am having an error ValueError: at least one array or dtype is required when I run mfe= mfe.impute(data, rfc, rfr). It is working fine with I read fish = pd.read_csv('Fish.csv')

But When I read some other file it gives the error. Although my DF is fine "[699 rows x 10 columns]", Type "Dataframe". Could please check?

How to deal with categorical variables

Thanks for sharing the implementation. I haven't figured out the way you deal with the categorical variables. Could you please tell me what type the input categorical variables should take the form of? It seems to me they could be string labels, and you apply one-hot encoding to them before imputation. Not sure if I understand it correctly. Thanks in advance for your help.

Fix Typo in README.md

Import statement 'from missforest.missforest import MissForest' in README.md was incorrect. Correct import statement should be 'from missforest.miss_forest import MissForest'.

Import MissForest / Typo in readme

importing MissForest with:

from missforest.missforest import

Leads to an Error:

`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 23
15 from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, LabelEncoder
17 # Small fix to make missingpy forward compatible
18 # import sklearn.neighbors._base
19 # sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
20
21 # from missingpy import MissForest
---> 23 from missforest.missforest import MissForest

ModuleNotFoundError: No module named 'missforest.missforest'
`

Explain How Categorical Variables Are Handled in READ.md.

The README.md file currently doesn't include any information about how MissForest deals with categorical variables. A new section or paragraph might be added to provides more information on that.

I'll post updates as I make progress. If anyone has any suggestions or insights, feel free to share!

Is it must be all dataset,when i fit the dataset.Why divide the dataset in the example？

ValueError: Input data must be 2 dimensional and non empty.

Running transform_fit generating error of: ValueError: Input data must be 2 dimensional and non empty.
The input data is 2 dimensional and non-empty:

Code to reproduce:

seed to follow along

np.random.seed(1234)

generate 1000 data points

N = np.arange(1000)

helper function for this data

vary = lambda v: np.random.choice(np.arange(v))

create correlated, random variables

a = 2
b = 1/2
eps = np.array([norm(0, vary(50)).rvs() for n in N])
y = (a + b*N + eps) / 100
x = (N + norm(10, vary(250)).rvs(len(N))) / 100

add missing values

y[binom(1, 0.4).rvs(len(N)) == 1] = np.nan

#convert to dataframe
df = pd.DataFrame({"y": y, "x": x})
df.head()

mf = MissForest()
df_imputed = mf.fit_transform(df)

Error:

ValueError Traceback (most recent call last)
Cell In[87], line 3
1 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
2 mf = MissForest()
----> 3 df_imputed = mf.fit_transform(df)

File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:531, in MissForest.fit_transform(self, X, categorical)
512 """
513 Class method 'fit_transform' calls class method 'fit' and 'transform'
514 on 'X'.
(...)
527 Imputed dataset (features only).
528 """
530 self.fit(X, categorical)
--> 531 X = self.transform(X)
533 return X

File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:457, in MissForest.transform(self, X)
455 X_missing = X_imp.loc[miss_index]
456 X_missing = X_missing.drop(c, axis=1)
--> 457 y_pred = estimator.predict(X_missing)
458 y_pred = pd.Series(y_pred)
459 y_pred.index = self._miss_row[c]

File /opt/conda/lib/python3.10/site-packages/lightgbm/sklearn.py:918, in LGBMModel.predict(self, X, raw_score, start_iteration, num_iteration, pred_leaf, pred_contrib, validate_features, **kwargs)
915 predict_params = _choose_param_value("num_threads", predict_params, self.n_jobs)
916 predict_params["num_threads"] = self._process_n_jobs(predict_params["num_threads"])
--> 918 return self._Booster.predict( # type: ignore[union-attr]
919 X, raw_score=raw_score, start_iteration=start_iteration, num_iteration=num_iteration,
920 pred_leaf=pred_leaf, pred_contrib=pred_contrib, validate_features=validate_features,
921 **predict_params
922 )

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:4220, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features, **kwargs)
4218 else:
4219 num_iteration = -1
-> 4220 return predictor.predict(
4221 data=data,
4222 start_iteration=start_iteration,
4223 num_iteration=num_iteration,
4224 raw_score=raw_score,
4225 pred_leaf=pred_leaf,
4226 pred_contrib=pred_contrib,
4227 data_has_header=data_has_header,
4228 validate_features=validate_features
4229 )

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:1004, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features)
995 _safe_call(
996 _LIB.LGBM_BoosterValidateFeatureNames(
997 self._handle,
(...)
1000 )
1001 )
1003 if isinstance(data, pd_DataFrame):
-> 1004 data = _data_from_pandas(
1005 data=data,
1006 feature_name="auto",
1007 categorical_feature="auto",
1008 pandas_categorical=self.pandas_categorical
1009 )[0]
1011 predict_type = _C_API_PREDICT_NORMAL
1012 if raw_score:

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:677, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
670 def _data_from_pandas(
671 data: pd_DataFrame,
672 feature_name: _LGBM_FeatureNameConfiguration,
673 categorical_feature: _LGBM_CategoricalFeatureConfiguration,
674 pandas_categorical: Optional[List[List]]
675 ) -> Tuple[np.ndarray, List[str], List[str], List[List]]:
676 if len(data.shape) != 2 or data.shape[0] < 1:
--> 677 raise ValueError('Input data must be 2 dimensional and non empty.')
679 # determine feature names
680 if feature_name == 'auto':

ValueError: Input data must be 2 dimensional and non empty.

Problem when column has no missingness

Hi,

I am using MissForest to impute a data frame consisting of two columns, of which only the second one has missing values. I get an error message:

    492     if (
    493             n_iter >= 2 and
    494             len(self.categorical) > 0 and
    495             all_gamma_cat[-1] > all_gamma_cat[-2]
    496     ):
    497         break
    499     if (
    500             n_iter >= 2 and
    501             len(self.numerical) > 0 and
--> 502             all_gamma_num[-1] > all_gamma_num[-2]
    503     ):
    504         break
    506 # mapping the encoded values back to its categories.

IndexError: list index out of range

The code works well as long as I have missing values in both columns (tested by artificially adding a nan in the first column), but not if the first column is completely observed.

Can this be fixed?

Thank you!

"None of [Index([382], dtype='int64')] are in the [index]"

Hi i'm getting the following error and have been unable to debug.

Thanks in advance!

kf = KFold(n_splits=5, shuffle=True, random_state=seed)

for fold, (train_index, test_index) in enumerate(kf.split(df)):
    print(f"Processing fold {fold + 1}")
    X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
    y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()

    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    clf = RandomForestClassifier(n_jobs=-1) #for categorical
    rgr = RandomForestRegressor(n_jobs=-1) #for numerical
    imputer = MissForest(clf,rgr)

    X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
    X_test_imputed = imputer.transform(X_test)

    # Save imputed datasets
    train = pd.concat([X_train_imputed, y_train], axis=1)
    test = pd.concat([X_test_imputed, y_test], axis=1)
    train.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_train.feather')
    test.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_test.feather')

KeyError                                  Traceback (most recent call last)
Cell In[9], line 15
     12 imputer = MissForest(clf,rgr)
     14 X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
---> 15 X_test_imputed = imputer.transform(X_test)
     17 # Save imputed datasets
     18 train = pd.concat([X_train_imputed, y_train], axis=1)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\missforest\missforest.py:475, in MissForest.transform(self, x)
    473 # Predict the missing column with the trained estimator
    474 miss_index = self._missing_row[c]
--> 475 x_missing = x_imp.loc[miss_index]
    476 x_missing = x_missing.drop(c, axis=1)
    477 y_pred = estimator.predict(x_missing)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1192, in _LocationIndexer.__getitem__(self, key)
   1190 maybe_callable = com.apply_if_callable(key, self.obj)
   1191 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1192 return self._getitem_axis(maybe_callable, axis=axis)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1421, in _LocIndexer._getitem_axis(self, key, axis)
   1418     if hasattr(key, "ndim") and key.ndim > 1:
   1419         raise ValueError("Cannot index with multidimensional key")
-> 1421     return self._getitem_iterable(key, axis=axis)
...
-> 6248         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6250     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6251     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([382], dtype='int64')] are in the [index]"