Coder Social home page Coder Social logo

midaspy's Introduction

MIDASpy

PyPI Latest Release Python Version lifecycle CI Linux CI macOS CI Windows

Overview

MIDASpy is a Python package for multiply imputing missing data using deep learning methods. The MIDASpy algorithm offers significant accuracy and efficiency advantages over other multiple imputation strategies, particularly when applied to large datasets with complex features. In addition to implementing the algorithm, the package contains functions for processing data before and after model training, running imputation model diagnostics, generating multiple completed datasets, and estimating regression models on these datasets.

For an implementation in R, see our rMIDAS repository here.

Background and suggested citations

For more information on MIDAS, the method underlying the software, see:

Lall, Ranjit, and Thomas Robinson. 2022. "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Political Analysis 30, no. 2: 179-196. doi:10.1017/pan.2020.49. Published version. Accepted version.

Lall, Ranjit, and Thomas Robinson. 2023. "Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS." Journal of Statistical Software 107, no. 9: 1-38. doi:10.18637/jss.v107.i09. Published version.

Installation

To install via pip, enter the following command into the terminal:
pip install MIDASpy

The latest development version (potentially unstable) can be installed via the terminal with:
pip install git+https://github.com/MIDASverse/MIDASpy.git

MIDAS requires:

  • Python (>=3.6; <3.11)
  • Numpy (>=1.5)
  • Pandas (>=0.19)
  • TensorFlow (<2.12)
  • Matplotlib
  • Statmodels
  • Scipy
  • TensorFlow Addons (<0.20)

Tensorflow also has a number of requirements, particularly if GPU acceleration is desired. See https://www.tensorflow.org/install/ for details.

Examples

For a simple demonstration of MIDASpy, see our Jupyter Notebook examples.

Contributing to MIDASpy

Interested in contributing to MIDASpy? We are looking to hire a research assistant to work part-time (flexibly) to help us build out new features and integrate our software with existing machine learning pipelines. You would be paid the standard research assistant rate at the University of Oxford. To apply, please send your CV (or a summary of relevant skills/experience) to [email protected].

Version 1.3.1 (October 2023)

  • Minor update to reflect publication of accompanying article in Journal of Statistical Software
  • Further updates to make documentation and URLs consistent, including removing unused metadata

Version 1.2.4 (August 2023)

  • Adds support for Python 3.9 and 3.10
  • Addresses deprecation warnings and other minor bug fixes
  • Resolves dependency issues and includes an updated setup.py file
  • Adds GitHub Actions workflows that trigger automatic tests on the latest Ubuntu, macOS, and Windows for Python versions 3.7 to 3.10 each time a push or pull request is made to the main branch
  • An additional Jupyter Notebook example that demonstrates the core functionalities of MIDASpy

Version 1.2.3 (December 2022)

v1.2.3 adds support for installation on Apple Silicon hardware (i.e. M1 and M2 Macs).

Version 1.2.2 (July 2022)

v1.2.2 makes minor efficiency changes to the codebase. Full details are available in the Release logs.

Version 1.2.1 (January 2021)

v1.2.1 adds new pre-processing functionality and a multiple imputation regression function.

Users can now automatically preprocess binary and categorical columns prior to running the MIDAS algorithm using binary_conv() and cat_conv().

The new combine() function allows users to run regression analysis across the complete data, following Rubin’s combination rules.

Previous versions

Version 1.1.1 (October 2020)

Key changes:

  • Update adds full Tensorflow 2.X support:

    • Users can now run the MIDAS algorithm in TensorFlow 2.X (TF1 support retained)

    • Tidier handling of random seed setting across both TensorFlow and NumPy

  • Fixes a minor dependency bug

  • Other minor bug fixes

Version 1.0.2 (September 2020)

Key changes:

  • Minor, mainly cosmetic, changes to the underlying source code.
  • Renamed ‘categorical_columns’ argument in build_model() to ‘binary_columns’ to avoid confusion
  • Added plotting arguments to overimputation() method to suppress intermediary overimputation plots (plot_main) and all plots (skip_plot).
  • Changed overimputation() plot titles, labels and legends
  • Added tensorflow 2.0 version check on import
  • Fixed seed-setting bug in earlier versions

Alpha 0.2:

Variational autoencoder enabled. More flexibility in model specification, although defaulting to a simple mirrored system. Deeper analysis tools within .overimpute() for checking fit on continuous values. Constructor code deconflicted. Individual output specification enabled for very large datasets.

Key added features:

  • Variational autoencoder capacity added, including encoding to and sampling from latent space

Planned features:

  • Time dependence handling through recurrent cells
  • Improving the pipeline methods for very large datasets
  • Tensorboard integration
  • Dropout scaling
  • A modified constructor that can generate embeddings for better interpolation of features
  • R support

Wish list:

  • Smoothing for time series (LOESS?)
  • Informative priors?

Alpha 0.1:

  • Basic functionality feature-complete.
  • Support for mixed categorical and continuous data types
  • An “additional data” pipeline, allowing data that may be relevant to the imputation to be included (without being included in error generating statistics)
  • Simplified calibration for model complexity through the “overimputation” function, including visualization of reconstructed features
  • Basic large dataset functionality

midaspy's People

Contributors

david-woroniuk avatar edvinskis avatar jackewiebohne avatar oracen avatar ranjitlall avatar tsrobinson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

midaspy's Issues

values not imputed

I'm essentially running the demo code, but with my own input data (all numeric data), and the data frames generated by imputer.generate_samples(m=10).output_list still have the same missing values as in the input.

Example input table:

Feature     feat1  feat2  feat3  ...  feat30  feat31  feat32
ERS2551628                65.0         0.0             101.0  ...            105.0                 230.0                27.0
SRS143466                 43.0         NaN              34.0  ...             98.0                   0.0                26.0
SRS023715                  0.0        54.0               0.0  ...             33.0                  55.0                 NaN
SRS580227                  0.0         0.0              10.0  ...             67.0                  22.0                 0.0
DRS091214             327457.0         0.0               NaN  ...              NaN                   0.0                24.0
...                        ...         ...               ...  ...              ...                   ...                 ...
ERS2551594                74.0        15.0              21.0  ...             93.0                  40.0                 0.0
ERS634957                  0.0        12.0               0.0  ...              0.0                  45.0                 0.0
DRS087574                  0.0        80.0              43.0  ...            209.0                   NaN                12.0
ERS634952                 33.0        56.0              11.0  ...              NaN                1032.0                 0.0
SRS1820544                49.0       102.0              12.0  ...             13.0                  27.0                49.0

...and the output:

Feature     feat1  feat2  feat3  ...  feat30  feat31  feat32
ERS2551628                65.0         0.0             101.0  ...            105.0                 230.0                27.0
SRS143466                 43.0         NaN              34.0  ...             98.0                   0.0                26.0
SRS023715                  0.0        54.0               0.0  ...             33.0                  55.0                 NaN
SRS580227                  0.0         0.0              10.0  ...             67.0                  22.0                 0.0
DRS091214             327457.0         0.0               NaN  ...              NaN                   0.0                24.0
...                        ...         ...               ...  ...              ...                   ...                 ...
ERS2551594                74.0        15.0              21.0  ...             93.0                  40.0                 0.0
ERS634957                  0.0        12.0               0.0  ...              0.0                  45.0                 0.0
DRS087574                  0.0        80.0              43.0  ...            209.0                   NaN                12.0
ERS634952                 33.0        56.0              11.0  ...              NaN                1032.0                 0.0
SRS1820544                49.0       102.0              12.0  ...             13.0                  27.0                49.0

Any idea on why the missing values are not imputed?

conda env

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_tflow_select             2.3.0                       mkl
absl-py                   0.15.0                   pypi_0    pypi
aiohttp                   3.8.1            py39h3811e60_0    conda-forge
aiosignal                 1.2.0              pyhd8ed1ab_0    conda-forge
astor                     0.8.1              pyh9f0ad1d_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
blas                      1.1                    openblas    conda-forge
blinker                   1.4                        py_1    conda-forge
brotlipy                  0.7.0           py39h3811e60_1003    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2021.10.26           h06a4308_2
cachetools                4.2.4              pyhd8ed1ab_0    conda-forge
certifi                   2021.10.8        py39hf3d152e_1    conda-forge
cffi                      1.15.0           py39h4bc2ebd_0    conda-forge
charset-normalizer        2.0.9              pyhd8ed1ab_0    conda-forge
click                     8.0.3            py39hf3d152e_1    conda-forge
cryptography              36.0.0           py39h9ce1e76_0
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dataclasses               0.8                pyhc8e2a94_3    conda-forge
flatbuffers               1.12                     pypi_0    pypi
freetype                  2.11.0               h70c0345_0
frozenlist                1.2.0            py39h3811e60_1    conda-forge
gast                      0.3.3                    pypi_0    pypi
google-auth               1.35.0                   pypi_0    pypi
google-auth-oauthlib      0.4.1                      py_2    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
grpcio                    1.32.0                   pypi_0    pypi
h5py                      2.10.0          nompi_py39h98ba4bc_106    conda-forge
hdf5                      1.10.6          nompi_h3c11f04_101    conda-forge
idna                      3.3                pyhd3eb1b0_0
importlib-metadata        4.10.0           py39hf3d152e_0    conda-forge
jbig                      2.1               h7f98852_2003    conda-forge
joblib                    1.1.0                    pypi_0    pypi
jpeg                      9d                   h516909a_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
kiwisolver                1.3.2            py39h1a9c180_1    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
lerc                      3.0                  h9c3ff4c_0    conda-forge
libblas                   3.9.0           1_h6e990d7_netlib    conda-forge
libcblas                  3.9.0           3_h893e4fe_netlib    conda-forge
libdeflate                1.8                  h7f98852_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgfortran-ng            7.5.0               h14aa051_19    conda-forge
libgfortran4              7.5.0               h14aa051_19    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
liblapack                 3.9.0           3_h893e4fe_netlib    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.13               h4367d64_0
libpng                    1.6.37               hed695b0_2    conda-forge
libprotobuf               3.19.2               h780b84a_0    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libtiff                   4.3.0                h6f004c6_2    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libwebp-base              1.2.1                h7f98852_0    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
markdown                  3.3.6              pyhd8ed1ab_0    conda-forge
matplotlib                3.3.2                         0    conda-forge
matplotlib-base           3.3.2            py39h98787fa_1    conda-forge
midaspy                   1.2.1                    pypi_0    pypi
multidict                 5.2.0            py39h3811e60_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.19.5                   pypi_0    pypi
oauthlib                  3.1.1              pyhd8ed1ab_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openblas                  0.3.4             h9ac9557_1000    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   3.0.0                h7f98852_2    conda-forge
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
pandas                    1.3.5            py39hde0f152_0    conda-forge
patsy                     0.5.2              pyhd8ed1ab_0    conda-forge
pillow                    8.4.0            py39ha612740_0    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.19.2           py39he80948d_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.8                      py_0
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyjwt                     2.3.0              pyhd8ed1ab_1    conda-forge
pyopenssl                 21.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.6              pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1            py39hf3d152e_4    conda-forge
python                    3.9.9           h543edf9_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.27.0             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.0              pyh9f0ad1d_0    conda-forge
rsa                       4.8                pyhd8ed1ab_0    conda-forge
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.1            py39hc65b3f8_2
setuptools                60.2.0           py39hf3d152e_0    conda-forge
six                       1.15.0                   pypi_0    pypi
sqlite                    3.37.0               h9cd32fc_0    conda-forge
statsmodels               0.13.1           py39hce5d2b2_0    conda-forge
tensorboard               2.6.0                      py_0
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.4.1           mkl_py39h4683426_0
tensorflow-addons         0.15.0                   pypi_0    pypi
tensorflow-base           2.4.1           mkl_py39h43e0292_0
tensorflow-estimator      2.4.0                    pypi_0    pypi
termcolor                 1.1.0                      py_2    conda-forge
threadpoolctl             3.0.0                    pypi_0    pypi
tk                        8.6.11               h27826a3_1    conda-forge
tornado                   6.1              py39h3811e60_2    conda-forge
typeguard                 2.13.3                   pypi_0    pypi
typing-extensions         3.7.4.3                  pypi_0    pypi
tzdata                    2021e                he74cb21_0    conda-forge
urllib3                   1.26.7             pyhd8ed1ab_0    conda-forge
werkzeug                  2.0.2              pyhd3eb1b0_0
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
wrapt                     1.12.1                   pypi_0    pypi
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.7.2            py39h3811e60_1    conda-forge
zipp                      3.6.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
zstd                      1.5.1                ha95c52a_0    conda-forge

Torch/TF2 version

MIDASpy is currently implemented using logic of TF1 and compatibility layers. As TF2 matures and more graph-based features become deprecated (see e.g. #21), we will need to plan for larger scale update of codebase.

We could try rebuild in TF2 natively or alternatively pivot to PyTorch implementation, which has a more "pythonic" feel.

MIDASpy some times get error

As recommended , I have installed all the packages, but I sometimes get an error message. the interesting point is that When I ran exactly this code on another account of Google Colab, I got no errors
!pip install numpy pandas matplotlib statsmodels scipy
!pip install tensorflow==2.11
!pip install tensorflow-addons<0.20
!pip install MIDASpy

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import sys
import MIDASpy as md

/usr/local/lib/python3.10/dist-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.9.0 and strictly below 2.12.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.13.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
warnings.warn(

ImportError Traceback (most recent call last)
in <cell line: 6>()
4 from sklearn.preprocessing import MinMaxScaler
5 import sys
----> 6 import MIDASpy as md

14 frames
/usr/local/lib/python3.10/dist-packages/keras/engine/base_layer_utils.py in
22
23 from keras import backend
---> 24 from keras.dtensor import dtensor_api as dtensor
25 from keras.utils import control_flow_util
26 from keras.utils import tf_inspect

ImportError: cannot import name 'dtensor_api' from 'keras.dtensor' (/usr/local/lib/python3.10/dist-packages/keras/dtensor/init.py)

Impute new data using trained model.

Looking at the codebase I could not locate a function where the trained model could be used to impute new data after training the model. There seems to be a couple of functions that could be utilized to perform this indirectly but I am surprised that is not included as a separate function.

How to reverse One hot encoding

Hello,

How to get the data in the original form (reverse dummies). We receive the imputed dataset in one hot encoded form. But how to convert it into the original dataset (the categorical data).
Thank you

Compatibility with compositional data

Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.

The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.

However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?

For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.

If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data:
Paper and two datasets

VAE deprecation warning from tf.distributions

Running MIDAS using VAE leads to deprecation warning re. tf.compat.v1.distributions.

E.g.

>>> tf.compat.v1.distributions.Normal()
WARNING:tensorflow:From <stdin>:1: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.

Migrating affected code to tfp.distributions is not straightforward as not designed for TF1 graph-oriented model. We should investigate solutions to safeguard codebase in medium term.

UnboundLocalError: local variable 'train_rng' referenced before assignment

If no seed is given when initialising the Midas object, then no seed is passed to Midas.train_model() and so the variable train_rng is left unassigned (line 748) and this creates an error on line on 759 when a value for train_rng is expected.

I suspect this same issue will arise in other areas where if self.seed is not None: is used without a corresponding else statement (e.g. line 1184 in Midas.over_impute()).

I suspect this can be fixed by simply adding an else statement which generates a random seed and uses this to assign a value to train_rng

Interpreter settings:
Python 3.9

numpy~=1.22.1
pandas~=1.3.5

scipy==1.8.0
matplotlib~=3.5.1
scikit-learn~=1.0.1
tensorflow==2.8.0
keras~=2.6.0
graphviz~=0.19
MIDASpy~=1.2.1
statsmodels~=0.13.2

Heuristics on choosing a model structure

Hi,

I was wondering if there was any heuristics on choosing a model structure for different types / sizes of datasets. For instance, if I had a standard corporate dataset with 20,000 rows and 15 columns, are there any sure-fire methods / parameters I should be using? Are there any clear do's or dont's in certain situations?

Overimpute legend

Related to #7, shift legend to below plotting area.

Need to account for clipping of legend when saving, and for varying numbers of items dependent on input data.

Use of ```isinstance``` instead of ```type```

Firstly, a great package.

I noticed that the package uses if type(var) == float:, and thought it may be useful to modify the behaviour to be more Pydantic.

To summarise, isinstance caters for inheritance (where an instance of a derived class is an instance of a base class), while checking for equality of type does not. This instead demands identity of types and rejects instances of subclasses.

Typical Python code should support inheritance, so isinstance is less bad than checking types, as it supports inheritance. However, “duck typing” would be the preferred (try, except), catching all exceptions associated with an incorrect type (TypeError).

I refer to lines 142-153, whereby the list type is evaluated:

    if type(layer_structure) == list:
      self.layer_structure = layer_structure
    else:
      raise ValueError("Layer structure must be specified within a list")

which could be achieved more elegantly using:

if not isinstance(layer_structure, list):
    raise TypeError("Layer structure must be specified within a list.")

181-187:

    if weight_decay == 'default':
      self.weight_decay = 'default'
    elif type(weight_decay) == float:
      self.weight_decay = weight_decay
    else:
      raise ValueError("Weight decay argument accepts either 'standard' (string) "\
                       "or floating point")

whereby the type (or types) could be hinted to the user within the init dunder method, and can be evaluated through:

if isinstance(weight_decay, str):
   if weight_decay != 'default':
        raise ValueError("A warning that the value must be 'default' or a float type")
   self.weight_decay = weight_decay
elif isinstance(weight_decay, float):
   self.weight_decay = weight_decay

Depending on the python versions supported, I would also recommend using typehints, and using the below:

from typing import List

abc_var: List[int]

More than happy to submit a PR with the proposed changes.

Pyplot rewrite

Would be useful to simplify pyplot in overimpute function to remove interactive plotting -- ideal behaviour is a single plot at the end of imputation, using "agg" if possible.

Results are not perfectly consistent

Have tried running the Python example notebook and noticed that the final loss slightly changes from run to run (e.g., from 73446.1 to 73355.3) despite setting the same seed. Does this have to do with unaccounted for randomness in the algorithm or just due rounding?

Another question, does it generally make a difference to scale the continuous data before inputting it to the algorithm? I assumed that the answer is no because it's done internally anyway; however, I noticed that in the R example the data was explictly scaled but not in the Python example.

Optimizing MIDAS on very large/complex datasets

In very large datasets (~30,000 samples x 1,000,000 features) with complex relationships (e.g. cancer omics data), the runtime for MIDAS can take a very long time (days?), even on a single GPU. However, I would like to take advantage of the 'overimpute' feature for hyperparameter tuning. This is prohibitive since this very useful feature runs the algorithm multiple times to evaluate various settings.

Would random downsampling of samples (columns) and/or features (rows) generalize the optimal hyperparameters to the larger dataset? For instance, a random subset of 500-1,000 samples with 5,000-10,000 features. This would be to specifically determine the optimal number of: nodes, layers, learning rate, and training epochs. I would think batch size (which can speed up training) is a function of the dataset size, so this would not generalize.

Any help would be great

Minimum and maximum value arguments (constraints)

I'm working with Dirichlet distributions and the compositional data simplex, and am really enjoying MIDASpy's flexibility when dealing with this data (related to K-L divergence in the decoder). However, there is a tendency to produce negative values in the numerical feature data I have been using.

In the case of compositional data, there is a constraint of zero as a minimum value. Other imputation approaches allow setting maximum and minimum value arguments (e.g., Scikit-Learn) and importantly these can be set per feature (autoimpute). Is this an argument which could be added to the package? It would be a major help to people working in several disciplines.

Error with multiple GPUs: Do not use tf.reset_default_graph() to clear nested graphs

I am trying to utilize two GPUs with MIDASpy. However, I get the following error during set-up:

from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md

data_0 = pd.read_csv('/home/comp/Documents/file.txt', sep = "\t")
data_0.columns.str.strip()

data_0 = data_0.set_index('Unnamed: 0')
data_0.index.names = [None]

np.random.seed(441)

na_loc = data_0.isnull()
data_0[na_loc] = np.nan

imputer = md.Midas(layer_structure= [256, 256, 256],
                   learn_rate= 1e-4,
                   input_drop= 0.9,
                   train_batch = 50,
                   savepath= '/home/comp/Documents/save',
                   seed= 89)

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
imputer.build_model(data_0)

AssertionError: Do not use tf.reset_default_graph() to clear nested graphs. If you need a cleared graph, exit the nesting and create a new graph.

Deprecation warnings to fix

Getting the following warning as part of training cycle:

FutureWarning: Passing a dict as an indexer is deprecated and will raise in a future version. Use a list instead.
  data_1 = data[subset]

We should update to future-proof asap.

Improve TensorFlow 2.X compatibility

Current behaviour allows MIDASpy to be loaded when using TF 2.X, but returns logging error to inform users imputation only possible in TF1.X

Looks like all TF1 components can be updated to TF 2.X -- just requires additional tensorflow-addons package dependency for the AdamW optimiser.

Train data

when i try to train data " adult data"
this message showed up
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Imputation target contains no missing values. Please ensure missing values are encoded as type np.nan
I tried to replace the missing values with np.nan but same message came

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.