mitmedialab / sherlock-project Goto Github PK

This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.

Home Page: https://sherlock.media.mit.edu

License: MIT License

Jupyter Notebook 69.89% Python 30.11%

tables semantic-table-interpretation deep-learning semantic-type-detection

sherlock-project's Introduction

Sherlock: code, data, and trained model.

Sherlock is a deep-learning approach to semantic data type detection, i.e. labeling tables with column types such as name, address, etc. This is helpful for, among others, data validation, processing and integration. This repository provides data and code to guide usage of Sherlock, retraining the model, and replication of results. Visit https://sherlock.media.mit.edu for more background on this project.

Installation of package

You can install Sherlock by cloning this repository, and run pip install ..
Install dependencies using pip install -r requirements.txt (or requirements38.txt depending on your Python version).

Demonstration of usage

The 00-use-sherlock-out-of-the-box.ipynb notebook demonstrates usage of the readily trained model for a given table.

The notebooks in notebooks/ prefixed with 01-data processing.ipynb and 02-1-train-and-test-sherlock.ipynb can be used to reproduce the results, and demonstrate the usage of Sherlock (from data preprocessing to model training and evaluation).

Data

The raw data (corresponding to annotated table columns) can be downloaded using the download_data() function in the helpers module. This will download +/- 500MB of data into the data directory. Use the 01-data-preprocessing.ipynb notebook to preprocess this data. Each column is then represented by a feature vector of dimensions 1x1588. The extracted features per column are based on "paragraph" embeddings (full column), word embeddings (aggregated from each column cell), character count statistics (e.g. average number of "." in a column's cells) and column-level statistics (e.g. column entropy).

The Sherlock model

The SherlockModel class is specified in the sherlock.deploy.model module. This model constitutes a multi-input neural network which specifies a separate network for each feature set (e.g. the word embedding features), concatenates them, and finally adds a few shared layers. Interaction with the model follows the scikit-learn interface, with methods fit, predict and predict_proba.

Making predictions

The originally trained SherlockModel can be used for generating predictions for a dataset. First, extract features using the features.preprocessing module. The original weights of Sherlock are provided in the repository in the model_files directory and can be loaded using the initialize_model_from_json method of the model. The procedure for making predictions (on the data) is demonstrated in the 02-1-train-and-test-sherlock.ipynb notebook.

Retraining Sherlock

The notebook 02-1-train-and-test-sherlock.ipynb also illustrates how Sherlock can be retrained. The model will infer the number of unique classes from the training labels unless you load a model from a json file, the number of classes will be 78 in that case.

Citing this work

To cite this work, please use the below bibtex:

@inproceedings{Hulsebos:2019:SDL:3292500.3330993,
 author = {Hulsebos, Madelon and Hu, Kevin and Bakker, Michiel and Zgraggen, Emanuel and Satyanarayan, Arvind and Kraska, Tim and Demiralp, {\c{C}}a{\u{g}}atay and Hidalgo, C{\'e}sar},
 title = {Sherlock: A Deep Learning Approach to Semantic Data Type Detection},
 booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \&\#38; Data Mining},
 year={2019},
 publisher = {ACM},
}

Project structure

├── data   <- Placeholder directory to download data into.

├── docs   <- Files for https://sherlock.media.mit.edu landing page.

├── model_files  <- Files with trained model weights and specification.
    ├── sherlock_model.json
    └── sherlock_weights.h5

├── notebooks   <- Notebooks demonstrating data preprocessing and train/test of Sherlock.
    └── 00-use-sherlock-out-of-the-box.ipynb
    └── 01-data-preprocessing.ipynb
    └── 02-1-train-and-test-sherlock.ipynb
    └── 02-2-train-and-test-sherlock-rf-ensemble.ipynb
    └── 03-train-paragraph-vector-features-optional.ipynb

├── sherlock  <- Package.
    ├── deploy  <- Code for (re)training Sherlock, as well as model specification.
        └── helpers.py
        └── model.py
    ├── features     <- Files to turn raw data, storing raw data columns, into features.
        ├── feature_column_identifiers   <- Directory with feature names categorized by feature set.
        └── bag_of_characters.py
        └── bag_of_words.py
        └── par_vec_trained_400.pkl
        └── paragraph_vectors.py
        └── preprocessing.py
        └── word_embeddings.py
    ├── helpers.py     <- Supportive modules.

sherlock-project's People

Contributors

Stargazers

Watchers

Forkers

byu-dml showersky yangchenghuang bsminu paulforan yikeqicn quantumsparkai uhekiat rgan19 sshefs02 mdubovyicv lakithasahan ngtrunghuan conradhorn srvds sylvain-morin 297367738 duraq chudo19 ieaves noirtree mtax dataunitylab brdhunga dcastf01 mohamedyd lowecg autofuzzoss intae515 delftdata dweny madelonhulsebos svenlang mmargaret24 andraionescu bastienboutonnet paulrschrater lanli2017 baturayo jaingmengmeng katft michaelmior siddhantv45 gokulnandan-0 lcsouzamenezes superwbb007 ipshitag ritvikprabhu leventidis sarod isabella232 austinmw vkuzina zhiyupan agemcipe penfever lost-in-the-canvas anishaman face1essboy zachary62 dearborn-open-ai accelmanu manonandan yassinyin judithgaothuse torazame olenatokova

sherlock-project's Issues

gensim version

Hi, I'm using gensim 3.8.0 but still gets this error when I run extract_features().

'Doc2Vec' object has no attribute 'neg_labels'

Is there a way to avoid this?

Julio reyes

Tensorflow not included in requirements

In the Python 3.8 requirements file, Tensorflow is not included. Furthermore, it doesn't appear that there's a version of Tensorflow that is compatible with all the other packages that are listed there. Switching to something like Pipenv or Poetry would likely make such things much easier to manage.

Attempting to reproduce results: feature vector normalization

I've been able to run custom data through your build_features function (with quite a few code changes), and I have the 1588 features that you have in your paper. However, many of these features need normalization, e.g. n_[a]-agg-sum: 2081. You don't describe how you normalized in your paper, you just say:

At a high-level, we train subnetworks for each feature category except the statistical features, which consist of only 27 features. These subnetworks “compress” input features to an output of fixed dimension.

Your results are not possible to reproduce without knowledge of how you normalized your values. Any pointers?

`extract_features()` function unable to find '../sherlock/features/par_vec_trained_400.pkl'

Preparing feature extraction by downloading 2 files:
        
 ../sherlock/features/glove.6B.50d.txt and 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy.
        
All files for extracting word and paragraph embeddings are present.
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-47-f863e274c8b1> in <module>
      1 y_test_subset = y_test[:100]
----> 2 X_test = extract_features(test_samples_converted.head(n=100))

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/sherlock/features/preprocessing.py in extract_features(data)
    158         features_list.append(f)
    159 
--> 160         df_par = df_par.append(infer_paragraph_embeddings_features(raw_sample, vec_dim))
    161 
    162     return pd.concat(

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/sherlock/features/paragraph_vectors.py in infer_paragraph_embeddings_features(data, dim)
     37 
     38     # Load pretrained paragraph vector model
---> 39     model = Doc2Vec.load('../sherlock/features/par_vec_trained_{}.pkl'.format(dim))
     40 
     41     f = pd.DataFrame()

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/gensim/models/doc2vec.py in load(cls, *args, **kwargs)
    793         """
    794         try:
--> 795             return super(Doc2Vec, cls).load(*args, rethrow=True, **kwargs)
    796         except AttributeError as ae:
    797             logger.error(

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/gensim/models/word2vec.py in load(cls, rethrow, *args, **kwargs)
   1920         """
   1921         try:
-> 1922             model = super(Word2Vec, cls).load(*args, **kwargs)
   1923             if not isinstance(model, Word2Vec):
   1924                 rethrow = True

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/gensim/utils.py in load(cls, fname, mmap)
    484         compress, subname = SaveLoad._adapt_by_suffix(fname)
    485 
--> 486         obj = unpickle(fname)
    487         obj._load_specials(fname, mmap, compress, subname)
    488         obj.add_lifecycle_event("loaded", fname=fname)

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/gensim/utils.py in unpickle(fname)
   1455 
   1456     """
-> 1457     with open(fname, 'rb') as f:
   1458         return _pickle.load(f, encoding='latin1')  # needed because loading from S3 doesn't support readline()
   1459 

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/smart_open/smart_open_lib.py in open(uri, mode, buffering, encoding, errors, newline, closefd, opener, ignore_ext, transport_params)
    172         transport_params = {}
    173 
--> 174     fobj = _shortcut_open(
    175         uri,
    176         mode,

~/anaconda3/envs/content-automation-env/lib/python3.9/site-packages/smart_open/smart_open_lib.py in _shortcut_open(uri, mode, ignore_ext, buffering, encoding, errors, newline)
    344         open_kwargs['errors'] = errors
    345 
--> 346     return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
    347 
    348 

FileNotFoundError: [Errno 2] No such file or directory: '../sherlock/features/par_vec_trained_400.pkl'

I have the following files in sherlock/features folder:

What am I doing wrong?

Parquet files invalid

I get errors reading the Parquet files produced by Sherlock when using certain other tools to read them. It turns out that some of the characters used are invalid as part of a column name. Perhaps character features could simply use their ASCII value instead of the character itself?

KeyError: "['n_[^]-agg-skewness', 'n_[\\\\]-agg-all', 'n_[\\\\]-agg-max', 'n_[\\\\]-agg-mean', 'n_[^]-agg-median', 'n_[\\\\]-agg-var', 'n_[^]-agg-all', 'n_[^]-agg-sum', 'n_[\\\\]-agg-median', 'n_[^]-agg-mean', 'n_[^]-agg-any', 'n_[\\\\]-agg-min', 'n_[^]-agg-min', 'n_[\\\\]-agg-any', 'n_[^]-agg-max', 'n_[\\\\]-agg-skewness', 'n_[\\\\]-agg-sum', 'n_[\\\\]-agg-kurtosis', 'n_[^]-agg-var', 'n_[^]-agg-kurtosis'] not in index"

Running notebook fails with "ValueError: numpy.ndarray size changed..."

Steps to reproduce

Installing the requirements using pip install -r requirements.txt
Run the 00-use-sherlock-out-of-the-box.ipynb notebook

Actual result

The notebook execution fails with
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
This issue seems similar to piskvorky/gensim#3097

Expected result

Notebook execution should run smoothly

How to change the number of epochs ?

I have tried changing the value of epochs in model.py, however it still runs till 100 epochs.

Is the data in sherlock/data/raw the full dataset?

This repository has no data in /data. However, the previous repository does, at https://github.com/madelonhulsebos/sherlock/tree/master/data/raw. Is that the full dataset used to train the model? The paper describes the dataset as the following:

Then, we use exact matching between semantic types and column headers to extract 686,765 data columns from the VizNet corpus [14], a large-scale repository of real world datasets collected from the web, popular visualization systems, and open data portals.

Seeing as how test_values.csv only contains 1,000 columns, I assume this is only a very small sample of the full dataset, and that the full dataset still needs to be compiled if we are to reproduce your results. Is that right?

Varsha

Why using raw data with pretrained model accuracy is only 56% while preprocessed data with pretrained model accuracy is 89%?

List of semantic data types detected by Sherlock

Hi! I'am building public registry of semantic data types similar to PRONOM for data formats.

Is there any document or code file with list of semantic data types supported by Sherlock ? I would like to include it into the registry and probably this registry will be helpful for your project too.

How to train Sherlock?

I see that the jupyter notebook, "demo_usage_sherlock.ipynb" is included to be able to see how to retrain the model. However, in the jupyter notebook, when "train_sherlock(X_train, y_train, X_val, y_val, nn_id='retrained_sherlock');" is called, none of the parameters that are being passed in have been declared before this, therefore the function obviously fails. I've tried several ways to create these missing variables for X_train, y_train, X_val, and y_val, but keep getting errors. I was wondering how you create these X_train, y_train, X_val, and y_val to pass into the train_sherlock() function.

Number of Epochs are not changing

I updated the epochs to 20 and early_stopping to categroical crossentropyin model.py file but while training sherlock model from scratch it still runs for 100 epochs .Although I had imported the sherlock model file after making this change .Please help me in solving this asap

Generation of paragraph vector files

Hi,

Can you please help me understand how are you generating the first, second and third paragraph vector numpy files mentioned in the 'preprocessing.py' file.

par_vec_trained_400.pkl.trainables.syn1neg.npy
par_vec_trained_400.pkl.docvecs.vectors_docs.npy
par_vec_trained_400.pkl.wv.vectors.npy

Feature preparation is not optimized (word_embeddings)

Feature preparation is slow and not optimized in the "word_embeddings" step

Is the model in "src/model" already trained?

Hi. I am really excited to see the new approach to detect semantic datatypes using deep learning :)
I wonder if the model provided in "src/model" is already trained using 685,765 columns (which is described in the paper).

Rules for selecting columns

Hi,

Thank you for developing this wonderful project and making it publicly accessible! Would it be possible to have access to the code that you used to select columns from the Web Tables corpus? You briefly mentioned the process in your paper:

"We then match data columns from VizNet that have headers corresponding to our 275 types. To accomodate variation in casing and formatting, single word types matched case-altered modifications (e.g., name = Name = NAME) and multi-word types included concatenations of constituent words (e.g., release date = releaseDate)."

`sherlock` model gives strange results

When trying predictions using the sherlock model, I get everything in the sample data predicted as weight or symbol. Using retrain_minimal_sample gives more reasonable results. It would be helpful to have an explanation of what these two models are.

licence?

What licence is this released under? The MIT licence?

KeyError when running model.predict(X_test) in 02-1-train-and-test-sherlock.ipynb

Hello!

I am trying to use the pre-built 'sherlock' model to make predictions. As suggested in the readme, I have run some of the cells in the 02-1-train-and-test-sherlock.ipynb file but get a KeyError when model.predict(X_test) is run.

Code to Reproduce:

model_id = 'sherlock'

from ast import literal_eval
from collections import Counter
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn.metrics import f1_score, classification_report

from sherlock.deploy.model import SherlockModel

start = datetime.now()
print(f'Started at {start}')

X_test = pd.read_parquet('../data/processed/X_test.parquet')
y_test = pd.read_parquet('../data/raw/test_labels.parquet').values.flatten()

y_test = np.array([x.lower() for x in y_test])

print(f'Finished at {datetime.now()}, took {datetime.now() - start} seconds')

start = datetime.now()
print(f'Started at {start}')

model = SherlockModel();
model.initialize_model_from_json(with_weights=True, model_id="sherlock");

print('Initialized model.')
print(f'Finished at {datetime.now()}, took {datetime.now() - start} seconds')

predicted_labels = model.predict(X_test)
predicted_labels = np.array([x.lower() for x in predicted_labels])

When model.predict(X_test) is run the following KeyError occurs:

KeyError                                  Traceback (most recent call last)
/var/folders/66/cbb21km104n7d7t9qf61q8rmrsjdc8/T/ipykernel_21846/2316637303.py in <module>
----> 1 predicted_labels = model.predict(X_test)
      2 predicted_labels = np.array([x.lower() for x in predicted_labels])

~/ebsco_repos/sherlock-project/sherlock/deploy/model.py in predict(self, X, model_id)
    118         Array with predictions for X.
    119         """
--> 120         y_pred = self.predict_proba(X, model_id)
    121         y_pred_classes = helpers._proba_to_classes(y_pred, model_id)
    122 

~/ebsco_repos/sherlock-project/sherlock/deploy/model.py in predict_proba(self, X, model_id)
    141         y_pred = self.model.predict(
    142             [
--> 143                 X[feature_cols_dict["char"]].values,
    144                 X[feature_cols_dict["word"]].values,
    145                 X[feature_cols_dict["par"]].values,

KeyError: "['n_[^]-agg-sum', 'n_[^]-agg-max', 'n_[\\\\]-agg-kurtosis', 'n_[^]-agg-var', 'n_[\\\\]-agg-median', 'n_[^]-agg-kurtosis', 'n_[\\\\]-agg-mean', 'n_[\\\\]-agg-all', 'n_[^]-agg-min', 'n_[\\\\]-agg-sum', 'n_[^]-agg-median', 'n_[^]-agg-mean', 'n_[^]-agg-all', 'n_[\\\\]-agg-min', 'n_[\\\\]-agg-max', 'n_[^]-agg-any', 'n_[\\\\]-agg-var', 'n_[\\\\]-agg-any', 'n_[^]-agg-skewness', 'n_[\\\\]-agg-skewness'] not in index"

Is there something that I am missing or need to do prior to running the above code?

Appreciate the help!

How do you use use our custom trained par_vec_trained_400.pkl to train our own model?

I have created the paragraph vector as mentioned in the notebook, but how do we use it for data-preprocessing? Because prepare_feature_extraction() still downloads the pre trained paragraph vectors (3 files).

nn_id=retrain_sherlock does not generate loss value

Hi, I am trying to use Sherlock model for schematic field type prediction. I tried for large and small amount of data but whenever i try to train the model with nn_id=retrain_sherlock out of 100epochs it is running only for 5epochs and stops, the loss value is NaN from 1st epoch. The predictions for all the fieldtype is email. And i tried different things to solve the issue like checking the null values etc but nothing did not work still facing the same error. Please let me know if i am missing something and help me to resolve this issue. Thanks

improved F1 score and pre-processing performance

Hello,

Firstly, thank you for this great project - I learned a lot from this :-)

I have a (substantially) re-written version of this code in a private fork for which I achieved the following results:

F1 score > 90
Pre-processing time for all data completes in 25 mins (on a 2013 MBP)

If you're interested, then it would be good to contribute these changes back but I don't want to dump a huge PR on you, so we might need to work out an approach that reintroduces the changes in stages.

Cheers,

Chris.

How to test on new data

Thanks for creating such an amazing project. I am able to replicate the output using the dataset shared by you.

However now if i want to try it on my own dataset how do i do it. What is the process to convert my raw data set into the pre-processed form . From the code and comment, i could understand that below code , takes X-test in the pre-processed form and not the raw form.
predicted_labels = predict_sherlock(X_test, nn_id='retrain_sherlock')

Wrong predictions while testing new data

I have trained a Sherlock model and it is performing well in test data. But, when I tried test the model and passing the data to model as per ' Sherlock-out-of-the-box' notebook, then it is giving wrong predictions ( even passing training data(in the same way) also results in wrong predictions). Any separate l approach need to be taken for testing the data ?
Note : I have created my own paragraph vector w.r.to data I have and using that for training Sherlock model as well.

Gensim Doc2Vec Error when loading para_vec_trained_400.pkl

Getting a numpy error when loading the model.

ValueError: cannot reshape array of size 28232988 into shape (412059,400)

Any suggestions to fix this issue?

Cannot reproduce already processed features

Hi, thanks for open-sourcing this tool :)

I tried to reproduce the processed/X_test.parquet dataframes from the raw/test_values.parquet values.
Unfortunately, I was not able to retrieve the same results using the extract_features function.

As we can see in the screenshot below, we don't get the same features at all, as if the dataset used to create the processed/X_test.parquet was not the raw/test_values.parquet one?

I attach a pdf file of my simple notebook: demo_usage_sherlock.pdf

Maybe I'm not using the provided data the right way, I'm not sure.

P.S: I'm using more recent versions of python dependencies than the ones in the requirements.txt file but I don't think the issue comes from here.

[Question] Get embeddings for each column

Hi, how can I get embeddings instead of predictions for each column? Thanks!

Nan probabilities prediction on datasets with (almost) constant data

When training a new model ore even using the pretrained one, trying to obtain predictions all probabilities leads to none.

This strange behavior was firstly observed during predictions of specific semantic data types where many labels had a bias towards the first defined label. Digging deeper, when using predict_proba, a full set of nan probabilities was observed. I believe this is a bug.

Digging deeper, I found out that probably skewness & kurtosis for character level statistics are having nan as actual values. As these metrics have the standard deviation in the denominator of calculations this is a valid concern and issue.

This can be fixed in the code by adding fixed min/max values for computational reasons but I believe that this is something that has to be also taken into account when deriving complex features from metrics. This issue is not described I believe in the corresponding paper (https://arxiv.org/pdf/1905.10688.pdf) and probably it is an edge case that was missed by the authors.

This may also be the root cause behind issue#47 (#47).

Thanks a lot for the great model, OS code and contributions.

Bellow is an example of the aforementioned behavior just by changing the examples of the provided examples notebooks. This is the minimum reproducible example.
https://gist.github.com/stranger-codebits/6074b5fe2d02ac9db9f2750dbad9a24f

Where to find the global statistics features?

In the paper, it mentions it uses 'global statistics' features, I looked at code, only found other text features but cannot find global statistics features.

question - general id columns

Is there a means to identify general id columns using sherlock ?

Retrain sherlock

Hello,

Thank you for this project!

In the notebook, I read:

Retrain sherlock

The model can be retrained using the code below. The model is currently restricted to be trained on 78 classes, please check the README for more details and a work-around.

Could you please advise where this workaround is?
I can't find it in the README, except that:

To retrain Sherlock, you are currently restricted to using 78 classes to comply with the original model architecture.

I don't really understand if I can re-train sherlock with my own training data?

I tried, and I have an issue, but maybe not related...
I share it just in case.

Thanks.

File "retrain.py", line 40, in
train_sherlock(X_train_feature, Y_train, X_val_feature, Y_val, nn_id='dwc_sherlock');
File "/workspace/ML/sherlock/sherlock/deploy/train_sherlock.py", line 120, in train_sherlock
callbacks=callbacks, epochs=100, batch_size=256
File "/home/sylmorin/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 709, in fit
shuffle=shuffle)
File "/home/sylmorin/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 2692, in _standardize_user_data
y, self._feed_loss_fns, feed_output_shapes)
File "/home/sylmorin/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_utils.py", line 549, in check_loss_and_target_compatibility
' while using as loss ' + loss_name + '. '
ValueError: A target array with shape (2, 2) was passed for an output of shape (None, 78) while using as loss categorical_crossentropy. This loss expects targets to have the same shape as the output.

Any scope for re-training with new classes outside pre-trained 78 classes?

Thank you so much for this wonderful work.

Do you guys have any plan for re-training the model with new classes apart from 78 classes?

Do you guys have any plan for providing an option for changing the word-embedding technique?

275 DBpedia properties??

Hi, Thank you for sharing your great work. I am currently interested in 275 DBpedia properties and am trying to extract them from T2Dv2 Gold Standard data. However, as far as I examined, there is no extraction method that shows exactly 275 different properties. For example, if I try to extract the dbpedia URLs from all the files in the property folder, it ends up with 118 types. Please let me know how you extracted them.

Multiprocessing issue with python 3.7.9 in Windows

In Windows default context is spawn, fork is not even an option. I am getting model not defined error in extract_features_to_csv(X_test_filename_csv, values) while running
01-data-preprocessing.ipynb.

Exact error is :
_check_not_importing_main()
File "C:\Users\costrategix\AppData\Local\Programs\Python\Python37\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

unable to import 'smart_open.gcs', disabling that module

Doubt about Impute Nan values

I am referring to the code in 01-data-preprocessing.ipynb, regarding the paragraph Impute NaN values with feature means.

Currently, the nan values in extracted features are imputed with the average of the train sample column.
It means calculating the average considering all vectors, of different classes.

train_columns_means = pd.DataFrame(X_train.mean()).transpose()
X_train.fillna(train_columns_means.iloc[0], inplace=True)
X_validation.fillna(train_columns_means.iloc[0], inplace=True)
X_test.fillna(train_columns_means.iloc[0], inplace=True)

Wouldn't it be a better option to calculate the averages for each class and replace any nan values with the values of the specific class?
We could append the train_labels.parquet types to the data, group by type and compute the averages per class, saving the results in train_columns_means.

Am I missing some theoretical concept or would this actually be an improvement to the system?