muhammad4hmed / gml Goto Github PK

View Code? Open in Web Editor NEW

138.0 15.0 33.0 44.48 MB

Auto Data Science - Python Library.

License: MIT License

Python 19.39% Jupyter Notebook 15.02% HTML 64.25% JavaScript 1.35%

pypi python gml

gml's Introduction

Trying to be the master of all trades

🔭 Currently working as a Research Engineer (Machine Learning and Computer Vision) at Retrocausal.AI
🏆 Winner & Top ranks in 28 competitions/hackathons (Goal: 20 before graduation).
Ranked Competitons Master on Kaggle 😎 Kaggle Profile
👯 Always up to collaborate in Kaggle Competitions.
🤔 In a fight between pytorch and tensorflow, you will find me standing with pytorch 😂
📫 How to reach me: Linkedin
⚡ I'm making my code base, you can find it here: http://gist.github.com/Muhammad4hmed (Under construction 😂)

[]

gml's People

Contributors

Stargazers

Watchers

gml's Issues

regression issue

Creating New Features with Features Selection

[GML] The 1 step feature engineering process could generate up to 49 features.
[GML] With 5429 data points this new feature matrix would use about 0.00 gb of space.
[FEATURE_ENGINEERING] Step 1: transformation of original features
[FEATURE_ENGINEERING] Generated 14 transformed features from 7 original features - done.
[FEATURE_ENGINEERING] Generated altogether 17 new features in 1 steps
[FEATURE_ENGINEERING] Removing correlated features, as well as additions at the highest level

AttributeError Traceback (most recent call last)
in
3 numeric_cols =['session_id','session_number','client_agent','device_details']
4
----> 5 fe = FeatureEngineering(train,'time_spent',fill_missing_data=True, method_cat='Mode',cat_cols = cat_cols,numeric_cols = numeric_cols,
6 method_num='Mean',encode_data=True,normalize=True, remove_outliers=False,new_features=True,feateng_steps=1,task ='regression')
7

~\Anaconda3\lib\site-packages\GML\FEATURE_ENGINEERING.py in init(self, data, label, fill_missing_data, method_cat, method_num, drop, cat_cols, numeric_cols, thresh_cat, thresh_numeric, encode_data, method, thresh, normalize, method_transform, thresh_numeric_transform, remove_outliers, qu_fence, new_features, task, test_data, verbose, feateng_steps)
166 except:
167 pass
--> 168 X = afc.fit_transform(X, y)
169 if not test_data == None:
170 test_data = afc.transform(test_data)

~\Anaconda3\lib\site-packages\GML\AUTO_FEATURE_ENGINEERING\autofeat.py in fit_transform(self, X, y)
294 target_sub = target.copy()
295 # generate features
--> 296 df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
297 self.feateng_steps, self.transformations, self.verbose)
298 # select predictive features

~\Anaconda3\lib\site-packages\GML\AUTO_FEATURE_ENGINEERING\feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
339 print("[FEATURE_ENGINEERING] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
340 print("[FEATURE_ENGINEERING] Removing correlated features, as well as additions at the highest level")
--> 341 feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
342 cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns] # categorical cols not in feature_pool
343 if cols:

~\Anaconda3\lib\site-packages\GML\AUTO_FEATURE_ENGINEERING\feateng.py in (.0)
339 print("[FEATURE_ENGINEERING] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
340 print("[FEATURE_ENGINEERING] Removing correlated features, as well as additions at the highest level")
--> 341 feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
342 cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns] # categorical cols not in feature_pool
343 if cols:

AttributeError: module 'sympy' has no attribute 'add'

How can one predict on new data using the glm ?

Thank you

Problem with NLP

tokenizer_name is not defined in tokenize.
AttributeError Traceback (most recent call last)
in ()
----> 1 tokenizedX = nlp.tokenize(cleanX.values)

/usr/local/lib/python3.6/dist-packages/GML/NLP.py in tokenize(self, text)
580 Return: Tokenized
581 """
--> 582 tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
583 encoded = tokenizer.batch_encode_plus(
584 text,

AttributeError: 'AutoNLP' object has no attribute 'tokenizer_name'

Issue with tokenization in NLP

Hello All,

I am facing an issue with tokenization method. I google the issue but got no results kind help me in resolving the bug.

Apply the featured engineering to new data to make predictions?

I'm using the auto feature engineering feature to create the new_features then to train my model. Now that I have new data that I want to predict on, how would I go about to apply that process to my new data so that the trained model doesn't mismatch in its core dimensions with the new data?

how to pass data into AUTOML

from GML import AutoML

gml_ml = AutoML()
gml_ml.GMLRegressor(X, y, metric =mean_squared_error, folds =10)

KeyError Traceback (most recent call last)
in
2
3 gml_ml = AutoML()
----> 4 gml_ml.GMLRegressor(X, y, metric =mean_squared_error, folds =5)

~\Anaconda3\lib\site-packages\GML\ML.py in GMLRegressor(self, X, y, metric, folds)
199 for i,model in enumerate(self.reg_models):
200 name = str(model.class.name)
--> 201 scores = self.cross_val(X, y, model, metric, folds)
202
203 print('{} got score of {} in {} folds'.format(model.class.name,scores,folds))

~\Anaconda3\lib\site-packages\GML\ML.py in cross_val(self, X, y, model, metric, folds)
165 for tr_in, val_in in KFold(n_splits = folds).split(X, y):
166 model_fold = model
--> 167 X_train, y_train, X_val, y_val = X.iloc[tr_in,:], y[tr_in], X.iloc[val_in,:], y[val_in]
168 model_fold.fit(X_train, y_train)
169 y_hat = model.predict(X_val)

~\Anaconda3\lib\site-packages\pandas\core\series.py in getitem(self, key)
904 return self._get_values(key)
905
--> 906 return self._get_with(key)
907
908 def _get_with(self, key):

~\Anaconda3\lib\site-packages\pandas\core\series.py in _get_with(self, key)
939 # (i.e. self.iloc) or label-based (i.e. self.loc)
940 if not self.index._should_fallback_to_positional():
--> 941 return self.loc[key]
942 else:
943 return self.iloc[key]

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
877
878 maybe_callable = com.apply_if_callable(key, self.obj)
--> 879 return self._getitem_axis(maybe_callable, axis=axis)
880
881 def _is_scalar_access(self, key: Tuple):

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1097 raise ValueError("Cannot index with multidimensional key")
1098
-> 1099 return self._getitem_iterable(key, axis=axis)
1100
1101 # nested tuple slicing

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1035
1036 # A collection of keys
-> 1037 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1038 return self.obj._reindex_with_indexers(
1039 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1313
1314 with option_context("display.max_seq_items", 10, "display.width", 80):
-> 1315 raise KeyError(
1316 "Passing list-likes to .loc or [] with any missing labels "
1317 "is no longer supported. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([1204, 1211, 1219, 1239, 1243,\n ...\n 5861, 5868, 5881, 5904, 5935],\n dtype='int64', length=312). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

feature engineering problems

from GML import FeatureEngineering

fe = FeatureEngineering(train,'Selling_Price', fill_missing_data=True,method_cat='Mode',cat_cols=3,
numeric_cols=11,
method_num='Mean', encode_data=True,
normalize=True, remove_outliers=True,
new_features=True, feateng_steps=2 ) # feateng_steps = 0 for features selection without feature creation

X_new, y, test = fe.get_new_data()

==============================
Handling Missing Data

There is missing data
'int' object is not iterable

==============================
Encoding Data

Success
Data Encoded

==============================
Transforming Data

Data Transformed

==============================
Handling Outliers

Before outlier removal
interquartile range: 0.919066837284749
upper_inner_fence: 10.106504611810337
lower_inner_fence: 6.430237262671341
upper_outer_fence: 11.48510486773746
lower_outer_fence: 5.051637006744217
percentage of records out of inner fences: 5.62
percentage of records out of outer fences: 0.22
length of input dataframe: 6313
length of new dataframe after outlier removal: 5958
After outlier removal

==============================
Creating New Features with Features Selection

ValueError Traceback (most recent call last)
in
1 from GML import FeatureEngineering
2
----> 3 fe = FeatureEngineering(train,'Selling_Price', fill_missing_data=True,method_cat='Mode',cat_cols=3,
4 numeric_cols=11,
5 method_num='Mean', encode_data=True,

~\Anaconda3\lib\site-packages\GML\FEATURE_ENGINEERING.py in init(self, data, label, fill_missing_data, method_cat, method_num, drop, cat_cols, numeric_cols, thresh_cat, thresh_numeric, encode_data, method, thresh, normalize, method_transform, thresh_numeric_transform, remove_outliers, qu_fence, new_features, task, test_data, verbose, feateng_steps)
156 except:
157 pass
--> 158 X = afc.fit_transform(X, y)
159 if not test_data == None:
160 test_data = afc.transform(test_data)

~\Anaconda3\lib\site-packages\GML\AUTO_FEATURE_ENGINEERING\autofeat.py in fit_transform(self, X, y)
245 cols = [str(c) for c in X.columns] if isinstance(X, pd.DataFrame) else []
246 # check input variables
--> 247 X, target = check_X_y(X, y, y_numeric=self.problem_type == "regression", dtype=None)
248 if not cols:
249 # the additional zeros in the name are because of the variable check in _generate_features,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
793 raise ValueError("y cannot be None")
794
--> 795 X = check_array(X, accept_sparse=accept_sparse,
796 accept_large_sparse=accept_large_sparse,
797 dtype=dtype, order=order, copy=copy,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
642
643 if force_all_finite:
--> 644 _assert_all_finite(array,
645 allow_nan=force_all_finite == 'allow-nan')
646

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
102 elif X.dtype == np.dtype('object') and not allow_nan:
103 if _object_dtype_isnan(X).any():
--> 104 raise ValueError("Input contains NaN")
105
106

ValueError: Input contains NaN

Can I limit the number of generated features?

Hi,

First of all thanks for your awesome work. The issue I bump into today while using it is the usage of memory. I am using it in google colab with IEEE Fraud dataset from Kaggle and it says that about 10TB space will be used but I don't have enough memory to do so.

So I was wondering if there is a limit of features to be generated or may be you can implement features in small baches and keep the best one, because every time I try to rum my colab session get crashed.

No requirements.txt

Hello!

I would like to like create this project locally in a virtual environment with all the dependencies, but there is not requirements.txt here. Please create one. Thanks!

regression new features sympy.add error

sympy module no attritubute with add

GML installation issue

ERROR: Command errored out with exit status 1:
command: 'C:\Users\prash\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py'"'"'; file='"'"'C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\prash\AppData\Local\Temp\pip-record-uf1syz3q\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\prash\Anaconda3\Include\torch'
cwd: C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch
Complete output (23 lines):
running install
running build_deps
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py", line 225, in
setup(name="torch", version="0.1.2.post2",
File "C:\Users\prash\Anaconda3\lib\site-packages\setuptools_init_.py", line 153, in setup
return distutils.core.setup(**attrs)
File "C:\Users\prash\Anaconda3\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "C:\Users\prash\Anaconda3\lib\distutils\dist.py", line 966, in run_commands
self.run_command(cmd)
File "C:\Users\prash\Anaconda3\lib\distutils\dist.py", line 985, in run_command
cmd_obj.run()
File "C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py", line 99, in run
self.run_command('build_deps')
File "C:\Users\prash\Anaconda3\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "C:\Users\prash\Anaconda3\lib\distutils\dist.py", line 985, in run_command
cmd_obj.run()
File "C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py", line 51, in run
from tools.nnwrap import generate_wrappers as generate_nn_wrappers
ModuleNotFoundError: No module named 'tools.nnwrap'

ERROR: Command errored out with exit status 1: 'C:\Users\prash\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py'"'"'; file='"'"'C:\Users\prash\AppData\Local\Temp\pip-install-9foi2r2a\torch\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\prash\AppData\Local\Temp\pip-record-uf1syz3q\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\prash\Anaconda3\Include\torch' Check the logs for full command output.

muhammad4hmed / gml Goto Github PK

gml's Introduction

Trying to be the master of all trades

gml's People

Contributors

Stargazers

Watchers

Forkers

gml's Issues

Creating New Features with Features Selection

============================== Handling Missing Data

There is missing data 'int' object is not iterable

============================== Encoding Data

Success Data Encoded

============================== Transforming Data

Data Transformed

============================== Handling Outliers

============================== Creating New Features with Features Selection

Recommend Projects

Recommend Topics

Recommend Org