Coder Social home page Coder Social logo

TerminatedWorkerError about skope-rules HOT 38 CLOSED

AlJohri avatar AlJohri commented on June 6, 2024 6
TerminatedWorkerError

from skope-rules.

Comments (38)

upendra431 avatar upendra431 commented on June 6, 2024 30

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

Why i am getting this error

from skope-rules.

anacarolinarocha avatar anacarolinarocha commented on June 6, 2024 16

I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

from skope-rules.

sarajcev avatar sarajcev commented on June 6, 2024 16

I have the same issue on Ubuntu 18.04 with 16GB RAM and Anaconda (Python 3.7 and scikit-learn 0.21) on this simple example:

from sklearn.linear_model import LogisticRegression as LR
# Logistic Regression (with fixed hyper-parameters)
lreg = LR(C=100.,  # fixed "C" hyper-parameter
          multi_class='ovr', solver='newton-cg', class_weight='balanced', n_jobs=4)
lreg.fit(X_train, y_train)  # fit model to data
y_lr = lreg.predict_proba(X_test)  # predict on new data

The code fails at the fit line with the following message:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

When I use n_jobs=1 the code runs just fine. With any other value for the n_jobs, including -1 it fails with the same message.

I know that this code was running without errors on this dataset with n_jobs=-1 until now (maybe I updated some Anaconda packages in the meantime, I don't remember?).

from skope-rules.

qmilangowin avatar qmilangowin commented on June 6, 2024 10

Get this now on a Ubuntu EC2 instance (c4.xlarge) with GridSearch in Jupyter Notebooks:

param_grid=[{
    'vect__ngram_range':[(1,1),(1,2),(1,3)],
    'clf__alpha':(1e-2,1e-3)}]

gs_clf=GridSearchCV(text_clf_NB,param_grid,n_jobs=-1)
gs_clf=gs_clf.fit(X_train,y_train)

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)

from skope-rules.

Hockwell avatar Hockwell commented on June 6, 2024 8

sc-learn v.0.22.1.
Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1.
Important: my algorithms for BaggingClf use n_jobs=1, not -1.

from skope-rules.

pnmartinez avatar pnmartinez commented on June 6, 2024 6

I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

This is what made it for me. Turns out allocating all CPUs can be unstable, specially when there are other independent programs running that can suddenly have an uncontrolled spike in memory usage.

The full case with n_jobs:

n_jobs = -1  # parallelization in all CPUs (until last element of the array of cpus, hence -1)
n_jobs = -2  # parallelization in all CPUs but 1 (until the previous element from the last, hence -2)
...
n_jobs = 1  #  parallelization deactivated

So n_jobs = -2 did it for me and should be enough, and clearly more efficient than n_jobs = 1.

EDIT: This is, however, only a nice workaround, not a fix, as @seanlseymour says below.

from skope-rules.

mdashkezari avatar mdashkezari commented on June 6, 2024 5

updating matplotlib did it for me:
pip install -U matplotlib

macOS Catalina 10.15.6
sklearn: 0.23.2
numpy: 1.19.1
scipy: 1.4.1
Cython: 0.29.21
pandas: 1.0.5
matplotlib: 3.3.1
joblib: 0.16.0
threadpoolctl: 2.1.0

from skope-rules.

ishanuc avatar ishanuc commented on June 6, 2024 4

This issue still exists as of 2022. Closing the issue, and pretending it went away (or use njobs=1 for "parallelization") does not fix the issue. Demanding "minimal examples" when the issue shows up for complicated working code is also unreasonable. I understand this is hard to track bug, but the above "solutions" are not solutions.

from skope-rules.

AlJohri avatar AlJohri commented on June 6, 2024 2

@ogrisel the code is fairly intertwined at the moment so creating a minimal reproduction will be difficult. if you have some debugging strategies for this type of issue I may be able to narrow it down first.

here is what my code looks like. this is a multi-class, multi-label problem transfromed into multiple single class problems:

neg_to_pos_ratio = 1.0

all_training_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]
all_test_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]

def process(topic):

    # find tagged data for topic (positive) and the remaining data ("negative")
    positive_data = [row for row in all_training_data if has_topic(topic, row)]
    negative_data = [row for row in all_training_data if not has_topic(topic, row)]

    # sample negative data to balance positive data
    sampled_positive_data = sample_or_all(positive_data, num_pos_training_data)
    sampled_negative_data = sample_or_all(negative_data, len(sampled_positive_data) * neg_to_pos_ratio)
    
    # create balanced training data
    training_data = sampled_positive_data + sampled_negative_data

    training_data_labels = [has_topic(topic, row) for row in training_data]
    training_data_stories = [get_text(story) for story in training_data]

    featurizer = CountVectorizer(
        stop_words='english',
        max_df=0.9,
        min_df=0.01, binary=True, analyzer='word')
    features = featurizer.fit_transform(training_data_stories).toarray()

    clf = SkopeRules(max_depth_duplication=2,
                    n_estimators=10,
                    precision_min=0.5,
                    recall_min=0.1,
                    verbose=2,
                    n_jobs=-1,
                    feature_names=["w_" + x.replace(' ', '_') for x in featurizer.get_feature_names()])

    clf.fit(features, training_data_labels)

    # .... more code here

for topic in topics:
    result = process(topic)

the error triggers on clf.fit. It always crashes after the same number of topics get processed (2 or 3) and I watched the code with top and it seems like the memory usage is fine. I'm running on an EC2 with 32 gigabytes of memory and 8 cores

if I remove n_jobs, then the script runs to completion

from skope-rules.

upendra431 avatar upendra431 commented on June 6, 2024 2

I got this from stackoverflow: it resolved my issue.
https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l

I figured out the my scipy module was incompatible with my windows 10 C++ redistributable version.

All i did was download the latest visual studio and installed the C++ redistributable update that is listed in the "individual components" section.

Once I installed that I restarted my computer and ran.

import scipy
scipy.test()
Once that was actually running I attempted my code block above and it fixed.

I think what this boils down to is installing an old build of windows 10 with a brand new version of python and scipy

This took a LONG time to solve and debug. Hopefully it helps.

from skope-rules.

anacarolinarocha avatar anacarolinarocha commented on June 6, 2024 2

GridSearchCV works fine for me with SVC, LinearSVC, MultinomialNB e RandomForest. I'm facing this problem only with MultilayerPerceptron. All attempts with all algorithms have n_jobs >1 .

from skope-rules.

IloBe avatar IloBe commented on June 6, 2024 2

I have got the same issue with GridSearchCV for RandomForestClassifier and n_jobs=-1 in Jupyter Notebooks, running on paperspace with GPU+ container; the dataset has been a cleaned disaster messages one from figure 8; coding is

`
pipeline = Pipeline([
('features', FeatureUnion([
('text_pipeline', Pipeline([
('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
('tfidf', TfidfTransformer(sublinear_tf=True)),
]))
])),
('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, class_weight='balanced',
n_jobs=-1, random_state=FIXED_SEED)))
])

rfc_param_grid = {
'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
'clf__estimator__n_estimators': [10, 100, 500, 1000],
'clf__estimator__max_depth': [None, 5, 10],
'clf__estimator__class_weight': ['balanced', 'balanced_subsample']
}

grid_cv = GridSearchCV(pipeline, param_grid=rfc_param_grid, n_jobs=-1, cv=5, verbose=1)
grid_cv.fit(X_train, y_train)
`
As expected, it does not happen, if the pipeline is used alone, without GridSearchCV.

from skope-rules.

ybagdasa avatar ybagdasa commented on June 6, 2024 2

A workaround for now:

with parallel_backend('threading',n_jobs=8):
   fitGridSearchDecisionTree(data,clf_args) #my code that calls instance of GridSearchCV.fit with n_jobs=None

This uses multithreading rather than multiprocessing (if I understand correctly) but it still results in a much faster execution of gridsearch.

from skope-rules.

mayujie avatar mayujie commented on June 6, 2024 2

sc-learn v.0.22.1.
Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1.
Important: my algorithms for BaggingClf use n_jobs=1, not -1.

Very helpful!!! Thank you

from skope-rules.

ogrisel avatar ogrisel commented on June 6, 2024 2

I think we should close this issue. joblib workers can crash for a variety of reasons (e.g. not enough memory on the system to use parallelism, installation problems and so on) and we should open one issue per problem, provided we have enough information to reproduce the problem.

In the comments above, most reports are unrelated to the skope-rules library and do not actually use it at all.

If you face such a problem in your code without importing skope-rules, please:

  • make sure your version of joblib is up to date: https://pypi.org/project/joblib/
  • check the memory usage of your system when you execute the code that crashes first (use the task manager on Windows, Activity Monitor on macOS or top / htop on Linux for instance): if the RAM is exhausted, it's normal to get a crash when using too many workers. Try using n_jobs=2 instead of n_jobs=-1 and monitor RAM usage again before growing the n_jobs value;
  • if you still get the problem, open an issue on the github repo of the library you actually use in your code (for instance scikit-learn or directly joblib);
  • mention the version of joblib, scikit-learn installed in the Python environment you use to get the crash. For instance you can use: python -c "import sklearn; sklearn.show_versions()"
  • please include a minimal reproduce code snippet, including all the import statements and code to generate random data, for instance using https://scikit-learn.org/stable/datasets/sample_generators.html or the functions of the numpy.random module. If you do not make the effort to provide us with a minimal reproducer it's very likely that nobody will be able to help you.

A minimal reproducer should be small (e.g. no more than 20 lines of python) and stand-alone: anyone should be able to execute the code, for instance by copy and pasting the code snippet in a IPython or jupyter session.

from skope-rules.

ngoix avatar ngoix commented on June 6, 2024 1

This is probably unrelated to skope-rules as n_jobs is just passed to sklearn ensemble estimators

from skope-rules.

seanlseymour avatar seanlseymour commented on June 6, 2024 1

I'm seeing that I can avoid this issue for some classifiers by setting n_jobs to -2, however not all. For example LogisticRegression produces this error as does Bagging. RandomForest, SVC, KNeighborsClassifier, XGBoost work. The tracebacks on failures don't always point to the same place, consistent with the lack of consistency cited in this thread. Sometimes the issue is at cross_validate, sometimes at learning_curve, sometimes at GridSearchCV, or RandomizedSearchCV - all seem to be from sklearn.model_selection. The only other common theme I see is they all tracebacks hit python3.7/site-packages/joblib/parallel.py. I'm sure this issue did not happen before switching to Catalina, but I'm not sure it was triggered immediately, so perhaps something else or a combo is the problem. I'm really hoping someone who understands this much more deeply than I do will dig into this for a real fix. Even if n_jobs = -2 always worked, that's still just a workaround, not a fix, right? Any progress here greatly appreciated!

My config:
OS Catalina 10.15.5
Python 3.7
Anaconda 4.4.7 (reinstalled per suggestions, no effect)
scikit_learn 23.1
matplotlib 3.2.1
16 GB RAM (free RAM is never the actual issue as far as I can tell)

from skope-rules.

ogrisel avatar ogrisel commented on June 6, 2024

Could you please provide a minimal reproduction case?

from skope-rules.

 avatar commented on June 6, 2024

facing similar issue... have 50kdatapoints and ample memory

CODE:

%%time

auc_cv_dict = {}
auc_tr_dict = {}

for i in range(3, 50, 4):
knn = KNeighborsClassifier(n_neighbors=i, algorithm='brute', weights='uniform', n_jobs=-1)
knn.fit(xtr, dtrain['numeric_score'])

#performance metrics for cv data:
y_pred_cv = knn.predict_proba(xcv)        
fpr_cv, tpr_cv, thresholds_cv = roc_curve(ycv, y_pred_cv[:,1])
auc_cv_dict[i] = auc(fpr_cv, tpr_cv)


#performance metrics for training data:
y_pred_tr = knn.predict_proba(xtr)
fpr_tr, tpr_tr, thresholds_tr = roc_curve(dtrain['numeric_score'], y_pred_tr[:,1])
auc_tr_dict[i] = auc(fpr_tr, tpr_tr)    

ERROR:

TerminatedWorkerError Traceback (most recent call last)
in

~/.local/lib/python3.5/site-packages/sklearn/neighbors/classification.py in predict_proba(self, X)
191 X = check_array(X, accept_sparse='csr')
192
--> 193 neigh_dist, neigh_ind = self.kneighbors(X)
194
195 classes_ = self.classes_

~/.local/lib/python3.5/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
433 X, self.fit_X, reduce_func=reduce_func,
434 metric=self.effective_metric
, n_jobs=n_jobs,
--> 435 **kwds))
436
437 elif self._fit_method in ['ball_tree', 'kd_tree']:

~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_chunked(X, Y, reduce_func, metric, n_jobs, working_memory, **kwds)
1300 X_chunk = X[sl]
1301 D_chunk = pairwise_distances(X_chunk, Y, metric=metric,
-> 1302 n_jobs=n_jobs, **kwds)
1303 if ((X is Y or Y is None)
1304 and PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None)

~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434

~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1071 ret = Parallel(n_jobs=n_jobs, verbose=0)(
1072 fd(X, Y[s], **kwds)
-> 1073 for s in gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs)))
1074
1075 return np.hstack(ret)

~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call(self, iterable)
928
929 with self._backend.retrieval_context():
--> 930 self.retrieve()
931 # Make sure that we get a last message telling us we are done
932 elapsed_time = time.time() - self._start_time

~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())

~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()

/usr/lib/python3.5/concurrent/futures/_base.py in result(self, timeout)
403 raise CancelledError()
404 elif self._state == FINISHED:
--> 405 return self.__get_result()
406 else:
407 raise TimeoutError()

/usr/lib/python3.5/concurrent/futures/_base.py in __get_result(self)
355 def __get_result(self):
356 if self._exception:
--> 357 raise self._exception
358 else:
359 return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

from skope-rules.

maninekkalapudi avatar maninekkalapudi commented on June 6, 2024

I'm facing the following error on Debian based GCP server:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

I'm facing the above error at clf.fit(x_train_multilabel, y_train). I certainly don't know anything about C++ packages and I try not to change anything.

start = datetime.now()
hyper_param = {'estimator__C': [10**-5,10**-4, 10**-3, 10**-2, 10**-1, 1, 10**1, 10**2, 10**3, 10**4,10**5]}

classifier = OneVsRestClassifier(LogisticRegression(penalty='l1'))

clf = GridSearchCV(classifier, hyper_param, scoring = 'f1_micro', cv=10, n_jobs=-1)

clf.fit(x_train_multilabel, y_train)

print("Time taken to run this cell :", datetime.now() - start)

from skope-rules.

arindam2007b avatar arindam2007b commented on June 6, 2024

I am facing the exact same issue, while running gridsearchcv. Anyone found any solution yet?

from skope-rules.

AlJohri avatar AlJohri commented on June 6, 2024

Does anyone here have a small dataset they would be willing to share to create a reproducible example?

from skope-rules.

wayneli215 avatar wayneli215 commented on June 6, 2024

I solve this problem by reinstall anaconda.
I use jupyter notebook in ubuntu system.

from skope-rules.

vbaryshev4 avatar vbaryshev4 commented on June 6, 2024

I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

Yes, as evidence of RunTime Error on Ubuntu you could see unused swap memory allocated in SWP-bar via htop

from skope-rules.

kid3night avatar kid3night commented on June 6, 2024

Facing the same issue when I tried to run RandomizedSearchCV with n_jobs larger than 1.
Is there any way to solve this problem now?
I run it on the Mac OS 10.15.1.

My sklearn version is '0.21.3'

A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

from skope-rules.

trewaite avatar trewaite commented on June 6, 2024

Encountered same issue using RandomizedSearchCV when passing a MultiOutputRegressor wrapped XGBRegressor.

sklearn version is '0.20.4'

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

mo_jobs = 1
grid_jobs = 40

cv = TimeSeriesSplit(3)
estimator = xg.XGBRegressor()
mo_estimator = MultiOutputRegressor(estimator,n_jobs=mo_jobs)

param_grid = {'estimator__silent': [True],
            'estimator__max_depth': [6, 10, 15, 20],
            'estimator__learning_rate': [0.01, 0.1],
            'estimator__subsample': [0.7, 0.8, 0.9, 1.0],
            'estimator__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'estimator__colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'estimator__min_child_weight': [0.1, 0.5, 1.0, 3.0, 5.0, 7.0, 10.0, 13.0],
            'estimator__gamma': [0, 0.1, 0.25, 0.5],
            'estimator__reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0],
            'estimator__n_estimators': [100]}


grid = RandomizedSearchCV(estimator=mo_estimator,
                          cv=cv,
                          param_distributions=param_grid,
                          n_iter=10,
                          verbose=2,
                          scoring='neg_mean_squared_error',
                          n_jobs=int(grid_jobs/mo_jobs),
                          pre_dispatch=int(grid_jobs/mo_jobs))

grid.fit(X_train,y_train)

Note my cluster has 64 cores, and I am hitting this error only using 40 cores and only n_iter = 10 of RandomizedSearchCV

from skope-rules.

pavel-kalmykov avatar pavel-kalmykov commented on June 6, 2024

I am also having this SIGABRT(-6) error as many have already posted here, but when I run the same notebook in Google Colab, I get the following:

/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning

from skope-rules.

wanpingDou avatar wanpingDou commented on June 6, 2024

sc-learn v.0.22.1.
Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1.
Important: my algorithms for BaggingClf use n_jobs=1, not -1.

very useful ! THX

from skope-rules.

ybagdasa avatar ybagdasa commented on June 6, 2024

I'm encountering the error

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.                                                                       
                                                    
The exit codes of the workers are {EXIT(1)} 

when running an instance of GridSearchCV on a DecisionTreeClassifier with n_jobs!=1. I tried updating sklearn and matplotlib with conda, but the problem persists. I am able to run RandomForestClassifier with n_jobs!=1 without any issue.

from skope-rules.

gtg472b avatar gtg472b commented on June 6, 2024

I kept getting this error even with n_jobs=1. Turns out I found a hidden error:

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/steve/anaconda3/envs/rapidsai-0.17/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
    prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5

My only workaround was to set LOKY_PICKLER='pickle'
https://buildmedia.readthedocs.org/media/pdf/joblib/latest/joblib.pdf

I can't seem to find much info on this...anyone know why the default cloudpickle is using protocol 5? It appears that has to do with python 3.8, but I have 3.7.8 installed

from skope-rules.

vss888 avatar vss888 commented on June 6, 2024

If it helps, I am having this problem while trying to run multiple XGBoost models in parallel. I.e. I use joblib to read from disk multiple copies of an XGBoost model, which then consume incoming MQ messages to make predictions. I do not see high RAM usage in the system monitor (15-20% of RAM is used). The models start and run fine for some time, but at some moment I get a crash with the same error, i.e.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

In one test reproducing the problem, if I run 40 model in parallel - I get the crash, but if I run 30 models in parallel - the crash does not occur.

from skope-rules.

dharathakkar5 avatar dharathakkar5 commented on June 6, 2024

Is this issue fixed?
I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4.
Number of cores=8
2 million rows of data.

from skope-rules.

mayujie avatar mayujie commented on June 6, 2024

Is this issue fixed?
I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4.
Number of cores=8
2 million rows of data.

What kind of model are you searching? Keras model or sklearn model?
If keras I suggest you could use keras tuner package for that.

from skope-rules.

pplonski avatar pplonski commented on June 6, 2024

Got a similar issue in AutoML that I'm working on. The solution was to update the joblib package to 1.0.1.

pip install -U joblib==1.0.1

from skope-rules.

paulmattheww avatar paulmattheww commented on June 6, 2024

I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

I tried something similar, wherein I set my regressor to have njobs = 4 while the grid-search is set to use almost all the CPUs available. Is this similar to what you did?

from skope-rules.

surfablebot avatar surfablebot commented on June 6, 2024

Got the same error. There is a bug, hope the following helps:

192vCPU
786 GB Memory
Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2022-06-09

scikit-learn==1.1.2
joblib==1.1.0
catboost==1.0.6
lightgbm==3.3.2
scipy==1.9.0
scikit-learn==1.1.2
scikit-optimize==0.9.0
filelock==3.8.0
progressbar2==4.0.0
numpy==1.23.2
pandas==1.4.3
tabulate==0.8.10
pycoingecko==2.2.0
jinja2==3.1.2
tables==3.7.0
blosc==1.10.6
joblib==1.1.0
python==3.10

File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 1056, in call
self.retrieve()
File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

from skope-rules.

vbyan avatar vbyan commented on June 6, 2024

Had the same issue while running sklearn.model_selection.cross_validate in PyCharm. I resolved it by increasing the heap memory of the IDE. For PyCharm it's 750 MiB by default which can trigger the TerminatedWorkerError especially when working with huge databases.

Hope this is helpful.

from skope-rules.

ibraym avatar ibraym commented on June 6, 2024

I had the same problem today. I run RandomizedSearchCV with n_jobs=5 on m5.2xlarge AWS instance (8 cores, 32G RAM). I solved it by changing n_jobs to 4 and adding the following:

import gc
gc.set_threshold(0)

For more info about this trick, read this article

from skope-rules.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.