lucianolorenti / ceruleo Goto Github PK

View Code? Open in Web Editor NEW

29.0 3.0 7.0 86.13 MB

CeRULEo: Comprehensive utilitiEs for Remaining Useful Life Estimation methOds

Home Page: https://lucianolorenti.github.io/ceruleo/

License: MIT License

Python 98.41% TeX 1.59%

predictive-maintenance time-series remaining-useful-life remaining-useful-life-prediction

ceruleo's Introduction

Hi there 👋

ceruleo's People

Contributors

Stargazers

Watchers

Forkers

zhanglangjd trellixvulnteam jacopoandreoli lemeda98 frizzodavide steccami coolcodelvs

ceruleo's Issues

Dataset enhancements

failure information is provided in different formats.

List of failures
Indication of the begging of the measurement
Just a Time in production field.

It would be handy to have these methods for splitting into cycles, computing the RUL column based on a timestamp column of the dataset or just a cycle column.

Also it would be handy to incorporate into the dataset classes ways of automatically handling different devices based on some identifier column.

`RULInverseWeighted` does not work on `TimeSeriesWindowTransformer`

I am trying to use the sample_weight input parameter of a TimeSeriesWindowTransformer inside a CeruleoRegressor in order to weight the training samples with the inverse of the RUL in this way:

regressor = CeruleoRegressor(
			TimeSeriesWindowTransformer(
			transformer,
			window_size=1,
			sample_weight=RULInverseWeighted(),
			padding=True,
			step=1),
			Ridge(alpha=15))

regressor.fit(train_dataset)

Unfortunately I get a KeyError. The main part of the error regards the definition of the RULInverseWeighted class:

class RULInverseWeighted(AbstractSampleWeights):
	"""
	Weight each sample by the inverse of the RUL
	"""
	def __call__(self, y, i: int, metadata):
		return 1 / (y[i, 0] + 1)

In particular there is a KeyError: (0,0) so it seems like the index [i,0] to which it is trying to access in order to compute the RUL of a certain sample does not exist.

The same error is also raised in case I create a WindowedDatasetIterator and try to inspect its elements with the next() method:

iterator = WindowedDatasetIterator(
			transformed_dataset,
			window_size=150,
			step=15,
			horizon=5,
			sample_weight=RULInverseWeighted(),
			iteration_type=IterationType.FORECAST 
		)

X,y,sw=next(iterator)
(X.shape, y.shape, sw.shape)

Improve Sickit-learn compatibilty

It is needed additional testing on setting parameters of a CeRULEo pipeline.

Some basic functionality works, but it would be nice to have full integration with the sklearn stack.

Control Charts (?)

Does it make sense to add some graphics modules regarding control charts?

Data augmentation

it would be interesting to handle time series augmentation.

Maybe we can have an augmented windowed iterator or a boolean indicator in the current one that indicates if we want to alter the time series.

Something like this can be used time_series_augmentation

References:

Time Series Data Augmentation for Neural Networks by Time Warping with a Discriminative Teacher

Installation failed and dependencies have no versions

no versions

ceruleo/setup.cfg

Line 12 in 4e95296

pyts

current installation error

Command

pip install ceruleo

Error

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.21.6 which is incompatible.

[JOSS Review] Comments on the software paper (writing, references)

Similar point as what has been raised by ulf, l.40: studyng -> studying
The abbreviation PdM has been introduced, but has not been used once (compared to RUL, which is used several times throughout the paper). Consider using PdM or not introduce the abbreviation PdM.
Syntax is fine, there could be additional citation of the software (l.44: scikit-learn, tensorflow)

openjournals/joss-reviews#5294

Bug Report: `GridSearchCV` does not work with Scalers

The GridSearchCV class, that works together with CeruleoMetricWrapper gives an error when a dataset including a Scaler in its Transformer is used.

Let's consider the following example:

Load the CMAPSSDataset and define the FEATURES as the most relevant sensor measurement features:

# Load the dataset 
train_dataset = CMAPSSDataset(train=True, models='FD001')
test_dataset = CMAPSSDataset(train=False, models='FD001')[15:30]

# Define list of sensor measurement features
FEATURES = [train_dataset[0].columns[i] for i in sensor_indices]

Define a simple Transformer

transformer = Transformer(
    pipelineX=make_pipeline(
        ByNameFeatureSelector(features=FEATURES), 
        MinMaxScaler(range=(-1, 1))

    ), 
    pipelineY=make_pipeline(
        ByNameFeatureSelector(features=['RUL']),  
    )
)

Note that here I am using the MinMaxScaler to scale the data in the range (-1,1)

Define an instance of GridSearchCV to find compare different Regression models

regressor_gs = CeruleoRegressor(
    TimeSeriesWindowTransformer(
        transformer,
        window_size=32,
        padding=True,
        step=1),   
    Ridge(alpha=15))

grid_search = GridSearchCV(
    estimator=regressor_gs,
     param_grid={
        'ts_window_transformer__window_size': [5, 10],         
        'regressor': [Ridge(alpha=15), RandomForestRegressor(max_depth=5)]
    },
    scoring=CeruleoMetricWrapper('neg_mean_absolute_error')
)

grid_search.fit(train_dataset)

The output returned after the grid_search.fit(train_dataset) command is launched is:

There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
...

And then the final error message is:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

So probably there are two operands that should be subtracted one to the other but being both strings this results in an error.

Looking at additional details in the long error message returned we can find:

ValueError: 
All the 20 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

As suggested in the error message I added the error_score='raise' input argument to GridSearchCV to get a more detailed error explanation.

Looking at the new error message I think that the source of the error is in the transform method of the MinMaxScaler class contained in ceruleo.transformation.features.scalers:

def transform(self, X: pd.DataFrame) -> pd.DataFrame:

try:
	divisor = self.data_max - self.data_min

The one reported above is the subtraction where two strings are found in correspondence of self.data_max and self.data_min thus creating the TypeError: unsupported operand type(s) for -: 'str' and 'str' reported above.

I tried to run the code again after placing an import ipdb; ipdb.set_trace() inside the transform function but for some reason the code did not stop as it is supposed to happen when using ipdb.

I was also able to access the data_max and data_min attributes with transformer.pipelineX.final_step.data_max and transformer.pipelineX.final_step.data_min and I was also able to do:

transformer.pipelineX.final_step.data_max-transformer.pipelineX.final_step.data_min

without any error.

So I do not really have a clue why this bug appears.

Obviously running the code without MinMaxScaler, so with:

transformer = Transformer(
    pipelineX=make_pipeline(
        ByNameFeatureSelector(features=FEATURES), 
    ), 
    pipelineY=make_pipeline(
        ByNameFeatureSelector(features=['RUL']),  
    )
)

it works without any errors.

JOSS: Scope of the library in the ecosystem and Incipient Industry 5.0

Two comments were made in the JOSS review process about clarifying the scope of the library in the current ecosystem, and about the mention of industry 4.0, which with the advent of its evolution, the industry 5.0, is a bit outdated

The PR covering this is #28

Take into consideration the sample number when the train-test splitting is made

Right now we are relying on sklearn.model_selection.train_test_split for splitting the run-to-failure cycles into train and test sets.

If the variance in the duration of the cycles is high this may cause an imbalance between the sets. It would be handy to split the cycles considering each cycle's length.

For example, if we have the following cycles:

>>> lengths = [5, 10, 15, 25, 50, 60, 40, 30, 9, 8]

We can have the following situation:

>>> train_set, test_set = train_test_split(lengths, train_size=0.8);

[[15, 10, 8, 5, 9, 40, 25], [60, 50, 30]]

But the lengths, in terms of samples, are almost equal in both sets.

>>> sum(train_set), sum(test_set)

(112, 140)

It would be nice to handle this situation, splitting taking into account this issue.

Is this still active?

I just found this repo and wanted to know if it is still actively developed. I just published a similar package for RUL datasets. Do you want to have a talk about collaborating? I couldn't find your contact info anywhere else.