lucianolorenti / ceruleo Goto Github PK
View Code? Open in Web Editor NEWCeRULEo: Comprehensive utilitiEs for Remaining Useful Life Estimation methOds
Home Page: https://lucianolorenti.github.io/ceruleo/
License: MIT License
CeRULEo: Comprehensive utilitiEs for Remaining Useful Life Estimation methOds
Home Page: https://lucianolorenti.github.io/ceruleo/
License: MIT License
failure information is provided in different formats.
It would be handy to have these methods for splitting into cycles, computing the RUL column based on a timestamp column of the dataset or just a cycle column.
Also it would be handy to incorporate into the dataset classes ways of automatically handling different devices based on some identifier column.
I am trying to use the sample_weight
input parameter of a TimeSeriesWindowTransformer
inside a CeruleoRegressor
in order to weight the training samples with the inverse of the RUL
in this way:
regressor = CeruleoRegressor(
TimeSeriesWindowTransformer(
transformer,
window_size=1,
sample_weight=RULInverseWeighted(),
padding=True,
step=1),
Ridge(alpha=15))
regressor.fit(train_dataset)
Unfortunately I get a KeyError
. The main part of the error regards the definition of the RULInverseWeighted
class:
class RULInverseWeighted(AbstractSampleWeights):
"""
Weight each sample by the inverse of the RUL
"""
def __call__(self, y, i: int, metadata):
return 1 / (y[i, 0] + 1)
In particular there is a KeyError: (0,0)
so it seems like the index [i,0]
to which it is trying to access in order to compute the RUL
of a certain sample does not exist.
The same error is also raised in case I create a WindowedDatasetIterator
and try to inspect its elements with the next()
method:
iterator = WindowedDatasetIterator(
transformed_dataset,
window_size=150,
step=15,
horizon=5,
sample_weight=RULInverseWeighted(),
iteration_type=IterationType.FORECAST
)
X,y,sw=next(iterator)
(X.shape, y.shape, sw.shape)
It is needed additional testing on setting parameters of a CeRULEo pipeline.
Some basic functionality works, but it would be nice to have full integration with the sklearn stack.
Does it make sense to add some graphics modules regarding control charts?
it would be interesting to handle time series augmentation.
Maybe we can have an augmented windowed iterator or a boolean indicator in the current one that indicates if we want to alter the time series.
Something like this can be used time_series_augmentation
References:
Time Series Data Augmentation for Neural Networks by Time Warping with a Discriminative Teacher
Line 12 in 4e95296
Command
pip install ceruleo
Error
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.21.6 which is incompatible.
The GridSearchCV
class, that works together with CeruleoMetricWrapper
gives an error when a dataset including a Scaler in its Transformer
is used.
Let's consider the following example:
CMAPSSDataset
and define the FEATURES
as the most relevant sensor measurement features:# Load the dataset
train_dataset = CMAPSSDataset(train=True, models='FD001')
test_dataset = CMAPSSDataset(train=False, models='FD001')[15:30]
# Define list of sensor measurement features
FEATURES = [train_dataset[0].columns[i] for i in sensor_indices]
Transformer
transformer = Transformer(
pipelineX=make_pipeline(
ByNameFeatureSelector(features=FEATURES),
MinMaxScaler(range=(-1, 1))
),
pipelineY=make_pipeline(
ByNameFeatureSelector(features=['RUL']),
)
)
Note that here I am using the MinMaxScaler
to scale the data in the range (-1,1)
GridSearchCV
to find compare different Regression modelsregressor_gs = CeruleoRegressor(
TimeSeriesWindowTransformer(
transformer,
window_size=32,
padding=True,
step=1),
Ridge(alpha=15))
grid_search = GridSearchCV(
estimator=regressor_gs,
param_grid={
'ts_window_transformer__window_size': [5, 10],
'regressor': [Ridge(alpha=15), RandomForestRegressor(max_depth=5)]
},
scoring=CeruleoMetricWrapper('neg_mean_absolute_error')
)
grid_search.fit(train_dataset)
The output returned after the grid_search.fit(train_dataset)
command is launched is:
There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
...
And then the final error message is:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
So probably there are two operands that should be subtracted one to the other but being both strings this results in an error.
Looking at additional details in the long error message returned we can find:
ValueError:
All the 20 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
As suggested in the error message I added the error_score='raise'
input argument to GridSearchCV
to get a more detailed error explanation.
Looking at the new error message I think that the source of the error is in the transform
method of the MinMaxScaler
class contained in ceruleo.transformation.features.scalers
:
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
try:
divisor = self.data_max - self.data_min
The one reported above is the subtraction where two strings are found in correspondence of self.data_max
and self.data_min
thus creating the TypeError: unsupported operand type(s) for -: 'str' and 'str'
reported above.
I tried to run the code again after placing an import ipdb; ipdb.set_trace()
inside the transform
function but for some reason the code did not stop as it is supposed to happen when using ipdb
.
I was also able to access the data_max
and data_min
attributes with transformer.pipelineX.final_step.data_max
and transformer.pipelineX.final_step.data_min
and I was also able to do:
transformer.pipelineX.final_step.data_max-transformer.pipelineX.final_step.data_min
without any error.
So I do not really have a clue why this bug appears.
Obviously running the code without MinMaxScaler
, so with:
transformer = Transformer(
pipelineX=make_pipeline(
ByNameFeatureSelector(features=FEATURES),
),
pipelineY=make_pipeline(
ByNameFeatureSelector(features=['RUL']),
)
)
it works without any errors.
Two comments were made in the JOSS review process about clarifying the scope of the library in the current ecosystem, and about the mention of industry 4.0, which with the advent of its evolution, the industry 5.0, is a bit outdated
The PR covering this is #28
Right now we are relying on sklearn.model_selection.train_test_split for splitting the run-to-failure cycles into train and test sets.
If the variance in the duration of the cycles is high this may cause an imbalance between the sets. It would be handy to split the cycles considering each cycle's length.
For example, if we have the following cycles:
>>> lengths = [5, 10, 15, 25, 50, 60, 40, 30, 9, 8]
We can have the following situation:
>>> train_set, test_set = train_test_split(lengths, train_size=0.8);
[[15, 10, 8, 5, 9, 40, 25], [60, 50, 30]]
But the lengths, in terms of samples, are almost equal in both sets.
>>> sum(train_set), sum(test_set)
(112, 140)
It would be nice to handle this situation, splitting taking into account this issue.
I just found this repo and wanted to know if it is still actively developed. I just published a similar package for RUL datasets. Do you want to have a talk about collaborating? I couldn't find your contact info anywhere else.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.