vrodriguezf / deepvats Goto Github PK

Deep Visual Analytics for Time Series

License: Apache License 2.0

Makefile 0.01% Jupyter Notebook 94.01% R 1.95% Python 3.82% Shell 0.21%

deepvats's Introduction

Deep VATS

Deep learning Visual Analytics for Time Series

The main objective of DeepVATS is to combine cutting-edge research in neural networks and visual analytics of time series. It is inspired by projects such as Timecluster and the TensorFlow's Embeddings Projector, in which tools are created to interpret the content of neural networks trained with visual and textual data. This allows to verify how the internal content of a neural network reveals high-level abstraction patterns present in the data (for example, semantic similarity between words in a language model).

Given a set of time series data, DeepVATS will allow three basic tasks to be carried out:

Train neural networks to search for representations that contain, in a compressed way, meaningful patterns of that data.
Project and visualize the content of the latent space of neural network in a way that allows the search for patterns and anomalies.
Provide interactive visualizations to explore different perspectives of the latent space.

Currently, DeepVATS is recommended for time series data with the following properties:

Univariate & Multivariate time series
With or without natural timesteps
Regular timestamps
1 single series at a time
Suitable for long time series that present cyclical patterns

Structure

The tool consists of three modules. The DL (Deep Learning) module sets the pipeline for training the 'backbone' neural netowirk model. The second module, namely the storage module, provides an API that allows to save the datatsets and de encoder models produced by the DL module, and load the into the Visual Analytics module to be used for inference. That module, the VA (Visual Analytics) module, allows to use the trained models in a exploratory way throught a Graphical User Interface (GUI).

How it works

The tool can be used for different time series data mining tasks, such as segmentation or detection of repetitive patterns (motifs) and anomalies (outliers). This example shows the use of the tool with a pre-trained model for segmentation.

Deploy

To run the notebooks and the app, install docker, docker-compose and the nvidia container toolkit in your system.

Note: Your system needs an NVIDIA GPU

Then, create a new .env file inside the folder docker of the project adding the following config variables liste below.

Note: You need to have an account in Weights & Biases (wandb).

# The name of the docker-compose project
COMPOSE_PROJECT_NAME=your_project_name
# The user ID you are using to run docker-compose
USER_ID=your_numeric_id
# The group ID you are using to run docker-compose (you can get it with id -g in a terminal)
GROUP_ID=your_numeric_id
# The user name assigned to the user id
USER_NAME=your_user_name
# The port from which you want to access Jupyter lab
JUPYTER_PORT=
# The token used to access (like a password)
JUPYTER_TOKEN=
# The path to your data files to train/test the models
LOCAL_DATA_PATH=/path/to/your/data
# The W&B entity
WANDB_ENTITY=
# The W&B project
WANDB_PROJECT=
# The W&B personal API key (see https://wandb.ai/authorize)
WANDB_API_KEY=your_wandb_api_key
# List of comma separated GPU indices that will be available in the container (by default only 0, the first one)
CUDA_VISIBLE_DEVICES=0
# Github PAT (see https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and>
GH_TOKEN=your_github_pat
# Port in which you want Rstudio server to be deployed (for developing in the front end)
RSTUDIO_PORT=
# Password to access the Rstudio server
RSTUDIO_PASSWD=

Finally, in a terminal located in the folder docker of this repository, run:

docker-compose up -d

then go to localhost:{{JUPYTER_PORT}} to run/edit the notebooks (backend) or go to localhost:{{RSTUDIO_PORT}} to edit the visualization module (frontend).

Note: In case you are working in a remote server, replace localhost with the IP of your remote server.

To run the GUI, enter the visualization service in localhost:{{RSTUDIO_PORT}}, and then run, in the R console:

shiny::runApp("app")

Contribute to the backend

The backend of the project has been created using nbdev, a library that allows to create Python projects directly from Jupyter Notebooks. Please refer to this library when adding new functionalities to the project, in order to keep the structure of it.

We recommend using the following procedure to contribute and resolve issues in the repository:

Because the project uses nbdev, we need to run nbdev_install_git_hooks the first time after the repo is cloned and deployed; this ensures that our notebooks are automatically cleaned and trusted whenever we push to Github/Gitlab. The command has to be run from within the container. Also, it can be run from outside if you pip install nbdev in your local machine.
Create a local branch in your development environment to solve the issue XX (or add a new functionality), with the name you want to give your merge request (use something that will be easy for you to remember in the future if you need to update your request):
```
git checkout -b issueXX
```
Make whatever changes you want to make in the code and notebooks, and remember to run nbdev_build_lib when you're done to ensure that the libraries are built from your notebook changes (unless you only changed markdown, in which case that's not needed). It's also a good idea to check the output of git diff to ensure that you haven't accidentally made more changes than you planned.
Make a commit of the changes made
```
git commit -am "Fix issue #XX"
```
Test that there are not merging problems in the Jupyter Notebooks with the function nbdev_fix_merge
Push your local branch to a branch in the gitlab repository with an identiffying name:
```
git push -u origin HEAD
```

When the push is made, a link will appear in the terminal to create a merge request. Click on it.

remote:
remote: To create a merge request for test_branch, visit:
remote:   https://gitlab.geist.re/pml/x_timecluster_extension/-/merge_requests/new?merge_request%5Bsource_branch%5D=issueXX_solved
remote:

In the gitlab website:
- Write in the description what is the problem to solve with your branch using a hyperlink to the issue (just use the hashtag symbol "#" followed by the issue number)
- Click on the option "Delete source branch when merge request is accepted" and assign the merge to your profile.
- Click on the button "Create merge request"
Wait to the merge to be accepted. In case you're solving an issue, we recommend to move the issue to the field "In review" (in the Issue Board). To keep your branch up to date with the changes to the main repo, run:

git pull upstream master

If there are no problems, the merge request will be accepted and the issue will be closed. Once your PR has been merged or rejected, you can delete your branch if you don't need it any more:

git branch -d issueXX

Cite

If you use DeepVATS, please cite the paper:

@article{RODRIGUEZFERNANDEZ2023110793,
title = {DeepVATS: Deep Visual Analytics for Time Series},
journal = {Knowledge-Based Systems},
volume = {277},
pages = {110793},
year = {2023},
issn = {0950-7051},
doi = {https://doi.org/10.1016/j.knosys.2023.110793},
author = {Victor Rodriguez-Fernandez and David Montalvo-Garcia and Francesco Piccialli and Grzegorz J. Nalepa and David Camacho}
}

deepvats's People

Contributors

Stargazers

Watchers

Forkers

chaomingcwn angel-gutierrez-gonzalez vioneta ayushmaanjagetiya

deepvats's Issues

Review visualization app warnings

In GitLab by @dmt on Jan 21, 2021, 19:36

Several warnings occur during the execution of the app, which may be due to incompatibilities with the hdbscan library update.

docker-compose and gpus

We have to find a way of deploying the project with docker-compose in a way that the gpus of the server (if any) are usable by the container. In classic docker, it is as easy as adding the flag --gpus all to the command docker run. However, I do not know how to do this in compose.

This will be important in a few weeks because we are installing more GPU servers in the group and thus, this proejct should be moved to a gpu-based machine.

Add more data to the experiment

In GitLab by @vrodriguezf on Mar 18, 2020, 18:04

Only one day of logs is being used in the nb 01_timecluster_replication. It is likely that better patterns are found if many days are taken into account.

Fix issue #18 - [merged]

In GitLab by @lgs on Oct 26, 2020, 10:23

Merges issue18_solved -> master

Fix issue #18 :

I have created the function plot_validation_ts_ae() in the visualization module
I use that function to visualize the predictions done by the Autoencoder vs original data
The notebook log that image to wandb

Update README.md - [merged]

In GitLab by @lgs on Oct 15, 2020, 17:16

Merges lgs-master-patch-73510 -> master

Draft: Resolve "Make a function that checks that the DR artifact to use matches, in terms of metadata, the DCAE one" - [closed]

In GitLab by @lgs on Oct 19, 2020, 11:17

Merges issue14_solved -> master

Closes #14

Ensure that config.yaml is saved if the DCAE run does not come from a sweep (i.e., if it comes from a single run of DCAE notebook)

In GitLab by @vrodriguezf on Oct 14, 2020, 13:18

I am not sure if running a single DCAE training (i.e., notebook DCAE.ipynb) would log a config.yaml into the wandb files of the run. Running it from a sweep does it for sure. This is important because the notebook DR, when it restores a DCAE run at the beginning, needs of a config.yaml file.

Update Dockerfile. - [merged]

In GitLab by @lgs on Oct 15, 2020, 13:19

Merges patch-1 -> master

Update Dockerfile. Remove the lines to share data folder before switching to the user (it's no longer necessary)

plot_top_losses

In GitLab by @vrodriguezf on Nov 2, 2020, 10:55

Define a function plot_top_losses (either in the DCAE module or in utils) with this header:

@patch([Model class in Keras])
def plot_top_losses(self:[Model class in Keras], k, largest=True, **kwargs):
"Take the validation data of model self, compute the model losses for every item there, and sort the results. If `largest` is True, the validation losses will be sorted from larger to lower. Once they are sorted, take the k first items based on this order and plot the predictions."

My idea is to call it like model.plot_top_losses given a Keras model in the variable model, therefore the @patch. @path is a decorator from fastcore to add methods to a class on the fly.

Ensure the validation set is properly selected for the DCAE

In GitLab by @vrodriguezf on Mar 31, 2020, 13:46

Ensure that the validation set is properly selected, in terms of respecting the time indices. With time series, you cannot split your data into training and test randomly, you have to take one part of the series (for example, the last) as validation and leave the rest for training.

issueXX_solved - [closed]

In GitLab by @lgs on Oct 20, 2020, 16:55

Merges test_branch -> master

This branch aims to fix the issue #15.

THIS IS JUST A TEST. DON'T MERGE

Draft: Resolve "Change mark type of embeddings plot" - [closed]

In GitLab by @lgs on Oct 20, 2020, 11:07

Merges issue16 -> master

Closes #16

Study the variability of UMAP depending on its parameterization

In GitLab by @vrodriguezf on Oct 14, 2020, 13:34

Execute, for a given DR artifact, the UMAP reducer (notebook DR) multiple times with different configurations.

To do this automatically, the library papermill has to be used, in combination with a wandb_group to visualize the results of the experiment in a separated URL in wandb. We should discuss how to do this further.

Another option could be to create a sweep, just as with the DCAE case. However, I don't know if sweeps can be run without a metric to optimize...that's the case here, we are just exploring UMAp parameters, without a clear optimization metric.

Change project name to timecluster_extension

In GitLab by @vrodriguezf on Oct 14, 2020, 12:09

null

Issues 23 24 25 26 solved - [merged]

In GitLab by @lgs on Nov 6, 2020, 17:41

Merges issue23_24_25_26_solved2 -> master

Fix issues #23 #24 #25 #26

Draft: Resolve "Make a function that checks that the DR artifact to use matches, in terms of metadata, the DCAE one" - [closed]

In GitLab by @lgs on Oct 19, 2020, 10:57

Merges issue14-solved -> master

Closes #14

Fix plot_validation_ts_ae function from the module visualization

In GitLab by @lgs on Oct 29, 2020, 10:57

The plot_validation_ts_ae function takes the first element of each window and plots it, this way it is possible to see the whole time-series because we have only used a stride 1 during the DCAE training. If we use a value higher than 1, the graph will not make any sense.

We need to change the current code, to plot only one window selected in the function parameters (i.e. window_num):

   for i,ax in zip(range(original.shape[2]),fig.axes):
        ax.plot(original[:,0,i], label='Original Data') # Change this
        ax.plot(prediction[:,0,i], label='Prediction')  # Change this

Make a function that checks that the DR artifact to use matches, in terms of metadata, the DCAE one

In GitLab by @vrodriguezf on Oct 14, 2020, 13:30

There can be problems if, for example, the DR wants to use an artifact with 15 variables, but the associated DCAE has been trained with an artifact with 10 variables. The inputs wouldn't match.

There's already a scaffolding of the function check_compatibility in the DR notebook:

%nbdev_export
def check_compatibility(dr_ar:TSArtifact, dcae_ar:TSArtifact):
    "TODO: Function to check that the artifact used by the DCAE and the artifact that is \
    going to be passed through the DR are compatible"
    ret = dr_ar.metadata['TS']['vars'] == dcae_ar.metadata['TS']['vars']
    # Check that the dr artifact is not normalized
    return ret

For now it just checks that the variables used are the same (in number and names). There are other things that has to be checked here such as:

the resample period used in both artifacts. (that is in TS.freq inside the metadata.
Whether the DR artifact has issing values or not (that is in TS.has_missing_values, it has to be false)

zoom in embeddings plot

In GitLab by @vrodriguezf on Nov 6, 2020, 11:14

If possible, using the mouse wheel would be great for this, otherwise a couple of +/- buttons can make the trick.

Automatic clustering through HDB-SCAN

In GitLab by @vrodriguezf on Mar 30, 2020, 15:55

From the Timecluster paper:
"For validation purpose, we also compare an automatic clustering approach. A hierarchical clustering method (HDB- SCAN) [12,20] is used to generate the most significant clusters as a density-based clustering algorithm. It requires only one parameter which represents the minimum size of the cluster. We use the sklearn package, and we use the hdbscan package as available on PyPi in order to determine the number of clusters"

Implementing this is useful to provide automatic clusters to other PACMEL projects such as the ones devoted to shapelet generation.

Add automatic project-name creation to avoid docker-compose creation problems.

In GitLab by @lgs on Oct 13, 2020, 17:55

Currently, when several users use the docker-compose up -d command within a single server, inconsistencies appear and containers are overwritten.

To avoid this it is necessary to define a project_name when docker-compose is invoked. To solve this and to automate the process it is proposed:

Use the .env file to define default project names. https://stackoverflow.com/a/44924392
Create a makefile that allows you to invoke docker-compose by automatically defining a name associated with the id of the user who invokes it (something like timeclusteR_extension-$(id -u) ) https://stackoverflow.com/a/59965344

Fixed issue #17 and #6 - [merged]

In GitLab by @lgs on Oct 22, 2020, 17:26

Merges issue17_6_solved -> master

This branch solved the issues #17 and #6

Fixed issue 16 - [merged]

In GitLab by @lgs on Oct 20, 2020, 15:54

Merges issue16 -> master

Fixed issue #16

Update README with info. about useful gist for Jupyter Lab

In GitLab by @lgs on Oct 29, 2020, 11:40

Update README with info. about useful gist for Jupyter Lab (shortcuts, etc)

Change mark type of embeddings plot

In GitLab by @vrodriguezf on Oct 16, 2020, 16:19

In the Timecluster paper, the 2D embedding plot is a scattered connected plot, where each point is marked as a hollow square:

In the DR notebook it is more or less the same, although the plot is wrognly logged into weights and biases. This is how the plot looks in the notebook:

And this is how it looks one in wandb:

As it can be seen, the points are filled with a colour. The plot in shiny also has filled marks.

I'd like to have both the wandb plots and the shiny ones with hollow marks as well, so that the plot scales better when it has thousands of points.

Change the validation splitting strategy in the encoder

In GitLab by @vrodriguezf on Oct 24, 2020, 16:57

Right now, the validation set used during the encoder training is just a random 20% split of the input artifact, made automatically by Keras using the argument val_pct. This implies that the validation data is used for the normalization of the training data, and this is a bad practice. The validation data must be normalized independently, using the means/stds of the training data.

The easiest way to solve this problem is to use a separate artifact as validation set, instead of splitting the one used for training. As an example, we can use days 1-10 for training. days 11-13 for validation, and days 14-15 for test. Note that, in the call to model.fit, the argument validation_split is no longer applicable, and we have to use the argument validation_data (See docs)

Issue29 solved - [merged]

In GitLab by @lgs on Dec 15, 2020, 18:06

Merges issue29_solved -> master

Clustering functionality has been added using reticulate and hdbscan (python). We have also added functionality to modify its parameters.
We have added reactivity to the embeddings value filters. A "renderUI to define the maximum points values to show of the embedding" has been created
The zoom functionality has been updated (now it works by clicking on a button)
dyShading plotting process updated. Previously there was a large overlap of rectangles when they were drawn in the dygraph, this is now resolved

Show precomputed clusters

Right now, the only way of showing clusters in the visualization app is to click on the button "calculate and show clusters", which is what it is done in the original Timecluster paper. HOwever, in order to integrate some of the works of our colleagues in Poland, it would be interesting to add the possibility to get the clusters from a logged artifact.

The artifact would be a ReferenceArtifact logged by the dr_run, just like it is done with the embeddings, and it could be as simple as an array of cluster labels.

Updated docker-compose - [merged]

In GitLab by @lgs on Oct 15, 2020, 13:24

Merges lgs-master-patch-09882 -> master

Updated docker-compose to fix the problem of /data/PACMEL-2019 permissions

Test visually the quality of the DCAE with test data

In GitLab by @lgs on Oct 22, 2020, 17:13

Use data from the test_data artifact to check visually the performance of the DCAE.

Create section in the Jupyter Lab Notebook (_02.DCAE)
Load test artifact.
Transform test artifact.
Take a sample of the dataset (some windows)
Print and log image to WANDB

Issue14 - [merged]

In GitLab by @lgs on Oct 19, 2020, 11:56

Merges issue14 -> master

Solve issue #14

Draft: Resolve "Make a function that checks that the DR artifact to use matches, in terms of metadata, the DCAE one" - [closed]

In GitLab by @lgs on Oct 19, 2020, 10:30

Merges 14-make-a-function-that-checks-that-the-dr-artifact-to-use-matches-in-terms-of-metadata-the-dcae-one -> master

Closes #14

Webapp to visualize results

In GitLab by @vrodriguezf on Mar 18, 2020, 18:16

Just in the same way that the authors of the Timecluster paper show in this video, having a webapp of the replication and extension coded in this project would allow for easy sharing and insights.

Possible options for implementation include: voila (voila-ipyvuetify), R shiny

Normalize the artifact used by the DR (test artifact) with the metadata of the associated DCAE artifact (training artifact)

In GitLab by @vrodriguezf on Oct 14, 2020, 13:06

Right now, the test artifact used in the Dimensionality Reduction (DR) is a subset of 1000 rows of the same artifact used for training the associated DCAE. Therefore, the data loaded from the artifact is already normalized correctly.

However, if the DR artifact is different than the DCAE one (the real and useful case), the DR one has to be normalized according to the means and stds contained in the metadata of the DCAE one. This implies that, for convenience, artifacts that are going to be used for DR should be logged without normalization. Otherwise, when using them in the DR, they should be first denormalized and normalized again with the DCAE metadata.

Fix issues #19 and #21 - [merged]

In GitLab by @lgs on Oct 29, 2020, 11:19

Merges issue1921_solved -> master

I have solved issues #19 and #21

I have also changed the Dockerfile so as not to have problems in the future with papermill (get the most updated version of the library)

BUG: Shiny app only considers the first artifact used by the run

In GitLab by @vrodriguezf on Oct 20, 2020, 18:21

Modify server.R, more specifically, the computation of the reactive tsdf, so that when calling used_artifacts, the correct artifact is looked for instead of assuming that the correct one is the first. The question is: how to find is it the correct one?

the correct one should not be normalized. This can be checked in the metadata. This is a bit of a botch, but it is simple.
Get the run that logged the artifact somehow, and check that that run is a DCAE run...too complicated IMHO

Refactor the contents of draftsheet to a new notebook `00_dataset_artifacts.ipynb`

In GitLab by @vrodriguezf on Oct 14, 2020, 12:47

It is non sense that something as important as the creation of the dataset artifacts is in a draftsheet notebook. The easiest way I can think of moving this is to create a notebook 00_dataset_artifacts.ipynb. We can even think of creating a function that is exported to the library and gather the functionality of the notebook, i.e., creating an TSArtifact from a given dataframe, with a specified configuration.

On a side note, I would rather remove the nomenclature 00_, 01_, for notebooks that don't belong explicitly to the project pipeline. Therefore, thinks like 00:load should be renamed as load.

Add transparency in embedding plot marks

In GitLab by @vrodriguezf on Nov 6, 2020, 11:37

Fix issues shiny app

In GitLab by @lgs on Dec 15, 2020, 17:46

null

Update Rstudio Docker image

Update Dockerfile-rstudio file to adapt it to new Python library updates.

Add dropout layers for avoiding overfitting in the DCAE

In GitLab by @vrodriguezf on Apr 1, 2020, 18:53

Issue23 24 25 26 solved - [merged]

In GitLab by @lgs on Nov 6, 2020, 17:33

Merges issue23_24_25_26_solved -> master

Fix issues #23 #24 #25 and #26

Read data in alignment with other PACMEL projects

In GitLab by @vrodriguezf on Mar 18, 2020, 18:20

Take the code from @pml and add it to the module 00_load to be aligned in terms of preprocessing with the rest of the project.

Ask anomalies in JNK dataset

In GitLab by @lgs on Nov 17, 2020, 13:08

We have observed some anom

We have observed some anomalies in the JNK dataset. One of them is shown in the Figure. You can see that there is an abrupt change in all sensors for a minute (all signals except SM_ShearerLocation have the value 0).

We wonder why this stop is due to an error in data acquisition or are small machine stops for some operational reason? (important to remark that the sensors that measure temperature are also set to 0)

Translated with www.DeepL.com/Translator (free version)

Update sintaxis of Jupyter notebooks to work with new nbdev version

In GitLab by @lgs on Dec 3, 2020, 12:56

New versions of nbdev do not work with the traditional Jupyter's Magic Lines.

i.e: %nbdev_export should be replaced by #remove as new documentation explains (https://pypi.org/project/nbdev/1.1.5/)

DCAE Hyperparameter optimization

In GitLab by @vrodriguezf on Mar 18, 2020, 18:14

According to the paper: The number of feature maps, size
of filter and depth of the model are set based on the recon-
struction error on validation set.

Use weights & biases to analyse which hyperparameters are best for the mining data.

Create a function to calculate a naive baseline model

In GitLab by @lgs on Oct 30, 2020, 15:25

Create a new Jupyter Notebook that:

Loads the train dataset
Transform the dataset in autoencoder input tensors
Compute the mean (or mode, or median) for each window
Calculate the MSE and save into a list
Compute the average MSE
Log values into wandb

Project roadmap and follow-up

In GitLab by @vrodriguezf on Mar 12, 2020, 18:57

Hi,

This issue is a way to keep track of the work done in this project and be aware of what is left.

I just made the initial commit which contains, as most important file, the notebook nbs/01_Timecluster_replication.ipynb. In this notebook I explore the methodology proposed n the Timecluster paper with one piece of data from one of the longwalls.

In the last section of that notebook, I want to start exploring how to extend that methodoogy so that the expert cannot only see which time window corresponds to each point in the 2d space, but also the most important variables of that time window.

Also, I still have to find a way to validate the replication, since the data from the paper is not available.

Update README.md - [closed]

In GitLab by @dmt on Jan 19, 2021, 10:28

Merges dmt-master-patch-62643 -> master

Study the val_loss of DCAE depending on its window and strize parameters

In GitLab by @lgs on Oct 29, 2020, 10:49

null

vrodriguezf / deepvats Goto Github PK

deepvats's Introduction

Deep VATS

Structure

How it works

Deploy

Contribute to the backend

Cite

deepvats's People

Contributors

Stargazers

Watchers

Forkers

deepvats's Issues

Recommend Projects

Recommend Topics

Recommend Org