cmstrackerdpg / mlplayground Goto Github PK

View Code? Open in Web Editor NEW

4.0 14.0 2.0 21.59 MB

Web application for collaborating to build ML models for Tracker DQM/DC

Python 14.35% CSS 0.13% HTML 6.23% Shell 0.37% Jupyter Notebook 34.82% JavaScript 43.94% Dockerfile 0.16%

mlplayground's Issues

[Data Taking Objects] Add API endpoints for Runs and Lumisections

Required:

api/viewsets.py
api/serializers.py
api/routers.py
api/filters.py (Optional)

Histogram manager breaks when histogram name filter contains a forwardslash `/`

Hi,

I just discovered that the histogram manager breaks when a histogram name containing a forward slash / is selected in a filter. The URL below demonstrates the error message.

https://ml4dqm-playground.web.cern.ch/histograms/lumisections_2D/list/?title=EcalBarrel%2FEBOccupancyTask%2FEBOT+digi+occupancy+EB%2B10

My initial explanation is that, since each row of the histogram manager contains the link to the visualiser in the form of ['visualize/(?P<runnr>[0-9]+)/(?P<lumisection>[0-9]+)/(?P<title>[^/]+)/\\Z'], the histogram title containing a forward slash can break the pattern.

With this problem, we need some way to "sanitise" histogram titles so that it would not contain forward slashes that breaks the URL pattern defined in visualize_histogram/urls.py.

Thanks,
Vichayanun

Expose functionality to initiate discovery of DQMIO files

Currently, the only way to initiate the discovery of new DQMIO files is limited to running the discover_dqm_files management command.

This should be executable via the API as well.

This functionality could be limited to staff.

Optionally, a button may also be shown in the Histogram File Manager page.

Add management script for run certification and preliminary frontend

The goal is to fill the database with run certification taken from the Run Registry, relying on a static file as a starting point. Then write a preliminary frontend to view the fraction of lumisections GOOD or BAD for the various subdetectors / objects.

Document the code (and its expected maintenance) used to import the input data

Reader for the perRun/perLS data stored on EOS
Data integrity checks to keep track of potentially missing Runs/LS
Data discovery automation for operations (from RR, OMS, EOS, CertHelper, etc)
Logging of the above input tasks with potential (email/displayed) alarms/notifications.

Combine Lumisection Histogram apps

Common fields for LumisectionHisto1D and LumisectionHisto2D are to be combined in a common model (LumisectionHistoBase) that LumisectionHisto1D, LumisectionHisto2D will inherit from.
- lumisection
- date
- title
- data
- entries
- x_min
- x_max
- x_bin
- source_data_file

Create a Task model to aggregate sets of data used for training and testing models

The Task will be a "container" which will group data so that it can be serialized and used by users for training and testing their models.

They should include:

Possibly a ManyToMany relationship with:
- 1D Lumisection Histograms and/or
- 2D Lumisection Histograms and/or
- Run Histogams

The main point to keep in mind is that a Task should contain multiple histograms, and many histograms can belong to many different Tasks

@Abhit03 @XavierAtCERN please verify that a Task should either aggregate a single type of histograms or a mix of them

Provide Foreign Key value instead of id when using API

Move from foreign key id to value for APIs in order to make results more relevant. To do so, see here.

Agree on a project management tool to collect features and organize tasks

Use the `Dockerfile` to upload a docker image (to Dockerhub?) to be used in the github workflow

In order to run the Django project, ROOT must be installed, along with the appropriate version of python and pyROOT.
The Dockerfile in the repository can create an UBI8 + Python3.8 + ROOT image, which can be used as the container to run the workflow in.

Refactor frontend

The goal is to have a frontend stable enough to start developing functional tests.

clean landing page
make access to tabs conditional on user authentication
add major apps to navbar (with dropdown for Run/Lumisection?)
add API link at the bottom of apps

[Histogram File Manager] Allow multiple paths to be configured for looking for DQMIO files

Currently, DIR_PATH_EOS_CMSML4DC points to a single directory, but it will be useful to allow multiple ones.

Splitting them by : could be a solution to retain the use of a single environmental variable.

Add Filtering for Histogram Data files front-end table

On the histogram file manager, filtering should be added for better management of the files

Add django-filters and configure custom filtering
Add options to front-end

Move to ListView and DetailView

Use generic display views for lists of items and items.

Find alternative ways to offload resource-heavy and time-consuming tasks

Description

Currently, on each parsing command issued for a HistogramDataFile, a new thread is spawned (from within Django) to parse the file.
Several issues with this approach:

Django carries the burden of parsing files, which it should not
The pod is pretty much overloaded, leading even to pod restarts.
There is no way to prevent multiple Threads being spawned on the same file, wasting resources.

Next steps:

Think about implementing a separate process which parses files, maybe even deployed on a separate pod, with its own API.
Create a new DB model where a queue can be implemented for the separate process to check for files to parse. This has the added advantage of being able to resume file parsing after a crash, as the entry in the queue has not been cleared.

[Histograms] Refactor data file parsing code

Currently, there are only two from_csv methods in the LumisectionHistogram1D and LumisectionHistogram2D models. This functionality should be moved to a separate module.

Create new DB models for storing parsed histogram files

Histogram files (currently only in CSV format but also keeping future nanoDQM format in mind) are parsed via a management command.

For huge (>2GB) files, this may take a lot of time or may even crash. Having appropriate models for storing file parsing history/progress can help a lot with the app's bookkeeping.

Add API endpoints for challenge app

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint

Description

Upon deployment, the ML Playground displays very long delays in rendering responses through the DRF API.

Initially, we were getting lots of Broken pipe messages, which were due to the proxy cutting off the connection due to the server not responding in time. Then, by increasing the proxy timeout to 180s, we were able to proceed with debugging the issue.

Example 1

Deploy the project on PaaS.
Visit this endpoint
Wait for it to load. It might take from 20 seconds up to 3 minutes to render (or even return a 504 error due to timeout).

Example 2

Run the project locally, using the DBoD database (just like the deployed project does).
Visit this endpoint
The page is rendered in (at most) 2 seconds.

Things we've tried

Use different wsgi/asgi servers (no difference, always the same delay):
- daphne with asgi under settings.py
- gunicorn with wsgi under settings.py
- Plain development server (the one that is not recommended for deployment) with manage.py runserver
Running the project locally, using the DBoD database that deployment uses. This works correctly, even when running it with daphne. Responses take up to 2 seconds, maximum, meaning that the database is not the problem.
Pod CPU usage displays some peaks during rendering the response in the range of 0.03 to 0.06, so it's not a matter of CPU speed.
Disabling DEBUG. No change.
Deployed with extra debug logging in settings.py, so that we can see what's going on in Django's mind while rendering that. You can see below that the queries themselves are fast. A long time is spent between the last SELECT query and the start of the HTTP 200 response (~2.5 min).

10.76.10.1:34256 - - [11/Apr/2022:13:27:40] "GET /api/histogram_data_files/?format=json&page=7" 200 17015
DEBUG - 2022-04-11 13:27:40,170 - http_protocol - HTTP response complete for ['10.76.10.1', 34256]
DEBUG - 2022-04-11 13:34:22,979 - http_protocol - HTTP b'GET' request for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:34:22,990 - selector_events - Using selector: EpollSelector
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:37:01,528 - http_protocol - HTTP 200 response started for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,530 - http_protocol - HTTP close for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,531 - http_protocol - HTTP response complete for ['10.76.10.1', 34540]

Other notes

Even after a 504 error, on the immediate next reload of the same page the page loads instantly (due to a 1-min cache of the responses), meaning that the response has already been cached from the first request, even if has not been returned to the client.

Summary

We need to 👽probe deeper👽 to what's going on inside Django between the end of the DB query and the start of the HTTP response.

Add sphinx documentation to project

[Histogram Data Files] Start parsing files with the push of a button

Work on the front end to make the Parsing button functional

Add logout url

Histogram visualizer breaks when histogram does not have x_min and x_max information

Hi,

MLP frontend redesign has been live for a few hours, but I already noticed a bug on the visualizer with the following error log:

TypeError at /visualize/316218/941/num_clusters_ontrack_PXBarrel/
unsupported operand type(s) for +: 'NoneType' and 'int'

This error is caused by some histograms not having x_bin attribute. This attribute is required in this line: https://github.com/CMSTrackerDPG/MLplayground/blob/master/visualize_histogram/views.py#L62 Furthermore, those histograms also do not have x_min and x_max attributes, which is required in the plotting as well. I have checked the DB in production, and found that these histograms amount to 268 136 out of 447 754 1D histograms, more than half of the histograms in 1D lumisection database.

For the reference, here is an example link that breaks the backend: https://ml4dqm-playground.web.cern.ch/visualize/316218/941/chargeInner_PXLayer_1/

Upon further investigation, I found that these histograms are parsed from three 2018 CSV files, all parsed on 30 Sep 2022. The exact addresses for these files are as follows:

/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_1.csv
/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_10.csv
/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_11.csv

According to histogram file manager, these three files are the only ones that are parsed out of all files in 2018A era. For now, I will re-parse the file using the manager and see if this issue is fixed.

Thanks,
Vichayanun

Fix Bootstrap `data` attributes to match Bootstrap 5.0 naming conventions

All data-X attributes have been namespaced to data-bs-X as of 5.0

Unknown users can sign up for a new account via "Login with CERN" button

Hi,

I have found a vulnerability within MLP login page, where unknown users can sign up for a new account with the procedure as follows:

Unknown user clicks on "Login with CERN" button.
Instead of clicking "Continue" at the bottom, the user clicks on "Sign Up" link.
The user sets up username and password.
The user now has an account for MLP.

Instead of having "Login with CERN" button directly taking users to CERN single sign-on, the button redirects to another page provided by django-allauth, which provides options for CERN sign-on, GitHub sign-on, and sign-in or sign-up for local accounts. We have to find a way to fix django-allauth directly.

Thanks,
Vichayanun

Add option to filter out specific datasets when parsing Histogram Data Files

Using a yaml configuration as @XavierAtCERN suggested, the user should be able to determine which datasets are to be dumped into the database when parsing Histogram Data Files

Fix the `discover_dqm_files` management command

After closing #30 , the discover_dqm_files management command which relied on the file list generated by FilePathField will not work.

Add CERN login for users

Add frontend for triggering kedro pipelines with Task(s) as parameters

Add authentication to the API

Token authentication for now(?)

[Histogram File Manager] Fix file_actions Vue componenent

Currently only shows POST errors. Should also show successful POST, reset fields etc.

Add API to certification

Add minimal plots

Add the following minimal plots (either static or using altair)

Runs

distribution of the mean of a given histogram for all runs
time serie of mean for all runs

Lumisections

1D histograms
2D histograms
...

Complete the url patterns / corresponding views

The goal is to have a sensible url pattern to browse through runs / lumisections.

The preliminary pattern could be:

data_taking_objects/
- runs/ [lists all runs available - ALREADY EXISTS]
  - run/id [shows information relative to a given run]
- lumisections/ [lists all lumisections available]
  - run/lumisection/id [shows information relative to a given lumisection]

Add management script for lumisection certification and preliminary frontend

The goal is to fill the database with lumisection certifications taken from the Run Registry, relying on a static files (one by subdetector/object) as a starting point. Then write a preliminary frontend to view the flag for each lumisection.

Add link to histogram urls from navbar

The goal is to be able to access some information about run and lumisection histograms from the navbar. So far, a simple list of the variables available is enough. Will add more information and visualization (time series, ...) later.

[API] Filtering Lumisections by `run` uses the Run's DB id instead of the actual run number

Example:

To filter lumisections of run 297056, one has to query

https://ml4dqm-playground.web.cern.ch/api/lumisections/?run=13

instead of:

https://ml4dqm-playground.web.cern.ch/api/lumisections/?run=297056

Re-assign bottom navbar for external data sources

The goal is to use the bottom navbar to provide links to the urls allowing to cross-check the information provided (OMS, RR) or the location of the files used (eos, ...)

Merge Lumisection/Run certification into one app

[Histograms] `Lumisection1DHistogram` parsing does not calculate percentage complete properly

Switch from Travis CI to Github Action

The goal is to move away from Travis CI and have a more modular CI using Github Action.

Preliminary version should run:

python linting using flake8
unit tests
functional test (dummy)

Add management script for run / lumisection OMS information

While some information from OMS is actually available from the Run Registry and could be extracted in a similar way than certification, it would be better to access OMS information directly and compare both sources of information.

Switch from pandas to pure queryset

Remove unnecessary use of pandas when possible.

Merge Lumisection/Run Histograms into one app

Add mechanism to create task [optional: allow creation using POST]

Merge Lumisection/Runs apps into one app

[Histogram Data Files] Add caching to viewsets to improve performance

Currently, the front-end queries the API every 5 seconds, leading to constant use of the ModelSerializer which involves a lot of background function calls and high CPU usage.
Instead of writing custom serializers, caching the replies for 1 min should be a reasonable tradeoff of latency vs performance.

Should we aggregate "/api" endpoints?

Currently, each app has a separate /api endpoint mounted on its URL (/lumisectionHistos1D/API, /lumisectionHistos2D/API).

Wouldn't it be clearer if all endpoints where under a common /api/ part of the URL?

E.g. (/api/lumisectionHistos1D/, /api/lumisectionHistos2D/).

Create a simple interface to interact with available CSV files

Description

Find a way to render an HTML page which:

Displays all CSV files found in the root directory where the DQM files reside (see how the FilePathField in admin creates a dropdown)
~~Displays status of files in regards to the Database (have they been stored? Have they been parsed to completion?)~~
Allows the user to initiate the management command to parse them

Technical details

The forms.py does not seem a very valid option for this, however the FilePathField seems useful (https://docs.djangoproject.com/en/4.0/ref/forms/fields/#filepathfield)
~~In order to check the status of each data file in the Database, on each page refresh the file list should be-cross checked with the entries in the HistogramDataFile table.~~ Only files that have been detected in a specific directory will be displayed. To prevent too many file operations, perhaps there should be a separate management command which scans the root directory and fills the HistogramDataFiles automatically (e.g. python manage.py discover_dqm_files)
Each listed file should present its parse status, and buttons to start its parsing. An API for this could prove useful, and could be queried periodically to show the current status of the file list. A simpler method would be a Django form.

Concept:

Add login/logout buttons to the navbar

A login view is already implemented using the Django-provided LoginView, but there is no direct link to the login URL (login/).

cmstrackerdpg / mlplayground Goto Github PK

mlplayground's Issues

Description

Next steps:

Description

Example 1

Example 2

Things we've tried

Other notes

Summary

Description

Technical details

Recommend Projects

Recommend Topics

Recommend Org