Coder Social home page Coder Social logo

cmstrackerdpg / mlplayground Goto Github PK

View Code? Open in Web Editor NEW
4.0 14.0 2.0 21.59 MB

Web application for collaborating to build ML models for Tracker DQM/DC

Python 14.35% CSS 0.13% HTML 6.23% Shell 0.37% Jupyter Notebook 34.82% JavaScript 43.94% Dockerfile 0.16%

mlplayground's Issues

Histogram manager breaks when histogram name filter contains a forwardslash `/`

Hi,

I just discovered that the histogram manager breaks when a histogram name containing a forward slash / is selected in a filter. The URL below demonstrates the error message.

https://ml4dqm-playground.web.cern.ch/histograms/lumisections_2D/list/?title=EcalBarrel%2FEBOccupancyTask%2FEBOT+digi+occupancy+EB%2B10

My initial explanation is that, since each row of the histogram manager contains the link to the visualiser in the form of ['visualize/(?P<runnr>[0-9]+)/(?P<lumisection>[0-9]+)/(?P<title>[^/]+)/\\Z'], the histogram title containing a forward slash can break the pattern.

With this problem, we need some way to "sanitise" histogram titles so that it would not contain forward slashes that breaks the URL pattern defined in visualize_histogram/urls.py.

Thanks,
Vichayanun

Expose functionality to initiate discovery of DQMIO files

Currently, the only way to initiate the discovery of new DQMIO files is limited to running the discover_dqm_files management command.

This should be executable via the API as well.

This functionality could be limited to staff.

Optionally, a button may also be shown in the Histogram File Manager page.

Combine Lumisection Histogram apps

  • Common fields for LumisectionHisto1D and LumisectionHisto2D are to be combined in a common model (LumisectionHistoBase) that LumisectionHisto1D, LumisectionHisto2D will inherit from.
    • lumisection
    • date
    • title
    • data
    • entries
    • x_min
    • x_max
    • x_bin
    • source_data_file

Create a Task model to aggregate sets of data used for training and testing models

The Task will be a "container" which will group data so that it can be serialized and used by users for training and testing their models.

They should include:

  • Possibly a ManyToMany relationship with:
    • 1D Lumisection Histograms and/or
    • 2D Lumisection Histograms and/or
    • Run Histogams

The main point to keep in mind is that a Task should contain multiple histograms, and many histograms can belong to many different Tasks

@Abhit03 @XavierAtCERN please verify that a Task should either aggregate a single type of histograms or a mix of them

Refactor frontend

The goal is to have a frontend stable enough to start developing functional tests.

  • clean landing page
  • make access to tabs conditional on user authentication
  • add major apps to navbar (with dropdown for Run/Lumisection?)
  • add API link at the bottom of apps

Find alternative ways to offload resource-heavy and time-consuming tasks

Description

Currently, on each parsing command issued for a HistogramDataFile, a new thread is spawned (from within Django) to parse the file.
Several issues with this approach:

  • Django carries the burden of parsing files, which it should not
  • The pod is pretty much overloaded, leading even to pod restarts.
  • There is no way to prevent multiple Threads being spawned on the same file, wasting resources.

Next steps:

  • Think about implementing a separate process which parses files, maybe even deployed on a separate pod, with its own API.
  • Create a new DB model where a queue can be implemented for the separate process to check for files to parse. This has the added advantage of being able to resume file parsing after a crash, as the entry in the queue has not been cleared.

Create new DB models for storing parsed histogram files

Histogram files (currently only in CSV format but also keeping future nanoDQM format in mind) are parsed via a management command.

For huge (>2GB) files, this may take a lot of time or may even crash. Having appropriate models for storing file parsing history/progress can help a lot with the app's bookkeeping.

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint

Description

Upon deployment, the ML Playground displays very long delays in rendering responses through the DRF API.

Initially, we were getting lots of Broken pipe messages, which were due to the proxy cutting off the connection due to the server not responding in time. Then, by increasing the proxy timeout to 180s, we were able to proceed with debugging the issue.

Example 1

  1. Deploy the project on PaaS.
  2. Visit this endpoint
  3. Wait for it to load. It might take from 20 seconds up to 3 minutes to render (or even return a 504 error due to timeout).

Example 2

  1. Run the project locally, using the DBoD database (just like the deployed project does).
  2. Visit this endpoint
  3. The page is rendered in (at most) 2 seconds.

Things we've tried

  • Use different wsgi/asgi servers (no difference, always the same delay):
    • daphne with asgi under settings.py
    • gunicorn with wsgi under settings.py
    • Plain development server (the one that is not recommended for deployment) with manage.py runserver
  • Running the project locally, using the DBoD database that deployment uses. This works correctly, even when running it with daphne. Responses take up to 2 seconds, maximum, meaning that the database is not the problem.
  • Pod CPU usage displays some peaks during rendering the response in the range of 0.03 to 0.06, so it's not a matter of CPU speed.
  • Disabling DEBUG. No change.
  • Deployed with extra debug logging in settings.py, so that we can see what's going on in Django's mind while rendering that. You can see below that the queries themselves are fast. A long time is spent between the last SELECT query and the start of the HTTP 200 response (~2.5 min).
10.76.10.1:34256 - - [11/Apr/2022:13:27:40] "GET /api/histogram_data_files/?format=json&page=7" 200 17015
DEBUG - 2022-04-11 13:27:40,170 - http_protocol - HTTP response complete for ['10.76.10.1', 34256]
DEBUG - 2022-04-11 13:34:22,979 - http_protocol - HTTP b'GET' request for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:34:22,990 - selector_events - Using selector: EpollSelector
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:37:01,528 - http_protocol - HTTP 200 response started for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,530 - http_protocol - HTTP close for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,531 - http_protocol - HTTP response complete for ['10.76.10.1', 34540]

Other notes

Even after a 504 error, on the immediate next reload of the same page the page loads instantly (due to a 1-min cache of the responses), meaning that the response has already been cached from the first request, even if has not been returned to the client.

Summary

We need to 👽probe deeper👽 to what's going on inside Django between the end of the DB query and the start of the HTTP response.

Histogram visualizer breaks when histogram does not have x_min and x_max information

Hi,

MLP frontend redesign has been live for a few hours, but I already noticed a bug on the visualizer with the following error log:

TypeError at /visualize/316218/941/num_clusters_ontrack_PXBarrel/
unsupported operand type(s) for +: 'NoneType' and 'int'

This error is caused by some histograms not having x_bin attribute. This attribute is required in this line: https://github.com/CMSTrackerDPG/MLplayground/blob/master/visualize_histogram/views.py#L62 Furthermore, those histograms also do not have x_min and x_max attributes, which is required in the plotting as well. I have checked the DB in production, and found that these histograms amount to 268 136 out of 447 754 1D histograms, more than half of the histograms in 1D lumisection database.

For the reference, here is an example link that breaks the backend: https://ml4dqm-playground.web.cern.ch/visualize/316218/941/chargeInner_PXLayer_1/

Upon further investigation, I found that these histograms are parsed from three 2018 CSV files, all parsed on 30 Sep 2022. The exact addresses for these files are as follows:

/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_1.csv
/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_10.csv
/eos/project/c/cmsml4dc/ML_2020/UL2018_Data/DF2018A_1D_Complete/ZeroBias_2018A_DataFrame_1D_11.csv

According to histogram file manager, these three files are the only ones that are parsed out of all files in 2018A era. For now, I will re-parse the file using the manager and see if this issue is fixed.

Thanks,
Vichayanun

Unknown users can sign up for a new account via "Login with CERN" button

Hi,

I have found a vulnerability within MLP login page, where unknown users can sign up for a new account with the procedure as follows:

  1. Unknown user clicks on "Login with CERN" button.
  2. Instead of clicking "Continue" at the bottom, the user clicks on "Sign Up" link.
  3. The user sets up username and password.
  4. The user now has an account for MLP.

Instead of having "Login with CERN" button directly taking users to CERN single sign-on, the button redirects to another page provided by django-allauth, which provides options for CERN sign-on, GitHub sign-on, and sign-in or sign-up for local accounts. We have to find a way to fix django-allauth directly.

Thanks,
Vichayanun

Add minimal plots

Add the following minimal plots (either static or using altair)

Runs

  • distribution of the mean of a given histogram for all runs
  • time serie of mean for all runs

Lumisections

  • 1D histograms
  • 2D histograms
  • ...

Complete the url patterns / corresponding views

The goal is to have a sensible url pattern to browse through runs / lumisections.

The preliminary pattern could be:

  • data_taking_objects/
    • runs/ [lists all runs available - ALREADY EXISTS]
      • run/id [shows information relative to a given run]
    • lumisections/ [lists all lumisections available]
      • run/lumisection/id [shows information relative to a given lumisection]

Add link to histogram urls from navbar

The goal is to be able to access some information about run and lumisection histograms from the navbar. So far, a simple list of the variables available is enough. Will add more information and visualization (time series, ...) later.

Switch from Travis CI to Github Action

The goal is to move away from Travis CI and have a more modular CI using Github Action.

Preliminary version should run:

  • python linting using flake8
  • unit tests
  • functional test (dummy)

[Histogram Data Files] Add caching to viewsets to improve performance

Currently, the front-end queries the API every 5 seconds, leading to constant use of the ModelSerializer which involves a lot of background function calls and high CPU usage.
Instead of writing custom serializers, caching the replies for 1 min should be a reasonable tradeoff of latency vs performance.

Should we aggregate "/api" endpoints?

Currently, each app has a separate /api endpoint mounted on its URL (/lumisectionHistos1D/API, /lumisectionHistos2D/API).

Wouldn't it be clearer if all endpoints where under a common /api/ part of the URL?

E.g. (/api/lumisectionHistos1D/, /api/lumisectionHistos2D/).

Create a simple interface to interact with available CSV files

Description

Find a way to render an HTML page which:

  • Displays all CSV files found in the root directory where the DQM files reside (see how the FilePathField in admin creates a dropdown)
  • Displays status of files in regards to the Database (have they been stored? Have they been parsed to completion?)
  • Allows the user to initiate the management command to parse them

Technical details

  • The forms.py does not seem a very valid option for this, however the FilePathField seems useful (https://docs.djangoproject.com/en/4.0/ref/forms/fields/#filepathfield)
  • In order to check the status of each data file in the Database, on each page refresh the file list should be-cross checked with the entries in the HistogramDataFile table. Only files that have been detected in a specific directory will be displayed. To prevent too many file operations, perhaps there should be a separate management command which scans the root directory and fills the HistogramDataFiles automatically (e.g. python manage.py discover_dqm_files)
  • Each listed file should present its parse status, and buttons to start its parsing. An API for this could prove useful, and could be queried periodically to show the current status of the file list. A simpler method would be a Django form.

Concept:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.