coedl / elpiscloud Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 2.47 MB

Refactor of the Elpis project, with a web service in mind.

TypeScript 44.33% Python 42.82% Dockerfile 0.61% JavaScript 0.34% Shell 0.09% HCL 11.44% CSS 0.38%

elpiscloud's People

Contributors

Stargazers

Watchers

elpiscloud's Issues

Setup ESLint and Prettier

Setup transcription workflow screen

Make frontend api calls fail gracefully.

For example, within lib/api/files.ts, getSignedUploadURLs and getUserFiles are async methods that make network calls such as fetch and getDocs. These operations could fail and these errors (example HTTP errors) would need to be handled properly.

Similarly, where methods from lib/api/files.ts are used, such as within components/datasets/DatasetViewer.tsx (loading the user's datasets) would depend on this error handling being done correctly.

`libsndfile` c library missing in cloud functions, which causes soundfile package to fail.

A new sound library needs to be found for performing conversions/audio processing in the cloud functions.
Currently we're using soundfile, but this relies on the underlying system having installed libsndfile (which cloud functions don't have as far as I can see).
Librosa also uses soundfile to do the audio processing so probably neither of these will work.
After this, the audio utilities in the cloud functions folder will need to be rewritten

Log dump:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/bin/functions-framework", line 8, in <module>
    sys.exit(_cli())
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/functions_framework/_cli.py", line 37, in _cli
    app = create_app(target, source, signature_type)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/functions_framework/__init__.py", line 288, in create_app
    spec.loader.exec_module(source_module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/workspace/main.py", line 4, in <module>
    from functions.datasets.process_file import process_dataset_file
  File "/workspace/functions/datasets/process_file.py", line 9, in <module>
    import utils.audio as audio
  File "/workspace/utils/audio.py", line 3, in <module>
    import soundfile as sf
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/soundfile.py", line 142, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found

`process_model` cant serialize Model status enum.

The process_model cloud function fails when trying to serialize the model status enum, as seen in the logs below.

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/functions_framework/__init__.py", line 171, in view_func
    function(data, context)
  File "/workspace/functions/process_model.py", line 31, in process_model
    publish_to_topic(PUBSUB_TOPIC, [model.to_dict()])
  File "/workspace/utils/pubsub.py", line 33, in publish_to_topic
    serialized = json.dumps(obj)
  File "/opt/python3.10/lib/python3.10/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/opt/python3.10/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/python3.10/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/python3.10/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type TrainingStatus is not JSON serializable

Setup model management layout

Primarily creation of models from dataset and file uploads.

Trainer topic subscription idempotency.

The trainer subscription can fire the same training event multiple times, which would start different training run throughs on cloud run (wasting resources, overwriting models etc.).

This is not a priority to fix at this point, but something to think about.

Copying over a reply below from Nick.

Maybe I just put it as a flag on the firestore Model that training has begun?

If firebase has atomic transactions this might work, otherwise you're just going to have the same run condition in a different place. The ideal scenario is to have idempotent training jobs, where the end state isn't affected by multiple runs (either because it realised that a training job with the given inputs has already occurred and short-circuits, or it does the training again in a non-destructive way - both of these typically involve hash comparisons between inputs/outputs). This isn't easy to pull off, and probably isn't a priority at the moment, but something to think about.

Originally posted by @nicklambourne in #89 (comment)

Inline and create types for as many Terraform module variables as possible

Update tags button should be disabled until changes occur.

Add functionality to allow a user to remove a file they have uploaded

Currently this is not implemented in the UI and it would be good to have this feature for completeness.

Allow users to download a dataset after it's processed

Sign in should redirect to the home page

Router improperly redirects on sign in Bug.

Currently the upon signing in to elpis.cloud, the router pops the last page off the stack, which can have the effect of navigating off the site.

We want the router to redirect to the home page instead after users log in.

Front-end view for creating a training job

TODO

When users visit /train, they should first of all be prompted to select a dataset to train with.
If the user has datasets but they are not yet processed, they should still see datasets, but there should be something to inform the user that they cannot select this until processing has completed.
If there are no datasets at all for a certain user, there should be an option to navigate to the datasets page to create one.
After selecting a dataset, the user should be provided with training options similar to the desktop version of elpis.
When a the options and dataset have been selected, the frontend should create a new training job in firestore with information about the date, dataset involved, and selected options.

Rename python docstring `Parameter` sections to `Args` to align with google documentation style.

Add newlines to end of terraform files in /architecture

The terraform formatter (terraform fmt) removes new lines from the ends of files as seen here #59 , despite saying that it doesn't. Generally this is gross but low priority fix.

Setup dataset management layout

Create datasets should have a notice if there are no user files uploaded yet.

Model duplication across frontend, services and cloud functions.

Essentially we'll have 4 representation of models like Dataset, Model etc that could get out of sync in the future.

We could write some integration tests
Or we could find some way of writing the model definitions once and propagate them through all parts of elpiscloud.

Include favicon

Pin cloud function dependencies

The cloud function dependencies aren't currently pinned at any version, which has the potential to create problems down the track if there are breaking changes (as outlined by Nick in a previous PR).

So essentially we just need to go through the requirements.txt file and find a set of compatible versions to freeze at.

Missing tests for `process_dataset.py`.

As above.
We need to test valid and invalid dataset processing requests.

Change tagging CSS to use Tailwind CSS

globals.css contains CSS for tagging functionality, this needs to be changed to use tailwind CSS. Also the tagging needs to be changed to use a cross icon rather than the rotated plus sign it is using at the moment.

Reloading some of the pages with api calls results in a 404.

To reproduce

Head to elpis.cloud/datasets
Refresh the page

Trainer subscription broke during deploy with non-compatible options

Arose from #86

This wasn't seen in the terraform planning stage, but apparently exactly once delivery and push config are incompatible options to have on the pubsub subscription, and it cried during the build pipeline:

│ Error: Error updating Subscription "projects/elpiscloud/subscriptions/trainer_subscription": googleapi: Error 400: A subscription cannot have push config or bigquery config set with exactly once delivery configured.
│ 
│   with module.trainer.google_pubsub_subscription.subscription,
│   on ../../modules/service/main.tf line 40, in resource "google_pubsub_subscription" "subscription":
│   40: resource "google_pubsub_subscription" "subscription" {

Setup file upload screen layout

Maybe allow users to download files?

Resampling done improperly during dataset preprocessing.

After issues with the soundfile library (#71), I tried to rewrite the audio utilities using the python built in wave module.

When I did so, I thought that resampling was just modifying the sample_rate metadata on the audio file, which has caused some funny bugs with the processed dataset files.

The processed dataset files are not anything like expected, and instead of writing a custom resampling algo, we should find a better external audio utility library.

Header navbar links are not proper links

Header links are <li> elements wrapped in a next/link component
When right clicking one of the nav links, we don't get the traditional options like open in new tab etc.

Show user a view of all files in a given dataset

Currently in DatasetViewer.tsx, when the user is shown a list of all their datasets, a column should contain a button to 'view' the dataset in that row. This button should show a simple view containing a list of all the user's files that belong in that dataset.

Refactor the elan processing/cleaning scripts taken from desktop elpis.

Not high priority, but the scripts taken from the desktop versions could, and probably should be improved in the following ways (stolen from @nicklambourne):

Dataclasses for utterances
Smaller functions
Significantly less parameters per function
Removing questionable default parameters ("", {None})
Replacement of prints with calls to logging
Actioning or removing TODOs
Simplifying the "clean" function logic
Separating removing english functionality.

Setup mock server with basic endpoints

Deleting a dataset in firestore should also delete dataset files in cloud storage.

elpis.cloud HTTP should redirect to HTTPS

Trainer pubsub subscription retries thousands of times before getting a response.

There's an error that occurs while running the trainer service, where some file isn't found where it's expected.

Pub sub subscriptions are 'outstanding' until they receive an ack response (see here)
If no ack response is received within a set time (defaults to 10 seconds), the request is retried until it succeeds.
Currently the trainer service waits for the entire training to complete before returning a response.
This means that although the cloud run service was running, the pubsub subscription had no way to verify that it's request had been received, and was sending tonnes and tonnes of requests to the cloudrun server.
Obviously this is a massive issue if we're paying for big compute resources to run the docker images, and they continue to run forever because they get spammed with requests.

Visible user uploaded files at /files should not be limited to wav and elan.

Currently the files viewer in /files only shows uploaded elan and wav files, which can be confusing if you've also just uploaded txt files like pronunciation dictionaries etc.
We should make the uploaded files section look similar to the dataset file selector, showing every file in a list view instead of segmenting it into multiple sections.

Refactoring resource names in Terraform scripts

Instead of using hyphens in resource names in Terraform, underscores should be used. For example, in architecture/modules/frontend_bucket/main.tf, google_storage_bucket is named static-site. This will be changed to static_site.

Documentation should talk about how to get started with the project

Dataset cloud function preprocessing should handle improper datasets.

Currently the preprocessing workflow blindly attempts to create processing jobs from the files supplied in a dataset.
This would fail if the user selected improper files, not enough files etc.

Instead, after forming a dataset object from the incoming event, the process_dataset function should check dataset validity before batching and pushing jobs to the pubsub queue.

If the dataset is found to be invalid, we should indicate so on the dataset model in firestore (setting status to error or something).

Show uploaded files to user at /files

Migrate i18n translations from desktop version to wherever possible.

Add a documentation site hosted at docs.elpis.cloud

On the files page, users should have the option to add tags to their files.

Tags should be used to organise the user uploaded files in firestore.

Each user file in firestore currently has an empty list of tags that a user should be able to update from the frontend. These tags should appear when viewing the files that a user has uploaded, e.g. from the files page or from the create new dataset page. They allow users to filter files not only by the name but also by some other information if they should so choose.