kernelci / kcidb Goto Github PK

kernelci.org common database tools

License: GNU General Public License v2.0

Python 83.09% Dockerfile 0.03% Jinja 2.67% Shell 14.21%

kcidb's Introduction

KCIDB

KCIDB is a package for submitting and querying Linux Kernel CI reports, coming from independent CI systems, and for maintaining the service behind that.

See the collected results on our dashboard. Write to [email protected] if you want to start submitting results from your CI system, or if you want to receive automatic notifications of arriving results.

See our guides for more information:

kcidb's People

Contributors

Stargazers

Watchers

kcidb's Issues

Invalid messages arrive into the new queue

It seems there's a percentage of invalid messages being pushed to the "kcidb_new" queue. And it looks like they're empty. This needs investigation. The source needs to be identified and notified.

jsonschema ignores "format": "date-time"

BigQuery doesn't seem to understand date-only timestamps when importing "JSON" using Python libraries. This manifests in the following entry in Google Cloud Function logs:

{
  "textPayload": "Traceback (most recent call last):\n  File \"/user_code/kcidb/db/__init__.py\", line 380, in load\n    job.result()\n  File \"/env/local/lib/python3.7/site-packages/google/cloud/bigquery/job.py\", line 812, in result\n    return super(_AsyncJob, self).result(timeout=timeout)\n  File \"/env/local/lib/python3.7/site-packages/google/api_core/future/polling.py\", line 130, in result\n    raise self._exception\ngoogle.api_core.exceptions.BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v1.py\", line 402, in run_background_function\n    _function_handler.invoke_user_function(event_object)\n  File \"/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v1.py\", line 222, in invoke_user_function\n    return call_user_function(request_or_event)\n  File \"/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v1.py\", line 219, in call_user_function\n    event_context.Context(**request_or_event.context))\n  File \"/user_code/main.py\", line 172, in kcidb_load_queue\n    DB_CLIENT.load(data)\n  File \"/user_code/kcidb/db/__init__.py\", line 384, in load\n    ])) from exc\nException: ERROR: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.\nERROR: Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0\nERROR: Error while reading data, error message: JSON parsing error in row starting at position 0: Couldn't convert value to timestamp: Could not parse '2020-09-13' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]] or YYYY/MM/DD HH:MM[:SS[.SSSSSS]] Field: discovery_time; Value: 2020-09-13\n\n",
  "insertId": "000000-34da1c79-def7-4674-a282-502b43a45b8f",
  "resource": {
    "type": "cloud_function",
    "labels": {
      "region": "us-central1",
      "project_id": "kernelci-production",
      "function_name": "playground_kcidb_load_queue"
    }
  },
  "timestamp": "2020-09-18T15:21:01.906Z",
  "severity": "ERROR",
  "labels": {
    "execution_id": "j41pbhj561th"
  },
  "logName": "projects/kernelci-production/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
  "trace": "projects/kernelci-production/traces/1374c54dd63048dfc81848d99ac79ed1",
  "receiveTimestamp": "2020-09-18T15:21:02.260092087Z"
}

A possible fix might be either requiring a full timestamp or padding incomplete timestamps in KCIDB.

Upgrade google-cloud-secret-manager dependency

The latest release of the google-cloud-secret-manager package breaks the existing KCIDB code, and was pinned to versions <2.0.0dev.

Fix the incompatibilities and unpin the package. Or perhaps re-pin to <3.0.0dev.

Consider supporting streaming JSON input and output

As the database grows it would soon become unfeasible to load it completely into memory when dumping/loading. For that matter consider supporting one of the JSON streaming protocols to allow incremental loading/dumping, as well as incremental submission.

The latter could help us with #88, as the submitter would then be able to start a single kcidb-submit and just keep piping the objects there, as it keeps the authentication and connection active.

Do not upgrade data in `kcidb.io.merge()` and `kcidb-merge`

Instead of upgrading the data to the latest schema in kcidb.io.merge() and kcidb-merge, move merging to methods in corresponding schema version objects.

OO: Switch to storing objects in lists

Right now the OO data representation stores objects in ID->object dictionaries. This is unnecessary past creation time, loses data (even though spurious), and loses test order.

Consider switching to storing OO objects in lists, in the same order as the source I/O data.

Consider supporting validating against a specific version with kcidb-validate

Handle BigQuery loading failures in Cloud Functions

BigQuery loading can fail sometimes. E.g. due to insufficient schema validation described in #108 data could be accepted into the queue, but rejected by BigQuery. In this case it could be beneficial to log the problem and acknowledge the broken pub/sub message, so it's removed from the queue. Otherwise it will keep getting picked up by the Cloud Functions and keep failing forever.

One way to do this could be by dividing the pulled list of messages into two, in case of such a failure, and retrying the loading until the culprit(s) are found, logged and ACKed, and the rest of the messages is loaded and ACKed. This should observe the load job flow control. I.e. dividing should only be done until the first load job succeeds, and the rest of retrying should be left to the queue mechanisms.

Remove unnecessary data upgrading

Leave upgrading to the user and specify in the function documentation whenever the data is supposed to adhere to the latest schema (most of the time, likely). Add assertions verifying it's the latest version. Remove upgrading from those functions.

Provide a way to install I/O schema module only

The KCIDB package now has a lot of dependencies, among which jq.py dependency is the most problematic, as it requires building (at least until, and if, upstream adopts stream parsing).

To minimize the dependency impact on programs which only care about validating the schema, provide a way to install only the modules required for that. Either provide a way to install a sub-package from the same repo, or create a separate repo and package with those.

Add more schema documentation

Even though the schema is well-commented, it's still not describing some key design elements. Consider either embedding more documentation into the schema itself, or writing a separate document. Particularly including:

Test classification tree
Summarizing test runs
Avoiding record duplicates
Dealing with duplicates

Consider making kcidb-validate output validated reports

It could be useful to make kcidb-validate output reports after they were validated, similar to other validation tools. This could help the user see which exact report was valid and which wasn't, when it aborts on validation failure.

Make kcidb-count and kcidb.io.get_obj_num() work with any schema version

At the moment kcidb.io.get_obj_num() and as a consequence kcidb-count only accept data complying with the latest schema version. Make them accept any version. E.g. by adding the functionality as a schema version's count() method.

Wait until messages publish before exiting kcidb-submit

To make sure kcidb-submit returns correct status wait until all Pub/Sub messages have published and check their status.

See https://googleapis.dev/python/pubsub/latest/publisher/index.html#batching

Pick up abandoned notifications

At the moment posted notifications are picked up with a "new Google Cloud Firestore document" trigger to a Google Cloud Function. That is only triggered once upon posting the notification. If for whatever reason the kcidb_send_notification function failed after picking up that notification, and before sending it, and the next retry was within the pick up timeout, the function is going to succeed and is never going to be retried.

Consider adding a "cron job" going over all unsent notifications and sending them to wipe that up.

An alternative could be perhaps making the kcidb_send_notification function fail if it detects a picked-up notification, which wasn't sent, so that it could be retried until either the pick up timeout expires or the notification is sent.

KCIDB exceeds BigQuery interactive query limits

BigQuery has a limit of 100 concurrent "interactive" queries per Google Cloud project. We've been exceeding that, now that syzbot is sending its results to the playground instance. This was especially seen during notification generation, so as a temporary measure I disabled retrials there, to give up notification generation after first failure and thus avoid continuing to spam the database with queries.

The proper fix for this would be adding a less-scalable, but more-interactive database as our notification and dashboard data source, with recent data only, while keeping pushing all data to BigQuery.

Other temporary measures could be minimizing the number of queries in the notification generation. E.g. by pushing bigger, bundled, and not smaller, original messages there. We could also review the number of queries we're doing there.

Verify all submitters comply with the fully-enforced schema

Since we're actually enforcing JSON schema's format fields, it could be that some submitters actually weren't complying. In fact, KernelCI seems to be sending invalid timestamps.

Make sure all incoming data complies before deploying v8 release.

Implement coherent summarizing of revision/build/test status

Right now we have disparate status values across revisions/builds/tests, and we don't have a uniform way to summarize them onto one value, saying e.g.: "this revision is OK", accumulating all the build and test statuses.

Come up with the logic and the terminology to summarize those.

Provide feedback on failed submissions

We need to let submitters know if their submission failed to process. One way to do that could be to accept an e-mail address in queue message metadata and send the failures there in the processing step.

Implement integration tests for command-line tools

We've changed the command-line tools considerably since the last release. Implement basic integration tests for them, or at least for the most important ones.

Consider supporting upgrading to a specific version with kcidb-upgrade

The kcidb-upgrade tool can be used for upgrading the I/O data schema. However, it only supports upgrading from any known version to the latest version. Sometimes, though, it could be useful to upgrade to an older version. E.g. to see what an older version of KCIDB would do, or to debug an upgrading issue.

Consider adding support for a command-line option accepting the major version number of a target schema to kcidb-upgrade. The option should default to the latest version, if unspecified.

Generate notification messages with HTML version

Some e-mail viewing software may choose to display our notification messages in variable-width fonts, messing up the formatting. The most prominent example is groups.io which we use to host the maillist with test reports. Consider adding an HTML version of the report, being just the text version wrapped in <pre></pre> element.

Figure out a way to correlate revisions coming from different CI systems

More than one reporting CI system can discover the same revision, but report them with its own ID. This leads to builds and tests attached to separate revision objects, prevents correlation, and would produce duplicate notifications for developers.

The likely solution is to agree on a common way of formatting revision IDs, and moving the origin from it to a separate field (again).

Add submit/query tests

Implement tests submitting and querying certain data to/from BigQuery. Consider using dynamically-generated, unique (as in UUID) dataset names or table name prefixes to avoid clashes with parallel-running tests.

The tests should ensure that valid data could be both submitted and queried and stays the same afterwards.

Reduce overhead when submitting small amounts of data to KCIDB

Submitting test results from one job currently takes about 1s. For example, from the kernelci-backend Celery log:

[ INFO/MainThread] Task kcidb-tests[44910df2-393d-457d-8c56-198d2bfcf54c] succeeded in 1.19923461694s: ObjectId('5ecfb49e5841ebffd3c9164f')

These tasks all appear to last between 0.8s and 1.4s. That seems like quite a long time given the small amount of data involved in each set of results, and it doesn't vary with the size of the data. Some test jobs only have 5 test case results and take the same amount of time.

While it's possible to do things in kernelci-backend, such as buffering the data to submit less often or keep a process alive with a connection always open, it seems like any new submitter would face the same problem. So it would be beneficial to have it solved once in KCIDB rather than in each system that submits some data.

Add complete HOWTO for starting to submit data

To make kcidb and common database more appealing, add a complete HOWTO to go from zero to submitting a minimal amount of data, to adding more data. The HOWTO should help companies start submitting data with minimum effort.

Implement a tool generating notification messages for given new data

Implement a command-line tool generating report messages provided new data, and existing data (either in the database or in JSON). This will help us test and debug report generation.

Consider implementing interactive submission/publishing

To improve performance, currently messages are batched before submitting to a message queue. Consider adding support for "interactive" submission so they're sent as soon as a report is read. This could be useful for KernelCI, provided performance is good enough.

As an alternative, implement a completion callback printing the message submission ID, but ensuring order. So at least we don't build up a huge array of futures and only print the IDs on exit.

Upgrade to kcidb-io v2

The kcidb-io v2 release should fix a number of issues. Upgrade when it's completed.

Consider replacing kcidb-upgrade and kcidb-merge with kcidb-cat

Since kcidb tools always output upgraded schema, consider simply renaming kcidb-merge to kcidb-cat (similar to the cat command) and dropping kcidb-upgrade.

Consider letting subscriptions generate the whole message

Right now subscriptions generate only the beginning of both the message Subject and the body, and the library adds the object summary and the description. This makes it a little inflexible, and more importantly unclear how the complete message is formed. Since getting the summary and the description of a report object is so easy, consider just letting the subscription take them and put them wherever they like.

Upgrade google-cloud-pubsub dependency

The latest release of the google-cloud-pubsub package breaks the existing KCIDB code, and was pinned to versions <2.0.0dev.

Fix the incompatibilities and unpin the package. Or perhaps re-pin to <3.0.0dev.

Consider never copying I/O data in functions, require user to copy when necessary

At the moment many functions taking I/O data copy it by default and have an optional argument disabling that. While that is good for correctness, it's also potentially wasteful in memory and CPU, and requires more code, considering we mostly don't need copying.

Consider instead stating on each function interface that data can/will be modified in place, and providing an official copying function.

Make kcidb-schema output older schema versions on request

Warn or optionally abort on encountering unknown test names

To solicit usage of common test identifiers, implement producing a warning or aborting on encountering unknown test names when submitting reports.

Abort (and possibly warn) only when a special option (value) was supplied to kcidb-submit command.

Require patch_mboxes field

To avoid ambiguity, consider requiring revisions in I/O data to always have the patch_mboxes field (even if empty).

Disable internal schema validation by default

At the moment KCIDB code is peppered with assertions validating supplied JSON data against the schema. While good for maintaining correctness, that has a large performance toll. Disable the internal schema validation by default, but provide a way to enable it. E.g. with an environment variable.

The assertions could then look like this:

assert not kcidb.misc.extra_assertions or io.schema.is_valid(data)

And the environment variable could be set like KCIDB_EXTRA_ASSERTIONS=True

Requires #113.

kcidb.io.merge() is too slow

The kcidb.io.merge() function is inexcusably slow, most likely due to a lot of schema validation happening.
See how we can reduce number of validations, both with assertions enabled and disabled.

Support pulling more than one message with kcidb-mq-subscriber-pull

At the moment, the kcidb-mq-subscriber-pull tool supports pulling one message per execution only. Now that we have JSON streaming support, implement support for pulling a specified number of messages within a specified timeout, and outputting them as a JSON stream.

Make OO representation more straight-forward

The current implementation of OO representation is trying to be smart and save some code lines, but to make it better documented and obvious it needs to be simpler and more straight-forward.

One approach could be explicitly defining a class for each JSON object node, and explicitly specifying and copying each field, as well as documenting them, at the expense of duplicating schema information (similar to the DB schema), compensated by tooling verifying nothing is misspelled or anything extra is introduced, yet allowing for the OO representation to lag behind the I/O schema.

Rename git_repository_commit_hash to git_commit_hash

Rename the revision's git_repository_commit_hash property to just have git_commit_hash to have easier time typing and reading it.

Quoting @gctucker:

Each hash is really bound to each commit rather than the repository, and commits only exist in repositories.

db_schema: Consider making test status an integer

We need a simple way for selecting and grouping tests based on status, and strings don't help with sorting. At the moment the Grafana dashboard prototype has to convert strings to numbers before doing any operations and that not likely to be efficient.

Consider switching the tests status in the database schema to numbers and converting on submission/querying.

Add explicit schema validation upon reading data in command-line tools

At the moment some command-line tools don't have explicit schema validation and rely on specific functions to do it, such as kcidb.io.schema.upgrade(). Instead, add explicit validation to those tools, thus allowing the mandatory validation to be removed from other functions.

Explicitly mention how to omit data from submission

The documentation might not be clear enough on how to omit data from submissions. We need to make sure people understand that they need to simply omit properties, instead of sending null, or "". A good place for that could be the SUBMISSION_HOWTO.md.

Consider renaming "subscriptions" to "subs"

Really, "subscriptions" might be a little too long.

Explain how to deploy extra Cloud Functions with modified environments

Generate "virtual tests" for missing test tree nodes in OO representation

To simplify summarizing test results, generating reports, and matching test results in subscriptions, generate "virtual tests" with summarized status for missing test tree nodes, when generating the OO representation of report data.

E.g. generate the root (empty path) node, taking stock of all test results, or e.g. generate the "kselftest" status, if only "kselftest.kvm" is reported (the case for CKI results). This is especially important for summarizing testing in progress.

Document the raw PubSub submission interface

Not everyone who wants to submit their reports uses Python, or is able to run the command-line tools, as conversation with Dmitry Vyukov has shown. Document the raw PubSub submission interface in the SUBMISSION_HOWTO.md to support such cases.

Standardize architecture names

At the moment architecture names are not standardized, and we have both aarch64 and arm64. Decide which names we should use, and document that in the schema.

Implement bundling new data before loading into the database

Implement bundling new data coming from the message queue together up to a time/size limited extent to avoid exceeding load job limits.

We can use Firestore or something else to store the incoming data until it could be loaded. Note that Firestore has its own limits on document size and ultimately we might need to pare the incoming data down for Firestore. E.g. down to a single object per document.

kernelci / kcidb Goto Github PK

kcidb's Introduction

KCIDB

kcidb's People

Contributors

Stargazers

Watchers

Forkers

kcidb's Issues

Recommend Projects

Recommend Topics

Recommend Org