Coder Social home page Coder Social logo

tuid's Introduction

Experimental TUID Project

TUID is an acronym for "temporally unique identifiers". These are numbers that effectively track "blame" throughout the source code.

Branch Status
master Build Status
dev Build Status

Overview

This is an attempt to provide a high speed cache for TUIDs. It is intended for use by CodeCoverage; mapping codecoverage by tuid rather than (revsion, file, line) triples.

More details can be gleaned from the motivational document.

Running tests

Running any tests requires access to an Elastic Search cluster for mo_hg on localhost:9200. This requires Elastic Search version 6.2.4. To look at the Elastic Search cluster, you can use Elasticsearch-head, found here. Steps to run the Elastic Search will differ based on the operating system, but for Windows we have to do the following:

  1. Install elasticsearch.
  2. Now, you might have to copy the contents of elasticsearch-6.2.4.yml to <ES-INSTALLATION>/config/elasticsearch. The default config should work though.
  3. Open a command prompt and go to the bin folder in the elasticsearch installation.
  4. Run elasticsearch.bat to start the service - you should now be able to run the tests.

After cloning the repo into ~/TUID:

Linux

cd ~/TUID
pip install -r ./tests/requirements.txt
pre-commit install
export PYTHONPATH=.:vendor
export TUID_CONFIG=tests/travis/config.json
python -m pytest -m first_run --capture=no ./tests
python -m pytest -m 'not first_run' --capture=no ./tests

Windows

cd %userprofile%\TUID
pip install -r .\tests\requirements.txt
pre-commit install
set PYTHONPATH=.;vendor
set TUID_CONFIG=tests\travis\config.json
python -m pytest -m first_run --capture=no tests
python -m pytest -m 'not first_run' --capture=no tests

Just one test

Some tests take long, and you want to run just one of them. Here is an example:

For Linux

python -m pytest tests/test_basic.py::test_one_http_call_required

For windows

python -m pytest tests\test_basic.py::test_one_http_call_required

If there are issues that arise concerning a private.json file, you may be required to set the following environment variable: TUID_CONFIG=tests/travis/config.json

Running the web application for development

You can run the web service locally with

cd ~/TUID
export PYTHONPATH=.:vendor
python tuid\app.py

The config.json file has a flask property which is sent to the Flask service constructor. Notice the service is set to listen on port 5000.

"flask": {
    "host": "0.0.0.0",
    "port": 5000,
    "debug": false,
    "threaded": true,
    "processes": 1,
}

The web service was designed to be part of a larger service. You can assign a route that points to the tuid_endpoint() method, and avoid the Flask server construction.

Deploying the web service

First, the server needs to be setup, which can be done by running the server setup script resources/scripts/setup_server.sh, and then the app can be setup using resources/scripts/prod_app.sh. If an error is encountered when running sudo supervisorctl, try restarting it by running the few commands in the server setup script.

Using the web service

The app.py sets up a Flask application with an endpoint at /tuid. This endpoint models a database: It has one table called files and it can accept queries on that table. The number of queries supported is extremely limited:

{
    "from":"files"
    "where": {"and": [
        {"eq": {"branch": "<BRANCH>"}},
        {"eq": {"revision": "<REVISION>"}},
        {"in": {"path": ["<PATH1>", "<PATH2>", "...", "<PATHN>"]}}
    ]}
}

Here is an example curl:

curl -XGET http://localhost:5000/tuid -d "{\"from\":\"files\", \"where\":{\"and\":[{\"eq\":{\"branch\":\"mozilla-central\"}}, {\"eq\":{\"revision\":\"9cb650de48f9\"}}, {\"eq\":{\"path\":\"modules/libpref/init/all.js\"}}]}}"

After some time (70sec as of March 23, 2018) we get a response (formatted and clipped for clarity):

{
    "format":"table",
    "header":["path","tuids"],
    "data":[[
        "modules/libpref/init/all.js",
        [
            242488,
            245829,
            ...<snip>...
            243144
        ]
    ]]
}

Using the client

This repo includes a client (in ~/TUID/tuid/client.py) that will send the necessary query to the service and cache the results in a local Sqlite database. This TuidClient was made for the ActiveData-ETL pipeline, so it has methods specifically suited for that project; but one method, called get_tuid(), you may find useful.

Examples using this service

  1. Web-extension for Phabricator. See the README in that repo for installation instructions.

TUID Service Improvements as part of GSoC 2019

Porting to Elasticsearch from SQLite and caching to make the service faster. Details can be found here.

Riot Matrix Channel

We've moved away from IRC. You can find us in the public code-coverage Riot channel instead.

tuid's People

Contributors

ajupazhamayil avatar gmierz avatar jyothisjagan avatar klahnakoski avatar mozilla-github-standards avatar natj212 avatar nishikeshkardak avatar rv404674 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tuid's Issues

revsions and changsets are 1-1 in hg

Conceptually, revisions and changesets are different, but hg does not make the distinction. The two tables going by those names should be merged into one.

Improve sqs consumer

The sqs_consumer was written to consume the recorded tuid requests, and make those same calls to a service. Unfortunately, it does not send the requests at the rate they arrived because it is single threaded.

Begin by determining the playback time offset, and ensure the messages are consumed (and sent) according to timestamp on those messages. Use multiple threads to do so.

Add a facade for logging

Almost alll methods should be wrapped in a facade, including builtins, like print:

 print(url)

Make a facade that uses print, or make a facade that uses a logging library:

    class Log:

        @classmethod
        def note(message):
            print(message)

then you can replace the above with

Log.note(url)

Facades provide an API that best matches your application calling patterns, and allows you to change libraries easier later.

Reduce warnings

There are too many warnings while running in prod. Each warning sends an email.

ip-172-31-1-12 (pid 8972) - 2018-05-05 11:53:38 - Main Thread - "service.py:784" (get_tuids) - WARNING: xpcom/ds/nsArrayUtils.h does not exist in the revision 334aed1c3232

Warning are for unusual situations that the machine can handle, but really should be fixed.

Cache the diffs

Caching the diffs may be useful, at the very least knowing what files are touched by what revisions will help us store a sparse table of tuids:

CREATE TABLE diffs (
    file VARCHAR,
    changeset VARCHAR,
    diff VARCHAR
)

the diff column is not the regular diff, rather some serialization of a structure that tells us what are new lines (+), deleted lines (-), or unchanged (|). Maybe a string with a length that matches the source file:

||||||+++--||||||||||||||||||+|||||-

or a run-length-encoded version of the same

[("|", 6), ("+", 3), ("-", 2), ("|", 18), ("+", 1), ("|", 5), ("-", 1)]

or something smarter.

Even so, this diff may not be needed since it should have been applied to the subsequent changeset, and stored in the tuid table already.

Add facade to sqlite

Another example of add-a-facade-to-everything; here is a common calling pattern you use with sqlite:

cursor = self.conn.execute(self._grabTIDQuery,(file,current_changeset))
cs_list = cursor.fetchall()

Make a facade over the sqlite library to simplify these call:

class Sqlite(object):

    def __init__(db):
        self.db = db  # actually a connection object

    def get(sql, params):
        self.conn.execute(sql, params)
        return cursor.fetchall()

then your calls look like

cs_list = db.get(self._grabTIDQuery,(file,current_changeset))

Cache get_tuids() responses

TUID requests will be significantly more numerous than the number of files ingested. Please cache the responses of get_tuid() in the database so they can be retrieved faster, and with a single query. Also consider pre-caching these responses so all the transformation logic is in one place.

Right now, this block does a lot of work for every request:

for line_num in range(1, len(line_origins) + 1):
:

Add check for active transactions

The TUID service easily gets overwhelmed by requests, especially when there are about 300 machines making those requests (as was this morning). The TUID service must be more aggressive about how it will respond to these requests.

The app.py methods could check the database for the number of pending transactions, and return a 503 Service Unavailable if the count is too high.

Add doc for class and methods for TID API

I would like to see a small doc that points to the main class responsible for TID management, and the (public) methods that can be called on it. This will help me make more test cases, and provide you with paice of code that turn your class into a web api.

Send less bad requests to HG

Operations noticed my deployment test on Saturday, and banned the IP for sending too many requests that result in 404. In our context, 404 is a perfectly legitimate response; it tells us what files do not exist. Unfortunately, these 404s are more suspicious than an equivalent number of successful requests.

I will be fixing the ETL pipeline to send less requests for files that do not exist. Plus we can do better: The ETL machines markup files with is_firefox, and the client can be made to use that to bring the number of requests down to near zero.

Solve the transaction problem

In a call with @gmierz this morning we decided to use use transactions to rollback changes caused by program errors. This requires the sql.py implement transactions, and implement them properly.

In this light, I will pushed a facade to allow transactions using Python "context manager" (66b1525) :

here is example uasge:

with db.transaction() as t:
    t.execute(
        "INSERT INTO temporal (tuid, revision, file, line) VALUES (?, ?, ?, ?)",
        (-1, rev[:12], file_name, 0)
    )
    t.execute(
        "INSERT INTO annotations (revision, file, annotation) VALUES (?, ?, ?)",
        (rev[:12], file_name, '')
    )

Notice the use of t rather than self.db: Technically, either one can be used.

Both commands will be completed, or none will: The transaction is committed when the block of code completes successfully, and the transaction is rolled back if an exception occurs inside the block.

The facade does not implement transactions yet, but at least the service.py code can start using the idiom.

Move database file to larger drive

The boot drive is only 7Gb, so it will not be able to hold the tuid database for long. Move it to /data1 which is a little bigger

-bash: cd: re: No such file or directory
[ec2-user@ip-172-31-1-12 TUID]$ cd resources/
[ec2-user@ip-172-31-1-12 resources]$ ls -al
total 1231852
drwxrwxr-x 4 ec2-user ec2-user       4096 Jun 23 01:37 .
drwxrwxr-x 8 ec2-user ec2-user       4096 Jun 19 03:12 ..
drwxrwxr-x 2 ec2-user ec2-user       4096 Jun 19 03:12 config
-rw-rw-r-- 1 ec2-user ec2-user        949 Apr 14 19:43 example_client.py
drwxrwxr-x 2 ec2-user ec2-user       4096 May 29 13:57 scripts
-rw-rw-r-- 1 ec2-user ec2-user      43724 Apr 14 19:43 stressfiles.json
-rw-r--r-- 1 ec2-user ec2-user 1261322240 Jun 23 01:37 tuid_app.db
-rw-r--r-- 1 ec2-user ec2-user      24248 Jun 23 01:37 tuid_app.db-journal

Measure % of files marked with TUIDs

The TUID service places priority on fast response rather than returning all TUIDs; it will return 202 if not all TUIDs can be found quickly. This is good because we maintain a quality of service that the ETL machines can rely on.

We should measure what percent of files are being marked with TUIDs. This will inform the accuracy of the TUID queries; it can be used to track the progression of this project; and we can use it for alerts in the future.

Do not use "is"

This code will not do what you expect:

if mozobj['diff'] is []:

instead write

if not mozobj['diff']:

Merge conflicts

Currently, the TUID service completely ignores merge diffs because they contain the same changes done by past patches/diffs that have already been applied.

However, there exists a case where merge conflicts can occur, leading to changes being done in the merge patch which would never be in the TUID DB with what we currently have. Furthermore, those changes applied between the previous merge, and a future merge which has a conflict, will be incorrect because of the conflict. Now, since diffs cannot be applied within that range, only annotations will be usable in these cases.

These conflicts occur anywhere from 1-9% of the time, so they are uncommon but not that rare. Currently, once a merge conflict occurs in one file, any changes applied to it afterwards will result in incorrect annotations. Because this doesn't happen very often, at most 9% of the time, we will have no choice but to return an annotation for a file at a given revision.

The solution to this problem is to use the upcoming Clogger to:
(1): Store which revisions are merges by setting a merge field to 1 to indicate it's a merge.
(2): Store a new table of (merge_revision, file) to denote all files that were actually modified by the merge.
(2i) Get all the files the merge modifies.
(2ii) Get the json-log for each file, and if the merge_revision is in the entries list, then the merge modifies it.
(3): When we encounter a merge revision in the service (searching for the closest revnum in the future with merge field set), check if any requested files are modified by the merge and then return a new annotation for them.

Remove loop

In

existing_tuids[int(node['lineno'])] = tuid_tmp[0]

the missing_tuids is filled in a loop (along with another set) this should be done in a single query; finding everything that exists, and doing set subtraction to find the missing.

Inline methods that have only one caller

_makeTIDsFromChangeset has only one caller; please in-line it.

test_addChangsetToRev fails because it calls _makeTIDsFromChangeset, which should only be called under certain conditions that _grabChangeset checks for. The mere existence of _makeTIDsFromChangeset influenced you to use it, and use it wrong. Its existence increases the the surface area you are compelled to test, despite it should not be tested directly; testing an implementation detail is brittle. There is also the connotative load of remembering the method names; less names are better.

In the same vein, please inline any other single-use methods. Add a comment line that describes what's being done if you wish.

Faster changelog?

Does hg accept a parameter to increase the number of results in a changelog? More results with less requests will be faster.

add version table to database

It would be nice for the service to verify the database has the correct version. If not, then either migrate the data, or make a new database.

Ensure columns are explicit

This code is a bit dangerous:

cursor = self.conn.execute("select * from revision where date<=? and file=?", (date, file,))
old_rev = cursor.fetchall()
if old_rev == [] or old_rev[0][0] == revision:
    return self._grab_revision(file,revision)

Do not use select * unless your code uses cursor.description to assign names to columns; the ordering of the columns may change on you. Let your SQL be explicit

cursor = self.conn.execute("select REV, FILE, DATE, CHILD, TID, LINE from revision where date<=? and file=?", (date, file,))

be sure to assign to named variables asap, so there are no magic numbers littered in your code

old_rev = cursor.fetchall()
if old_rev:
    old_rev_id, current_changeset, current_date, _, _, _ = old_rev[0]

if not old_rev or old_rev_id == revision:
    return self._grab_revision(file,revision)

old_rev = self._grab_revision(file,old_rev_id)

Faster changeset application?

This is only to be implemented if we find changeset application must be faster.

If the (tuid, revision, file, line) tuples are rows in the database, then a changeset can be converted to SQL statements that operate on those rows to update the database. Sending SQL to the database should be faster than pulling the raw data out, operating on it with Python, and pushing it back in.

Do not parse diffs, use existing code

Concerning this code

    if line['t']=='@':
                m=re.search('(?<=\+)\d+',line['l'])
                curline=m.group(0)

It looks like you are parsing. Please use diff parsing library, or use the mo_hg.parse.diff_to_json() in the mo-hg library

Apply diffs to TUID arrays rather than pulling blame from hg

A diff touching multiple files will be faster to pull from mo-hg than to pull every touched file from hg's json-annotate. Therefore deltas from know TUIDs should be faster to calculate.

The only time json-annotate should be used is when we are asking about files we do not know about, or revisions that precede the database knowledge, or when the number of changesets from the target revision to the last known revision is too great.

Move to new server

The TUID service is demanding; it should be contained in its own server

Catch exceptions early and often, and chain them

An uncaught exception will make the program exit. Instead of

            try:
                self.conn = sqlite3.connect(self.config['database']['name'])
            except Exception:
                print("Could not connect to database")
                exit(-1)

raise an error:

            try:
                self.conn = sqlite3.connect(self.config['database']['name'])
            except Exception as e:
                raise Exception("Could not connect to database") from e

Actually, the whole init() function should be surrounded by try/raise to catch a plethora of other problems that could occur:

def __init__(self,conn=None): #pass in conn for testing purposes
    try:
        with open('config.json', 'r') as f:
            self.config = json.load(f, encoding='utf8')

        if conn is None:
            self.conn = sqlite3.connect(self.config['database']['name'])
        else:
            self.conn = conn
        cursor = self.conn.cursor()
        cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
        if cursor.fetchone() is None:
            self.initDB()
    except Exception as e:
        raise Exception("can not setup service") from e

You do this to help explain what the problem is ("can not setup service") and you add the original exception to the causal chain (from e), so you get a detailed reason for failure.

Pack database

The annotations table stores the tuids<->line mapping as a string of many lines, with tuid, linenum pair on each line. Storing as an order list of tuids should be half the size.

Format structured blocks

Every line of code should have balanced parenthesis, quotes, or any other syntactical containment. If code does not fit on a line, then the opening tokens should end the line, with a line of matching closing tokens after the enclosed block.

Instead of

    self.conn.execute('''CREATE TABLE Temporal
             (TID INTEGER PRIMARY KEY     AUTOINCREMENT,
             REVISION CHAR(12)		  NOT NULL,
             FILE TEXT,
    		 LINE INT,
    		 OPERATOR INTEGER,
    		 UNIQUE(REVISION,FILE,LINE,OPERATOR));''')

format it as

    self.conn.execute('''
        CREATE TABLE Temporal (
            TID INTEGER PRIMARY KEY     AUTOINCREMENT,
            REVISION CHAR(12)           NOT NULL,
            FILE TEXT,
            LINE INT,
            OPERATOR INTEGER,
            UNIQUE(REVISION,FILE,LINE,OPERATOR)
        );
    ''')

Notice the first line has two opening tokens (( and """) at the end of the line, and the matching ones are together on the last line. The second line also ends with a closing token, and its match is on its own. This formatting is easier to read, and easier to cut-and-paste; notice the columns names are able to line up, and, we can cut this code for testing in a database session.

Python method names should not be CamelCase

Change the method names (and names in general) from camelCase (test_applyChangesetsToRev) to underscore-separated (test_apply_changesets_to_rev)

The one exception is class names; they should be CamelCase.

Use with context clause for files

Instead of

    f=open('config.json', 'r',encoding='utf8')
    self.config = json.load(f)
    f.close();

use the with clause to close the file no matter now you may exit the block:

        with open('config.json', 'r') as f:
            self.config = json.load(f, encoding='utf8')

oh, and the encoding parameter moved

Alert on thread limit

I suspect the machine failure over the weekend was caused by too many threads being generated by the TUID service. We should have an alert when too many threads get created. This may be hard since we do not control the creation of threads, the Flask app does. Still, we should be able to id() threads and count them.

Database closing?

ip-172-31-1-12 (pid 4466) - 2018-05-10 03:54:19 - Main Thread - "app.py:97" (tuid_endpoint) - WARNING: could not handle request
        File "tuid/app.py", line 97, in tuid_endpoint
        File "/home/ec2-user/TUID/vendor/pyLibrary/env/flask_wrappers.py", line 52, in output
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1997, in __call__
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 258, in execute
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 270, in run_wsgi
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 328, in handle_one_request
        File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
        File "/usr/lib64/python2.7/SocketServer.py", line 652, in __init__
        File "/usr/lib64/python2.7/SocketServer.py", line 331, in finish_request
        File "/usr/lib64/python2.7/SocketServer.py", line 596, in process_request_thread
        File "/usr/lib64/python2.7/threading.py", line 757, in run
        File "/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
        File "/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
caused by
        ERROR: database is closed
        File "/home/ec2-user/TUID/vendor/pyLibrary/sql/sqlite.py", line 132, in execute
        File "/home/ec2-user/TUID/tuid/sql.py", line 25, in rollback
        File "/home/ec2-user/TUID/tuid/sql.py", line 49, in __exit__
        File "/home/ec2-user/TUID/tuid/service.py", line 328, in get_tuids_from_files
        File "tuid/app.py", line 82, in tuid_endpoint
        File "/home/ec2-user/TUID/vendor/pyLibrary/env/flask_wrappers.py", line 52, in output
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
        File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1997, in __call__
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 258, in execute
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 270, in run_wsgi
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 328, in handle_one_request
        File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
        File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 293, in handle
        File "/usr/lib64/python2.7/SocketServer.py", line 652, in __init__
        File "/usr/lib64/python2.7/SocketServer.py", line 331, in finish_request
        File "/usr/lib64/python2.7/SocketServer.py", line 596, in process_request_thread
        File "/usr/lib64/python2.7/threading.py", line 757, in run
        File "/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
        File "/usr/lib64/python2.7/threading.py", line 777, in __bootstrap

Fix Log usage

  1. Rename the Log.py file to log.py (all lowercase).
  2. Do not assign a log object: self.log = Log.Log, you can use it without an instance being created
  3. Instead of self.log.note() use the call class directly: Log.note()

Changesets table to track changesets and thier parents

By tracking every changeset, its parent, and a total ordering number we can find the latest changeset for every file we are interested in.

SELECT
    file,
    max(ordering)
FROM
    changesets c
LEFT JOIN
    diffs d on d.changset=c.id
WHERE
    d.file in (file_list)

I do not intend that this resultset be pulled from the database, rather this be used as part of a larger query to get all relevant tuids

Use global constants

Instead of

_grabTIDQuery = "SELECT * from Temporal WHERE file=? and substr(revision,0,13)=substr(?,0,13);"

define a module-level constant

GRAB_TID_QUERY = "SELECT * from Temporal WHERE file=? and substr(revision,0,13)=substr(?,0,13);"

Fix comparision with None, and fix formatting

This syntax should be raising a warning in your IDE:

if changesets == None:

Python prefers

if changesets is None:

Furthermore, there are many formatting problems with the code. Pycharm has a key-combo that will auto-format your code. Find out what that combo is, and use it often.

Client may be sending empty body

The client is complaining about the server's complaint!

ip-172-31-2-132 (pid 10484) - 2018-05-24 20:23:51 - ETL Loop 0 - "client.py:131" (get_tuids) - WARNING: TUID service has problems.
	File "/home/ubuntu/ActiveData-ETL/vendor/tuid/client.py", line 131, in get_tuids
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 41, in _annotate_sources
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 63, in tuid_batches
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 83, in line_gen
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/aws/s3.py", line 348, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/sinks/s3_bucket.py", line 91, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 98, in process_grcov_artifact
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/cov_to_es.py", line 102, in process
	File "activedata_etl/etl.py", line 177, in _dispatch_work
	File "activedata_etl/etl.py", line 304, in loop
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_threads/threads.py", line 268, in _run
caused by
	ERROR: Can not decode JSON:
 e  x  p  e  c  t  i  n  g     q  u  e  r  y
65 78 70 65 63 74 69 6E 67 20 71 75 65 72 79

	File "/home/ubuntu/ActiveData-ETL/vendor/mo_json/__init__.py", line 339, in json2value
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/env/http.py", line 236, in post_json
	File "/home/ubuntu/ActiveData-ETL/vendor/tuid/client.py", line 115, in get_tuids
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 41, in _annotate_sources
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 63, in tuid_batches
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 83, in line_gen
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/aws/s3.py", line 348, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/sinks/s3_bucket.py", line 91, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 98, in process_grcov_artifact
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/cov_to_es.py", line 102, in process
	File "activedata_etl/etl.py", line 177, in _dispatch_work
	File "activedata_etl/etl.py", line 304, in loop
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_threads/threads.py", line 268, in _run
caused by
	ERROR: can not decode
expecting query
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_json/__init__.py", line 300, in json2value
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/env/http.py", line 236, in post_json
	File "/home/ubuntu/ActiveData-ETL/vendor/tuid/client.py", line 115, in get_tuids
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 41, in _annotate_sources
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 63, in tuid_batches
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 83, in line_gen
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/aws/s3.py", line 348, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/sinks/s3_bucket.py", line 91, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 98, in process_grcov_artifact
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/cov_to_es.py", line 102, in process
	File "activedata_etl/etl.py", line 177, in _dispatch_work
	File "activedata_etl/etl.py", line 304, in loop
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_threads/threads.py", line 268, in _run
caused by
	ERROR: No JSON object could be decoded
	File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
	File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
	File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_json/__init__.py", line 298, in json2value
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/env/http.py", line 236, in post_json
	File "/home/ubuntu/ActiveData-ETL/vendor/tuid/client.py", line 115, in get_tuids
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 41, in _annotate_sources
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/imports/coverage_util.py", line 63, in tuid_batches
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 83, in line_gen
	File "/home/ubuntu/ActiveData-ETL/vendor/pyLibrary/aws/s3.py", line 348, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/sinks/s3_bucket.py", line 91, in write_lines
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/grcov_to_es.py", line 98, in process_grcov_artifact
	File "/home/ubuntu/ActiveData-ETL/activedata_etl/transforms/cov_to_es.py", line 102, in process
	File "activedata_etl/etl.py", line 177, in _dispatch_work
	File "activedata_etl/etl.py", line 304, in loop
	File "/home/ubuntu/ActiveData-ETL/vendor/mo_threads/threads.py", line 268, in _run

Daemon to work ahead of TUID requests

This is to track work on the "clogger" (changeset logger) that pre-fills the database with TUIDs before the requests for those TUIDs come in.

First draft is here #47

Add daemon to pre-load tuids into database for coverage revisions

We want to ensure that requests for tuids on coverage revisions have low latency. I would hope that unified diff application** would be fast enough, but maybe not. We can predict the future revisions that will be requested by the ETL pipeline machines far in advance: Query ActiveData for the recent coverage tasks, and the build revision used.

A daemon should monitor the future revisions, and ensure the database is pre-loaded with tuids for all known files on those revisions. This same daemon can then be responsible for loading the initial database.

** The request-and-apply for all the diffs from the known revision to the target revision may take longer than the 30sec QoS we desire.

Pull less branches during Travis testing

The mo-hg library fills an empty ES instance with branch information. This takes a while, and generates lots of logs.

Add a configuration option to mo-hg that limits the branches it is interested in; making loading and testing faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.