pacificclimate / crmprtd Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 2.0 1.79 MB

Utility to download near real time weather data and insert it into PCIC's database

License: GNU General Public License v3.0

Python 99.48% XSLT 0.39% Shell 0.11% PLpgSQL 0.02%

actions make pipenv pypi

crmprtd's Introduction

crmprtd

Utility to download near real time weather data and insert it into PCIC PCDS-type databases (e.g., CRMP, Metnorth).

Documentation

Creating a production release

Modify tool.poetry.version in pyproject.toml: First remove any suffix to the version number, as our convention is to reserve those for test builds (e.g., 1.2.3 is a release build, 1.2.3.dev7 is a test build). Then increment the release build version.
Summarize release changes in NEWS.md

Commit these changes, then tag the release

git add pyproject.toml NEWS.md
git commit -m"Bump to version X.Y.Z"
git tag -a -m"X.Y.Z" X.Y.Z
git push --follow-tags

Our GitHub Actions workflow will build and release the package on our PyPI server.

Creating a dev/test release

The process is very similar to a production release, but uses a different version number convention, and omits any notice in NEWS.md.

Modify tool.poetry.version in pyproject.toml: Add or increment the suffix in the pattern .devN, where N is any number of numeric digits (e.g., 1.2.3.dev11). Our convention is to reserve those for test releases (e.g., 1.2.3 is a release build, 1.2.3.dev11 is a test build).

Commit changes and tag the release:

git add pyproject.toml
git commit -m"Create test version X.Y.Z.devN"
git tag -a -m"X.Y.Z.devN" X.Y.Z.devN
git push --follow-tags

Our GitHub Actions workflow will build and release the package on our PyPI server.

crmprtd's People

Contributors

Watchers

Forkers

bhawesh96 cdmballantyne

crmprtd's Issues

Allow requests longer than 7 days for MoTI infilling

MoTI's "web API" doesn't allow requests longer than 7 days, but that doesn't mean that our interface can't allow them (and manually chunk up the requests). This would be a nice feature to add to the command line tools for when we need to do some infilling.

Installer is broken w/o requirements

A recent change broke the PyPI Publishing Action.

The cause of this is that the action doesn't actually install any of the package requirements, but then setup.py makes an import which includes 3rd party packages.

It turns out that the 3rd party packages imported in crmprtd/__init__.py are only used in a function that's not actually used anywhere in the code base anymore. Should we just remove it? The other option would be to only import 3rd party packages at run-time (inside the functions in which they are used) in this module. Since the code in question isn't run from setup.py, this would let us work around the existing problem.

@nikola-rados may want to weigh in here, as he and I are the only contributors to this function.

Fix "tests/test_align.py::test_closest_stns_within_threshold_bad_data" test

This test fails on the lynx node due to the following error: psycopg2.errors.InvalidParameterValue: Coordinate values are out of range [-180 -90, 180 90] for GEOGRAPHY type.

Infill not reflected in database

This is a continuation of the problems from #64, trying to corroborate data from the logs vs entries into the databases themselves.

We will focus on the outage event of April 4th, 2020.

Here is the outage as seen in the database before running the infill:

crmp=> select date_trunc('hour', obs_time) as hour, count(*) from obs_raw natural join meta_vars natural join meta_network where mod_time < '2020-04-09:10
:00:00' and obs_time > '2020-04-04:05:00:00' and obs_time < '2020-04-05:05:00:00' and network_name = 'MoTIe' group by date_trunc('hour', obs_time) order b
y hour;
        hour         | count
---------------------+-------
 2020-04-04 06:00:00 |  1698
 2020-04-04 07:00:00 |  1668
 2020-04-04 08:00:00 |  1675
 2020-04-04 09:00:00 |  1666
 2020-04-04 10:00:00 |  1643
 2020-04-04 11:00:00 |  1668
 2020-04-04 12:00:00 |  1612
 2020-04-04 15:00:00 |  1630
 2020-04-04 16:00:00 |  1697
 2020-04-04 17:00:00 |  1689
 2020-04-04 18:00:00 |  1682
 2020-04-04 19:00:00 |  1591
 2020-04-04 21:00:00 |  1683
 2020-04-04 22:00:00 |  1682
 2020-04-04 23:00:00 |  1674
 2020-04-05 00:00:00 |  1671
 2020-04-05 01:00:00 |  1661
 2020-04-05 02:00:00 |  1577
(18 rows)

Focusing on one hour to run infill_all over, with the following parameters:

crmprtd_infill_all -S '2020/04/04 05:00:00' -E '2020/04/04 06:00:00' -a /auth/path -c "connection-string" -L /logging/path -N moti

and re-checking the database for these insertions reveals

crmp=> select date_trunc('hour', obs_time) as hour, count(*) from obs_raw natural join meta_vars natural join meta_network where obs_time > '2020-04-04:05:00:00' and obs_time < '2020-04-05:05:00:00' and network_name = 'MoTIe' group by date_trunc('hour', obs_time) order by hour;                                      hour         | count
---------------------+-------
 2020-04-04 06:00:00 |  2722
 2020-04-04 07:00:00 |  1668
 2020-04-04 08:00:00 |  1675
 2020-04-04 09:00:00 |  1666
 2020-04-04 10:00:00 |  1643
 2020-04-04 11:00:00 |  1668
 2020-04-04 12:00:00 |  1612
 2020-04-04 15:00:00 |  1630
 2020-04-04 16:00:00 |  1697
 2020-04-04 17:00:00 |  1689
 2020-04-04 18:00:00 |  1682
 2020-04-04 19:00:00 |  1591
 2020-04-04 21:00:00 |  1683
 2020-04-04 22:00:00 |  1682
 2020-04-04 23:00:00 |  1674
 2020-04-05 00:00:00 |  1671
 2020-04-05 01:00:00 |  1661
 2020-04-05 02:00:00 |  1577
(18 rows)

Which appears to have run successfully, and we usually wouldn't have looked twice. But checking for the recent inserts only (i.e. subtracting the last two queries) reveals:

crmp=> select date_trunc('hour', obs_time) as hour, count(*) from obs_raw natural join meta_vars natural join meta_network where mod_time > '2020-04-09:10
:00:00' and obs_time > '2020-04-04:05:00:00' and obs_time < '2020-04-05:05:00:00' and network_name = 'MoTIe' group by date_trunc('hour', obs_time) order b
y hour;
        hour         | count
---------------------+-------
 2020-04-04 06:00:00 |  1024
(1 row)

Which makes sense in the database. However, examining the logs for each tag reveals that there were 'skips:', 5975, 'failures:', 0, 'successes:', 1210. As a baseline, this makes it difficult to assess missing data because of the discrepancy (we expect 1024 successes, but the logs show 1210), particularly when trying to isolate a particular error that causes the missing data.

It should be noted that the Could not detect the station id in #64 amounted to 205 occurrences, but amount to no failures in the logs. So this is not likely to be the culprit of our missing data, and if it were, it's not reflected consistently between the database or the logs.

We believe that this degeneracy that we see on this simple example on April 4th, is happening for the other infills as well as the discrepancy can not be accounted for between the logs and the database.

$ cat 04042020.txt | grep -c "Could not detect"
205

We are seeing a similar discrepancy here as well as for the 30th, 23rd and 12th outages.

Block secrets from log files

The crmprtd_infill script has some debug logging that outputs the full arguments of the subprocesses that it's invoking. This includes a PG connection string that could feasibly include a secret password if a user were to unwittingly use this syntax instead of the much preferred .pgass file.

Write a few LOCs to anonymize the secret before logging.

Improve DBMetrics usage

The DBMetrics class in insert.py needs to have a third category for "skips" (observations that it has been asked to insert, but already exist in the database).

Additionally, the code should be refactored so that DBMetrics objects cannot be modified in delegate functions. Results should be loudly and explicitly returned, not quietly modified.

The align phase needs to use caching from the database

In the dev branch, the align phase goes and does a database lookup for every single variable and every single station that it receives in a obs_tuple. When it get's thousands of obs_tuples this is really slow!

In the master branch, in the wamr module we had a strategy to deal with this by creating a cache mapping of both variables and histories at the beginning of the process, so that we only had to go to the database once, and then we could rely on the cache for the rest of the process.

We should implement this for align as well. It would save a ton of time.

Write log files compressed

Our log files are filling up the root filesystem. There should be an option to crmprtd scripts to write log files with compression.

crmprtd incorrectly assumes Pacific time

Currently, crmprtd assumes that all timestamps are local and in pacific time. For NE BC this is incorrect and, generally, the timezone should be determined based on the lat/lon/time of any given observations. time is needed because timezones have changed historically.

Fix PEP8 violations and add flake8 checking to CI

Pull request #16 had a number of changes in it that were fixing up PEP8 Python Style Guide violations. This is great! I'm glad that @nikola-rados added these! However, the changes were a little distracting from the actual content during the code review.

Let's squash all of the PEP8 violations in a single go. Use flake8 to identify all of the violations or use autopep8 to fix as many as possible automatically. Then fix the rest. Then ensure that flake8 is added to the test_requirements.txt and run as a test under TravisCI.

Let's branch this off of master as a hotfix branch, and then we can merge it into both master and dev.

Update Meterorological Service of Canada URL

From ECCC Datamart:

======================

Dear MSC Datamart users,

Please note that as of October 13, 2020, the following MSC Datamart access URLs will be decommissioned:

http://dd.weatheroffice.ec.gc.ca

http://dd.meteo.ec.gc.ca
These URLs are obsolete and only the following addresses will now be used to access the MSC Datamart data (*) :

https://dd.weather.gc.ca/

https://dd.meteo.gc.ca

The URL is hardcoded here and needs to be either parameterized or at least changed.

moti_hourly.py is broken in Python 3

(env) [crmprtd@crmprtd ~]$ moti_hourly.py -c 'postgresql://dbuser@dbhost/crmp' --auth=$AUTH_FILE --auth_key=moti -l ~/moti/logs/crmprtd.log --log_level=ERROR -C ~/moti/cache/
Traceback (most recent call last):
  File "/storage/home/crmprtd/env/bin/moti_hourly.py", line 151, in <module>
    main(args)
  File "/storage/home/crmprtd/env/bin/moti_hourly.py", line 92, in main
    f.write(req.content)
TypeError: must be str, not bytes

Race condition in test_ec.py

See the comment here and example of this in action here.

Create makefile to setup project

Title.

Log file left unaltered from `infill_all` if existing default crmprtd_log.txt in directory

Without providing an explicit log file using the -l tag when running crmprtd_infill_all, the script does not overwrite the default crmprtd_log.txt file that may previously exist in a directory. This could lead to headaches down stream.

This may be expected behaviour, if so, if the recommended usage is to provide -l tag, please inform.

Clairify interface between download and normalize stages

The interface between the download and normalize stages is nominally supposed to be a "file stream", but that's very loosely defined.

The process script reads from sys.stdin and generates a generator of lines of bytes, whereas most of the testing use BytesIO objects or something else.

There is a clear lack of continuity between the tests and real world input, which has--on numerous occasions--resulted in bugs going out (even with passing tests). We need to make these usages consistent.

closest_stns_within_threshold() function breaks when bad data exists

The module crmprtd.ec module relies on a PG/PSQL function named closest_stns_within_threshold(). On one hand, it's kind of convenient to put this functionality in the database, but on the other, if it breaks everything goes wrong.

We had two bad records in the crmp database, such that X and Y (lon and lat) were reversed:

crmp=> select network_name, native_id, station_name, history_id, station_id, st_y(the_geom) as y, st_x(the_geom) as x from meta_history natural join meta_station natural join meta_network where the_geom is not null and (st_y(the_geom) < -90 or st_y(the_geom) > 90 or st_x(the_geom) < -180 or st_x(the_geom) > 180);
 network_name | native_id |            station_name            | history_id | station_id |     y     |    x    
--------------+-----------+------------------------------------+------------+------------+-----------+---------
 ENV-AQN      | E289310   | Annacis Island Metro Van Inst Shop |      14136 |      12080 | -122.9606 | 49.1658
 ENV-AQN      | E206167   | Pemberton Clover Road              |      14269 |      12214 |  122.7897 | 50.3225
(2 rows)

When we tried to run the Environment Canada script real_time_ec.py, it would error out every time as follows:

InternalError: (psycopg2.InternalError) current transaction is aborted, commands ignored until end of transaction block
 [SQL: 'SELECT history_id from closest_stns_within_threshold(%(lon)s, %(lat)s, %(threshold)s)'] [parameters: {'threshold': '1000', 'lat': 54.15914, 'lon': -131.661326}]

...even though we were running it on data completely unrelated to the bad entries.

We should find a way to either remove this function (do it in Python), make it robust against bad data, or put database constraints in meta_history such that bad data cannot exist.

Add functionality for --diag

In the current data pipeline the --diag flag is not being used. The purpose of this flag is to run the pipeline in diagnostic mode. It should be passed into the align, insert since they interact with the database.

Write a `download` module for BC Hydro

Our BC Hydro feed is currently being downloaded with some one-off bash scripts. It's not super complicated (basically just a bunch of files off of an FTP site), so we should be able to easily recreate this within the crmprtd ecosystem.

See the crmprtd.wmb.download module for a similar approach.

Reorganize/Refactor Data Normalization

Since the download code has been isolated in crmprtd the next stage in the pipeline can begin. The normalize stage should receive the output of the download stage (a file stream) as input and output a stream of named tuples. Place the code into the [network name]/normalize.py file since each module's data normalization will be unique. Branch this off of dev.

Create tests for crmprtd

The test coverage for crmprtd is just above 50%, moving forward this coverage will need to be more complete. The goal is to bring the coverage to > 90%.

Document and/or handle the hetereogeneous time handling across networks

Different networks post their data at different frequencies, containing different time ranges and for different durations. For example Environment Canada posts one hour's worth of met data every hour, with the previous month of data available. Contrast that to BC's ENV-AQN network which posts a single file containing the previous month's worth of data, updated daily. Or BC's Wildfire Management Branch which posts a single file that is update hourly containing the previous day's data.

At the very least, we need to document the differences for each network, preferably in the help text of their respective "download" scripts.

One step further would be to implement consistent time arguments in the download scripts that error out if a time range is selected that is incompatible with its network.

One step further would be to implement a time range selection to the process script that could filter observations that our outside of the range. This would cut down on the number of skips that we receive on the networks that post rolling windows (and drastically speed up our insertion phase).

AttributeError in wamr.normalize when running infill_all script

While running the infill script for the most recent outage, the new version 3.1.3 gives the following error when running crmprtd_infill_all:

2020-07-17 11:03:36,971:INFO:crmprtd.wamr.normalize - Starting WAMR data normalization
Traceback (most recent call last):
  File "/storage/home/nannau/crmp/crmpvenv/bin/crmprtd_process", line 33, in <module>
    sys.exit(load_entry_point('crmprtd', 'console_scripts', 'crmprtd_process')())
  File "/storage/home/nannau/crmp/crmprtd_3.1.3/crmprtd/crmprtd/process.py", line 91, in main
    process(args.connection_string, args.sample_size, args.network, args.diag)
  File "/storage/home/nannau/crmp/crmprtd_3.1.3/crmprtd/crmprtd/process.py", line 61, in process
    rows = [row for row in norm_mod.normalize(download_iter)]
  File "/storage/home/nannau/crmp/crmprtd_3.1.3/crmprtd/crmprtd/process.py", line 61, in <listcomp>
    rows = [row for row in norm_mod.normalize(download_iter)]
  File "/storage/home/nannau/crmp/crmprtd_3.1.3/crmprtd/crmprtd/wamr/normalize.py", line 27, in normalize
    reader = csv.DictReader(file_stream.getvalue().decode('utf-8')
AttributeError: 'generator' object has no attribute 'getvalue'
/storage/home/nannau/crmp/crmprtd_3.1.3/crmprtd/scripts/infill_all.py:193: UserWarning: WMB cannot be infilled since the period of infilling is outside the currently offered data (the previous day).
  warn(warning_msg['disjoint'].format("WMB", "day"))

It was run with the following command:

crmprtd_infill_all -S '2020/07/11 05:00:00' -E '2020/07/13 00:00:00' -a auth -c connection_string -L log_config -l logfile -N wamr wmb

Might this have something to do with #75?

Reduce actions footprint

Remove multi os actions
Upgrade to 20.04
Remove PR trigger

wamr normalize returns station name as station_id causes downstream issues especially when lat lon are missing from source data

In trying to use the crmprtd tools to patch a hole in the station data from the ENV-AQN/wamr network, I've run into a problem that boils down to normalize not grabbing the native ID from the source file and getting the station_name instead and using in the tuple as station_id.

Essentially, normalize_wamr looks for these field names in the flat files:

keys_of_interest = (
"DATE_PST",
"STATION_NAME",
"UNIT",
"UNITS",
"PARAMETER",
"REPORTED_VALUE",
"LONGITUDE",
"LATITUDE",
)

But yields:

    yield Row(
        time=dt,
        val=value,
        variable_name=variable_name,
        unit=unit,
        network_name="ENV-AQN",
        station_id=station_id,
        lat=lat,
        lon=lon,
    )

Based on thise keys_of_interest, this implies that the station_id is being associated with the STATION_NAME in the flat file. Ideally, this would be associated with the EMS_ID.

When a station history search yields nothing, crmprtd creates a new station which, in this case, puts the STATION_NAME into the native_id field. Again, this should be the EMS_ID. I think when the flat files have a lat/lon, this issue is avoided because a nearest station search gives a hit. When they don't, align searches for a native_id in the database equivalent to the value it gets from STATION_NAME. This fails, so a new station is created.

Datetime offset problem when running infill_all.py

While running infill_all.py:

python3 infill_all.py -S '2020/03/12 03:00:00' -E '2020/03/12 10:30:00' -a path-to-config.yaml -c "connection-string" -L path-to-logger.yaml

(With default parameters for everything else) gives:

Traceback (most recent call last):
  File "infill_all.py", line 299, in <module>
    main()
  File "infill_all.py", line 77, in main
    log_args)
  File "infill_all.py", line 164, in infill
    if not interval_overlaps((start_time, end_time), last_month):
  File "infill_all.py", line 284, in interval_overlaps
    return max(a[0], b[0]) <= min(a[1], b[1])
TypeError: can't compare offset-naive and offset-aware datetimes

I'm running Python 3.6.8, GCC 4.8.5, and CentOS Linux 7.

Prior to the Python error, are the following logs since the last successful insert:

2020-03-17 14:10:21,478:INFO:crmprtd.moti.download - Starting MOTIe rtd
2020-03-17 14:10:21,487:INFO:crmprtd.moti.download - Starting manual run using timestamps 2020-03-12 00:00:00 2020-03-12 10:30:00
2020-03-17 14:10:21,490:INFO:crmprtd.moti.download - Downloading https://prdoas2.apps.th.gov.bc.ca/saw-data/sawr7110
2020-03-17 14:10:21,855:INFO:crmprtd.moti.download - 200: https://prdoas2.apps.th.gov.bc.ca/saw-data/sawr7110?request=historic&station=28072&from=2020-03-12%2F00&to=2020-03-12%2F10
2020-03-17 14:10:29,527:INFO:crmprtd.moti.normalize - Starting MOTI data normalization
2020-03-17 14:10:31,167:INFO:crmprtd.insert - Using Chunk + Bisection Strategy
2020-03-17 14:10:31,300:INFO:crmprtd.insert - Successfully inserted observations
2020-03-17 14:10:31,352:INFO:crmprtd.insert - Successfully inserted observations
2020-03-17 14:10:31,379:INFO:crmprtd.insert - Successfully inserted observations
2020-03-17 14:10:31,440:INFO:crmprtd.insert - Data insertion complete
2020-03-17 14:10:31,441:INFO:crmprtd - Data insertion results

`download_moti` script has faulty time logic

MoTI's data download application has some strange properties.

By default (no parameters), the app will return all of the data for all of the available stations for the previous hour. You can optionally specify a time range that is less than 7 days and no farther in the past than one month prior. However, if you do specify a time range, you can only do so for a single station and you have to specify it. Therefore, you can't specify a time range without a priori information.

Unfortunately, the download_moti script doesn't pass this logic through correctly. If one specifies a time range, but not a station, it silently throws all of the parameters away. This makes the script really hard to use effectively.

This should be fixed. The script should raise an error if the given parameters are not in line with the MoTI SAWR parameter requirements.

bisect_insert_strategy doesn't accumulate results

Just did a run of the moti_insert and found an interesting result:

{"asctime": "2018-08-24 14:59:20,814", "levelname": "INFO", "name": "crmprtd.insert", "message": "Using Chunk + Bisection Strategy"}
{"asctime": "2018-08-24 14:59:21,111", "levelname": "INFO", "name": "crmprtd.insert", "message": "Successfully inserted observations", "num_obs": 512}
{"asctime": "2018-08-24 14:59:21,187", "levelname": "INFO", "name": "crmprtd.insert", "message": "Successfully inserted observations", "num_obs": 128}
{"asctime": "2018-08-24 14:59:21,209", "levelname": "INFO", "name": "crmprtd.insert", "message": "Successfully inserted observations", "num_obs": 32}
{"asctime": "2018-08-24 14:59:21,224", "levelname": "INFO", "name": "crmprtd.insert", "message": "Successfully inserted observations", "num_obs": 16}
{"asctime": "2018-08-24 14:59:21,230", "levelname": "INFO", "name": "crmprtd.insert", "message": "Successfully inserted observations", "num_obs": 2}
{"asctime": "2018-08-24 14:59:21,234", "levelname": "INFO", "name": "crmprtd.insert", "message": "Data insertion complete"}
{"asctime": "2018-08-24 14:59:21,235", "levelname": "INFO", "name": "crmprtd", "message": "Data insertion results", "results": {"insertions_per_sec": 4.76, "skips": 0, "successes": 2, "failures": 0}}

As you can see, the end "result" is 2 successful insertions, even though the logs show that it's successfully inserting larger chunks before that. If you look at the code:

        with Timer() as tmr:
            for chunk in chunks(observations):
                dbm = bisect_insert_strategy(sesh, chunk)

You have forgotten to accumulate the results from each chunk insert, throwing away the results from before. This should be a quick fix:

dbm += bisect_insert_strategy(sesh, chunk)

As long as dbm is initialized prior to the beginning of the loop.

Drop support for Python <3.6

Add black to project

A lot of our repos have moved over to the strict black formatting. This update will:

Update format of code around repo
Add black code check to actions
Add pre-commit hook to repo and to Makefile

Refactor ArgParse arguments in the scripts

There are a number of command line arguments that are common to all of the download scripts that we have (or at least should be). Examples include cache_file, logging options, diagnostic mode, etc.

These arguments get repeated in all of the scripts. We should pull them all up into a common function that takes a parser as an argument and adds them all. As follows:

# In crmprtd/__init__.py
def common_script_arguments(parser):
    parser.add_argument('-D', '--diag', ...)
    parser.add_argument(...)
    return parser

Then we can use this common code in all of the scripts and reduce the volume of repetitive code by quite a bit.

Implement time range selection in process script

implement a time range selection to the process script that could filter observations that our outside of the range. This would cut down on the number of skips that we receive on the networks that post rolling windows (and drastically speed up our insertion phase).

See #50

Investigate SQLAlchemy's nested transations

There is some evidence that suggests our nested transactions are not behaving properly. The moti_infill.py script is reporting that it's making insertions, however those insertions never actually appear in the database.

The transaction management in this package is "complicated". We haven't pinned our version of SQLAlchemy in the requirements.txt so there is a real possibility that the behaviour of nested transactions has changed out from under us. We need to:

Upgrade and pin to the latest stable version of SQLAlchemy and
Ensure that the nested transactions are behaving the way that they should

Handle 'trace' precip measurements

Failing test added in 044ae3b. Doesn't actually 'fail' but errors on those measurements and notes the error.

Handle additional WAMR info

The network for BC Ministry of Environment's Air Quality Network (known in our code as "WAMR") has started adding lat/lon information into their CSV data feeds.

To date, we have manually tokenized their data, ignoring the headers. Thus, additional fields have caused processing of the entire feed to fail. This should be fixed, ideally in a backward-compatible manner (e.g., utilize the header information and then integrate the lat/lon fields if available).

Adjust MOTI normalization log level

An error logged in moti's normalization module reports when it cannot find obs-series in the xml file. It was discovered that this variable is missing mainly due to user permissions for different stations (see the note here). This is an expected behaviour and thus we should not be logging missing obs-series as an error but as a warning.

Ongoing issues with precipitation ingestion in EC_raw

This was my mistake in doing too brief a look at the data currently being ingested compared to other sources. However, there appears to be a timestamping issue.

See the redmine issue:

https://redmine.pcic.uvic.ca/issues/578

Standardize logging

Now that we're actually doing something with our logs, it will become important for them to have a bit more structure than the free flowing, laissez faire approach that we have taken to date.

Logs should be parseable: we need to use a log formatter that can be easily consumed by logstash and can handle multiline messages and stacktraces. Most advice says to use a JSON log formatter.
Logs should provide metrics: data points that get inserted vs. skipped (due to being duplicates) vs. errors are all extremely important metrics for this package. Provide these values as specifically attributed values in the log entries at the conclusion of each run. Insertions / second gives us good information on database code performance. We can get creative on measuring other things, but these are the most important.
Logs (and information levels) should be consistent. Right now are logs are a little scattered about what type of information belongs at which level of importance (DEBUG, INFO, WARNING, etc.). We should define the levels at which certain information should be and make that consistent across all of our modules.

We're already using Python's logging module which will make this relatively easy. We just need to define an application-wide logging config and use it consistently across modules. There are only 99 logging statements in the code, so this shouldn't be too bad.

james@basalt:~/code/git/crmprtd/crmprtd$ grep "log\." *.py | wc -l
99

Development wise, we should work on this as a hotfix branch off of master.

Materialized view functions not in user's search path.

When running the crmprtd_process script, it's possible to come across the following error:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedFunction) function matview_queue_refresh_row(unknown, integer) does not exist
LINE 1: SELECT matview_queue_refresh_row('collapsed_vars_mv', NEW.hi...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT matview_queue_refresh_row('collapsed_vars_mv', NEW.history_id)
CONTEXT:  PL/pgSQL function "collapsed_vars_mv_obs_raw_insert" line 3 at PERFORM

This error shows up if you connect to the database with a user that doesn't have the materialized view functions (e.g. matview_queue_refresh_row) in its search path.

We should catch this Exception, and provide a more actionable and informative error message for the user.

Write a `normalize` module for BC Hydro

Related to #78 , once the download module is ready, write a normalize module to yield a sequence of Row objects.

See here for details on the input and review other normalize modules for details on what the output should be.

Address requests TLS warning

When running the MoTIe downloader, the following warnings appear:

/home/bveerman/crmprtd_env/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/home/bveerman/crmprtd_env/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning

This is apprently (?) due to using an old version of python (2.7.3!)

Either the python version should be upgraded, or the error addressed. Possible solution on Stackoverflow.

Add coverage tests to continuous integration testing

While test coverage isn't a perfect metric, it can be a good indicator as to how completely the code base is tested. The crmprtd test coverage is very poor:

----------- coverage: platform linux, python 3.5.2-final-0 -----------
Name                        Stmts   Miss  Cover
-----------------------------------------------
crmprtd/__init__.py            20     17    15%
crmprtd/compat.py               4      1    75%
crmprtd/db.py                  44      3    93%
crmprtd/ec.py                 180     30    83%
crmprtd/moti.py               149     37    75%
crmprtd/wamr.py               165     48    71%
crmprtd/wmb.py                235    235     0%
crmprtd/wmb_exceptions.py      15     15     0%
-----------------------------------------------
TOTAL                         812    386    52%

For this issue, please enable coverage testing in our TravisCI continuous integration config. Configure coverage to only measure coverage on modules in the crmprtd package directory and configure it to fail our test suite under a generous threshold. Let's start with 50% (so that our preset tests pass) and then let's crank that up to 80-90 as we slowly add tests to cover the rest of the code base.

Implement consistent time arguments each download script

implement consistent time arguments in the download scripts that error out if a time range is selected that is incompatible with its network.

See #50

Station ID not found logged in normalize.py

After running the infill script for recent outages, we noticed a shortage of data from MoTI for March 12th, 2019, and March 23rd 2019.

The logs reveal the following, and we are suspecting they point to the skipped entries:

Could not detect the station id: xpath search '//observation-series/origin/id[@type='client']' return no results

The exception raised is

list index out of range

A raw example from the logs:

{"asctime": "2020-03-17 23:41:07,675", "levelname": "INFO", "name": "crmprtd.moti.normalize", "message": "Starting MOTI data normalization"}
{"asctime": "2020-03-17 23:41:07,677", "levelname": "ERROR", "name": "crmprtd.moti.normalize", "message": "Could not detect the station id: xpath search '//observation-series/origin/id[@type='client']' return no results", "exception": "list index out of range"}

This is logged by the normalize() function in normalize.py, on line 36:

except IndexError as e:
    log.error("Could not detect the station id: xpath search "
                  "'//observation-series/origin/id[@type='client']' "
                  "return no results", extra={'exception': e})
    continue

crmprtd_infill (the March 23rd event as an example) was run using the following arguments:

-S '2020/03/23 02:00:00'
-E '2020/03/23 08:30:00'
-L <path/to/logging/config>, -a <path/to/auth>, and -c <connection string>
-N moti

Examining the counts in the crmp database for these times reveals:

For the 12th:

crmp=> select date_trunc('hour', obs_time) as hour, count(*) from obs_raw natural join meta_vars natural join meta_network where obs_time > '2020-03-12:01:00:00' and obs_time < '2020-03-12:04:00:00' and network_name = 'MoTIe' group by date_trunc('hour', obs_time) order by hour;
        hour         | count
---------------------+-------
 2020-03-12 02:00:00 |  1025
 2020-03-12 03:00:00 |  2649
(2 rows)

For the 23rd:

crmp=> select date_trunc('hour', obs_time) as hour, count(*) from obs_raw natural join meta_vars natural join meta_network where obs_time > '2020-03-23:01:00:00' and obs_time < '2020-03-23:09:00:00' and network_name = 'MoTIe' group by date_trunc('hour', obs_time) order by hour;
        hour         | count
---------------------+-------
 2020-03-23 02:00:00 |  1030
 2020-03-23 03:00:00 |  1032
 2020-03-23 04:00:00 |  1028
 2020-03-23 05:00:00 |  1032
 2020-03-23 06:00:00 |  1030
 2020-03-23 07:00:00 |  2671
 2020-03-23 08:00:00 |  2687
(7 rows)

Curiously, or perhaps coincidentally, the shortage of data occurs at 02:00:00 mark, the same rough time both original outages began.

Just looking to account for these skips, which are reoccur in the logs during infilling for both the the 12th and the 23rd.

Include python-json-logger in requirements

The default logging config file uses the pythonjsonlogger package for log formatting.

While it's not strictly a code requirement (one doesn't have to use the default log config), it would be convenient to include it as an optional requirement in the setup.py file.

Make insert usage consistent in insert.py

We have two insertion strategies in insert.py: bisect and one-by-one. Both have slightly different usages: the former takes a series of Obs objects, while the latter only takes one. Make these consistent. Move the loop into single_insert_obs, handle the uniqueness checks inside the function, and get rid of any usage of the UniquenessError exception which we don't need (it's expected behaviour, which we can easily test for, so it's not exceptional).

Update workflows to follow best practices

We have introduced some best practices for our workflow usage. The workflows need to be updated to fall inline with these practices.

`download_moti` script has incorrect time formatting

From @faronium:

There is an error in how time ranges are being built into the URL. The web app that we are requesting data from requires timestamps of format:

%y-%m-%d/%H

note the slash and lack of :%M:%S

Currently, the code makes a url with %Y-%m-%d %H:%M:%S

So downloading date ranges from individual stations doesn't work.

At the URL side of things, what gets delivered is something like this:

https://prdoas2.apps.th.gov.bc.ca/saw-data/sawr7110?request=historic&station=33099&from=2020-01-22+20%3A51%3A39.422198&to=2020-01-22+21%3A51%3A39.422198

And that fails with HTML 400 with a web page returned saying:

"Invalid date format: 2020-01-22 20:51:39.422198 - Should be yyyy-MM-dd/HH"

Whereas the following succeeds:

https://prdoas2.apps.th.gov.bc.ca/saw-data/sawr7110?request=historic&station=33099&from=2020-01-01/12&to=2020-01-06/13

Reorganize modules

I've started a dev branch for reorganizing the code into a four stage pipeline. See the description in crmprtd/__init__.py for a description of the stages.

For this issue, begin with a checkout of the dev branch, and begin to pull almost all code out of the executable scripts except for command line argument handling and delegation to a external functions. Basically this lets us ignore any testing coverage in scripts/ because there should be no business logic there.

Move download specific code into the [network_name]/download.py modules and try to get a handle on how to separate the downloading code from the normalization code (in some networks, they are a bit intertwined at the moment).

Add input file support

With the introduction of the dev branch refactor of crmprtd some of the old functionality was lost. The script can no longer take in a data file and use it to make insertions to the database. A script needs to be created that takes in a file and runs it through the appropriate stages of the new pipeline.