eyalroz / usdt-ontime-tools Goto Github PK

13.0 3.0 1.0 88 KB

Tools for work with the US Dept. of Transport on-time data and MonetDB

License: BSD 3-Clause "New" or "Revised" License

Shell 100.00%

monetdb benchmark flight-data dbms transport database usdt-ontime benchmarking sql analytics-database

usdt-ontime-tools's Introduction

'usdt-ontime' benchmark tools

This project facilitates the use of on-time and delayed flight data, available from the US Department of Transport's Bureau of Transportation Statistics, for DBMS testing and benchmarking.

Specifically, it comprises:

A script for automating the tasks of downloading the zipped tabular data (available through here), mildly cleaning it, and loading it into a database with a (mostly) proper schema.
A set of queries are used as the benchmark, to be executed on all data for years 2000-2008, with correct query result files included as well, as reference.
Miscellaneous additional potentially useful scripts and SQL queries.

This work is inspired (or rather, necessitated) by a post from several years ago on Percona's performance blog, comparing MonetDB with InfoBright on the set of queries included here. They are perhaps not the most thorough or insightful benchmark queries one could device for this data, but that's what we're starting with.

Currently, only MonetDB is supported as the DBMS into which data is to be loaded.

Requirements

Internet connection (specifically HTTP)
The Bourne Again Shell - bash
various typical Unix-ish command-line tools: unzip, wget, sed, echo and so on.
MonetDB installed and running
Enough disk space for the data you want

Getting started

Set up a MonetDB 'Database Farm' (see the MonetDB tutorial if you're not sure how to do that).
Make sure you have your MonetDB credentials (username and password) in ~/.monetdb (otherwise you will be prompted for credentials for each of dozens of files when loading the date).
Invoke scripts/setup-usdt-ontime-db to create and populate DB with data from 2000 through 2008; the script's command-line options are as follows:

Options:
  -r, --recreate           If the benchmark database exists, recreate it, dropping all
                           existing data. (If neither recreate nor append are set, the 
                           database be missing.)
  -a, --append             If the benchmark database exists, append to it rather than
                           recreating it. (If neither recreate nor append are set, the 
                           database be missing.)
  -d, --db-name NAME       Name of the database holding on-time performance data
  -f, --db-farm PATH       Filesystem path for the root directory of the MonetDB
                           database farm in which to place the database
  -p, --port NUMBER        IP port on which to connect to the database
  -k, --keep-downloads     Keep the zipped CSV files downloaded from the US department
                           of transport's website after loading them into the DB
  -s, --fix-sort-order    The order of appearance of records in the CSV files is not-uniform;
                           With this option (on by default) the CSVs will be resorted,
                           all by the same fixed order.
  -i, --ignore-sort-order  The order of appearance of records in the CSV files is not-uniform;
                           With this option, ) the CSVs will be loaded as-is
  -v, --verbose            Be verbose about actions taken and current status
  -h, --help               Print usage information
  -D, --download-dir       Directory into which individual monthly data file will be
                           downloaded and decompressed before being loaded into the DB
                           (default: /export/scratch1/home/eyalroz/src/usdt-ontime-tools/usdt-ontime-downloads)
  --first-year YEAR        First year for which to download data
  --first-month MONTH      First month in first year for which to download data
  --last-year YEAR         Last year for which to download data
  --last-month MONTH       Last month in last year for which to download data

Execute scripts/run_benchmark_queries.sh -v as a sanity check, to make sure you get results that look like the expected answer (you can also diff-compare the results you get with scripts/run_benchmark_queries.sh -w to the reference results in expected_results/).

Questions? Requests? Feedback? Bugs?

Feel free to open an issue or write me.

usdt-ontime-tools's People

Stargazers

Watchers

Forkers

yrezgui

usdt-ontime-tools's Issues

Support loading into a DB on a remote host

At the moment, we assume the DB farm in which we create a new DB is local. We should also support loading into a remote instance of MonetDB.

Detect months with header-only CSV file

If we download files for 1987, we get some months with .zip files containing a header-line-only .csv. We should not be trying to load these.

Estimate the free space requirement more accurately

We currently assume that every month of data requires upto 200 MB, uniformly - and estimate the necessary free space on that basis. This is an overestimate which we can correct. Also, we're not accounting for the fact that only the last month's file needs to be on-disk outside the DB, while the rest of the data exists only in loaded form.

Support more concise date range switches

At the moment, we support --start-year XYZW and --start-month AB. We could very well support --start XYZW-AB and -end XYZW-AB.

Support recovery from an interrupted DB setup

Sometimes, the DB setup might get interrupted for some reason. We should be able, upon re-starting it, to determine which is the last 'full' month; clear the last non-full month; and start again from that month, eventually loading all data as if the process had completed successfully the first time - rather than recreating the whole thing.

Logging issues

At the moment we log everything to the standard output stream, but don't redirect stderr. Should we? Perhaps we should also log to a file?

Duplicate switch for download directory and removing duplicates

Currently, -D supposedly indicates both the download directory and removal of duplicate records (bug introduced by fix for #17). Let's differentiate.

FlightNum can be an integer rather than a string

We only have integral FlightNum values (between 1 and under 10,000) - we never have any other characters. So, let's convert it from a VARCHAR(4) to a SMALLINT.

Support keeping the CSV files rather than the ZIPs

Let's allow the user to keep the CSV files we load rather than the unprocessed ZIPs (or vice-versa).

Skip downloading & preprocessing if a CSV file is available

For each month of data, if we notice a CSV file with the appropriate name, we should skip downloading the ZIP file, decompressing and preprocessing it - and just go ahead and load from the CSV. This is particularly suitable for when we keep CSVs but not ZIPs.

Download website HTTPS certificate is invalid

When downloading, we get:

Connecting to transtats.bts.gov (transtats.bts.gov)|204.68.194.70|:443... connected.
ERROR: The certificate of ‘transtats.bts.gov’ is not trusted.
ERROR: The certificate of ‘transtats.bts.gov’ doesn't have a known issuer.

which messes up our work. Let's ignore this state of affairs and just download the file.

Problem with wheelsoff field in 2003 April

Current script is failing to load the CSV for April 2003:

MAPI  = monetdb@localhost:50000
QUERY = COPY OFFSET 2 INTO ontime FROM '/ufs/eyalroz/src/usdt-ontime-tools/usdt-ontime-downloads/On_Time_On_Time_Performance_2003_4.csv' USING DELIMITERS ',','\n','"'  NULL AS '';
ERROR = !Failed to import table line 64382 field wheelsoff 'int' expected in '0-90'

and indeed, we have that value in WheelsOff. We'll probably need to make it a string.

When unsure about invalid characters, nullify the field

At the moment, if we encounter an invalid character (within a certain range) we're not sure how to correct, we replace it with an underscore '_'. That's not a good idea. Instead, in this case, we should nullify the whole field.

Can be easily implemented by changing the data cleaning sed script we pipe the CSV through before loading.

Handle MonetDB authentication somehow

At the moment, the DB creation script completely ignores the issue of authentication to the MonetDB server. What this means is that either the user has taken care of that in his/her ~/.monetdb file, or s/he will be prompted for credentials whenever mclient is called. The latter option is incredibly annoying when we have dozens and dozens of files to load.

So, we should probably do something about it.

Parallelize pre-load work

While we don't (?) want to mess around with how MonetDB does loading, specifically with its internal capacity for parallelism - there's no problem in parallelizing the processing of data before loading, i.e. the unzipping, filtering with sed, sorting and writing of the CSV.

Perhaps we can use GNU parallel or some other facility to achieve this?

Add support for loading into additional DBMSes

It would be nice if we could load data into MySQL rather than only MonetDB. Specifically, MySQL and PostgreSQL should be supported.

Support removal of perfect dupes before loading into DB

The ontime date set has quite a lot of dirt. Some of it requires more careful analysis - which is probably best done in a DBMS proper. However, the CSV files also have some perfect, consecutive, duplicate records. It should be possible to remove these before loading.

Import errors with the WAC and UniqueCarrier fields

Great scripts! However, I was trying to run the import and got

ERROR = !Failed to import table 'ontime', line 1105 field originwac 'tinyint' expected in '826'
ERROR = !Failed to import table 'ontime', line 1001 field destwac 'tinyint' expected in '268'

on a bunch of files, suggest to change these types into SMALLINT.

Error loading data from 2001 - non-ASCII characters

When trying to load monthly data files from 2001, I get the following error:

ERROR = !Failed to import table line 21 field tailnum 'varchar(50)' expected in '<E4>NKNO<E6>'

The actual string does indeed have the non-ASCII characters ä and æ (that is, the octets 0xE4 and 0xE6 respectively). That probably makes it an invalid UTF-8 string - and the column is created as UTF-8. I guess we'll need to change that.