workforce-data-initiative / skills-airflow Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 6.0 223 KB

Orchestration of data processing tasks to power the Open Skills Project

License: Other

Shell 0.88% Python 98.28% Mako 0.85%

skills-airflow's Introduction

Dataatwork.org - the Data@Work project website.

Contributing

See http://dataatwork.org/about/edit-site/

Theme

This uses the OKFN Handbook Theme:

https://github.com/okfn/handbook-theme

See the README there for details of configuration options, layouts etc.

skills-airflow's People

Contributors

Stargazers

Watchers

Forkers

robinsonkwame zhoplocal pawankjha25 mcclory quinnpertuit dustin-hawkins

skills-airflow's Issues

Aggregate SOC Code Counts by Geo

To compute representativeness of our dataset (and/or accuracy of our classifier), we need to compute counts per CBSA

Add database instructions to README

We should have a basic integration test that goes through the whole DAG. With a simple dataset of, like, a handful of job postings in a couple of different quarters from a couple of different 'partners'.

The assertions at the end can probably be simple (check that the API database has stuff in the right tables and there is some research output), but it would help catch some bugs that make it to production.

Allow switch from boto to boto3

We want to switch from boto2 to boto3, but the Airflow S3Hook.get_conn() returns a boto connection. We should abstract this out a bit, and control it the boto version through a feature switch in config, so we can update skills-ml and skills-utils independently.

Schema migration/alembic should be installed here

Since alembic requires a command line script to be run, it should not be in skills-ml but rather here.

Fix DAG integration test Heisenbug

This test https://github.com/workforce-data-initiative/skills-airflow/blob/master/tests/api_sync_v1/test_dag.py

occasionally deadlocks on Travis; the Base.metadata.create_all() in ensure_db sometimes tries to create a table which is already there. testing.postgresql should make sure that this doesn't happen between tests (it nukes the old database file!), but it's possible that parallel processes within one test could be clobbering each other.

The fix for this might be schema migrations.

Use high-precision skill extractor using occupation filter

A new skill extractor class exists in skills-ml, motivation here: workforce-data-initiative/skills-ml#95

We should use this skill extractor in the DAG to provide a higher-confidence list of skills.

Incorporate partner ETL dags, scaffolding for private data

Once workforce-data-initiative/skills-ml#43 is merged, we can implement the orchestration of the new job listings ETL here.

Convert Aggregator tasks to one-aggregator-per-task

Our aggregation tasks are too big. They are currently scoped to everything needed to produce a particular dataset, which sometimes ends up being multiple classifiers, a skill extractor, etc. This can end up using a lot of memory. And if we want to output multiple datasets that use the same aggregation task, all that work ends up being done again by each dataset generator.

We should really split each of these aggregation tasks into its own Airflow task, which can then be merged together afterwards.

We can do most of this here in the Airflow repository, by splitting tasks like GeoTitleCount into smallers ones that output smaller CSVs, and adding a new merge task that combines the CSVs into the output currently produced.

Setup skills-airflow and skills-api

Hello everyone!

The Problem

First of all thanks for the fantasting work. I would like to be able to get skills-airflow and the skills-api up and running. However the instructions provided seem to be not enough for me to make them run. Maybe we can clarify things and together improve the documentation as well.

What I did so far for skills-airflow

set up the virtual environment in skills-airflow repo (python 3.6.0) and pip installed requirements.txt and requirements_dev.txt
installed postresql and createt a data base I called daw_db
updated config/api_v1_db_config.yaml to

PGPORT: 5432
PGHOST: localhost
PGDATABASE: daw_db
PGUSER: daw_db
PGPASSWORD:

the alembic upgrade head command fails for me, not sure whether it's important?
set up the following s3 buckets, right now empty

my-geo-bucket
my-job-postings 
my-labeled-postings
my-model-cache 
my-onet
my-output-tables

I copied example_config.yaml to config.yaml
running the airflow scheduler which gives the following output

[2019-03-28 12:01:06,790] {__init__.py:51} INFO - Using executor SequentialExecutor
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/

[2019-03-28 12:01:07,490] {jobs.py:1477} INFO - Starting the scheduler
[2019-03-28 12:01:07,490] {jobs.py:1485} INFO - Running execute loop for -1 seconds
[2019-03-28 12:01:07,491] {jobs.py:1486} INFO - Processing each file at most -1 times
[2019-03-28 12:01:07,491] {jobs.py:1489} INFO - Searching for files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,504] {jobs.py:1491} INFO - There are 19 files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,506] {jobs.py:1534} INFO - Resetting orphaned tasks for active dag runs
[2019-03-28 12:01:07,517] {dag_processing.py:453} INFO - Launched DagFileProcessorManager with pid: 34311
[2019-03-28 12:01:07,536] {settings.py:51} INFO - Configured default timezone <Timezone [UTC]>
[2019-03-28 12:01:07,568] {dag_processing.py:663} ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1
[2019-03-28 12:01:08,002] {jobs.py:1559} INFO - Harvesting DAG parsing results
[2019-03-28 12:01:09,630] {jobs.py:1559} INFO - Harvesting DAG parsing results
...

What I did so far for skills-api

setup virtual env (python 2.7.11) in skills-api repo and installed requirements.txt
run bin/make_config.sh specifying postgresql://localhost/daw_db
python server.py runserver which gives starts a server running on http://127.0.0.1:5000/v1/jobs

I get the error that
ProgrammingError: (psycopg2.ProgrammingError) relation "jobs_alternate_titles" does not exist
LINE 3: FROM jobs_alternate_titles) AS anon_1

The Question

What exactly do I have to place into the s3 buckets and in which format and or naming conventions?
Did I miss anything else?

Some help would be greatly appreciated!
Cheers

Upgrade to latest airflow

Move partner nightly to own dag

It seems that the quarterly schedule of our main dag stops the partner nightly subdag from being actually run nightly, even though it has its own nightly schedule. This prevents us from syncing USAJobs.

We should just move the nightly to its own dag, it will be more reliable that way. The quarterly should just work with whatever nightly results are in there.

Include reference data for CBSA in relevant datasets

When workforce-data-initiative/skills-ml#60 is merged, we should include these changes in the aggregation and tabular upload DAGs.

Create example Airflow DAG

Create a placeholder (non-functional) DAG showing the entire flow. Should encompass etl, ml processing, and API syncing (this might involve subdags)

Visualize this in the webserver and post screenshots

Include dataset summaries in Research Hub README

We collect useful stats when ETLing data, such as how many postings, in what quarters, and from what partners. We should also expose this data in the main README in the Research Hub.

Add geocoding

Parallelize Aggregators

Now that workforce-data-initiative/skills-ml#78 has been closed, we can parallelize the aggregation DAGs in a map-reduce fashion.

Leftover bugs from restructuring and python3 upgrade

Some problems that didn't get caught:

New job postings generate interface
Python3 tempfile changes
title_count DAG now checks for empty files

Also (newer, but worth fixing here):

New skills_ml NLP interface

Allow turning off of VA partner sync using config

Right now partner_quarterly gracefully skips if there is no config for any raw job postings, but assumes that VA is present if so. Since VA is currently broken and has to be rewritten, we should also be able to skip this over right now

Include stats counting in PartnerETL

When workforce-data-initiative/skills-ml#61 is closed the PartnerETL superclass should call it for runs.

Use different queues for SubDAGs

The task scheduling seems to be problematic. It seems to favor SubDAG operators...so right now there are eight tasks running, all of them SubDAGs. Which means no real work is being done. Effectively, Airflow deadlocks with itself. In theory you can get past this by bring more and more workers online, but we have a lot of subdags and quarters and this can get ridiculous. Hundreds of workers online at once to get anything done?

We can fix this with queues. I think if SubDAG operators go on their own queue, we have one worker, maybe on the airflow host, working that queue with concurrency one or two. Then the beefy machines only watch the other queue.

Incorporate DAGs from skills-ml

Copy over the DAGs from skills-ml and incorporate them into the current SubDAG form. partner_etl can remain a placeholder for now (the data already exists, so the tasks temporarily being no-ops won't post a huge problem; the later tasks can just get them from S3 as per usual).