Coder Social home page Coder Social logo

skills-airflow's Introduction

skills-airflow's People

Contributors

pyup-bot avatar rayidghani avatar thcrock avatar tweddielin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

skills-airflow's Issues

Write full integration test

We should have a basic integration test that goes through the whole DAG. With a simple dataset of, like, a handful of job postings in a couple of different quarters from a couple of different 'partners'.

The assertions at the end can probably be simple (check that the API database has stuff in the right tables and there is some research output), but it would help catch some bugs that make it to production.

Allow switch from boto to boto3

We want to switch from boto2 to boto3, but the Airflow S3Hook.get_conn() returns a boto connection. We should abstract this out a bit, and control it the boto version through a feature switch in config, so we can update skills-ml and skills-utils independently.

Fix DAG integration test Heisenbug

This test https://github.com/workforce-data-initiative/skills-airflow/blob/master/tests/api_sync_v1/test_dag.py

occasionally deadlocks on Travis; the Base.metadata.create_all() in ensure_db sometimes tries to create a table which is already there. testing.postgresql should make sure that this doesn't happen between tests (it nukes the old database file!), but it's possible that parallel processes within one test could be clobbering each other.

The fix for this might be schema migrations.

Convert Aggregator tasks to one-aggregator-per-task

Our aggregation tasks are too big. They are currently scoped to everything needed to produce a particular dataset, which sometimes ends up being multiple classifiers, a skill extractor, etc. This can end up using a lot of memory. And if we want to output multiple datasets that use the same aggregation task, all that work ends up being done again by each dataset generator.

We should really split each of these aggregation tasks into its own Airflow task, which can then be merged together afterwards.

We can do most of this here in the Airflow repository, by splitting tasks like GeoTitleCount into smallers ones that output smaller CSVs, and adding a new merge task that combines the CSVs into the output currently produced.

Setup skills-airflow and skills-api

Hello everyone!

The Problem

First of all thanks for the fantasting work. I would like to be able to get skills-airflow and the skills-api up and running. However the instructions provided seem to be not enough for me to make them run. Maybe we can clarify things and together improve the documentation as well.

What I did so far for skills-airflow

  1. set up the virtual environment in skills-airflow repo (python 3.6.0) and pip installed requirements.txt and requirements_dev.txt
  2. installed postresql and createt a data base I called daw_db
  3. updated config/api_v1_db_config.yaml to
PGPORT: 5432
PGHOST: localhost
PGDATABASE: daw_db
PGUSER: daw_db
PGPASSWORD:
  1. the alembic upgrade head command fails for me, not sure whether it's important?

  2. set up the following s3 buckets, right now empty

my-geo-bucket
my-job-postings 
my-labeled-postings
my-model-cache 
my-onet
my-output-tables
  1. I copied example_config.yaml to config.yaml
  2. running the airflow scheduler which gives the following output
[2019-03-28 12:01:06,790] {__init__.py:51} INFO - Using executor SequentialExecutor
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/

[2019-03-28 12:01:07,490] {jobs.py:1477} INFO - Starting the scheduler
[2019-03-28 12:01:07,490] {jobs.py:1485} INFO - Running execute loop for -1 seconds
[2019-03-28 12:01:07,491] {jobs.py:1486} INFO - Processing each file at most -1 times
[2019-03-28 12:01:07,491] {jobs.py:1489} INFO - Searching for files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,504] {jobs.py:1491} INFO - There are 19 files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,506] {jobs.py:1534} INFO - Resetting orphaned tasks for active dag runs
[2019-03-28 12:01:07,517] {dag_processing.py:453} INFO - Launched DagFileProcessorManager with pid: 34311
[2019-03-28 12:01:07,536] {settings.py:51} INFO - Configured default timezone <Timezone [UTC]>
[2019-03-28 12:01:07,568] {dag_processing.py:663} ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1
[2019-03-28 12:01:08,002] {jobs.py:1559} INFO - Harvesting DAG parsing results
[2019-03-28 12:01:09,630] {jobs.py:1559} INFO - Harvesting DAG parsing results
...

What I did so far for skills-api

  1. setup virtual env (python 2.7.11) in skills-api repo and installed requirements.txt
  2. run bin/make_config.sh specifying postgresql://localhost/daw_db
  3. python server.py runserver which gives starts a server running on http://127.0.0.1:5000/v1/jobs

I get the error that
ProgrammingError: (psycopg2.ProgrammingError) relation "jobs_alternate_titles" does not exist
LINE 3: FROM jobs_alternate_titles) AS anon_1

The Question

  1. What exactly do I have to place into the s3 buckets and in which format and or naming conventions?
  2. Did I miss anything else?

Some help would be greatly appreciated!
Cheers

Move partner nightly to own dag

It seems that the quarterly schedule of our main dag stops the partner nightly subdag from being actually run nightly, even though it has its own nightly schedule. This prevents us from syncing USAJobs.

We should just move the nightly to its own dag, it will be more reliable that way. The quarterly should just work with whatever nightly results are in there.

Create example Airflow DAG

Create a placeholder (non-functional) DAG showing the entire flow. Should encompass etl, ml processing, and API syncing (this might involve subdags)

Visualize this in the webserver and post screenshots

Leftover bugs from restructuring and python3 upgrade

Some problems that didn't get caught:

  • New job postings generate interface
  • Python3 tempfile changes
  • title_count DAG now checks for empty files

Also (newer, but worth fixing here):

  • New skills_ml NLP interface

Allow turning off of VA partner sync using config

Right now partner_quarterly gracefully skips if there is no config for any raw job postings, but assumes that VA is present if so. Since VA is currently broken and has to be rewritten, we should also be able to skip this over right now

Use different queues for SubDAGs

The task scheduling seems to be problematic. It seems to favor SubDAG operators...so right now there are eight tasks running, all of them SubDAGs. Which means no real work is being done. Effectively, Airflow deadlocks with itself. In theory you can get past this by bring more and more workers online, but we have a lot of subdags and quarters and this can get ridiculous. Hundreds of workers online at once to get anything done?

We can fix this with queues. I think if SubDAG operators go on their own queue, we have one worker, maybe on the airflow host, working that queue with concurrency one or two. Then the beefy machines only watch the other queue.

Incorporate DAGs from skills-ml

Copy over the DAGs from skills-ml and incorporate them into the current SubDAG form. partner_etl can remain a placeholder for now (the data already exists, so the tasks temporarily being no-ops won't post a huge problem; the later tasks can just get them from S3 as per usual).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.