Dataatwork.org - the Data@Work project website.
See http://dataatwork.org/about/edit-site/
This uses the OKFN Handbook Theme:
https://github.com/okfn/handbook-theme
See the README there for details of configuration options, layouts etc.
Orchestration of data processing tasks to power the Open Skills Project
License: Other
Dataatwork.org - the Data@Work project website.
See http://dataatwork.org/about/edit-site/
This uses the OKFN Handbook Theme:
https://github.com/okfn/handbook-theme
See the README there for details of configuration options, layouts etc.
To compute representativeness of our dataset (and/or accuracy of our classifier), we need to compute counts per CBSA
We should have a basic integration test that goes through the whole DAG. With a simple dataset of, like, a handful of job postings in a couple of different quarters from a couple of different 'partners'.
The assertions at the end can probably be simple (check that the API database has stuff in the right tables and there is some research output), but it would help catch some bugs that make it to production.
We want to switch from boto2 to boto3, but the Airflow S3Hook.get_conn() returns a boto connection. We should abstract this out a bit, and control it the boto version through a feature switch in config, so we can update skills-ml and skills-utils independently.
Since alembic requires a command line script to be run, it should not be in skills-ml but rather here.
occasionally deadlocks on Travis; the Base.metadata.create_all() in ensure_db sometimes tries to create a table which is already there. testing.postgresql should make sure that this doesn't happen between tests (it nukes the old database file!), but it's possible that parallel processes within one test could be clobbering each other.
The fix for this might be schema migrations.
A new skill extractor class exists in skills-ml, motivation here: workforce-data-initiative/skills-ml#95
We should use this skill extractor in the DAG to provide a higher-confidence list of skills.
Once workforce-data-initiative/skills-ml#43 is merged, we can implement the orchestration of the new job listings ETL here.
Our aggregation tasks are too big. They are currently scoped to everything needed to produce a particular dataset, which sometimes ends up being multiple classifiers, a skill extractor, etc. This can end up using a lot of memory. And if we want to output multiple datasets that use the same aggregation task, all that work ends up being done again by each dataset generator.
We should really split each of these aggregation tasks into its own Airflow task, which can then be merged together afterwards.
We can do most of this here in the Airflow repository, by splitting tasks like GeoTitleCount into smallers ones that output smaller CSVs, and adding a new merge task that combines the CSVs into the output currently produced.
Hello everyone!
First of all thanks for the fantasting work. I would like to be able to get skills-airflow and the skills-api up and running. However the instructions provided seem to be not enough for me to make them run. Maybe we can clarify things and together improve the documentation as well.
daw_db
config/api_v1_db_config.yaml
toPGPORT: 5432
PGHOST: localhost
PGDATABASE: daw_db
PGUSER: daw_db
PGPASSWORD:
the alembic upgrade head
command fails for me, not sure whether it's important?
set up the following s3 buckets, right now empty
my-geo-bucket
my-job-postings
my-labeled-postings
my-model-cache
my-onet
my-output-tables
airflow scheduler
which gives the following output[2019-03-28 12:01:06,790] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
[2019-03-28 12:01:07,490] {jobs.py:1477} INFO - Starting the scheduler
[2019-03-28 12:01:07,490] {jobs.py:1485} INFO - Running execute loop for -1 seconds
[2019-03-28 12:01:07,491] {jobs.py:1486} INFO - Processing each file at most -1 times
[2019-03-28 12:01:07,491] {jobs.py:1489} INFO - Searching for files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,504] {jobs.py:1491} INFO - There are 19 files in /Users/matthausheer/airflow/dags
[2019-03-28 12:01:07,506] {jobs.py:1534} INFO - Resetting orphaned tasks for active dag runs
[2019-03-28 12:01:07,517] {dag_processing.py:453} INFO - Launched DagFileProcessorManager with pid: 34311
[2019-03-28 12:01:07,536] {settings.py:51} INFO - Configured default timezone <Timezone [UTC]>
[2019-03-28 12:01:07,568] {dag_processing.py:663} ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1
[2019-03-28 12:01:08,002] {jobs.py:1559} INFO - Harvesting DAG parsing results
[2019-03-28 12:01:09,630] {jobs.py:1559} INFO - Harvesting DAG parsing results
...
bin/make_config.sh
specifying postgresql://localhost/daw_db
python server.py runserver
which gives starts a server running on http://127.0.0.1:5000/v1/jobs
I get the error that
ProgrammingError: (psycopg2.ProgrammingError) relation "jobs_alternate_titles" does not exist
LINE 3: FROM jobs_alternate_titles) AS anon_1
Some help would be greatly appreciated!
Cheers
It seems that the quarterly schedule of our main dag stops the partner nightly subdag from being actually run nightly, even though it has its own nightly schedule. This prevents us from syncing USAJobs.
We should just move the nightly to its own dag, it will be more reliable that way. The quarterly should just work with whatever nightly results are in there.
When workforce-data-initiative/skills-ml#60 is merged, we should include these changes in the aggregation and tabular upload DAGs.
Create a placeholder (non-functional) DAG showing the entire flow. Should encompass etl, ml processing, and API syncing (this might involve subdags)
Visualize this in the webserver and post screenshots
We collect useful stats when ETLing data, such as how many postings, in what quarters, and from what partners. We should also expose this data in the main README in the Research Hub.
Now that workforce-data-initiative/skills-ml#78 has been closed, we can parallelize the aggregation DAGs in a map-reduce fashion.
Some problems that didn't get caught:
Also (newer, but worth fixing here):
Right now partner_quarterly gracefully skips if there is no config for any raw job postings, but assumes that VA is present if so. Since VA is currently broken and has to be rewritten, we should also be able to skip this over right now
When workforce-data-initiative/skills-ml#61 is closed the PartnerETL superclass should call it for runs.
The task scheduling seems to be problematic. It seems to favor SubDAG operators...so right now there are eight tasks running, all of them SubDAGs. Which means no real work is being done. Effectively, Airflow deadlocks with itself. In theory you can get past this by bring more and more workers online, but we have a lot of subdags and quarters and this can get ridiculous. Hundreds of workers online at once to get anything done?
We can fix this with queues. I think if SubDAG operators go on their own queue, we have one worker, maybe on the airflow host, working that queue with concurrency one or two. Then the beefy machines only watch the other queue.
Copy over the DAGs from skills-ml and incorporate them into the current SubDAG form. partner_etl can remain a placeholder for now (the data already exists, so the tasks temporarily being no-ops won't post a huge problem; the later tasks can just get them from S3 as per usual).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.