codeforafrica-scrapers / healthtools_ke Goto Github PK
View Code? Open in Web Editor NEW[morph] Scrapers for the HealthTools Kenya data.
Home Page: https://morph.io/CodeForAfrica-SCRAPERS/healthtools_ke
License: MIT License
[morph] Scrapers for the HealthTools Kenya data.
Home Page: https://morph.io/CodeForAfrica-SCRAPERS/healthtools_ke
License: MIT License
Currently we have a relatively ad hoc structure of the entire project. We should instead follow the HealthTools.API structure and improve our scrapers overall by borrowing the approach of OpenSanctions by having our own libsanctions ("libhealthtools" ?).
We should allow scraper selection of what we want by passing an argument --scraper
with the different "doc types" e.g --scraper doctors
:
python scraper.py
python scraper.py --scraper doctors
python scraper.py --scraper doctors,clinical_officers
The Clinical Officers website has changed the validity date which is breaking our scraper.
This scraper started out as a simple script but has since evolved into a more complex system that we intend to "grow". We should therefore replace our print use with Python's logging module.
"Good logging practice in Python" by @fangpenlin - https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/
We are currently using sys.argv
directly but we should instead use argparse as the standard as our options grow.
After scraping, we should store stats in s3 in a stats.json
file that we will be used in display on a HTML page. This should include:
For Debate: This info should also be pushed to Google Analytics at a later stage.
Code integration to help with tests on PR.
When a scrape session takes longer than 30mins, we should ping a notification to Slack as a warning.
Travis is our standard testing service so should use it instead of Circle-CI
Morph considers the scraper to have failed if no SQLite created;
Scraper didn't create an SQLite database in your current working directory called
data.sqlite. If you've just created your first scraper and not edited the code yet
this is to be expected.
Other than solving that error, it would be nice to make the data we scrape available on there too.
Currently we're outputting too much logs. We should only mark successful scrapes as a whole (doctors, clinical officers, etc) instead of pages.
Other logs would be failed scrapes which should be returned as errors instead of normal print.
Currently we are uploading all the data at once to Elasticsearch but this fails when you have a lot of data for example in the case of Health Facilities, we get the following error:
TransportError(413, u'{"Message":"Request size exceeded 10485760 bytes"}')
Currently, in regards to the AWS S3 storage where we archive data, the running assumption is that for someone who is trying to install the project for whatever purpose already has S3 directory structure as the scraper expects. This should not be the case however. We should have the scraper check that the AWS S3 bucket exist and if it does, it has the expected structure contrary to which, the scraper will create the structure as expected.
DISCLAIMER: The AWS S3 bucket must have been created before hand though. The structure is what the scraper should create.
This might bring about speed and other benefits.
Presently we send a slack notification when a scraping session takes more than 30 minutes. It would be nice if we can also send a notification of how long the scraping took.
We should create the S3 folders if they don't exists similar to how we are doing for local file storage.
This should probably be done in a Python module instead of in config.
NB: Include in tests as mentioned here
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.