Coder Social home page Coder Social logo

codeforafrica-scrapers / healthtools_ke Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 15.0 2.35 MB

[morph] Scrapers for the HealthTools Kenya data.

Home Page: https://morph.io/CodeForAfrica-SCRAPERS/healthtools_ke

License: MIT License

Python 100.00%
scraper elasticsearch doctor data-scraping slack-notifications health-facilities healthtools-api healthtools doctors dodgy-doctors

healthtools_ke's People

Contributors

andela-mabdussalam avatar andela-mmakinde avatar andela-ookoro avatar celelstine avatar davidlemayian avatar gathondu avatar ryansept avatar tinamurimi avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

healthtools_ke's Issues

Allow selection of what to scrape

We should allow scraper selection of what we want by passing an argument --scraper with the different "doc types" e.g --scraper doctors:

  • Scrape all (default): python scraper.py
  • Single: python scraper.py --scraper doctors
  • Multiple: python scraper.py --scraper doctors,clinical_officers

Store stats of scraping

After scraping, we should store stats in s3 in a stats.json file that we will be used in display on a HTML page. This should include:

  1. Count of each data scraped.
  2. Dates of last successfully scraped for each and date of last successful scrape as a whole.
  3. How long each scrape took and all the scrapers.

For Debate: This info should also be pushed to Google Analytics at a later stage.

Store data in SQLite as suggested by Morph

Morph considers the scraper to have failed if no SQLite created;

Scraper didn't create an SQLite database in your current working directory called
data.sqlite. If you've just created your first scraper and not edited the code yet
this is to be expected.

Other than solving that error, it would be nice to make the data we scrape available on there too.

https://morph.io/documentation

Improve debug logs

Currently we're outputting too much logs. We should only mark successful scrapes as a whole (doctors, clinical officers, etc) instead of pages.

Other logs would be failed scrapes which should be returned as errors instead of normal print.

Limit Elasticsearch upload size

Currently we are uploading all the data at once to Elasticsearch but this fails when you have a lot of data for example in the case of Health Facilities, we get the following error:

TransportError(413, u'{"Message":"Request size exceeded 10485760 bytes"}')

Handling of AWS S3 Data Directories and Keys

Currently, in regards to the AWS S3 storage where we archive data, the running assumption is that for someone who is trying to install the project for whatever purpose already has S3 directory structure as the scraper expects. This should not be the case however. We should have the scraper check that the AWS S3 bucket exist and if it does, it has the expected structure contrary to which, the scraper will create the structure as expected.

DISCLAIMER: The AWS S3 bucket must have been created before hand though. The structure is what the scraper should create.

Create S3 folders if don't exist

We should create the S3 folders if they don't exists similar to how we are doing for local file storage.

This should probably be done in a Python module instead of in config.

NB: Include in tests as mentioned here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.