Coder Social home page Coder Social logo

Comments (5)

chendaniely avatar chendaniely commented on July 25, 2024

wow this is pretty cool.

the last time I was did table scraping from pdf's I redirected less output to a file and went line by line using regex. It was not pretty... nor fun..

from ebola.

donpdonp avatar donpdonp commented on July 25, 2024

what is happening currently for getting new data from PDFs into _data/ csv files? is there any automation and are those scripts in the repo?

from ebola.

donpdonp avatar donpdonp commented on July 25, 2024

I just tried Tabula after a tip-off from someone at CodeForPortland last night.
http://tabula.nerdpower.org/

The tables are manually selected using the mouse, but the data in the tables are read into csv automatically. It seems to work fairly well. I'm not sure if thats any better than whats being done already.

from ebola.

srinivvenkat avatar srinivvenkat commented on July 25, 2024

I think these tools work fine, as long as the text in the PDF is readable. I don't think it would help if they are scanned pages (which, at least in the guinea dataset, is a majority). I suspect most such are manually done. (@caitlin, please clarify)

Tried a crude crowdsourced initiative to do it --> bit.ly/ebola_guinea (A few were digitized but, due to lack of incentives/spreading the 'crowd' didn't come). The key reason for using google doc is to lower the tech barrier (no need of github account), and partial contribution (fill a few columns until you're bored).

If anybody has ideas on extending this idea, shoot!

from ebola.

cmrivers avatar cmrivers commented on July 25, 2024

I use a combination of Tabula, a few little scripts to reformat the data,
and some manual work. The broad overview is here:
http://www.caitlinrivers.com/blog/the-setup-tools-i-use-to-track-emerging-infectious-diseases.
I keep meaning to make a screencast of my workflow, but...no time.

Each Liberia file takes about 3-5 minutes from download to upload, and
Sierra Leone takes a bit longer.

On Wed, Oct 15, 2014 at 11:28 PM, srinivvenkat [email protected]
wrote:

I suspect most are manually done. (@caitlin https://github.com/Caitlin,
please clarify)

I think these tools work fine, as long as the text in the PDF is readable.
I don't think it would help if they are scanned pages (which, at least in
the guinea dataset, is a majority).

Tried a crude crowdsourced initiative to do it --> bit.ly/ebola_guinea (A
few were digitized but, due to lack of incentives/spreading the 'crowd'
didn't come). The key reason for using google doc is to lower the tech
barrier (no need of github account), and partial contribution (fill a few
columns until you're bored).

If anybody has ideas on extending this idea, shoot!


Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/29#issuecomment-59309406.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyOTA0OTI4OCwiZGF0YSI6eyJpZCI6NDQxNzY4NTl9fQ==--d66722f2a22b3ea35926f18e416190cd46d220be.gif]

from ebola.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.