Coder Social home page Coder Social logo

enemkaywun / data-cleaning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from keep-the-receipts/data-extraction

0.0 0.0 0.0 119 KB

This represents data extracted according to this format https://github.com/South-Africa-Government-Procurement/project-docs/wiki/Data-models-and-standards#abstract-records-of-amounts

Home Page: https://join.keepthereceipts.org.za/

data-cleaning's Introduction

Procurement Data Cleaning

Thanks for offering to help out! For more info, join: http://join.keepthereceipts.org.za/ or Slack, channel #keep-the-receipts on https://zatech.co.za/

Basic steps to get started:

  • Download and install Tabula: https://tabula.technology/
  • Download a copy of the PDF from the GitHub issue that you will be processing.
  • Load the PDF into Tabula.
  • Highlight/select the tables in Tabula, export to CSV.
  • Open the CSV files in a spreadsheet app (Excel / Google Sheets / LibreOffice)
  • Examine the CSV, and make any adjustments:
    • DON'T fix any spelling mistakes or typos, these should match the original document as closely as possible.
    • DON'T remove the headings for the table.
    • DO remove any totals rows - we are interested in individual line items, not totals.
    • DO remove any empty lines that aren't needed.
    • DO make sure that everything that is in one row on the PDF is one row on the CSV (More info here)[keep-the-receipts#104 (comment)]
  • Save the resulting CSV file, which you will use for creating the Pull Request. Use the same name as the source PDF file for the CSV (naturally replacing the .pdf extension with .csv).
  • Raise a PR using the Github UI.
    • Include a screenshot of the table in the PDF and the CSV table in Excel/Calc/Google Sheets - that makes it a lot easier for us to spot issues quickly and discuss. See example keep-the-receipts#127

If you're already familiar with Git, some extra tips:

F.A.Q.:

  1. Should we put different tables into different CSVs?

If the column headings are the same, they can be in one CSV. But if the tables relate to different departments or entities, add a column to specify who that part of the CSV relates to. If the columns headings are different, they should be separate CSVs.

  1. What should I do if there are merged cells that should be split?

Instructions for managing merged cells are here: keep-the-receipts#119 (comment)

  1. What should I do if Tabula splits cells that should be on a single row? Instructions for this are here: keep-the-receipts#104 (comment)

  2. Should I do 'pass 2' of a file?

Preferably do pass 1 first, just so we easily keep track and ensure there's a first pass of everything. Once Pass 1 is done, you can do pass 2 unless it was you who did pass 1. The idea with two passes is to identify errors when by looking at the differences between two passes done by two different people

  1. Tabula gives me the following error for the PDF: "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."

For PDFs like this, we need to use Optical Character Recognition (OCR) software to convert the image-based PDFs to text-based. This is easiest done on Linux or MacOS, but you have to download a bash script and run it on the PDF. If any of those things sound scary or foreign to you, feel free to reach out on the issue or in Slack and someone else will be happy to convert the PDF for you in the mean time.

If you wanna try download the bash script and run it on the PDF yourself, here is what you need to do:

  1. Make sure you have tesseract 4 installed (brew install tesseract on MacOS)
  2. Download the pdf-ocr.sh bash script from this gist: https://gist.githubusercontent.com/zoidbergwill/e48ddeab1552c868a4c140fd14c4aeb2/raw/bc44ee1a0a132d83b945fdacda479d52ac3dc1ed/pdf-ocr.sh
  3. Make it executable: chmod +x pdf-ocr.sh
  4. Run it on your scanned PDF. e.g. ./pdf-ocr.sh "HOME AFFAIRS - COVID SUMMARY.pdf"

Pull Request Review checks

You will need to open the original PDF file to do comparisons/spot-checks

  • Does it capture all the data? (or is there another file or pull request for other tables, e.g. when it's different column headings in the different tables)
  • Is the data in the correct column? Sometimes some rows are shifted and not aligned with the respective heading.
  • Is each "record" - one supplier, one buyer, one order amount - in one row? Sometimes tabula splits multiline cells into multiple rows - these must be single rows (with multiple lines as in the PDF table) in the CSV.

data-cleaning's People

Contributors

schalkventer avatar jbothma avatar martink-rsa avatar capesean avatar schalkventer-demo avatar georgezee avatar jonathanaspeling avatar rmaclean avatar sacheen avatar zoidyzoidzoid avatar runningdeveloper avatar camgreen avatar lali-sed avatar narothamsai avatar theodowling avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.