Coder Social home page Coder Social logo

kenya-power-pdf-extract's Introduction

Kenya Power Interruptions PDF Extract

Parsing kenya power interruption data from their pdf files into json format

๐Ÿง lemme tinker with the pdf file, see if I can parse the data

โ€” collins muriuki (@collinsmuriuki_) July 12, 2022

Steps

First step is to actually derive the text content from the pdf file into string format. Luckily, rust crate, pdf-extract, handles this for us via it's extract_text function. PS: storing this data in a String type is not the most memory efficient method of going about this I must say, memory usage will be higher the bigger the pdf text size; we can make this compromise for this short demo.

The next bit is where the "fun" begins - make something meaningful from the junky text that we get back. First is to filter out what I consider as junk i.e text that doesn't really hold any meaningful data. This functionality is handled by the extract_text_from_pdf function

Next step is to break down the massive string into smaller chunks containing isolated outage information for a given area. The approach that was taken to do this was pretty simple, we split the huge string at "AREA:". See the FromStr implementation of the OutagesList

Now that we have a list of strings, we can figure out how we can handle a single string from the list. The main goal is to establish breakpoints in the remaining string, this was achieved through two regex objects - stored as lazy static variables:

  • DATE_RE - matches the date of the outage: With this we can derive the date of the outage as well as the string text that comes before the match; at this point we now have the region and the date
  • TIME_RE - matches the time range at which the outage will occur as well as the affected areas which is the string patterns that occurs after the date; at this point we now have the time and the areas.

What is left is to put everything together by creating two structs OutagesList and OutagesItem with their respective FromStr trait implementations. So that we finally have this in our main function:

use kenya_power_pdf_extract::{extract_text_from_pdf, OutagesList};

fn main() -> Result<(), anyhow::Error> {
    let args = std::env::args().collect::<Vec<_>>();
    let pdf_text = extract_text_from_pdf(&args[1])?;
    let outages_list = pdf_text.parse::<OutagesList>()?;
    println!("{:#?}", outages_list);
    Ok(())
}

Output snippet:

OutagesList {
    data: [
        OutagesItem {
            region: "PART OF KILIMANI, MILIMANI",
            date: "Monday 18.07.2022",
            time: "9.00 A.M. โ€“ 5.00 P.M.",
            areas: [
                "Part  of  Jabavu  Rd",
                "Woodlands",
                "DoD  Headquarters",
                "Woodlands  Mosque",
                "Part  ofHurlingum S/Centre",
                "Jabavu Court",
                "Chinese Embassy",
                "Russian Embassy",
                "Sri LankaEmbassy",
                "Jakaya  Kikwete  Rd",
                "Delamere  Flats",
                "Sagret  Hotel",
                "Comfort  Hotel",
                "SwizzHotel",
                "Ralph Bunch Rd",
                "Integrity Centre",
                "Middle East Bank",
                "Heron Portico",
                "PITMAN,Telkom  Plaza",
                "Adak  House  Nairobi  Central  SDA",
                "Nairobi  Area  Police",
                "Medical  &Dentist Board",
                "Lenana Rd & adjacent customers.",
            ],
        },
//...

Local Development

Requires rust and cargo installation.

Once that's done run:

cargo run ./files/kenya_power.pdf

Check the output folder for the resulting stdout output for both kenya_power_latest.pdf and kenya_power.pdf files in the files directory

Caveats

  • Only tested with 4 pdfs files derived from kplc.co.ke - Some edge cases might not be covered
  • Data is only grouped by AREA rather than REGION - can be fixed, decided to keep things simple for now

Authored by Collins Muriuki

This project is MIT licensed

kenya-power-pdf-extract's People

Contributors

c12i avatar ibutiti avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.