Parsing kenya power interruption data from their pdf files into json format
๐ง lemme tinker with the pdf file, see if I can parse the data
โ collins muriuki (@collinsmuriuki_) July 12, 2022
First step is to actually derive the text content from the pdf file into string format. Luckily, rust crate, pdf-extract
, handles this for us via it's extract_text
function. PS: storing this data in a String
type is not the most memory efficient method of going about this I must say, memory usage will be higher the bigger the pdf text size; we can make this compromise for this short demo.
The next bit is where the "fun" begins - make something meaningful from the junky text that we get back. First is to filter out what I consider as junk
i.e text that doesn't really hold any meaningful data. This functionality is handled by the extract_text_from_pdf
function
Next step is to break down the massive string into smaller chunks containing isolated outage information for a given area. The approach that was taken to do this was pretty simple, we split the huge string at "AREA:"
. See the FromStr
implementation of the OutagesList
Now that we have a list of strings, we can figure out how we can handle a single string from the list. The main goal is to establish breakpoints in the remaining string, this was achieved through two regex objects - stored as lazy static variables:
DATE_RE
- matches the date of the outage: With this we can derive the date of the outage as well as the string text that comes before the match; at this point we now have theregion
and thedate
TIME_RE
- matches the time range at which the outage will occur as well as the affected areas which is the string patterns that occurs after the date; at this point we now have thetime
and theareas
.
What is left is to put everything together by creating two structs OutagesList
and OutagesItem
with their respective FromStr
trait implementations. So that we finally have this in our main function:
use kenya_power_pdf_extract::{extract_text_from_pdf, OutagesList};
fn main() -> Result<(), anyhow::Error> {
let args = std::env::args().collect::<Vec<_>>();
let pdf_text = extract_text_from_pdf(&args[1])?;
let outages_list = pdf_text.parse::<OutagesList>()?;
println!("{:#?}", outages_list);
Ok(())
}
Output snippet:
OutagesList {
data: [
OutagesItem {
region: "PART OF KILIMANI, MILIMANI",
date: "Monday 18.07.2022",
time: "9.00 A.M. โ 5.00 P.M.",
areas: [
"Part of Jabavu Rd",
"Woodlands",
"DoD Headquarters",
"Woodlands Mosque",
"Part ofHurlingum S/Centre",
"Jabavu Court",
"Chinese Embassy",
"Russian Embassy",
"Sri LankaEmbassy",
"Jakaya Kikwete Rd",
"Delamere Flats",
"Sagret Hotel",
"Comfort Hotel",
"SwizzHotel",
"Ralph Bunch Rd",
"Integrity Centre",
"Middle East Bank",
"Heron Portico",
"PITMAN,Telkom Plaza",
"Adak House Nairobi Central SDA",
"Nairobi Area Police",
"Medical &Dentist Board",
"Lenana Rd & adjacent customers.",
],
},
//...
Requires rust and cargo
installation.
Once that's done run:
cargo run ./files/kenya_power.pdf
Check the output
folder for the resulting stdout
output for both kenya_power_latest.pdf
and kenya_power.pdf
files in the files
directory
- Only tested with 4 pdfs files derived from kplc.co.ke - Some edge cases might not be covered
- Data is only grouped by
AREA
rather thanREGION
- can be fixed, decided to keep things simple for now
Authored by Collins Muriuki
This project is MIT licensed