Coder Social home page Coder Social logo

vladimir-ionita / cetatenie-romana-stadiu-dosar Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 106 KB

๐Ÿ‡ฒ๐Ÿ‡ฉ๐Ÿ‡ท๐Ÿ‡ด A dossier web parser for http://oldcetatenie.just.ro

Python 99.46% Roff 0.54%
python webparser tesseract-ocr pdf2image pytesseract

cetatenie-romana-stadiu-dosar's Introduction

About

A dossier web parser for http://oldcetatenie.just.ro.

Motive

Based on the common history of Romania and the Republic of Moldova, the citizens of the Republic of Moldova can apply for, and obtain Romanian citizenship. Annually, circa 80.000-100.000 people apply for Romanian citizenship. Documents examination can take from half a year to 2-3 years and sometimes even more. People can check the status of their application, but the process is really tedious and error-prone. The orders are published on http://oldcetatenie.just.ro/index.php/ro/ordine/articol-11 as PDF files. A person must access this web address, and check every order for his dossier number.

This tool tends to make the process easier, by automating the whole process.

Installation

The project makes use of several libraries, some of which require some manual setup.

1. The pdf2image library

The pdf2image library requires poppler to be installed, according to its README file.

You can install it using homebrew: brew install poppler.

2. The pytesseract library

The pytesseract library requires tesseract to be installed, according to its README file.

You can install it using homebrew: brew install tesseract.

3. Install dependencies

To install the project dependencies, you can use pip and the requirements.txt file:

pip install -r requirements.txt

Usage

In the main.py search for the lookups. Set your dossier number according to the example and run the project.

Configurations

The pytesseract library

For tesseract to properly read the PDF images, I've configured it's page segmentation mode to 4 (single column of text of variable sizes).

You can read more here: https://ai-facets.org/tesseract-ocr-best-practices.

Storage Path

The default storage path for the tools is set to $HOME/dev/romanian-citizenship-status-files.

You can change it by changing the FILES_STORAGE_PATH constant.

Known Issues

Sometimes the PDF conversion to TXT results in badly formatted data. Unfortunately, I don't have much experience with tesseract and couldn't configure it better.

Solutions: Manually convert the malformatted files using https://pdftotext.com/ or other tools.

Notes

I've also included all the PDF and TXT files until November 2020.

The tool can download and convert them by itself, but it might take a very long time, as there are more than 2000 files until now. This will help a lot.

cetatenie-romana-stadiu-dosar's People

Contributors

vladimir-ionita avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.