If you are using MacOS, then run brew install poppler tesseract
first.
- Create and activate python virtual environment.
- Run
pip install -r requirements.txt
to install dependencies. - Run
download.sh
to get the minimal set of files required to run inference.
Run pipeline on single pdf document
python -m table_extractor.run run-sequentially <path-to-pdf> <results-output-dir> --verbose <true/false> --paddle_on <true/false>
Results folder will have next structure:
python -m table_extractor.excel_run <path-to-excel> <output-path>
Note! To run excel extraction correctly, please set File -> Export as PDF -> Structure -> Whole Sheet Export