alephdata / pdflib Goto Github PK
View Code? Open in Web Editor NEWBinary Python bindings for poppler utils for content extraction
Binary Python bindings for poppler utils for content extraction
Maybe we should have a check to prevent this.
Hi — is this able to be installed on a mac? I'm getting errors when I try to pip install pdflib
.
Collecting pdflib
Using cached pdflib-0.1.2.tar.gz (49 kB)
ERROR: Command errored out with exit status 1:
command: /usr/local/anaconda3/envs/chapter-extraction/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py'"'"'; __file__='"'"'/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-pip-egg-info-a9e9_nrp
cwd: /private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/
Complete output (11 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py", line 54, in <module>
ext_modules=cythonize([poppler_ext]),
File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 965, in cythonize
module_list, module_metadata = create_extension_list(
File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 815, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 114, in nonempty
raise ValueError(error_msg)
ValueError: 'pdflib/poppler.pyx' doesn't match any files
I have poppler installed and $POPPLER_ROOT exported.
To make tesseract happy.
I've found that the extracted text from PDFs in right-to-left languages is backwards (reversed). I can reproduce the error by calling the pdflib package directly, but if I call Poppler's pdftotext utility, it is correct. It looks like Aleph/pdflib are taking a different approach to extracting text page-by-page than what poppler does https://github.com/alephdata/aleph/blob/master/services/ingest-file/ingestors/support/pdf.py#L13 vs https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/utils/pdftotext.cc#L400 (in poppler you have to read through all the HTML vs plain text as it is mixed together in one function).
https://github.com/alephdata/pdflib/blob/master/pdflib/poppler.pyx#L379-L388 appears to read the text in right-to-left and then write it out lef-to-right.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.