Comments (29)
This problem should soon be resolved by the way, as I have moved archive-hocr-tools away from lxml entirely. So it simply won't be required anymore. See internetarchive/archive-hocr-tools#5
With the next release of archive-pdf-tools, I'll require that specific version of archive-hocr-tools (or higher), and then we can close this issue.
from archive-pdf-tools.
A few things:
- I don't think you need to get leptonica, openjpeg, libxml, libxslt, jbig2enc for basic functionality (Pillow can compress JPEG2000 and the wheel comes with it I guess, leptonica is only for jbig2, libxml/libxslt I think will just come with pip for python)
- The
pip3 install ...
line seems to attempt to build archive-pdf-tools rather than just install the binary, I think it might be because you're on arm64. I don't know if we already build a binary for that.
For completeness sake, can you share with me your OS and Python version? (Seems to be Python 3.11, but would like to check)
from archive-pdf-tools.
Also, searching for cython #include "longintrepr.h" clang
on google seems to suggest this is an error that happens for many Python packages on macOS/clang, so there might be some hints there. Let me see what I can get done in the next few days.
from archive-pdf-tools.
It looks like just upgrading to a newer Cython version will solve the problem, but I will still need to see if I can make the CI build these releases.
from archive-pdf-tools.
I will see if I can take care of it on my end with upgrades etc, I'll try a bit harder, and get back to you. Thanks for the attention!
from archive-pdf-tools.
I may have just gotten this to install on a different Mac that has the latest OS -- it turns out my Mac that I was having trouble on is still on MacOS 12 instead of 13.
I'm not sure if other things that are significant may differ between them too. I need to spend more time with it. But it may be a common python issue and not something special to this code, indeed, I'm not really sure.
from archive-pdf-tools.
I made a test branch for arm wheels for mac. Can you download the artifact.zip
from here and try it?
https://github.com/internetarchive/archive-pdf-tools/actions/runs/4602441350
$ ls | grep arm | grep mac
archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl
archive_pdf_tools-1.5.3-cp38-cp38-macosx_11_0_arm64.whl
archive_pdf_tools-1.5.3-cp39-cp39-macosx_11_0_arm64.whl
I can also build for other mac os x versions if 11 is not the right one. I was building for macOS-10.15 before.
from archive-pdf-tools.
https://github.com/internetarchive/archive-pdf-tools/actions/runs/4608630138/jobs/8144663900
this one contains wheels for mac OS 10.15 and 12 as well.
from archive-pdf-tools.
(Doesn't look like the macos 12 wheels actually made it)
from archive-pdf-tools.
Thank you! I'm sorry if I'm sending you on a distraction here, this may not be a priority. Some things:
- MacOS 13 is the latest MacOS
- I have two laptops, one with MacOS 12 and one with MacOS 13
pip install archive-pdf-tools
on my MacOS 13 laptop did appear to succeed, it turns out. When I filed this ticket I hadn't tried yet, and didn't realize my main laptop wasn't the latest MacOS 13.pip install archive-pdf-tools
on my MacOS 12 laptop did not, as above (This is not the newest OS)- I am not very experienced at python, I may not have set up either/both of these machines properly for python, or there may be other differences between them than just MacOS version
Since I have demonstrated it installing succesfully on one MacOS laptop, I'm inclined to think the problem might be mine,, not yours. Although there may be things you can do to make it install more reliably, I'm no expert here.
Yeah, I am not able to find the artifact.zip
on that Github Actions build page -- I'm not sure I"d know what to do with it even if i did. Not very python-comfortable here. If you'd like to me to test a build artifact, and it's not totally obvious how, please provide instructions -- but I'm wondering if this is actually my problem not yours.
I need to find more time to update my laptop to MacOS 13 (it's not old on purpose), and re-install dependencies etc, and maybe make sure I am setting up python in a best-practices way on MacOS, and see what happens.
from archive-pdf-tools.
It looks like the zip from the action is not visible for others, please find it attached in this message.
from archive-pdf-tools.
Other than that, there are a few things to mention:
- You can install the wheel files in the zip like this:
pip install --force-reinstall -U /path/to/wheel
pip install pkgnamehere
will try to fetch an online binary package typically, based on your OS and architecture, and if no binary package exists, it will fall back to try to building from source. Before you made this issue, I was not building anyarm64
macOS wheels (aka Apple Silicon), and I still haven't uploaded these wheels to the place that pip gets them from.
if you can verify if these wheels work for macOS, then I can upload them to pypi (where pip gets them from).
from archive-pdf-tools.
The problem you were encountering before was definitely caused by an older Cython version, which I have raised in a separate branch where I am trying to build these wheels. When I know that the wheels work, I can merge those changes to the master branch, and include them in a new release.
from archive-pdf-tools.
For testing various versions, you could also consider setting up a virtualenv: https://docs.python.org/3/library/venv.html
it might be easier.
from archive-pdf-tools.
Hi! Trying to spend more time on this to give you feedback!
OK, I am now using venv.
I'm sorry I'm new to python, so not totally sure how to test what you'd like me to test. Thank you for your tips earlier.
You can install the wheel files in the zip like this: pip install --force-reinstall -U /path/to/wheel
I have unzipped artifact.zip... I get a bunch of .whl files. I am supposed to manually identify which one is appropriate to my system?
I have an M1 Pro MacBook, I am running MacOS 12.6.3 (note this is still not the latest MacOS, the latest is MacOS 13).
If I understand the conventional naming right, it looks like you have wheels in the artifact.zip for macosx_10_9
and macosx_11_0
-- if those numbers are version numbers, neither of those are me, but maybe I'll try the most recent one, so _11_0
?
I believe my M1 Pro Macbook is arm64
rather than x86_64
. I still see three candidates, I don't know how to choose from, what's the difference between cp39
, cp38
, and cp310
?
- archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl
- archive_pdf_tools-1.5.3-cp38-cp38-macosx_11_0_arm64.whl
- archive_pdf_tools-1.5.3-cp39-cp39-macosx_11_0_arm64.whl
On the the theory that bigger is better, maybe I'll try 310. So in an activated venv:
pip install --force-reinstall -U wheel_artifact/archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl
ERROR: archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl is not a supported wheel on this platform.
OK, not that one. Try the other two? Nope, same result.
Maybe the problem is that I'm on MacOS 12? Sorry I'm really flying by touch here, I don't know what I'm doing. If you want me to choose a different .whl
file, just let me know which one and I can try it!
I'm still not convinced there is necessarily anything wrong you had to fix, the problem might have been my system from the start?
Now that I am more intentional about exactly what version of python I am using (3.11.2
) and I'm using a venv, let me try an official release again:
pip install archive-pdf-tools
Hm, alas that one still failed, on I think the same error, #include "longintrepr.h"
.
I did get the install to work on my personal MacBook though -- which was on MacOS 13. I wonder if I upgraded this laptop to MacOS 13 if it would just work. (Sorry I will not be downgrading to MacOS 11 or 10!). Or if there's something else that differs between this laptop and my personal one where I think pip install archive-pdf-tools
worked. I'm sorry, I don't have time this week to try every possible combination of everything (or to upgrade my laptop this week), but I can try a few more things if you'd like!
from archive-pdf-tools.
If you can tell me the Python version you are using on MacOS 12? python --version
will tell you. the cp3x
corresponds to the CPython version.
from archive-pdf-tools.
Ah, sorry, I just saw that you told me what version you are using. I don't think I build wheels for 3.11 yet, let me see if I can do that.
from archive-pdf-tools.
Thanks! This is all just me trying out demos, please know that whatever python version I am using is just what I happen to be using right now to try things out, it's not a commitment to using it forever or what have you!
That was just me installing the "latest" python because I had to pick one and it seemed like a good idea?
If you really need to build a wheel for every possible version of python (combined with OS etc!), that seems pretty untenable!
I can also go back to python 3.10 for the purpose of testing if it's easier for you! I don't totally underestand what we are testing! I didn't pick 3.11 with intention, i just installed the latest version thinking that was the thing to do!
from archive-pdf-tools.
Yeah, it gets a little tedious, but 3.7 - 3.11 is not too bad. I'm almost ready to get a build for 3.11, but unfortunately a bug in lxml
has had me pin specific versions on lxml
for archive-hocr-tools and these are not available for 3.11, so I need to figure out how to make this work.
If you could give it a try on 3.10 if that is not too much work, that would be great. You would use archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl
with that.
from archive-pdf-tools.
(btw, I am running on the assumption that macosx_11_0
would work on 11.0+)
from archive-pdf-tools.
OK, thanks!
I had to start over in a new directory with a new venv
, cause I didn't know how else to do it (not sure if there is another way to do it!)
Then... it seems to have installed!
It did warn:
DEPRECATION: lxml is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559
It installed enough to run recode_pdf --version
and get 1.5.3 anyway! (Took almost 3 seconds for it to be able to print the version number, I guess it just had to load a lot of code first, and this is expected).
MacOS 12.6.3, Python 3.10.10, Apple M1 Pro chip, archive_pdf_tools-1.5.3-cp310-cp310-macosx_11_0_arm64.whl
from archive-pdf-tools.
Great news, thanks for testing the MacOS ARM64 version on Python 3.10
I didn't raise the version further, so getting 1.5.3 makes sense for this test. I have tried to build a version for Python 3.11 here, but it doesn't depend on archive-hocr-tools, so you will have to install that with pip manually, if you'd be up for another test.
from archive-pdf-tools.
(From the above archive, you'd need archive_pdf_tools-1.5.3-cp311-cp311-macosx_11_0_arm64.whl
)
from archive-pdf-tools.
Okay! in a venv using python 3.11.2
. Still on a M1 Pro MacBook running MacOS 12.6.3.
pip install archive-hocr-tools
pip install --force-reinstall -U wheel_artifacts/archive_pdf_tools-1.5.3-cp311-cp311-macosx_11_0_arm64.whl
I'm afraid that did not install.
console output
clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -DCYTHON_CLINE_IN_TRACEBACK=0 -Isrc -Isrc/lxml/includes -I/Users/jrochkind/code/archive-pdf-tools-311/env/include -I/opt/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/lxml/etree.c -o build/temp.macosx-12-arm64-cpython-311/src/lxml/etree.o -w -flat_namespace
src/lxml/etree.c:261877:23: error: no member named 'exc_type' in 'struct _err_stackitem'
while ((exc_info->exc_type == NULL || exc_info->exc_type == Py_None) &&
~~~~~~~~ ^
src/lxml/etree.c:261877:53: error: no member named 'exc_type' in 'struct _err_stackitem'
while ((exc_info->exc_type == NULL || exc_info->exc_type == Py_None) &&
~~~~~~~~ ^
src/lxml/etree.c:261891:23: error: no member named 'exc_type' in 'struct _err_stackitem'
*type = exc_info->exc_type;
~~~~~~~~ ^
src/lxml/etree.c:261893:21: error: no member named 'exc_traceback' in 'struct _err_stackitem'
*tb = exc_info->exc_traceback;
~~~~~~~~ ^
src/lxml/etree.c:261907:26: error: no member named 'exc_type' in 'struct _err_stackitem'
tmp_type = exc_info->exc_type;
~~~~~~~~ ^
src/lxml/etree.c:261909:24: error: no member named 'exc_traceback' in 'struct _err_stackitem'
tmp_tb = exc_info->exc_traceback;
~~~~~~~~ ^
src/lxml/etree.c:261910:15: error: no member named 'exc_type' in 'struct _err_stackitem'
exc_info->exc_type = type;
~~~~~~~~ ^
src/lxml/etree.c:261912:15: error: no member named 'exc_traceback' in 'struct _err_stackitem'
exc_info->exc_traceback = tb;
~~~~~~~~ ^
src/lxml/etree.c:261994:30: error: no member named 'exc_type' in 'struct _err_stackitem'
tmp_type = exc_info->exc_type;
~~~~~~~~ ^
src/lxml/etree.c:261996:28: error: no member named 'exc_traceback' in 'struct _err_stackitem'
tmp_tb = exc_info->exc_traceback;
~~~~~~~~ ^
src/lxml/etree.c:261997:19: error: no member named 'exc_type' in 'struct _err_stackitem'
exc_info->exc_type = local_type;
~~~~~~~~ ^
src/lxml/etree.c:261999:19: error: no member named 'exc_traceback' in 'struct _err_stackitem'
exc_info->exc_traceback = local_tb;
~~~~~~~~ ^
src/lxml/etree.c:262185:26: error: no member named 'exc_type' in 'struct _err_stackitem'
tmp_type = exc_info->exc_type;
~~~~~~~~ ^
src/lxml/etree.c:262187:24: error: no member named 'exc_traceback' in 'struct _err_stackitem'
tmp_tb = exc_info->exc_traceback;
~~~~~~~~ ^
src/lxml/etree.c:262188:15: error: no member named 'exc_type' in 'struct _err_stackitem'
exc_info->exc_type = *type;
~~~~~~~~ ^
src/lxml/etree.c:262190:15: error: no member named 'exc_traceback' in 'struct _err_stackitem'
exc_info->exc_traceback = *tb;
~~~~~~~~ ^
src/lxml/etree.c:264391:20: error: no member named 'exc_type' in 'struct _err_stackitem'
t = exc_state->exc_type;
~~~~~~~~~ ^
src/lxml/etree.c:264393:21: error: no member named 'exc_traceback' in 'struct _err_stackitem'
tb = exc_state->exc_traceback;
~~~~~~~~~ ^
src/lxml/etree.c:264394:16: error: no member named 'exc_type' in 'struct _err_stackitem'
exc_state->exc_type = NULL;
~~~~~~~~~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
Compile failed: command '/usr/bin/clang' failed with exit code 1
creating var
creating var/folders
creating var/folders/_1
creating var/folders/_1/89lqv5550mx2tggl22z27_p18516fz
creating var/folders/_1/89lqv5550mx2tggl22z27_p18516fz/T
cc -I/usr/include/libxml2 -c /var/folders/_1/89lqv5550mx2tggl22z27_p18516fz/T/xmlXPathInit6_cfwx5a.c -o var/folders/_1/89lqv5550mx2tggl22z27_p18516fz/T/xmlXPathInit6_cfwx5a.o
cc var/folders/_1/89lqv5550mx2tggl22z27_p18516fz/T/xmlXPathInit6_cfwx5a.o -lxml2 -o a.out
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> lxml
(I think what I'm learning is it's best not to use the very latest python release maybe?)
from archive-pdf-tools.
Thanks for testing. I will get Python 3.11.x installed on my laptop and see if with the latest lxml the grave bugs I was seeing are gone. If that is the case, then we increase the requirement for hocr tools and then we should be all set for 3.11.x.
Support wise, it's probably also a matter of this project not having that many users on Python 3.11. :)
from archive-pdf-tools.
I just checked, with lxml 4.9.2 the bug is still there: https://bugs.launchpad.net/lxml/+bug/1970741 - I'll see what I can do, but meanwhile, yeah, probably better to use 3.10.
from archive-pdf-tools.
Hm, bug reported to lxml a year ago, there doesn't seem to be anyone in a hurry to fix it.
The bug on lxml doesn't mention python 3.11 specifically... is the issue that in order to use lxml on python 3.11, you need to use a newer version of lxml that exhibits the bug, while on 3.10 you can use an older version of lxml that does not?
This is a bit irritating indeed!
from archive-pdf-tools.
That's right, there does not seem to be a lxml 4.6.5 Python 3.11, and all the new ones are broken currently.
from archive-pdf-tools.
The latest 1.4.x branch and master now ought to work with Python 3.11 as well. Please give it go if you can.
from archive-pdf-tools.
Related Issues (20)
- pillow is not working properly HOT 27
- Need some inspiration? HOT 7
- Some scans become inverted HOT 7
- Detect if RGB images in pages are greyscale or even 1bit
- Define scope of tooling and work to improve for that scope
- Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu HOT 1
- Missing test suite? HOT 1
- pdfcomp: new tool, discussion, compression questions HOT 19
- Bug in foreground/background separator choosing massive block instead of character outline. HOT 14
- The choice for inverting, what's the use for perc_larger?
- pdfcomp: problems with inverted text that is often better in hocr. HOT 10
- Wrong resolution of mask image when foreground image is downsampled HOT 1
- First recode_pdf test: 'numpy' has no attribute 'int'. HOT 5
- IndexError: list index out of range (single TIFF file) HOT 5
- HOCR rendering compares unfavorably with tesseract PDF text layer HOT 11
- Q: accessible tagging/hints? HOT 4
- A certain PDF from Archive.org does not display all of its contents on Mac OS HOT 26
- A user-friendly example for a scanned multipage PDF needed HOT 3
- Recode does not merge hocr into pdf HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archive-pdf-tools.