Comments (2)
You might like to try using the windows subsystem for linux @knana1662
from textract.
Hi, I am also facing the same issue here. Below is my code snippet of using textract
:
doc = textract.process(f"Attention is All You Need.pdf")
doc
Then, it shows this error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\utils.py:87, in ShellParser.run(self, args)
86 try:
---> 87 pipe = subprocess.Popen(
88 args,
89 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
90 )
91 except OSError as e:
File [~\AppData\Local\Programs\Python\Python310\lib\subprocess.py:971](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/ILLEGEAR/OneDrive/Desktop/PDF%20Chatbot/~/AppData/Local/Programs/Python/Python310/lib/subprocess.py:971), in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
968 self.stderr = io.TextIOWrapper(self.stderr,
969 encoding=encoding, errors=errors)
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
973 startupinfo, creationflags, shell,
974 p2cread, p2cwrite,
975 c2pread, c2pwrite,
976 errread, errwrite,
977 restore_signals,
978 gid, gids, uid, umask,
979 start_new_session)
980 except:
981 # Cleanup if the child failed starting.
File [~\AppData\Local\Programs\Python\Python310\lib\subprocess.py:1440](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/ILLEGEAR/OneDrive/Desktop/PDF%20Chatbot/~/AppData/Local/Programs/Python/Python310/lib/subprocess.py:1440), in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session)
1439 try:
-> 1440 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1441 # no special security
1442 None, None,
1443 int(not close_fds),
1444 creationflags,
1445 env,
1446 cwd,
1447 startupinfo)
1448 finally:
1449 # Child is launched. Close the parent's copy of those pipe
1450 # handles that only the child should have open. You need
(...)
1453 # pipe will not close when the child process exits and the
1454 # ReadFile will hang.
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
ShellError Traceback (most recent call last)
Cell In[9], line 1
----> 1 doc = textract.process(f"Attention is All You Need.pdf")
2 doc
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\__init__.py:79, in process(filename, input_encoding, output_encoding, extension, **kwargs)
76 # do the extraction
78 parser = filetype_module.Parser()
---> 79 return parser.process(filename, input_encoding, output_encoding, **kwargs)
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\utils.py:46, in BaseParser.process(self, filename, input_encoding, output_encoding, **kwargs)
36 """Process ``filename`` and encode byte-string with ``encoding``. This
37 method is called by :func:`textract.parsers.process` and wraps
38 the :meth:`.BaseParser.extract` method in `a delicious unicode
39 sandwich `_.
40
41 """
42 # make a "unicode sandwich" to handle dealing with unknown
43 # input byte strings and converting them to a predictable
44 # output encoding
45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
47 unicode_string = self.decode(byte_string, input_encoding)
48 return self.encode(unicode_string, output_encoding)
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\pdf_parser.py:29, in Parser.extract(self, filename, method, **kwargs)
27 return self.extract_pdfminer(filename, **kwargs)
28 else:
---> 29 raise ex
31 elif method == 'pdfminer':
32 return self.extract_pdfminer(filename, **kwargs)
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\pdf_parser.py:21, in Parser.extract(self, filename, method, **kwargs)
19 if method == '' or method == 'pdftotext':
20 try:
---> 21 return self.extract_pdftotext(filename, **kwargs)
22 except ShellError as ex:
23 # If pdftotext isn't installed and the pdftotext method
24 # wasn't specified, then gracefully fallback to using
25 # pdfminer instead.
26 if method == '' and ex.is_not_installed():
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\pdf_parser.py:44, in Parser.extract_pdftotext(self, filename, **kwargs)
42 else:
43 args = ['pdftotext', filename, '-']
---> 44 stdout, _ = self.run(args)
45 return stdout
File [c:\Users\ILLEGEAR\OneDrive\Desktop\PDF](file:///C:/Users/ILLEGEAR/OneDrive/Desktop/PDF) Chatbot\pdfcb_env\lib\site-packages\textract\parsers\utils.py:95, in ShellParser.run(self, args)
91 except OSError as e:
92 if e.errno == errno.ENOENT:
93 # File not found.
94 # This is equivalent to getting exitcode 127 from sh
---> 95 raise exceptions.ShellError(
96 ' '.join(args), 127, '', '',
97 )
98 else: raise #Reraise the last exception unmodified
100 # pipe.wait() ends up hanging on large files. using
101 # pipe.communicate appears to avoid this issue
ShellError: The command `pdftotext Attention is All You Need.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
from textract.
Related Issues (20)
- mp3 text extraction Exception - 5MB~ file
- Error in textract setup command w/ extract-msg<=0.29.* due to Wheel 0.40.0 HOT 2
- textract3-1.6.4.post1 and textract-1.6.5 compilation error: error in beautifulsoup4 setup command: use_2to3 is invalid. HOT 1
- Use latest six HOT 3
- progress bar for long documents
- Replace Antiword with a Python alternative HOT 2
- Is textract still maintained? HOT 5
- Support for .one (OneNote) files
- textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.* HOT 2
- error message whilest pip installing HOT 1
- Requesting compatibility for red hat linux
- Non-Standard Dependency Specifier with pip 24.0 HOT 1
- textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.* HOT 2
- Transfer the project to jazzband? HOT 7
- pstotext Preventing Packaging
- Suggestion: Add support for .pdf files
- pip install failing on windows 10 when installing from git HOT 1
- Deprecation Issue HOT 1
- dependancy conflict between djlint 1.9< onwards and textract
- Cannot Install with other packages due to `~=` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from textract.