Comments (4)
Thanks for creating an issue.
If you want to use the same option to all pages, I would suggest to call tabula.template.load_template
directly.
Here is the example:
>>> import tabula
>>> fname = "./tests/resources/data.tabula-template.json"
>>> o = tabula.template.load_template(fname)
>>> o
[TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=2, guess=True, area=[[123.999, 154.0, 210.444, 453.88], [410.996, 154.0, 497.441, 487.54]], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=3, guess=True, area=[123.999, 154.0, 322.899, 235.855], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)]
>>> o[0]
TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)
>>> o[0].pages
1
>>> o[0].pages="all"
>>> tabula.read_pdf(pdf_path, options=" ".join(o[0].build_option_list()))
'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug. 22, 2023 9:08:53 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 808 fonts
[ Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2, Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa, Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 145 6.7 3.3 5.7 2.5 virginica
1 146 6.7 3.0 5.2 2.3 virginica
2 147 6.3 2.5 5.0 1.9 virginica
3 148 6.5 3.0 5.2 2.0 virginica
4 149 6.2 3.4 5.4 2.3 virginica
5 150 5.9 3.0 5.1 1.8 virginica, len supp dose
0 4.2 VC 0.5
1 11.5 VC 0.5
2 7.3 VC 0.5
3 5.8 VC 0.5
4 6.4 VC 0.5
5 10.0 VC 0.5
6 11.2 VC 0.5
7 11.2 VC 0.5
8 5.2 VC 0.5
9 7.0 VC 0.5
10 16.5 VC 1.0
11 16.5 VC 1.0
12 15.2 VC 1.0
13 17.3 VC 1.0
14 22.5 VC 1.0]
Of course, there is room for improvement to pass TabulaOption
to tabula.read_pdf
directly, but before that, I'd love to hear your feedback.
from tabula-py.
Close since no response.
from tabula-py.
uuhhh.. sorry, I didn't reply sooner, but this is a hobby project I'm working on.
While I understand your suggestion, this means that the template are not longer only defined in the json file, but explicitly manipulated... I think that at the moment I'll stuck with multiple templates and a simple logic to choose what to use for the extraction
from tabula-py.
Thanks for your response.
Unfortunately, tabula-py also doesn't know the page size of a PDF, so we can only use pages="all"
option for handling unknown pages.
from tabula-py.
Related Issues (20)
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Use JPype instead of subprocess HOT 11
- Exception: RuntimeError: java.lang.UnsatisfiedLinkError: HOT 2
- cant install tabula-py on m1 mac vscode. HOT 1
- Support Python 3.12 HOT 5
- Pls add "orientation" parameter to read_pdf HOT 4
- Security vulnerability in tabula-1.0.5-jar-with-dependencies.jar HOT 4
- [BUG] Encoding still being overridden even after fix to #371. HOT 5
- FutureWarning: errors='ignore' is deprecated and will raise in a future version. HOT 3
- Unable to detect table with longer header information HOT 4
- [BUG] issue just running sample code HOT 1
- Table detection in images HOT 1
- [BUG] <FutureWarning: errors='ignore' > HOT 3
- [BUG] Error importing jpype dependencies. Fallback to subprocess. No module named 'org.apache' HOT 1
- [BUG] column parameter of read_pdf currently needs to be list, not generic iterable HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.