ledongthuc / pdf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rsc/pdf

387.0 8.0 143.0 47 KB

PDF reader

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

golang pdf pdf-viewer

pdf's Introduction

PDF Reader

A simple Go library which enables reading PDF files. Forked from https://github.com/rsc/pdf

Features

Get plain text content (without format)
Get Content (including all font and formatting information)

Install:

go get -u github.com/ledongthuc/pdf

Read plain text

package main

import (
	"bytes"
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true
	content, err := readPdf("test.pdf") // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// remember close file
    defer f.Close()
	if err != nil {
		return "", err
	}
	var buf bytes.Buffer
    b, err := r.GetPlainText()
    if err != nil {
        return "", err
    }
    buf.ReadFrom(b)
	return buf.String(), nil
}

Read all text with styles from PDF

func readPdf2(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// remember close file
	defer f.Close()
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}
		var lastTextStyle pdf.Text
		texts := p.Content().Text
		for _, text := range texts {
			if isSameSentence(text, lastTextStyle) {
				lastTextStyle.S = lastTextStyle.S + text.S
			} else {
				fmt.Printf("Font: %s, Font-size: %f, x: %f, y: %f, content: %s \n", lastTextStyle.Font, lastTextStyle.FontSize, lastTextStyle.X, lastTextStyle.Y, lastTextStyle.S)
				lastTextStyle = text
			}
		}
	}
	return "", nil
}

Read text grouped by rows

package main

import (
	"fmt"
	"os"

	"github.com/ledongthuc/pdf"
)

func main() {
	content, err := readPdf(os.Args[1]) // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	defer func() {
		_ = f.Close()
	}()
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}

		rows, _ := p.GetTextByRow()
		for _, row := range rows {
		    println(">>>> row: ", row.Position)
		    for _, word := range row.Content {
		        fmt.Println(word.S)
		    }
		}
	}
	return "", nil
}

Demo

pdf's People

Contributors

Stargazers

Watchers

Forkers

cervitalotekniikka ivinpolosony patricktoca lwolf scapal rikvanmechelen reinhardhsu tomoka64 jainpawan go-user fencholcn louise-jones shreeshac kuriancoding mtmoses folio-as bonedaddy quesadaao lebakken dayfine t-yuki code-hex lubyruffy victron ekanna mewbak jason916 diogocatapreta pikachule r3vit rinor dcu durp karust toaster elibaron reggiepy kolkov mycshq chandrashekar3792 falconandy jedeft small-lei yinqiang umutcomlekci ya-mitra wundertax nikolayk812 yuzic shouldend fasenaj nicholasmole ajstiles blueskyz skhuang1993 gaodihu adrianuswarmenhoven vbatushev dslipak andrysky brunofontes sahwar admodev tulip-data mao12312 sanity187 nicexai dzl84 lunarforge alexandreliberato diegohordi alexjomin yudeguang yushihong jimashi riselytechnologies dpsigor silviosaczucktc mazzegi jackey925 jamesarthurholland nof0rte zorchenhimer cethap fredfoo quasimodo7614 sybillium cloudresty sheexiongomh jonham houjunpeng0576 lzambarda lpuigo wuruipeng404 ryanpenn ogatalars hunchly stormvirux leowmjw dht-hedaihua

pdf's Issues

Is this content stream reachable with the API?

How do I reach the content stream with "/Im0 Do" (0 4 obj) in this example?

I tried:

package main

import (
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true

	f, r, err := pdf.Open("test.pdf")
	defer f.Close()
	if err != nil {
		panic(err)
	}

	page := r.Page(1)
	//fmt.Println(r)
	fmt.Println(page)
	fmt.Println(page.Content())
	return
}

Which outputs this:

{<</Contents 4 0 R /MediaBox [0 0 198.75 235.5] /Parent 1 0 R /Resources <</XObject <</Im0 5 0 R>>>> /Type /Page /pdftk_PageNum 1>>}
{[] []}

So it doesn't seem to work even though the code has token handling for everything you'd expect from a graphic stack content stream.

So, is it a bug or am I missing something obvious?

Thanks.

p.s. the sample came out of img2pdf and was uncompressed with pdftk to make it more convenient to open with a text editor and debug.

Empty rows when parsing PDF

I'm using GetTextByRow() to get text for every rows but I got empty rows

2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []

This is detail about the pdf

It works on another PDF though. Only fail (empty rows) in this pdf

GetTextByRow issue

GetTextByRow doesn't save the X Y of the child text elements properly.

All set to 0.

calling pdf.Page.GetTextByRow and got result with disordered text with some pdf file

I discover that calling pdf.Page.GetTextByRow and got result with disordered text with some pdf file. For example, I got "761" which should be "176".
I found the result is that in page.go sort with sort.Sort which is not stable, and replace the sort function with sort.Stable can solve the problem.
And pdf.Page.GetTextByColumn also need to modify the same.

Infinite loop in Page()

I have a PDF (which unfortunately I am not allowed to share) that causes an infinite loop in the Page() function.

Basically, none of the two if checks in the kid handling return true and the outer loop just keeps repeating

I'm not a go developer and have no idea about the inner workings of the PDF format, so I don't know how to address this.

Here are two screenshots from the debugger showing the state within the function:

PS I realize that this is probably a problem with the PDF, but I would prefer an error instead of an infinite loop.

Parsing fails on Empty pages (without Contents)

when looking into the issue described in
tmc/langchaingo#348

i found that the library fails when parsing pages without Contents.

strm := p.V.Key("Contents")

returns a Null Value, which then leads to an error.

So when parsing one either has to check for p.V.Key("Contents").Kind() not being pdf.Null

		if p.V.IsNull() || p.V.Key("Contents").Kind() == pdf.Null {
			continue
		}

or I think the functions using p.V.Key("Contents") should check the returned Kind (or recover from an error)

How to load (get Reader) from bytes intead of file path?

Stream not present

Got an error stream not present when calling p.V.Reader() function. Any advice ? Thank you

parse pdf error

when i parse pdf，sometimes error happen：malformed PDF: reading at offset 0: stream not present

unknown encoding UniGB-UCS2-H

what is this code?
// See PDF 32000-1:2008, Table D.2
I can not find this doc
please help

not a PDF file: missing %%EOF

163123725.pdf

How can i get the size in bytes of each Page of the pdf?

Hi, i'm trying to get the size of each page of the file. I know there is a function that returns the Page, but i can't really access the attributes inside V Value since they are not exported, only methods. Could this be a new feature perhaps?

Thanks

crash when encountering some CJK text amongst English

I get a panic when trying to read the attached PDF file.

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).readHexString(0xc0002a1720)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:203 +0x2f6
github.com/ledongthuc/pdf.(*buffer).readToken(0xc0002a1720)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:157 +0x165
github.com/ledongthuc/pdf.Interpret({0xc0001301e0?, {0x1138a0?, 0xc0?}, {0x4f8bc0?, 0xc000128120?}}, 0xc0002a18c0)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x145
github.com/ledongthuc/pdf.Page.Content({{0xc0001301e0?, {0x1?, 0x0?}, {0x4f1ec0?, 0xc000120a80?}}})
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2b5
main.getChars(0x53cb78?, 0x1, 0x5126f3?)
	/home/mark/app/go/pdfdump/text.go:14 +0x85
main.outputWords(0xc0002a1f00, 0x40c41e?, 0x400000?, 0xc0000666b0?)
	/home/mark/app/go/pdfdump/output.go:75 +0x1ca
main.output(0xc000113f00)
	/home/mark/app/go/pdfdump/output.go:33 +0x1a5
main.main()
	/home/mark/app/go/pdfdump/pdfdump.go:12 +0x99

99.pdf

Won´t open some PDFs

While the package opens normally most of the PDFs files, it encounters problems opening some files, instead returning a "panic: malformed PDF: reading at offset 0: stream not present" error.

For example, the file "SP 10-2019 Relatório Analítico de Composições de Custos.pdf" (which you can get in the url "https://www.gov.br/dnit/pt-br/assuntos/planejamento-e-pesquisa/custos-e-pagamentos/custos-e-pagamentos-dnit/sistemas-de-custos/sicro/sudeste/espirito-santo/2019/outubro-1/es-outubro-2019.zip", after extracting the zip file) won´t open with your "github.com/ledongthuc/pdf" package, but opens normally with any PDF reader (like Adobe Reader, for instance).

FWIW, the entire error message that I get while trying to open the file is:

panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).reload(0xc04c7db790, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:95 +0x1fe
github.com/ledongthuc/pdf.(*buffer).readByte(0xc04c7db790, 0xc0003ff9d0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:71 +0x67
github.com/ledongthuc/pdf.(*buffer).readToken(0xc04c7db790, 0xc0732d6260, 0x1000)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:135 +0x47
github.com/ledongthuc/pdf.Interpret(0x0, 0x0, 0x0, 0x0, 0xc04c7db930)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x1ae
github.com/ledongthuc/pdf.Page.Content(0xc04f7395c0, 0x48, 0x4dad60, 0xc073356000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2db
main.extraiPDFAnalitico(0x539921, 0x49, 0x0)
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/pdf_analitico.go:50 +0x165
main.main()
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/main.go:18 +0xa8
exit status 2

Handle space after header

PDF Files produced by libtiff/tiff2pdf have an extra space after the file header. (In other words, the header is "%PDF-1.1 \n" instead of "%PDF-1.1\n". They work in most PDF viewers, but this package rejects them with the error "not a PDF file: invalid header". Would it be a good idea to relax the header check and allow the extra space?