Coder Social home page Coder Social logo

pdf's Introduction

PDF Reader

Built with WeBuild

A simple Go library which enables reading PDF files. Forked from https://github.com/rsc/pdf

Features

  • Get plain text content (without format)
  • Get Content (including all font and formatting information)

Install:

go get -u github.com/ledongthuc/pdf

Read plain text

package main

import (
	"bytes"
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true
	content, err := readPdf("test.pdf") // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// remember close file
    defer f.Close()
	if err != nil {
		return "", err
	}
	var buf bytes.Buffer
    b, err := r.GetPlainText()
    if err != nil {
        return "", err
    }
    buf.ReadFrom(b)
	return buf.String(), nil
}

Read all text with styles from PDF

func readPdf2(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// remember close file
	defer f.Close()
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}
		var lastTextStyle pdf.Text
		texts := p.Content().Text
		for _, text := range texts {
			if isSameSentence(text, lastTextStyle) {
				lastTextStyle.S = lastTextStyle.S + text.S
			} else {
				fmt.Printf("Font: %s, Font-size: %f, x: %f, y: %f, content: %s \n", lastTextStyle.Font, lastTextStyle.FontSize, lastTextStyle.X, lastTextStyle.Y, lastTextStyle.S)
				lastTextStyle = text
			}
		}
	}
	return "", nil
}

Read text grouped by rows

package main

import (
	"fmt"
	"os"

	"github.com/ledongthuc/pdf"
)

func main() {
	content, err := readPdf(os.Args[1]) // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	defer func() {
		_ = f.Close()
	}()
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}

		rows, _ := p.GetTextByRow()
		for _, row := range rows {
		    println(">>>> row: ", row.Position)
		    for _, word := range row.Content {
		        fmt.Println(word.S)
		    }
		}
	}
	return "", nil
}

Demo

Run example

pdf's People

Contributors

dangquyitt avatar dayfine avatar dcu avatar ivinpolosony avatar josharian avatar karust avatar ledongthuc avatar liron-l avatar louise-jones avatar luchoman08 avatar nenormalka avatar odeke-em avatar qingmo avatar rikvanmechelen avatar rsc avatar t-yuki avatar victron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf's Issues

Is this content stream reachable with the API?

How do I reach the content stream with "/Im0 Do" (0 4 obj) in this example?

I tried:

package main

import (
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true

	f, r, err := pdf.Open("test.pdf")
	defer f.Close()
	if err != nil {
		panic(err)
	}

	page := r.Page(1)
	//fmt.Println(r)
	fmt.Println(page)
	fmt.Println(page.Content())
	return
}

Which outputs this:

{<</Contents 4 0 R /MediaBox [0 0 198.75 235.5] /Parent 1 0 R /Resources <</XObject <</Im0 5 0 R>>>> /Type /Page /pdftk_PageNum 1>>}
{[] []}

So it doesn't seem to work even though the code has token handling for everything you'd expect from a graphic stack content stream.

So, is it a bug or am I missing something obvious?

Thanks.

p.s. the sample came out of img2pdf and was uncompressed with pdftk to make it more convenient to open with a text editor and debug.

Empty rows when parsing PDF

I'm using GetTextByRow() to get text for every rows but I got empty rows

2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []
2023/01/26 16:53:58 rows []

This is detail about the pdf
image

It works on another PDF though. Only fail (empty rows) in this pdf

GetTextByRow issue

GetTextByRow doesn't save the X Y of the child text elements properly.

All set to 0.

calling pdf.Page.GetTextByRow and got result with disordered text with some pdf file

I discover that calling pdf.Page.GetTextByRow and got result with disordered text with some pdf file. For example, I got "761" which should be "176".
I found the result is that in page.go sort with sort.Sort which is not stable, and replace the sort function with sort.Stable can solve the problem.
And pdf.Page.GetTextByColumn also need to modify the same.

Infinite loop in Page()

I have a PDF (which unfortunately I am not allowed to share) that causes an infinite loop in the Page() function.

Basically, none of the two if checks in the kid handling return true and the outer loop just keeps repeating

I'm not a go developer and have no idea about the inner workings of the PDF format, so I don't know how to address this.

Here are two screenshots from the debugger showing the state within the function:

2023-06-20-102905_387x551_scrot

2023-06-20-103316_262x121_scrot

PS I realize that this is probably a problem with the PDF, but I would prefer an error instead of an infinite loop.

Parsing fails on Empty pages (without Contents)

when looking into the issue described in
tmc/langchaingo#348

i found that the library fails when parsing pages without Contents.

strm := p.V.Key("Contents")

returns a Null Value, which then leads to an error.

So when parsing one either has to check for p.V.Key("Contents").Kind() not being pdf.Null

		if p.V.IsNull() || p.V.Key("Contents").Kind() == pdf.Null {
			continue
		}

or I think the functions using p.V.Key("Contents") should check the returned Kind (or recover from an error)

Stream not present

Got an error stream not present when calling p.V.Reader() function. Any advice ? Thank you

parse pdf error

when i parse pdf,sometimes error happen:malformed PDF: reading at offset 0: stream not present

How can i get the size in bytes of each Page of the pdf?

Hi, i'm trying to get the size of each page of the file. I know there is a function that returns the Page, but i can't really access the attributes inside V Value since they are not exported, only methods. Could this be a new feature perhaps?

Thanks

crash when encountering some CJK text amongst English

I get a panic when trying to read the attached PDF file.

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).readHexString(0xc0002a1720)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:203 +0x2f6
github.com/ledongthuc/pdf.(*buffer).readToken(0xc0002a1720)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:157 +0x165
github.com/ledongthuc/pdf.Interpret({0xc0001301e0?, {0x1138a0?, 0xc0?}, {0x4f8bc0?, 0xc000128120?}}, 0xc0002a18c0)
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x145
github.com/ledongthuc/pdf.Page.Content({{0xc0001301e0?, {0x1?, 0x0?}, {0x4f1ec0?, 0xc000120a80?}}})
	/home/mark/opt/gows/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2b5
main.getChars(0x53cb78?, 0x1, 0x5126f3?)
	/home/mark/app/go/pdfdump/text.go:14 +0x85
main.outputWords(0xc0002a1f00, 0x40c41e?, 0x400000?, 0xc0000666b0?)
	/home/mark/app/go/pdfdump/output.go:75 +0x1ca
main.output(0xc000113f00)
	/home/mark/app/go/pdfdump/output.go:33 +0x1a5
main.main()
	/home/mark/app/go/pdfdump/pdfdump.go:12 +0x99

99.pdf

Won´t open some PDFs

While the package opens normally most of the PDFs files, it encounters problems opening some files, instead returning a "panic: malformed PDF: reading at offset 0: stream not present" error.

For example, the file "SP 10-2019 Relatório Analítico de Composições de Custos.pdf" (which you can get in the url "https://www.gov.br/dnit/pt-br/assuntos/planejamento-e-pesquisa/custos-e-pagamentos/custos-e-pagamentos-dnit/sistemas-de-custos/sicro/sudeste/espirito-santo/2019/outubro-1/es-outubro-2019.zip", after extracting the zip file) won´t open with your "github.com/ledongthuc/pdf" package, but opens normally with any PDF reader (like Adobe Reader, for instance).

FWIW, the entire error message that I get while trying to open the file is:

panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).reload(0xc04c7db790, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:95 +0x1fe
github.com/ledongthuc/pdf.(*buffer).readByte(0xc04c7db790, 0xc0003ff9d0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:71 +0x67
github.com/ledongthuc/pdf.(*buffer).readToken(0xc04c7db790, 0xc0732d6260, 0x1000)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:135 +0x47
github.com/ledongthuc/pdf.Interpret(0x0, 0x0, 0x0, 0x0, 0xc04c7db930)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x1ae
github.com/ledongthuc/pdf.Page.Content(0xc04f7395c0, 0x48, 0x4dad60, 0xc073356000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2db
main.extraiPDFAnalitico(0x539921, 0x49, 0x0)
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/pdf_analitico.go:50 +0x165
main.main()
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/main.go:18 +0xa8
exit status 2

Handle space after header

PDF Files produced by libtiff/tiff2pdf have an extra space after the file header. (In other words, the header is "%PDF-1.1 \n" instead of "%PDF-1.1\n". They work in most PDF viewers, but this package rejects them with the error "not a PDF file: invalid header". Would it be a good idea to relax the header check and allow the extra space?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.