Coder Social home page Coder Social logo

pdfkitten's Introduction

Kurt the PDFKitten

A Proof-of-Concept framework for searching PDF documents on iOS.

PLEASE NOTE THAT THIS SOFTWARE IS EXPERIMENTAL AND WILL MOST LIKELY NEVER BE FINISHED.

It was created to show how PDF search and highlighting could be done in third-party apps. Alas, it will likely never reach completeness or cover all usecases.

Why?

iOS, up to and including the current fifth version, does not provide any public APIs for searching PDF documents, or determining where on a page a given word is drawn. Any developer aiming to provide these features in an app must use low-level Core Graphics APIs, and keep track of the stateful process of laying out the content of the page.

This project is meant to facilitate this by implementing a complete workflow, taking as input a PDF document, a keyword string, and returning a set of selections that can be drawn on top of the PDF document.

How?

First, create a new instance of the scanner.

	CGPDFPageRef page = CGPDFDocumentGetPage(document, 1);
	Scanner *scanner = [Scanner scannerWithPage:page];

Set a keyword (case-insensitive) and scan a page.

	NSArray *selections = [scanner select:@"happiness"];

Finally, scan the page and draw the selections.

	for (Selection *selection in selections)
	{
		// draw selection
	}

Limitations

The PDF specification is huge, allowing for different fonts, text encodings et cetera. This means strict design is a must, and thorough testing is needed. At this point, this project is not fully compatible with all font types, and especially support for non-latin characters will require further development.

Offering a complete solution for processing any PDF document would apparently require the inclusion of a complete library of font files. We currently do not intend to include more than the bare essentials for a proof-of-concept application.

Only latin character sets are currently supported.

License and Warranty

This software is provided under the MIT license, see License.txt.

pdfkitten's People

Contributors

chrisjrn avatar keeshux avatar kurtcode avatar tarunbatta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfkitten's Issues

how is search done?

i know its not case sensitive and converts it to lower case and then searches for the word.But how is it done?
Does it search on the basis that the

  1. keyword is contained in a word
  2. the word starts with the keyword
  3. the word exactly matches the keyword

Also how can i highlight the text..Are we supposed to call a method to enable highlighting?

Is it anyway possible to search and highlight words with spaces, coz i need to highlight the whole line but it doesn't.It just highlights words without spaces

Crash when scanning a keyword of ""

You can replicate this in the sample app by tapping in the search box in the Simulator and hitting return. This searches for an empty keyword string.

I do this to retrieve all of the strings back from a PDF page. This used to work fine

The crash is in the appendString: method of StringDetector:

unichar expectedCharacter = [keyword characterAtIndex:keywordPosition];

The error is:

-[NSCFString characterAtIndex:]: Range or index out of bounds'

Thank you!

Search can't find strings with small spaces?

In this file http://dl.dropbox.com/u/39382628/test2.pdf, the word "test" is not found, even though the file contains just two words and they are both "test".

It seems that there are some spaces between some of the characters in the array passed to the TJ handler . These spaces cause the Scanner to reset its stringDetector which prevents it from finding the complete string.

Here is what I'm seeing:

2011-11-09 17:00:45.563 neuAnnotatePlus[12749:1f303] didScanString: t
2011-11-09 17:00:45.563 neuAnnotatePlus[12749:1f303] didScanSpace: value=111.00000000, width=0.11100001
2011-11-09 17:00:45.580 neuAnnotatePlus[12749:1f303] didScanString: e
2011-11-09 17:00:45.580 neuAnnotatePlus[12749:1f303] didScanSpace: value=-0.20000000, width=-0.00020000
2011-11-09 17:00:45.581 neuAnnotatePlus[12749:1f303] didScanString: st
2011-11-09 17:00:45.585 neuAnnotatePlus[12749:1f303] didScanSpace: value=0.20000000, width=0.00020000
2011-11-09 17:00:45.585 neuAnnotatePlus[12749:1f303] didScanString:
2011-11-09 17:00:45.586 neuAnnotatePlus[12749:1f303] didScanString:
2011-11-09 17:00:45.587 neuAnnotatePlus[12749:1f303] didScanString: t
2011-11-09 17:00:45.587 neuAnnotatePlus[12749:1f303] didScanSpace: value=0.20000000, width=0.00020000
2011-11-09 17:00:45.588 neuAnnotatePlus[12749:1f303] didScanString: e
2011-11-09 17:00:45.588 neuAnnotatePlus[12749:1f303] didScanSpace: value=-0.20000000, width=-0.00020000
2011-11-09 17:00:45.589 neuAnnotatePlus[12749:1f303] didScanString: st

Use this library

I'm trying to use this library but i can't find any tutorial for using this framework with another project.Please give me some suggestion about that.

Search for PDF in Landscape mode

When I try to use the search function for PDF in Landscape mode, finds words are not colored like in portrait mode. Is there a way to fix this bug?

Best regards

Unable to search in my sample pdf.

Thanks for writing such a nice code. It helps me a lot. I have downloaded one pdf from apple's site. You can also download by just writing "Event Handling guides for iOS". If I am searching "Event" application not searching at first page. It doesn't found anything at that page. I know it might be issue of Font But not sure what to do. It would be really great for me if you can point out what to do to make it working.

Forgot to init "content" in "operatorWithStartingToken"?

NSString *content = nil;
NSCharacterSet *newLineSet = [NSCharacterSet newlineCharacterSet];
NSCharacterSet *tagSet = [NSCharacterSet characterSetWithCharactersInString:@"<>"];
NSString *separatorString = @"> <";

chars = [[NSMutableDictionary alloc] init];    
NSScanner *rangeScanner = [NSScanner scannerWithString:content]

It makes the warning "NSScanner: nil string argument", and makes this method meanless.

PDF Search & Text Highlighting

Hi !

Don't know how to else contact you:

Would you be available/interested in doing contract work around PDF technology on iOS?
If yes then please write me a mail to alexander [dot] marktl [at] gmail [dot] com

Thanks
Alex

Issues found when looking for specific words...

If you search for 'finally' or 'find' in the program in the "Kurt the cat.pdf", there will be no result.
actually, there is a 'finally' at line 3 and a 'find' at the last line.

i inserted some logging into the operator callbacks, and found that the string returned was actually 'fnally' and 'fnd', the 'i' is missing.

when zoomed in onto the two words, you can see that the i is rendered to be extremely close to 'f'

how should we fix this 'fnally' issue?

Highlighting Matches

Hey,
I wasn't sure how else to contact you, but I am in the middle of a pdf reader project plugin for phonegap. I currently have vfr reader running and it lacks search. What I was hoping to do was import your project into that, or vice versa. Could you please explain how the highlighting works? Is it an overlay of a transparent view on top of the pdf? Or do you actually alter the content stream of the pdf with the yellow rectangle?

Thanks

Standard 14 fonts (type1 fonts)

I'm (trying) to write a "PDF to plain text parser" and I'm using pdfKitten as an example.

I have a PDF which uses courier and courier-bold. Courier is included in what the PDF specs call the "standard 14 fonts". These fonts do not need to specify values for the "Widths" "FirstChar" "LastChar" keys in the font dictionary.

This causes a problem in the Scanner/didScanSpace function, which now always indicates a space was scanned because the value [font withOfSpace] is 0;

Also the function Font/stringWithPDFString: cannot transform the pdfString in a NSString.

I'm willing to fix this but I need some pointers to get me started. For instance, because they are called the "standard 14 fonts" I assume the iOS framework has the values for "Widths" "FirstChar" "LastChar" stored somewhere but where? How can I obtain these values from the iOS framework?

Can you help me for Chinese text search?

Hello, thanks for your amazing work about pdf. I want to realize Chinese text search in PDFs, I have worked for it several days, but I don't have any idea. Can you help me make some thoughts? Thanks again.

PDF Text scanner missing line breaks and space

Hello,

Thank you for providing such a beautiful framework to handle the PDF, Your framework save allot of my time, Helped me allot. There are some things i have noticed in the framework while creating custom text highlighting feature. Highlighting works while Text to speech read aloud. For that i am using NSRange to determine which part of string to be highlight. Everything working very good so far i am able to highlight. But there are some issues with pdf scan text. There are some missing spaces between words and Also missing line breaks.

I have never worked with PDF before, Also i don't know much about PDF. But now i am looking into it how things are working. So i have found you are using CGPDFScannerRef to scan text from PDF. So there must be something i can do that help me to get better text. Can you please guide me a bit where should i look and if there's any tutorial about CGPDFScannerRef.

Thank you!

Wrong width calulation

Hi Kurt,

first totally thank you for your contribution. Its nice to see that complex projects still get crafted.

I have a problem which i already explained on Stackoverflow ( http://stackoverflow.com/questions/12914479/pdfkitten-is-highlighting-on-wrong-position ).

The highlighting frame is sometimes on a wrong position. Its almost a little padding but in my case up to 100px-200px. I cant see a pattern right now and tried it with different PDF's with the same Fonts in my complex ones. You could replicate a wrong calculation by searching the string "in" in your example PDF.

Can you maybe give me any direction or tip where to look at? In a bit more complex PDFs 40% of selections goes wrong. I can send you the screenshots via E-Mail cause i cant post them to the public audience.

Thank you

Text selection?

Has anybody successfully implementedt UITextField like text selection in PDFs based on PDF Kittens sources? I don want to reinvent th wheel. :-)

a fix for FontDescriptor.m

In FontDescriptor.m, - (id)initWithPDFDictionary:(CGPDFDictionaryRef)dict, CGPDFDictionaryGetName(dict, "Type", &type) sometimes returns a nil in type.

Testing as follows goes around the problem:
if (type == nil)
{
[self release]; return nil;
}

Laurent.

leaks

Thanks for the project, this has helped me a lot.

I'm not so great with git so I will list a few leaks here:

CMap.m, line 112: initWithPDFStream
text should be released

Scanner.m, line 498: dealloc
selections should be released (this one wouldn't be picked up on in the example project as the PDFPage is reused and never released), this one leaks the most amount of info

StringDetector.m, line 183: dealloc
unicodeContent should be released

StringDetector.m, line 12: initWithKeyword
line 16 in here calls [self setKeyword:](self.keyword = str;) which calls [self reset] which will allocate and set unicodeContent, directly after self.keyword = str; on line 17 unicodeContent is set again (unicodeContent = [[NSMutableString alloc] init];) overwriting the old one set inside [self reset]. The method should be:

  • (id)initWithKeyword:(NSString *)str
    {
    if ((self = [super init]))
    {
    self.keyword = str;

    if(unicodeContent == nil)
        unicodeContent = [[NSMutableString alloc] init];
    

    }
    return self;
    }

Again, thanks heaps! :)

Pdf searching

Hi,
I'm currently in the process of integrating the search functionality into my project.
From what i gather, each time a search term is entered, for every page drawn, the search is performed and the keyword is highlighted. This works fine when i swipe through the pages. How do i search the entire document and list the results in a table view with page number and some text.. for example.. if i searched for the word "the" then i get probably a string like "enter 'the' dragon" and also the page number? I'm looking to jump to page based on the table row clicked.
Pardon me if this isnt the place for asking help, i'm just starting out with stuff and any help on this would be very appreciated.
Thanks!

Scanning Issue

Hello,

When I am scanning the pdf it breaks some text in pdf and I am unable to search the text becuase it does not scan that text.. I can search the text in same paragraph of pdf but some sentences in the same paragraph do not search, scanner does not scan the text properly any idea please?

Regression in scanning a document

Search for "Council" in this document:

https://dl.dropbox.com/s/l4ebaq33eayvacm/Minutes20091214.pdf?dl=1

This used to work ok, but with current master doesnt find any of the text in the doc

The version of code that I have that works is circa Nov or Dec 2011. I note that the tania2000 branch also doesnt work on this doc.

As an experiment I went back to this revision, and it worked fine:

Commit: ea3661e [ea3661e]
Parents: c8b3be3
Author: Marcus Hedenström [email protected]
Date: 25 October 2011 8:55:29 AM AEDT
Page drawing tweaks, better info

CGPDFStringRef to NSString *

Hi, i'm working on a project and I need to be able to highlight parts of the text by location and not by match, so I took your project and slightly modified it so that [scanner selections] returns every single character frame instead of wherever it matches the keyword

the change was fairly simple and it works like a charm, however i did found a "bug", and it's that some CGPDFStringRef's are wrongly converted (this happens on the pdf downloaded from here)

When the scanner starts, it reads the first "A" (from "A cat in his...") and gets an error when converting it

- (NSString *)stringWithCode:(int)code
{
    static NSString *singleUnicodeCharFormat = @"%C";
    NSString *characterName = [names objectForKey:[NSNumber numberWithInt:code]];
    unichar unicodeValue = [FontFile characterByName:characterName];
    return [NSString stringWithFormat:singleUnicodeCharFormat, unicodeValue];
}

unicodeValue is 0, so when it creates the return value, it's an incorrect value

this happens with about 40% of the characters found in that PDF
i tried using CGPDFStringCopyTextString like this:

CFStringRef cfStr = CGPDFStringCopyTextString(string);
NSString *cidString = [NSString stringWithString:(NSString *)cfStr];
NSString *unicodeString = [[NSString stringWithString:(NSString *)cfStr] lowercaseString];
CFRelease(cfStr);

and all the characters are converted correctly

is there a reason I should be using your method? or should I (and possibly you too) use the CGPDFStringCopyTextString function?

if i can get in contact with you i could provide you with further detail / screenshots

anyways, thanks for the great work you've done :)

Clarification on license...

Being an indie developer I'm hoping to avoid any problems down the line by including the code. I can understand 'as is' in regards to warranty, but is the licence MIT or BSD?

Also ... thanks... I've probably learned enough about the code to start contributing.

Newline not working

When searching for a string that covers multiple lines, the frame comes out wrong. This is due to the fact the search doesn't seem to account multiple lines (while i think this is implemented). I did notice that T* ' and "" operators are set up, but they aren't being called.

get occurences

hi every body i m working with pdfkitten library,,i want to get total occurencies for a string in a pdf file,,,,any body can help me plz,,,,,thx in advance

memory warning

when i am scanning pdf having 100+ pages (inside for loop) memory is increasing for every pages which is not releasing after calling [scanner release] too.

Issue with cm operator

Thanks for your code. I have learnt a lot studying it.

I think you have an issue with cm operator in scanner.m

the old ctm should be pre-multiplied with the new ctm

//state.ctm = CGAffineTransformConcat(state.ctm, t);//ORIGINAL
state.ctm = CGAffineTransformConcat(t,state.ctm);

see below pdf reference chapter 4.2

when a sequence of transformations is car- ried out, the matrix representing the combined transformation (M′) is calculated by premultiplying the matrix representing the additional transformation (MT) with the one representing all previously existing transformations (M):
M′ = MT × M

Search in not woking for some fonts.

Search in not woking for the following fonts. can you please help out how to fixe this.

fonts = {
R9 = "ZBPUFD+TT15Ct00 {\n\ttype = TrueTypeFont\n\tcharacter widths = 25\n\ttoUnicode = 1\n}\n";
}
names = (
R9
)

fonts = {
F1 = "UIFRYV+ArialMT {\n\ttype = Type0Font\n\tcharacter widths = 0\n\ttoUnicode = 1\n\tdescendant fonts = 1\n}\n";
F2 = "CBLYWM+Arial-BoldMT {\n\ttype = Type0Font\n\tcharacter widths = 0\n\ttoUnicode = 1\n\tdescendant fonts = 1\n}\n";
}
names = (
F1,
F2
)

fonts = {
"TT1.1" = "QEAUZH+Cambria-Bold {\n\ttype = TrueTypeFont\n\tcharacter widths = 26\n\ttoUnicode = 1\n}\n";
"TT2.1" = "JELXHX+Cambria-BoldItalic {\n\ttype = TrueTypeFont\n\tcharacter widths = 1\n\ttoUnicode = 1\n}\n";
"TT3.1" = "GMBIZN+Cambria {\n\ttype = TrueTypeFont\n\tcharacter widths = 80\n\ttoUnicode = 1\n}\n";
"TT4.1" = "KGXGSE+TimesNewRomanPSMT {\n\ttype = TrueTypeFont\n\tcharacter widths = 1\n\ttoUnicode = 1\n}\n";
}
names = (
"TT1.1",
"TT2.1",
"TT3.1",
"TT4.1"
)

Stop searching

How I can stop searching while it works?
For example, user pressed search button, searching has started and user press search button again.

Search for PDF in Landscape mode

When I try to use the search function for PDF in Landscape mode, finds words are not colored like in portrait mode. Is there a way to fix this bug?

Best regards

Position Issue

Facing issue for some sentences. position and width is not working properly for some keywords in pdf

It is detecting this font in pdf.

C2_0: FCUFBJ+HelveticaNeue-Light {
type = Type0Font
character widths = 0
toUnicode = 1
descendant fonts = 1
}
2012-09-19 15:56:36.907 PDFKitten[4502:11103] TT0: OPYSFF+HelveticaNeue-Light {
type = TrueTypeFont
character widths = 117
toUnicode = 1
}

any help what is the issue?

Selection frames not considering CTM

Using this file: http://dl.dropbox.com/u/8069980/neuAnnotate%20Guide.pdf

Try searching for "select". You'll notice that on the second page the "Select" in bullet 3 is has only the "Sel" highlighted.

The problem is that Selection -finalizeWithState does not take the CTM into consideration. In the search for "Sel" we see a cm and TD right after finding the "sel". The the TD resets the text matrix so -finalizeWithState calculates an incorrect width. The fix is to take the CTM into consideration, which has been adjusted accordingly to compensate for the TD's reset of the text matrix.

CGFloat width = [state textMatrix].tx - [initialState textMatrix].tx + ([state ctm].tx - [initialState ctm].tx) / [state ctm].a;    

Support Type 1 Font

There is no supporting Type 1 Font. For example font XObject has stucture:
F1
BaseFont - Times-Roman
Subtype - Tipe 1
Type - Font

and that is all. So we have not information about widths and other....

I think it will be good to override methods in Type 0 Font to get information from descendant. It could be somethink like this:


#pragma mark - Overrided Font methods

- (FontDescriptor*)fontDescriptor {
    for (Font *font in self.descendantFonts) {
        FontDescriptor *descriptor = [font fontDescriptor];
        if (descriptor) {
            return descriptor;
        }
    }
    
    return nil;
}


- (CGFloat)minY {
    for (Font *font in self.descendantFonts) {
        CGFloat minY = [font minY];
        if (minY > 0) {
            return minY;
        }
    }
    
    return 0;
}


- (CGFloat)maxY {
    for (Font *font in self.descendantFonts) {
        CGFloat maxY = [font maxY];
        if (maxY > 0) {
            return maxY;
        }
    }
    
    return 0;
}


- (CMap*)toUnicode {
    for (Font *font in self.descendantFonts) {
        CMap *cmap = [font toUnicode];
        if (cmap) {
            return cmap;
        }
    }
    
    return nil;
}

Could I help you to write library?

Print The Document

I want to print the whole document (all pages) from pageview?
how it could?

Problem with determining special characters width

I've got a problem with some special characters, e.g. ™ or even ". The Type 1 font Widths array doesn't have a value for them. Their width is considered to be 0 in Scanner's didScanCharacter. As a result the text matrix is not properly transformed and selections found in the line after the character have wrong transformation and are not properly highlighted.

I noticed that the characters are not properly processed by CGPDFStringCopyTextString and CGPDFStringGetBytePtr either.
I tried to read font program stored in the font description under FontFile3 but i don't know how to process the stream.

Do you have an idea where in the pdf can i retrieve the information about corresponding glyphs?
I'd really appreciate any help!

How to highlight in right positions when pdf page is resized and centered?

Hi, I do not draw the pdf page as-is, but I resize and center it so that it fits the screen. How can I re-calculate the position to display the highlights (the yellow areas)?

Basically, if I have an outer frame, an inner frame which lies on the outer, then when I change the outer frame, I can calculate the inner frame to fit in the new outer frame. But your selection.frame seems not to be be what I can use to calculate. (selection.frame depends on the graphic context...) I guess that when I read the code:
for (Selection *s in self.selections)
{
CGContextSaveGState(ctx);
CGContextConcatCTM(ctx, s.transform);
CGContextFillRect(ctx, s.frame);
CGContextRestoreGState(ctx);
}

ARC support?

Hiyas

I don't think this can be considered an issue, but is it planned to support ARC in the future? The problem is, that even in a project where ARC is enabled and PdfKitten is excluded, a project still doesn't compile because of things like NSMutableString** rawTextContent (Scanner.h) for example. Every ARC enabled project gets freaky when it should compile this class (even if PdfKitten is excluded).

T* callback implementation

I think the T* operator callback TStar implementation might be incorrect.

Selection items have totally wrong translation matrix for occurrences found after the T* operator. I logged text matrix in every operator callback. I noticed that the ty value was decreasing all the time until T* operator was met. After that it started increasing and could go out of the page frame in the end.

TStar callback calls newLine wich has following line in it: [self newLineWithLineHeight: self.leadning save:NO];

As soon as I passed negative value ( -self.leadning ) into the newLineWithLineHeight it started to work just fine.
What do you think about it? Is my fix correct or did i miss anything?
Thanks

Font Issue

After implementing the PDFKitten scanner in my project I am able to search 90% of my pdf's fine, however documents containing text with TrueType fonts Cambria or Calibri are not correctly decoded.

Common for these fonts are that they are TrueType(CID) fonts with encoding: Identity-H.
Cambria, Bold is however ANSI encoded and works just fine...

Can someone please give me a hint on how to best troubleshoot this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.