Coder Social home page Coder Social logo

manycoding / pymupdf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pymupdf/pymupdf

0.0 1.0 0.0 230.19 MB

Python bindings for MuPDF's rendering library.

License: GNU Affero General Public License v3.0

Python 25.81% C 0.38% Shell 0.20% SWIG 73.60%

pymupdf's Introduction

PyMuPDF 1.18.19

logo

Release date: September 16, 2021

On PyPI since August 2016: Downloads

Author

Jorj X. McKie, based on original code by Ruikai Liu.

Introduction

PyMuPDF (current version 1.18.19) is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.

MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.

With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..

In partnership with Artifex, PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the "License and Copyright" section below for additional information.

Usage

For all supported document types (i.e. including images) you can

  • decrypt the document
  • access meta information, links and bookmarks
  • render pages in raster formats (PNG and some others), or the vector format SVG
  • search for text
  • extract text and images
  • convert to other formats: PDF, (X)HTML, XML, JSON, text

To some degree, PyMuPDF can therefore be used as an image converter: it can read a range of input formats and can produce Portable Network Graphics (PNG), Portable Anymaps (PNM, etc.), Portable Arbitrary Maps (PAM), Adobe Postscript and Adobe Photoshop documents, making the use of other graphics packages obselete in these cases. But interfacing with e.g. PIL/Pillow for image input and output is easy as well.

For PDF documents, there exists a plethora of additional features: they can be created, joined or split up. Pages can be inserted, deleted, re-arranged or modified in many ways (including annotations and form fields).

  • Images and fonts can be extracted or inserted.

    You may want to have a look at this cool GUI example script, which lets you insert, delete, replace or re-position images under your visual control.

    Since v1.18.8 there is a Document method subset_fonts(), which automatically builds subsets based on the usage of all eligible fonts in the document. Especially for new documents, this can lead to significant file size reductions. The method was developed in cooperation with our user @cuteufo - again thanks a lot for the contribution.

  • Embedded files are fully supported.

  • PDFs can be reformatted to support double-sided printing, posterizing, applying logos or watermarks

  • Password protection is fully supported: decryption, encryption, encryption method selection, permmission level and user / owner password setting.

  • Support of the PDF Optional Content concept for images, text and drawings.

  • Low-level PDF structures can be accessed and modified.

  • Command line module "python -m fitz ...". A versatile utility with the following features

    • encryption / decryption / optimization
    • creation of sub-documents
    • document joining
    • image / font extraction
    • full support of embedded files
    • layout-preserving text extraction (all documents)

Have a look at the basic demos, the examples (which contain complete, working programs), and the recipes section of our Wiki sidebar, which contains more than a dozen of guides in How-To-style.

New: Layout preserving text extraction!

Via its subcommand "gettext", script fitzcli.py offers text extraction in different formats. Of special interest surely is layout preservation, which produces text as close to the original physical layout as possible, surrounding areas where there are images, or reproducing text in tables and multi-column text.

See here for more information on layout preserving text extraction.

Documentation

Our documentation, written using Sphinx, is available in various formats from the following sources. It currently is a combination of reference guide and user manual. For a quick start look at the tutorial and the recipes chapters.

  • You can view it online at Read the Docs. This site also provides download options for PDF.
  • The search function on Read the Docs does not work for me currently. If you want a working searchable local version, please download a zipped HTML for here.
  • Find a Windows help file here.

The latest changelog can be viewed here.

Installation

PyMuPDF requires Python 3.6 or later.

Python wheels exist for Windows (32bit and 64bit), Linux (64bit, Intel and ARM) and Mac OSX (64bit, Intel only), so it can be installed from PyPI in the usual way:

python -m pip install --upgrade pip
python -m pip install --upgrade pymupdf

There are no mandatory external dependencies. However, a few optional methods become available if additional packages are installed:

  • Pillow for using pillow image output directly from PyMuPDF.
  • fontTools for creating font subsets on PDF output.
  • pymupdf-fonts to extend your text output options with some nice fonts.

Older wheels - also with support for older Python versions - can be found here and on PyPI.

Starting with v1.18.15, to minimize network traffic we no longer redundantly store wheels in this repository's releases folder. You can find older versions back to v1.9.2 on PyPI. Sources for every release continue to be stored in here.

Other platforms require installation from sources, follow these instructions in the documentation.

Note: If you try installing from PyPI for a platform with no available wheel, pip will automatically start a source installation process - which will fail if it finds no MuPDF installation.

Folder installation contains platform-specific source installation scripts contributed by users. You may also find the following Wiki pages useful:

License and Copyright

In order to comply with MuPDF’s dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.

PyMuPDF and MuPDF are now available under both, open-source AGPL and commercial license agreements.

Please read the full text of the AGPL license agreement (which is also included here in file COPYING) to ensure that your use case complies with the guidelines of this license. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Artifex is the exclusive commercial licensing agent for MuPDF.

Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc. © 2021 Artifex Software, Inc. All rights reserved.

Contact

Please use the Discussions menu for questions, comments, or asking for help, and submit issues here. If you wish, you can also contact me directly via [email protected].

pymupdf's People

Contributors

b4stien avatar cbm755 avatar cges30901 avatar danofsteel32 avatar davidewalder avatar deepgully avatar divyaraok29 avatar dorianturba avatar dreua avatar fsecada01 avatar inf3rnus avatar jloehel avatar jorjmckie avatar josch avatar liuruikai avatar mara004 avatar mjg avatar mozbugbox avatar ousia avatar ph3n92h3 avatar pinotree avatar powersnail avatar rk700 avatar rossmeier avatar swt2c avatar tjcuddihy avatar user202729 avatar wenheping avatar wilfreddv avatar zhoubingcheng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.