Coder Social home page Coder Social logo

silentsoul04 / inscriptis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from weblyzard/inscriptis

0.0 0.0 0.0 685 KB

A python based HTML to text conversion library, command line client and Web service.

License: Apache License 2.0

Python 41.96% HTML 57.14% Dockerfile 0.25% Shell 0.66%

inscriptis's Introduction

inscriptis -- HTML to text conversion library, command line client and Web service

Supported python versions Maintainability Coverage Build status Documentation status PyPI version

A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS. Please take a look at the Rendering document for a demonstration of inscriptis' conversion quality.

A Java port of inscriptis is availabe here.

Documentation

The full documentation is built automatically and published on Read the Docs.

Table of Contents

  1. Installation
  2. Python library
  3. Standalone command line client
  4. Web service
  5. Fine tuning
  6. Changelog

Installation

At the command line:

$ pip install inscriptis

Or, if you don't have pip installed:

$ easy_install inscriptis

If you want to install from the latest sources, you can do:

$ git clone https://github.com/weblyzard/inscriptis.git
$ cd inscriptis
$ python setup.py install

Python library

Embedding inscriptis into your code is easy, as outlined below:

import urllib.request
from inscriptis import get_text

url = "https://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

Standalone command line client

The command line client converts HTML files or text retrieved from Web pages to the corresponding text representation.

Command line parameters

The inscript.py command line client supports the following parameters:

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a]
                   [--indentation INDENTATION] [-v]
                   [input]

Converts HTML from file or url to a clean text version

positional arguments:
  input                 Html input either from a file or an url
                        (default:stdin)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Content encoding for reading and writing files
                        (default:utf-8)
  -i, --display-image-captions
                        Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -a, --display-anchor-urls
                        Deduplicate image captions (default:false).
  --indentation INDENTATION
                        How to handle indentation (extended or strict;
                        default: extended).
  -v, --version         display version information

Examples

convert the given page to text and output the result to the screen:

$ inscript.py https://www.fhgr.ch

convert the file to text and save the output to output.txt:

$ inscript.py fhgr.html -o fhgr.txt

convert text provided via stdin and save the output to output.txt:

$ echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt

Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

Additional Requirements

  • python3-flask

Startup

Start the inscriptis Web service with the following command:

$ export FLASK_APP="web-service.py"
$ python3 -m flask run

Usage

The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified in the Content-Type header (UTF-8 in the example below):

$ curl -X POST  -H "Content-Type: text/html; encoding=UTF8" --data-binary @test.html  http://localhost:5000/get_text

The service also supports a version call:

$ curl http://localhost:5000/version

Fine tuning

The following options are available for fine tuning inscriptis' HTML rendering:

  1. More rigorous indentation: call inscriptis.get_text() with the parameter indentation='extended' to also use indentation for tags such as <div> and <span> that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as lynx. If you do not want extended indentation you can use the parameter indentation='standard' instead.

  2. Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:

    from lxml.html import fromstring
    from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
    from inscriptis.html_properties import Display
    from inscriptis.model.config import ParserConfig
    
    # create a custom CSS based on the default style sheet and change the rendering of `div` and `span` elements
    css = CSS_PROFILES['strict'].copy()
    css['div'] = HtmlElement('div', display=Display.block, padding=2)
    css['span'] = HtmlElement('span', prefix=' ', suffix=' ')
    
    html_tree = fromstring(html)
    # create a parser using a custom css
    config = ParserConfig(css=css)
    parser = Inscriptis(html_tree, config)
    text = parser.get_text()
    

Changelog

A full list of changes can be found in the release notes.

inscriptis's People

Contributors

albertweichselbraun avatar fabod avatar maxgoebel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.