ludbek / webpreview Goto Github PK

Extracts OpenGraph, TwitterCard and Schema properties from a webpage.

License: Other

Python 87.65% HTML 11.85% Dockerfile 0.50%

web-preview open-graph twitter-cards schema

webpreview's Introduction

webpreview

For a given URL, webpreview extracts its title, description, and image url using Open Graph, Twitter Card, or Schema meta tags, or, as an alternative, parses it as a generic webpage.

Installation

pip install webpreview

Usage

Use the generic webpreview method (added in v1.7.0) to parse the page independent of its nature. This method fetches a page and tries to extracts a title, description, and a preview image from it.

It first attempts to parse the values from Open Graph properties, then it falls back to Twitter Card format, and then to Schema. If none of these methods succeed in extracting all three properties, then the web page's content is parsed using a generic HTML parser.

>>> from webpreview import webpreview

>>> p = webpreview("https://en.wikipedia.org/wiki/Enrico_Fermi")
>>> p.title
'Enrico Fermi - Wikipedia'
>>> p.description
'Italian-American physicist (1901–1954)'
>>> p.image
'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg'

# Access the parsed fields both as attributes and items
>>> p["url"] == p.url
True

# Check if all three of the title, description, and image are in the parsing result
>>> p.is_complete()
True

# Provide page content from somewhere else
>>> content = """
<html>
    <head>
        <title>The Dormouse's story</title>
        <meta property="og:description" content="A Mad Tea-Party story" />
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    </body>
</html>
"""

# The the function's invocation won't make any external calls,
# only relying on the supplied content, unlike the example above
>>> webpreview("aa.com", content=content)
WebPreview(url="http://aa.com", title="The Dormouse's story", description="A Mad Tea-Party story")

Using the command line

When webpreview is installed via pip, then the accompanying command-line tool is installed alongside.

$ webpreview https://en.wikipedia.org/wiki/Enrico_Fermi
title: Enrico Fermi - Wikipedia
description: Italian-American physicist (1901–1954)
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

$ webpreview https://github.com/ --absolute-url
title: GitHub: Where the world builds software
description: GitHub is where over 83 million developers shape the future of software, together.
image: https://github.githubassets.com/images/modules/site/social-cards/github-social.png

Using compatibility API

Before v1.7.0 the package mainly exposed a different set of the API methods. All of them are supported and may continue to be used.

# WARNING:
# The API below is left for BACKWARD COMPATIBILITY ONLY.

from webpreview import web_preview
title, description, image = web_preview("aurl.com")

# specifing timeout which gets passed to requests.get()
title, description, image = web_preview("a_slow_url.com", timeout=1000)

# passing headers
headers = {'User-Agent': 'Mozilla/5.0'}
title, description, image = web_preview("a_slow_url.com", headers=headers)

# pass html content thus avoiding making http call again to fetch content.
content = """<html><head><title>Dummy HTML</title></head></html>"""
title, description, image = web_preview("aurl.com", content=content)

# specifing the parser
# by default webpreview uses 'html.parser'
title, description, image = web_preview("aurl.com", content=content, parser='lxml')

Run with Docker

The docker image can be built and ran similarly to the command line. The default entry point is the webpreview command-line function.

$ docker build -t webpreview .
$ docker run -it --rm webpreview "https://en.m.wikipedia.org/wiki/Enrico_Fermi"
title: Enrico Fermi - Wikipedia
description: Enrico Fermi (Italian: [enˈriːko ˈfermi]; 29 September 1901 – 28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age"[1] and the "architect of the atomic bomb".
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

Note: built docker image weighs around 210MB.

Testing

# Execute the tests
poetry run pytest webpreview

# OR execute until the first failed test
poetry run pytest webpreview -x

Setting up development environment

# Install a correct minimal supported version of python
pyenv install 3.7.13

# Create a virtual environment
# By default, the project already contains a .python-version file that points
# to 3.7.13.
python -m venv .venv

# Install dependencies
# Poetry will automatically install them into the local .venv
poetry install

# If you have errors likes this:
ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.

# Then do this:
.venv/bin/pip install --upgrade setuptools

webpreview's People

Contributors

Stargazers

Watchers

Forkers

alvarohurtado84 escaped geekbeard junhonomad juanfont marten-cz lucabezerra launchlabau illing2005 hadalin peter-bartoszuk lmegviar ameetmk taghash wisnercelucus algoo vduseev

webpreview's Issues

Tweet title and image are returning None

Hi tweet's preview image and the title is not working
I am using the following peace of code:
title, description, image = web_preview("https://twitter.com/realDonaldTrump/status/1290011569657610240?s=20", parser="html.parser")
print(title, description, image)

Looking forward to hearing from you about this issue.
Osama

webi_preview doesn't accept headers

Thanks for building this awesome pkg. In the readMe you claim it supports passing headers to web_preview function, but it doesn't.

Update: I just found out the header feature is supported in 1.3.1, but only for python3, is th7?ere any plan to support this in python2.7?

Document headers option.

Fix warnings about socket not being closed.

Replace strict versions in requirements.txt with acceptable version ranges

Pining exact version is bad practice because if webpreview is used in some project it will always use for example requests==2.10.0 even if there are improvement/security updates in this package.
Also if project has other dependency which requires newer version of this library, it will be impossible to be used together.
Instead requirements.txt should contain range from minimum required version to maximum (if it is incompatible with newer version of dependency) or just minimum version requirement (if there are no incompatibilities known with newer versions).
Related discussion:
pypa/setuptools#894

python packaging issue

The latest version of webpreview has been named webpreview-1.0.3dev-r0.tar.gz. The dev-r0 part makes it impossible to upload to pypi as it is considered immature package.

Does anyone know how to prevent sdist from appending dev-r0 to the package name?

duplicate requests sometimes not necessary

Hi,

Thanks for your work as it is very useful. Why do you make a second request if the first one works?

try:
            res = requests.get(url, timeout=timeout, headers=headers)
        except (ConnectionError, HTTPError, Timeout, TooManyRedirects):
            raise URLUnreachable("The URL does not exist.")
        except MissingSchema: # if no schema add http as default
            url = "http://" + url

        # throw URLUnreachable exception for just incase
        try:
            res = requests.get(url, timeout=timeout, headers=headers)
        except (ConnectionError, HTTPError, Timeout, TooManyRedirects):
            raise URLUnreachable("The URL is unreachable.")

Also, you can reduce the rate of failure of the first block if you check for schema before any request is made (with a regex). Which would therefore allow you to merge the 2 blocks in one...

How about Structuring the single file into different contextual files

Relative Image Path

Greetings!

First of all, thanks for the good work! :)

I'm having some trouble with extracting the data from a few URLs, namely the ones that have relative paths in the returned image value. For instance, when using https://understand.ai as the URL, I get 'images/banner.jpg' in the image return.

It would it be nice if this lib could return the image URL along with the domain, or even if the web_preview() method had an optional parameter where the user would choose whether he/she always wanted the absolute URL or the URL exactly as it's set in the original website, like:

title, description, image = web_preview("aurl.com", absolute_url=True)

Hope I made myself clear :)

Cheers!

Update doc to include headers parameter.

Setting timeout to 1000, or to 5, still loads up as URL Unreachable.

Not sure what the link was, I am fairly sure it had a few test URLs, like Amazon and Google, but one might have died.

I use this in a django application and would just like it to return None on title, description, image, within a certain time frame.

BeautifulSoup prints a GuessedAtParserWarning

Running webpreview in default configuration yields this error

webpreview/previews.py:51: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 51 of the file /home/nelson/src/linkblog/pinboard-to-static/venv/lib/python3.9/site-packages/webpreview/previews.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

Presumably a change in BeautifulSoup since the last webpreview release. It works anyway, just annoying. Adding the suggested features argument does make the warning go away but raises the question of whether other parsers should be configurable.

Asyncio Version?

Thank you for all of this great code. It works great.
Now I just need to figure out how to make this work in an asyncio environment.

Command-line tool not present while installing through pip

Hi, this is a cool project! But I have issues with commandline tool.

System: MacOS arm64, python version 3.8.11

I did pip install webpreview but there is no command webpreview.

I even tried python -m webpreview https://example.com and it says: 'webpreview' is a package and cannot be directly executed.
Did I miss anything?