Notice: Our use of Shlorp to generate multi-author, data-driven reports is now superseded by Ckan-o-Sweave, a template Sweave project integrating content from a CKAN data catalogue (figures, captions, text) and the project itself.

Shlorp is a webscraper pipelining wget, beautifulsoup, pandoc, and latex to generate PDF from Confluence Wiki pages. It was designed against very specific page structures and content, and will require some customisation to scrape and layout other targets.

Purpose

Shlorp scrapes and layouts some very specifically set up reports we author collaboratively on an Atlassian Confluence wiki. Some metadata that ends up in PDF metadata, as title and so on, is located under very specific headings. Content comes from a variety of sources, even copy/pasted MS Word in Windows-encodings may be found.

What you get out of Shlorp

If you find yourself stuck in an annual collaborative editing / email-Ms Word-versionitis / layouting nightmare, Shlorp might be for you - but we offer no warranties, use at own risk! Modifying Shlorp to scrape different content will require some knowledge of Python, HTML/DOM, and LaTeX, but the code should be readable and documented enough to be adaptable without having to reinvent the wheel once again.

Build status

Get SHLORPed

Shlorp uses fabric as make tool.

sudo apt-get install fabric
fab setup
# modify shlorp.settings to your needs
fab run

Run fab -l to see other make targets at hand.

Run fab makedoc to generate the developer docs at docs/build/html/index.html

Workflow

In a Ubuntu environment, fabric is installed.
Optionally, a virtualenv is created with virtualenvwrapper and activated
Fabric is installed
A set of URLs, a username and a password are given in shlorp/settings.py (copied from shlorp/settings.template).
HTML pages are downloaded from the given URLs (login requires un and pw)
attached files are sanitized and re-linked in the HTML files
HTML is parsed and transformed according to specific rules

The Confluence web pages are assumed to be of a somewhat defined structure. This script removes some elements, uses others to set variables to feed the pandoc latex template, and transforms the rest into clean HTML, which in turn will arrive more or less correctly in the pandoc-generated latex.

Modified HTML is translated into LaTeX using a custom LaTeX template, using variables extracted from each HTML file.
The LaTeX file is further modified and re-saved, pending
compliation by XeLaTeX and
compression through GhostScript.

Resources

Confluence auth
HTML download
HTML parsing
HTML parsing using HTMLtidy, BeautifulSoup
HTML to LaTeX conversion using Pandoc and wrapper pypandoc
regex replacements

Authors and Copyright

Florian Mayer

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

florianm / shlorp Goto Github PK

shlorp's Introduction

Purpose

What you get out of Shlorp

Build status

Get SHLORPed

Workflow

Resources

Authors and Copyright

License

shlorp's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent