Notice: Our use of Shlorp to generate multi-author, data-driven reports is now superseded by Ckan-o-Sweave, a template Sweave project integrating content from a CKAN data catalogue (figures, captions, text) and the project itself.
Shlorp is a webscraper pipelining wget, beautifulsoup, pandoc, and latex to generate PDF from Confluence Wiki pages. It was designed against very specific page structures and content, and will require some customisation to scrape and layout other targets.
Shlorp scrapes and layouts some very specifically set up reports we author collaboratively on an Atlassian Confluence wiki. Some metadata that ends up in PDF metadata, as title and so on, is located under very specific headings. Content comes from a variety of sources, even copy/pasted MS Word in Windows-encodings may be found.
If you find yourself stuck in an annual collaborative editing / email-Ms Word-versionitis / layouting nightmare, Shlorp might be for you - but we offer no warranties, use at own risk! Modifying Shlorp to scrape different content will require some knowledge of Python, HTML/DOM, and LaTeX, but the code should be readable and documented enough to be adaptable without having to reinvent the wheel once again.
Shlorp uses fabric as make tool.
sudo apt-get install fabric
fab setup
# modify shlorp.settings to your needs
fab run
Run fab -l
to see other make targets at hand.
Run fab makedoc
to generate the developer docs at docs/build/html/index.html
- In a Ubuntu environment, fabric is installed.
- Optionally, a virtualenv is created with virtualenvwrapper and activated
- Fabric is installed
- A set of URLs, a username and a password are given in shlorp/settings.py (copied from shlorp/settings.template).
- HTML pages are downloaded from the given URLs (login requires un and pw)
- attached files are sanitized and re-linked in the HTML files
- HTML is parsed and transformed according to specific rules
The Confluence web pages are assumed to be of a somewhat defined structure. This script removes some elements, uses others to set variables to feed the pandoc latex template, and transforms the rest into clean HTML, which in turn will arrive more or less correctly in the pandoc-generated latex.
- Modified HTML is translated into LaTeX using a custom LaTeX template, using variables extracted from each HTML file.
- The LaTeX file is further modified and re-saved, pending
- compliation by XeLaTeX and
- compression through GhostScript.
- Confluence auth
- HTML download
- HTML parsing
- HTML parsing using HTMLtidy, BeautifulSoup
- HTML to LaTeX conversion using Pandoc and wrapper pypandoc
- regex replacements
Copyright (C) 2012-2013 Department of Parks and Wildlife
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.