Coder Social home page Coder Social logo

site-ar's Introduction

site-ar

A Python based time-series site scraper with a text UI. Scraped data is stored in a SQLite database and skipped over in subsequent scans. Data can be browsed in a tree view, searched and exported with images (if available) to a XLSX file. The application core provides a simple db interface to handle schema creation, migrations and an ORM. It also contains a key/value preference system and several urwid based widgets. site-ar can be customized into different site types that define the schema, scrapers, views and how data is mapped on export.

Motivation

Many smaller online auction sites publish past bid results but do not provide a way to search them. A combination of robots.txt and the way pages link back to each other renders searching with site:exampleauction.com on Google unusable. This program allows for evaluation of market prices and frequency of select items sold at these auctions. These auction sites are typically local pickup only and the size/weight of items make shipping impracticable, skewing results when compared to a nationwide auction site like eBay. While the auction site type is provided it only includes test scrapers since this is not a TOS friendly program. Writing custom scrapers using lxml is trivial though.

Installation

  • This was tested on a clean and updated Ubuntu 16.04.2 install.
sudo apt-get install python-pip python-pil python-lxml python-urwid
sudo pip install openpyxl
git clone https://github.com/Cutty/site-ar.git

Only openpyxl was installed from pip because the version in Ubuntu's apt repositories is too old. Exported files were only tested using LibreOffice Calc. For people who prefer pip over apt for their Python packages the versions tested are listed below:

Python: 2.7.12
LibreOffice: 5.1.6.2
lxml: 3.5.0
PIL: 3.1.2
openpyxl: 2.3.5, 2.4.0, 2.4.5, 2.4.6
urwid: 1.3.1

Usage

  • In terminal 1 start the local test-site HTTP server:
cd site-ar/test-site/
./http_server.py
  • In terminal 2 (you may want to resize the terminal to ~60/140 rows/cols):
cd site-ar/
./site-ar.py

Once in the application press Shift-U to update all auctions at once. Navigate using the Left,Down,Up,Right or H,J,K,L keys and space for a detailed view. Common keys are listed in the footer and the help dialog can be brought up using ?. Once the db schema has been created the raw ORM objects can be browsed by using the --site-type generic command line switch.

Screenshots

screenshot 01 screenshot 02
screenshot 03 screenshot 04
screenshot 05 screenshot 06

Troubleshooting

  • If the program crashes during export or the XLSX file contains broken images try disabling jpeg conversion; ctrl-p to open preferences and set export.xlsx.img.jpeg.enable to False. Most versions of openpyxl will always convert images to png. For space savings site-ar will monkey patch openpyxl and PIL to force converting/referencing images as a jpeg. This may not work on all versions of openpyxl.

  • Debugging must be done using pudb available in apt and pip. pudb uses the same UI library (urwid) as site-ar and will switch seamlessly between the inferior and debugger.

License

See LICENSE.

All images in test-site/img were retrieved from http://www.publicdomainpictures.net and believed to be in the public domain.

site-ar's People

Contributors

cutty avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.