pySpidy

A simple, yet powerful, Python web crawler for Google with browser capabilities

pySpidy is a Python (2.7) webcrawler for Google with browser capabilities. It does Google queries and mine the data from the resulting webpages, including title, link, date and description. It saves everything to a CSV file.

Intro

pySpidy was born out of a mid-2013 personal project to study how to build a web scraper out of Python extracting information from Google, exporting it to a CSV file and downloading the HTML content from the result links. I'm a journalist who happens to code a little in Python. At that time, I couldn't find any Python crawlers that worked with Google. They were either broken or Google had banned them. It may be the case that Google has already banned mine. They are very good at figuring out your robot is not a person using an actual browser.

Bear in mind that Google doesn't approve scraping their search results. For that, they have a custom search API. For free, you get 100 results per day. More than that you'll have to show them your monies. Use this tool at your own discretion.

How does it work?

Internally, pySpidy works by defining a class which holds all the information of the query, such as link, date, description and title. There is a browser object (powered by mechanize) that handles the HTTP requests. Those are parsed to a Beautiful Soup object that are manipulated by data-mining helper funcions. The crawler itself is a simple script that calls those functions and cycle through the result pages at Google. It stores everything it finds in a CSV file. It tells you mostly everything it does in the console and it handles some errors with more than just a callback.

pySpidy uses two external Python libraries:

mechanize - Stateful programmatic web browsing in Python
Beautiful Soup - allows you to scrape the HTML documents easily

...and some built-in stuff:

csv - a CSV handling library, to create and modify CSV data
re - Regular expressions in Python
urllib - a library to, among other things, encode a string to a URL-friend format
urlparse - something I used to revert back and encoded URL to a human-readable format
os - used to create, modify and save files
time - used to time some crawler tasks
random - for chaos

Disclaimer

I did this project for a very specific purpose, which may or may not be aligned with your goals. It goes without saying that the code is not free of bugs and that it may not behave 100% correctly all the time. Google is very smart in figuring out whether you're using bots to mine data through their web interface. It also goes without saying that you're free to fork the code and edit it at your heart's content.

Also, I don't claim to be a full fledge coder. As much as I try to comment the code (sometimes too much), there are some approaches that may look far fetched or simply clumsy.

I appreaciate comments and constructive criticism.

Contact

Please use github or drop me a message at mtrpires at outlook dot com. I'm also on twitter: @mtrpires

hungnguyenvan / pyspidy Goto Github PK

pyspidy's Introduction

pySpidy

Intro

How does it work?

Disclaimer

Contact

pyspidy's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent