Coder Social home page Coder Social logo

hungnguyenvan / pyspidy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mtrpires/pyspidy

0.0 0.0 0.0 27 KB

A simple, yet powerful, python web crawler for Google with browser capabilities

Home Page: https://github.com/mtrpires/pySpidy

License: GNU General Public License v3.0

Python 100.00%

pyspidy's Introduction

pySpidy

A simple, yet powerful, Python web crawler for Google with browser capabilities

pySpidy is a Python (2.7) webcrawler for Google with browser capabilities. It does Google queries and mine the data from the resulting webpages, including title, link, date and description. It saves everything to a CSV file.

Intro

pySpidy was born out of a mid-2013 personal project to study how to build a web scraper out of Python extracting information from Google, exporting it to a CSV file and downloading the HTML content from the result links. I'm a journalist who happens to code a little in Python. At that time, I couldn't find any Python crawlers that worked with Google. They were either broken or Google had banned them. It may be the case that Google has already banned mine. They are very good at figuring out your robot is not a person using an actual browser.

Bear in mind that Google doesn't approve scraping their search results. For that, they have a custom search API. For free, you get 100 results per day. More than that you'll have to show them your monies. Use this tool at your own discretion.

How does it work?

Internally, pySpidy works by defining a class which holds all the information of the query, such as link, date, description and title. There is a browser object (powered by mechanize) that handles the HTTP requests. Those are parsed to a Beautiful Soup object that are manipulated by data-mining helper funcions. The crawler itself is a simple script that calls those functions and cycle through the result pages at Google. It stores everything it finds in a CSV file. It tells you mostly everything it does in the console and it handles some errors with more than just a callback.

pySpidy uses two external Python libraries:

  • mechanize - Stateful programmatic web browsing in Python
  • Beautiful Soup - allows you to scrape the HTML documents easily

...and some built-in stuff:

  • csv - a CSV handling library, to create and modify CSV data
  • re - Regular expressions in Python
  • urllib - a library to, among other things, encode a string to a URL-friend format
  • urlparse - something I used to revert back and encoded URL to a human-readable format
  • os - used to create, modify and save files
  • time - used to time some crawler tasks
  • random - for chaos

Disclaimer

I did this project for a very specific purpose, which may or may not be aligned with your goals. It goes without saying that the code is not free of bugs and that it may not behave 100% correctly all the time. Google is very smart in figuring out whether you're using bots to mine data through their web interface. It also goes without saying that you're free to fork the code and edit it at your heart's content.

Also, I don't claim to be a full fledge coder. As much as I try to comment the code (sometimes too much), there are some approaches that may look far fetched or simply clumsy.

I appreaciate comments and constructive criticism.

Contact

Please use github or drop me a message at mtrpires at outlook dot com. I'm also on twitter: @mtrpires

pyspidy's People

Contributors

mtrpires avatar aleborba avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.