Coder Social home page Coder Social logo

httpproxymiddleware's Introduction

HttpProxyMiddleware

A middleware for scrapy. Used to change HTTP proxy from time to time.

Initial proxyes are stored in a file. During runtime, the middleware will fetch new proxyes if it finds out lack of valid proxyes.

Related blog: http://www.kohn.com.cn/wordpress/?p=208

fetch_free_proxyes.py

Used to fetch free proxyes from the Internet. Could be modified by youself.

Usage

settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 351,
    # put this middleware after RetryMiddleware
    'crawler.middleware.HttpProxyMiddleware': 999,
}

DOWNLOAD_TIMEOUT = 10           # 10-15 second is an experienmental reasonable timeout

change proxy

Often, we wanna change to use a new proxy when our spider gets banned. Just recognize your IP being banned and yield a new Request in your Spider.parse method with:

request.meta["change_proxy"] = True

Some proxy may return invalid HTML code. So if you get any exception during parsing response, also yield a new request with:

request.meta["change_proxy"] = True

spider.py

Your spider should specify an array of status code where your spider may encouter during crawling. Any status code that is not 200 nor in the array would be treated as a result of invalid proxy and the proxy would be discarded. For example:

website_possible_httpstatus_list = [404]

This line tolds the middleware that the website you’re crawling would possibly return a response whose status code is 404, and do not discard the proxy that this request is using.

Test

Update HttpProxyMiddleware.py path in HttpProxyMiddlewareTest/settings.py.

cd HttpProxyMiddlewareTest
scrapy crawl test

The testing server is hosted on my VPS, so take it easy… DO NOT waste too much of my data plan.

You may start your own testing server using IPBanTest which is powered by Django.

httpproxymiddleware's People

Contributors

kohn avatar kongtianyi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.