Coder Social home page Coder Social logo

jamesturk / scrapelib Goto Github PK

View Code? Open in Web Editor NEW
205.0 20.0 43.0 965 KB

⛏ a library for scraping unreliable pages

Home Page: https://jamesturk.github.io/scrapelib/

License: BSD 2-Clause "Simplified" License

Python 99.64% Dockerfile 0.36%
python scraper http

scrapelib's Introduction

scrapelib is a library for making requests to less-than-reliable websites.

Source: https://github.com/jamesturk/scrapelib

Documentation: https://jamesturk.github.io/scrapelib/

Issues: https://github.com/jamesturk/scrapelib/issues

PyPI badge Test badge

Features

scrapelib originated as part of the Open States project to scrape the websites of all 50 state legislatures and as a result was therefore designed with features desirable when dealing with sites that have intermittent errors or require rate-limiting.

Advantages of using scrapelib over using requests as-is:

  • HTTP(S) and FTP requests via an identical API
  • support for simple caching with pluggable cache backends
  • highly-configurable request throtting
  • configurable retries for non-permanent site failures
  • All of the power of the suberb requests library.

Installation

scrapelib is on PyPI, and can be installed via any standard package management tool:

poetry add scrapelib

or:

pip install scrapelib

Example Usage

  import scrapelib
  s = scrapelib.Scraper(requests_per_minute=10)

  # Grab Google front page
  s.get('http://google.com')

  # Will be throttled to 10 HTTP requests per minute
  while True:
      s.get('http://example.com')

scrapelib's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar fgregg avatar hancush avatar jamesturk avatar jessemortenson avatar jmcarp avatar joegermuska avatar konklone avatar mikejs avatar newageairbender avatar poliquin avatar twneale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapelib's Issues

Throttling should only apply when not reloading from cache

I think the inheritence order for Scraper between CachingSession and ThrottledSession should be reversed, so that when a cached document is reloaded from cache, there's no wait until the next one.

Thus, only requests really loaded from the network would be throttled.

Does this make sense ?

May want to consider requiring requests security extras

Hello, and thank you for contributing and maintaining such a helpful project.

I found out about scrapelib via the unitedstates/congress project, which I'm just beginning to explore. That project requires scrapelib>=0.1.0,<1.0.0 and my virtual environment picked up scrapelib-0.10.1. The first command I attempted to run was to pull down the fdsys sitemaps, which resulted in the following SSL error emerging out of scrapelib:

File "unitedstates-congress/.env/lib/python2.7/site-packages/scrapelib/__init__.py", line 201, in request
raise exception_raised
SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)

I found this StackOverflow post which recommends installing the requests package's extra security packages:

pip install requests[security]

That worked for my immediate use case. Since this wasn't an issue with unitedstates/congress, I thought to recommend that scrapelib update its requests module requirement to include the optional security packages. If my understanding of the issue is correct, this issue could crop up in other downstream projects for similar user configurations (OSX 10.12.3 Sierra w/ stock OpenSSL 0.9.8zh running Python 2.7.10 w/ requests 2.13.0).

(Separately, there seems to be a mismatch between the project README's requirement for requests > 2.0 and the requirements.txt requirement for requests >= 1.2.2)

Thank you again, and I hope this is helpful!

robotparser is wrong?

The two most recent times I've tried to use scrapelib defaulting to honoring the robots.txt files, I've gotten blocked, even though my read of the Robots files says I shouldn't.

I found this on SO: http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly and checked out reppy, and i do get conflicting answers for my specific case:

>>> url = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county-equivalents'
>>> user_agent = 'scrapelib 0.9.0 python-requests/1.2.3 CPython/2.7.5 Darwin/13.0.0'
>>> from reppy.cache import RobotsCache
>>> robots = RobotsCache()
>>> robots.allowed(url,user_agent)
True

>>> import robotparser
>>> parser = robotparser.RobotFileParser()
>>> parser.set_url('http://en.wikipedia.org/robots.txt')
>>> parser.read()
>>> parser.can_fetch(user_agent,url)
False

Think it's worth switching to use reppy instead? It sounds like the robots.txt spec is just not very well articulated, but the Wikipedia robots.txt file says specifically "Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please" ...

set attribute to indicate if response was pulled from cache or not

it is sometimes useful to know if a response was pulled from a cache the cache, particularly if the last modified code path is used in the sqlite cache. for example you may only want to process new data.

it would be nice to set an attribute on responses hydrated from a cache that indicated that it was pulled from a cache or not

requests incompatibility

Seems like there's an issue with the version of requests being used. Just did a pip install scrapelib on a clean virtualenv. From the console:

(scrapelibtest)Toms-MacBook-Air:scrapelibtest tomlee$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import scrapelib
Traceback (most recent call last):
File "", line 1, in
File "/Users/tomlee/Projects/scrapelibtest/lib/python2.7/site-packages/scrapelib/init.py", line 193, in
class FTPSession(requests.Session):
File "/Users/tomlee/Projects/scrapelibtest/lib/python2.7/site-packages/scrapelib/init.py", line 195, in FTPSession
requests.defaults.SCHEMAS.append('ftp')
AttributeError: 'module' object has no attribute 'defaults'

requests.sessions api change causes breakage

requests.sessions api changed between 1.2.1 and 1.2.2. requests.sessions.merge_kwargs is now called requests.sessions.merge_settings. Change is from this commit:

kennethreitz/requests@9811424

Someone using a fresh virtualenv install will eventually get this traceback:

File ".../site-packages/scrapelib/__init__.py", line 348, in request
    headers = requests.sessions.merge_kwargs(headers, self.headers)
AttributeError: 'module' object has no attribute 'merge_kwargs'

Suggested quick fix is to change requirements.txt to:

requests>=1.0,<=1.2.1

or of course change the requests api scrapelib uses and say requests>=1.2.2 instead.

Time for a release, maybe?

Hello,

I saw someone's request to package it for Debian repositories. I am not sure if that's still needed? Is it?

If so, could you please make a new release? Because the last release predates to Nov, 2018.

Handling of requests exceptions

If a request raises

requests.HTTPError, 
requests.ConnectionError,
requests.Timeout

and a subsequent retry cannot recover, then one of these exceptions is raised even if self.raise_errors == False.

Instead, could we build a kind of response object when we get these errors, and handle error raising the same way we handle valid http error status codes?

Recent changes break urlopen?

Using HEAD on Python 2.7.5:

  File "utils.py", line 297, in lxmlize
    entry = scraper.urlopen(url)
  File "/Users/james/.virtualenvs/scrapers_ca_app/src/scrapelib/scrapelib/__init__.py", line 307, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)
  File "/Users/james/.virtualenvs/scrapers_ca_app/src/scrapelib/scrapelib/__init__.py", line 282, in request
    headers = requests.sessions.merge_setting(headers, self.headers)
AttributeError: 'module' object has no attribute 'merge_setting'

scrapelib incorrectly caches kwargs to requests

I initialized a scraper, and fetched a URL whose server has not correctly configured their HTTPS certificates. (They're missing the intermediates, so desktop browsers will fetch them, but most client libs will not.)

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf")

This hangs, as it goes through a retry cycle, and will eventually fail. If I pass the verify=False option, as documented in requests, it will work and spit out a warning:

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf", verify=False)

/home/eric/.virtualenvs/inspectors/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)

However, if I then remove the verify flag, or even if I add it back with verify=True, I can't coax scrapelib into going back into verify mode:

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf", verify=True)

/home/eric/.virtualenvs/inspectors/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)

Raw requests.get() calls do not display this behavior. So, something in scrapelib's caching behavior is caching keyword options.

To be clear, I think the correct behavior here is not to cache kwargs at all, as I wouldn't want verify to "stick" whether or not I was explicit about it later.

This is with the latest version of scrapelib:

requests==2.5.1
scrapelib==0.10.1

Allow for more control of status codes to not retry on

Hi James,

Would you be amenable to allowing the user to configure status codes to not retry on.

I think I would approach this by adjusting this code section:

if self.accept_response(resp) or (resp.status_code == 404 and not retry_on_404):
break

to be something like

                no_retry = (self.accept_response(resp) or
                            (resp.status_code == 404 and not retry_on_404) or
                            resp.status_code in self.quick_fail_status_codes)
                if no_retry:
                    break

Follow <meta> redirects

scrapelib doesn't seem to follow <meta> redirects. While this is a somewhat old and non-standard way to do redirection in 2013, it's still out there on a few government sites.

Here's one example:

$ curl http://www.risch.senate.gov
<html>
<head>
<meta http-equiv="Refresh"
content="0;url=http://www.risch.senate.gov/public/">
</head>

There's an example on StackOverflow for how this could be implemented.

Currently also being discussed in unitedstates/congress-legislators#85

adding to debian/ubuntu

Hi,

I've really become excited about the Sunlight Foundation in the past few months, and think it's a great project. I'd love to see these packages in both Debian and Ubuntu, since the goals of those projects are pretty complementary to Sunlight's, and it seems like a natural fit for Debian/Ubuntu users to want to contribute to Sunlight.

In my mind, the best way forward is to get the packages into the distro to reduce any friction developers might encounter that would prevent them from installing and playing with Sunlight.

I've already done most of the work in packaging up scrapelib, which included writing a manpage for scrapeshell, and turning on the tests during package building. Additionally, I've set up the package to be maintained collectively by the Debian Python Modules Team, so that anyone in the team can step up and help, and so that things aren't bottlenecked on me.

My plan was to find someone to help with the last few steps of actually uploading the package to the archive, but I figured I'd check in you first. I don't think this should be a problem but it's always nice to reach out to the upstream to coordinate a bit. :)

cheers!
/ac

Comparison with Scrapy

I couldn't find any comparison with a more popular scrapy library.
I see that Scrapy is much older than scrapelib.
Could you please provide key advantages of scrapelib over Scrapy?

release to pypi?

Hi @jamesturk ,

would you mind making a minor release to pypi that has the june changes in there. thanks for your consideration!

AttributeError: 'module' object has no attribute 'Scraper'

Hello,

I've just used pip install requests and pip install scrapelib in a new virtualenv, with Python 3.4.

I got this error:

Traceback (most recent call last):
File "C:/Users/natalie/Documents/Encore-in-Google/scrapelib.py", line 1, in
import scrapelib
File "C:/Users/natalie/Documents/Encore-in-Google\scrapelib.py", line 2, in
s = scrapelib.Scraper(requests_per_minute=10)
AttributeError: 'module' object has no attribute 'Scraper'

Not sure what I'm doing wrong? I'm using equests 2.3.0 and Python 3.4 so I think both are compatible with scrapelib? I just copied the example usage at https://github.com/sunlightlabs/scrapelib

I've not used virtualenv before so I'm wondering if that's the problem though both requests and scraplib said they'd installed.

Thanks!

remove urlopen/RequestStr

these are holdovers from a pre-requests era and should go away now.

attributes of RequestStr resp:

  • treating resp like a string (use resp.text)
  • _scraper - can go away, was 'private'
  • resp.bytes -> new.content
  • resp.response -> new
  • resp.encoding -> new.encoding
  • resp.response.requested_url -> new.history[0].url if new.history else new.url
  • resp.response.code -> new.status_code

next version of scrapelib

My current thinking is that the next version of scrapelib should support httplib3 2.0 as well as 1.0 (as recommended here: https://urllib3.readthedocs.io/en/stable/v2-migration-guide.html), that'd probably be 2.3.

I've also had the idea to decouple from requests entirely (allowing direct usage of httplib3 and/or httpx), but would save that for a 3.0 release (or new library).

If you're seeing this issue, I'd love to know what is/isn't useful about the current version. Why do you use it & when do you hit pain points?

Example usage uses deprecated urlopen

The changelog for 0.10.0 says deprecation of urlopen in favor of Requests’s request(), get(), post(), etc., but the README still shows urlopen as the main usage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.