Coder Social home page Coder Social logo

cosmichorrordev / pubproxpy Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 0.0 143 KB

An easy to use Python wrapper for pubproxy.com

License: GNU General Public License v3.0

Python 100.00%
proxy pubproxy scraping python proxy-api proxies web-scraping python-pubproxy

pubproxpy's Introduction

Pubproxpy

An easy to use Python wrapper for pubproxy's public proxy API.

Installation

NOTE: The minimum python version for this library is 3.7, check with python -V or python3 -V if you're unsure about your current version.

Install the pubproxpy package using your standard Python package manager e.g.

$ pip install pubproxpy

As always you are recommended to install into a virtual environment.

Keyless API Limitations

API Daily Limits

At the time of writing this without an API key the pubproxy API limits users to 5 proxies per request and 50 requests per day. The maximum proxies per request is always used to minimize rate limiting along with getting the most proxies possible within the request limit meaning you should get 250 proxies per day without needing an API key.

API Rate Limiting

Without an API key pubproxy limits users to one request per second so a ProxyFetcher will try to ensure that at most only one request per second is done without an API key. This is synchronized between ProxyFetchers, but this is not thread safe so make sure all ProxyFetchers are on one thread in one program if you have no API key. The rate limiting is quite severe, so upon being hit the API seems to deny requests for several minutes/hours.

Quickstart Example

from pubproxpy import Level, Protocol, ProxyFetcher

# ProxyFetcher for proxies that use the socks5 protocol, are located in
# the US or Canada and support POST requests
socks_pf = ProxyFetcher(
    protocol=Protocol.SOCKS5, countries=["US", "CA"], post=True
)

# ProxyFetcher for proxies that support https, are elite anonymity level,
# and connected in 15 seconds or less
https_pf = ProxyFetcher(
    protocol=Protocol.HTTP, https=True, level=Level.ELITE, time_to_connect=15
)

# Get one socks proxy, followed by 10 https proxies
# NOTE: even though there are multiple `ProxyFetcher`s the delays are
#       coordinated between them to prevent rate limiting
socks_proxy = socks_pf.get()[0]  # Get a single socks proxy
https_proxies = https_pf.get(10)  # Get 10 https proxies

# And then if you want to get any remaining proxies left over in the local list
# before stopping then you can!
unused_proxies = socks_pf.drain()

# Do something with the proxies, like spawn worker threads that use them

Documentation

Getting proxies is handled by the ProxyFetcher class. There are several parameters you can pass on initialization to narrow down the proxies to a suitable type. From there you can just call .get(amount=1) to receive a list of amount proxies where each proxy is in the form of "{ip}:{port}". There is an internal blacklist to ensure that the same proxy IP and port combo will not be used more than once by any ProxyFetcher, unless exclude_used is False.

ProxyFetcher Parameters

Since the API doesn't check pretty much anything for correctness, we do our best in the ProxyFetcher to ensure nothing is wrong. As far as I know the only thing that isn't checked is that the countries or not_countries actually use the correct codes.

Parameter Type Description
exclude_used bool [Default: True] If the ProxyFetcher should prevent re-returning proxies
api_key str API key for a paid account, you can also set $PUBPROXY_API_KEY to pass your key, passing the api_key parameter will override the env-var if both are present
level pubproxpy.Level [Options: ANONYMOUS, ELITE] Proxy anonymity level
protocol pubproxpy.Protocol [Options: HTTP, SOCKS4, SOCKS5] Desired communication protocol
countries str or list<str> Locations of the proxy using the ISO-3166 alpha-2 country code, Incompatible with not_countries
not_countries str or list<str> Blacklist locations of the proxy using the ISO-3166 alpha-2 country code, Incompatible with countries
last_checked int [Bounds: 1-1000] Minutes since the proxy was checked
port int Proxies using a specific port
time_to_connect int [Bounds: 1-60] How many seconds it took for the proxy to connect
cookies bool Supports requests with cookies
google bool Can connect to Google
https bool Supports HTTPS requests
post bool Supports POST requests
referer bool Supports referer requests
user_agent bool Supports forwarding user-agents

ProxyFetcher Methods

Keeping it simple (stupid), so just .get(amount=1) and .drain().

Method Returns
.get(amount=1) List of amount proxies, where each proxy is a str in the form "{ip}:{port}"
.drain() Returns any proxies remaining in the current list, useful if you are no longer getting proxies and want to save any left over

Exceptions

All the exceptions are defined in pubproxpy.errors.

Exception Description
ProxyError Base exception that all other pubproxpy errors inherit from, also raised when the API returns an unknown response
APIKeyError Error raised when the API gives an incorrect API Key response
RateLimitError Error raised when the API gives a rate-limiting response (more than 1 request per second)
DailyLimitError Error raised when the API gives the daily request limit response

pubproxpy's People

Contributors

cosmichorrordev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pubproxpy's Issues

Avoid stringly typed API

I'm not a fan of stringly typed APIs so it would be nice to convert the parameters that only encompass a few options from strs to enums.

Flip precedence for api key from param and env var

Currently, the API key will only be read from the env var if it wasn't passed into the ProxyFetcher: however, it makes more sense for users to want to override the env var since changing the env var is much easier to do. I think for this reason it should try reading from the env var first, and fall back to the param if the env var isn't set.

  • This will be a breaking change so add a warning for when both are set.
  • Make sure that this behavior is properly documented.

Raise appropriate error from incorrect params

This was my first library so some of the design reflects that, but the params being wrong should likely be raised as a ValueError instead of asserting.

This is considered a breaking change since it would result in different error handling if people are trying to specifically catch it.

Add warnings for breaking changes

To make sure library users are aware I'm planning on adding warnings for #9 and #10 first, and then actually releasing the breaking change under 3.0.0 six months after the warnings are added. This should give some time to prepare for any changes, while library users can switch back down the minor release to avoid the warnings after they see them.

Hello Users

You may be seeing this because the deprecation warning linked you here. As you've seen there will be breaking changes to the public API which may end up breaking your code as well. For that reason I've added deprecation warnings to the breaking changes and am planning on switching on the next major release (v3.0.0) which I plan on publishing on 2021-06-01.

This matters to you only if you either use the premium API (with an API key) since one of the changes affects the precedence of passing in the key as a param vs and env variable (#10), or you handle specific errors raised by the API (#9). If you don't handle errors, then you likely should since you could very well at least exhaust the daily proxy allowance.

Your options currently are

  • Switch to the breaking changes early by using the master branch (directions below).
  • Never update to the breaking changes. If this fits you then simply sticking to v2.0.0 will get rid of the annoying deprecation warnings and not switch to breaking changes. This of course implicitly means that there likely won't be receiving any further updates including other possible bug fixes.

Updating to the master branch

pip

Installing with this command should do the trick (as usual you should probably use virtual environments).

pip install https://github.com/LovecraftianHorror/pubproxpy/archive/master.zip

poetry

Simply using this for your dependency should work

pubproxpy = { git = "https://github.com/LovecraftianHorror/pubproxpy" }

Bug where proxies can get duplicated even when `exclude_used=True`

So adding the test in 97e22b9 has revealed that duplicate proxies can still get out because the check for duplicates is done during ._fetch() while the blacklist isn't updated till .get_proxy() or .get_proxies(...). This can cause duplicates to get through that are added to the internal proxy list before the blacklist is updated.

Revamp error system

Right now the errors seem to be done in a kinda hacky way so it would be good to redo the entire system I think.

Actually add some tests

pytest is listed for the dev dependencies and is automatically run in CI, but nothing is really tested. It would be nice to add in some tests to make sure that the library is actually strict on the things it promises and to ensure that the delay to avoid rate limiting is correctly applied.

Documentation

It would be nice for the library to include:

  • A poetry.lock file
  • CONTRIBUTING.md
  • A pull request template
  • Issue templates
  • Descriptive releases on github

Test bad response behavior in test suite

It would be good to check bad responses in the test suite too. Try to find if there's something similar to wiremock for mocking out the web server. It's also possible to determine if pytest is currently running so worst case we could spin up a server for the pytest run with some fixtures.

Handle potential request errors

Currently, the library assumes that the request goes through correctly. It would be good to check to make sure the request succeeded and return an appropriate error if not.

Server Errors are not raised correctly

I thought I remembered the server giving error messages with a 200 status code still, but it looks like at the very least daily limit is raised with a 503 (test out the rest). This doesn't cause the errors to get exposed correctly since they are thrown by request raising for error first.

The correct way to do this would be to try and match the error, then raise for error with request if no match is found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.