Coder Social home page Coder Social logo

bartdag / pylinkvalidator Goto Github PK

View Code? Open in Web Editor NEW
142.0 10.0 36.0 126 KB

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.

License: Other

Python 99.13% CSS 0.01% JavaScript 0.01% HTML 0.84%
python networking link-checker crawler

pylinkvalidator's People

Contributors

arunelias avatar bartdag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pylinkvalidator's Issues

Invalid IPv6 URL

When checking some URLs I get the following error:

error (<type 'exceptions.ValueError'>): Invalid IPv6 URL: 

Even though the URL is is not formatted unusually.

Scan http://verticalindustriesblog.redhat.com/ with depth=1 for some examples.

I may modify my fork to just ignore this error but I'm not sure there is a correct way to 'fix' it. Googling it seems like an issue with Python 2.7.x.

I see it both on 2.7.5 and 2.7.10.

filter by file

There is IGNORED_PREFIXES but I need to filter a particular file out of the results...

All our WordPress blogs include http://my.domain/xmlrpc.php which always returns a 405.

Would it be possible to add another flag, or maybe make make the current "ignore" more flexible (regex maybe)?

I'm going to dig into the code but figured you may have a quick solution.

Ignore Telephone Links

Is there a way to enable the linkchecker to ignore telephone links? For a site with the following link:

<a href="tel:18002524793"><span>Assisted Living<br>Sales Office</span>1-800-252-4793</a>

The linkchecker attempts to crawl http://www.theosborn.org/tel:18006732926 which returns 404. The sites my company run have multiple telephone links. This site in particular has 6 telephone links in a sidebar that renders on every single page, which results in quite a few false positives:

ERROR Crawled 1049 urls with 504 error(s) in 126.18 seconds

URL query parameters are not escaped

Thank you for making this tool. It's very helpful.
It seems that it doesn't support links with whitespaces in query parameters:
e.g.
https://twitter.com/intent/tweet?text=This is a text with spaces and a linkt to https://www.example.com
returns 400.
Maybe we could have a flag to toggle URL query parameter escaping?

Installation does not work with Python 3.8

I had to manually delete Python2 code containing print statements, afterwards python3 setup.py install worked for me.

Would you accept a PR to remove Python 2 compatibility (Python 2 is EOL)?

follows tel: links but shouldn't

links beginning with "tel:" should be skipped

source file:

<a href="tel:0033111111111">+ 33 <u>(0)1 11 11 11 11</u></a>

output

not found (404): http://localhost:8000/tel:0033111111111
    from http://localhost:8000/

I didn't try "mailto:", it might not work as well

UnicodeEncodeError

Making progress on CSV export but while testing plain text output I ran into this issue while scanning our site:

error (<type 'exceptions.UnicodeEncodeError'>): 'ascii' codec can't encode 
character u'\u2019' in position 58: ordinal not in range(128): 
http://my.private.url/about/press-releases/joe-smith-named-sunshine’s-officer

I can do: 'print u'\u2019' - and it correctly prints the single quote in my terminal?

Digging through the code I found in crawler.py where it appears to be handling the exception but I'm not sure how to fix (new to Python)

# Something bad happened when opening the url
exception = ExceptionStr(
unicode(type(response.exception)),
unicode(response.exception))

csv format

It looks like the original author had ideas for other output formats other than plain text. I see HTML as one format in the code.

I was curious how hard it would be to add CSV? It appears I could copy _write_plain_text_report in reporter.py and tweak?

I'm tinkering with the code now and if I come up with anything will send it back.

allow arbitrary header

E.g.,: change user agent, add custom header to bypass custom authentication or provide oauth2 token, etc.

How to see 302 redirected page

I am crawling this website to find all the pages that 404, But the website i am crawling have the 404's redirected to a pretty 'sorry for 404' page(302). So is there a way to detect link that get redirected like this? , log the links that gets redirected to a pretty 404 link

I was running a small python code like this
import requests link = 'https://example/1234sdsd' r = requests.get(link, allow_redirects=False) print(link,r.status_code, r.headers['Location'])

print log comes like this :"https://example/1234sdsd 302 /404.aspx?item=%2f1234sdsd&user=extranet%5cAnonymous&site=website"

i was looking for something like this with the crawler
"302 - original link (1 of 1669 -0%"

Add raw content check

Add the following cumulative option:

--raw-content-include '/path,content to check'
--raw-content-exclude '/path,content to check'

Fetch partial links

Is there any way to check the resources specified as relative links in the page?

Thanks!

limit scanning reqs/second

We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).

I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?

I will take a stab at it and submit a pull-request when complete.

Thanks!
jim

Validate website locally

Is it possible to validate a static website that I have on my local computer?

I naively tried pointing at an index.html file, but that doesn't work:

$ pylinkvalidate.py docs/_build/html/index.html 
ERROR Crawled 1 urls with 1 error(s) in 0.01 seconds

  Start URL(s): http://docs/_build/html/index.html

  error (<class 'urllib.error.URLError'>): <urlopen error [Errno 8] nodename nor servname provided, or not known>: http://docs/_build/html/index.html

If no, would it be a common enough case to add an option for that, or to mention in the README how it can be done?

The URL must not be empty

Trying to run:

pylinkvalidate.py -P "https://eai.company/"
or
pylinkvalidate.py -P "https://enlightenment.ai/"

results in a:
The URL must not be empty: https://eai.company/ or
The URL must not be empty: https://enlightenment.ai/

I'm not sure what is happening. It says the URL is empty, but it only happens with some domains for some reason.

validate content

add two cumulative options:

--check-html-presence '<tag attr1="value1" attr2="value2">string</tag>'
--check-html-absence '<tag attr1="value1" attr2="value2">string</tag>'

Report an error if the specified tag is found (or not found) on a page.

speed

Any suggestions for how I could speed this up? Anything I could optimize code wise?

Currently I'm scanning about 250,000 urls and with 10 workers it takes about 12-13 hrs :)

I know I can increase workers but then I increase traffic which I can't do.

urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]

Just installed and my first test says this. Tried this as http & https with the exact same response.

$ pylinkvalidate.py --progress http://www.cnn.com/
Starting crawl...
error - http://www.cnn.com/ (1 of 1 - 100%)
Crawling Done...

ERROR Crawled 1 urls with 1 error(s) in 0.11 seconds

  Start URL(s): http://www.cnn.com/

  error (<class 'urllib.error.URLError'>): <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>: http://www.cnn.com/


# A little ENV info on my OSX machine.
$ which python
/usr/local/bin/python

$ python --version
Python 3.6.4

allow-insecure-content not an option

Code:
import bs4
import pylinkvalidator.api from pylinkvalidator.api import crawl_with_options as crawl_opts
crawled_site = crawl_opts(["https://mysite.net/"], {"run-once": True, "progress": True, "console": True, "show-source": True, "allow-insecure-content": True, "parser": "lxml"})

returns (with IPython 3.7):
"Usage: ipykernel_launcher.py [options] URL ...
ipykernel_launcher.py: error: no such option: --allow-insecure-content
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2"

Link Depth

Do you think it'd be possible to have an additional parameter to define the link depth the crawler goes to?
like --depth=2 would only follow 2 pages deep.
Could be usefull if there was a page ready with all the url one wants to crawl, or do a quicker pass on a big website.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.