bartdag / pylinkvalidator Goto Github PK

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.

License: Other

Python 99.13% CSS 0.01% JavaScript 0.01% HTML 0.84%

python networking link-checker crawler

pylinkvalidator's People

Contributors

Stargazers

Watchers

pylinkvalidator's Issues

Invalid IPv6 URL

When checking some URLs I get the following error:

error (<type 'exceptions.ValueError'>): Invalid IPv6 URL:

Even though the URL is is not formatted unusually.

Scan http://verticalindustriesblog.redhat.com/ with depth=1 for some examples.

I may modify my fork to just ignore this error but I'm not sure there is a correct way to 'fix' it. Googling it seems like an issue with Python 2.7.x.

I see it both on 2.7.5 and 2.7.10.

Collection Broken links of an entire domain.

can someone please help me understand this module? when i type "pylinkvalidate.py -P http://www.example.com/" it will only scrape that one URL. Is this script capable of collecting broken links of an entire domain? I am not the best with python :/

filter by file

There is IGNORED_PREFIXES but I need to filter a particular file out of the results...

All our WordPress blogs include http://my.domain/xmlrpc.php which always returns a 405.

Would it be possible to add another flag, or maybe make make the current "ignore" more flexible (regex maybe)?

I'm going to dig into the code but figured you may have a quick solution.

Read start urls from file

One starting url per line

Ignore Telephone Links

Is there a way to enable the linkchecker to ignore telephone links? For a site with the following link:

<a href="tel:18002524793"><span>Assisted Living<br>Sales Office</span>1-800-252-4793</a>

The linkchecker attempts to crawl http://www.theosborn.org/tel:18006732926 which returns 404. The sites my company run have multiple telephone links. This site in particular has 6 telephone links in a sidebar that renders on every single page, which results in quite a few false positives:

ERROR Crawled 1049 urls with 504 error(s) in 126.18 seconds

URL query parameters are not escaped

Thank you for making this tool. It's very helpful.
It seems that it doesn't support links with whitespaces in query parameters:
e.g.
https://twitter.com/intent/tweet?text=This is a text with spaces and a linkt to https://www.example.com
returns 400.
Maybe we could have a flag to toggle URL query parameter escaping?

Installation does not work with Python 3.8

I had to manually delete Python2 code containing print statements, afterwards python3 setup.py install worked for me.

Would you accept a PR to remove Python 2 compatibility (Python 2 is EOL)?

follows tel: links but shouldn't

links beginning with "tel:" should be skipped

source file:

<a href="tel:0033111111111">+ 33 <u>(0)1 11 11 11 11</u></a>

output

not found (404): http://localhost:8000/tel:0033111111111
    from http://localhost:8000/

I didn't try "mailto:", it might not work as well

Does it crawl external urls?

I crawled a small site and It did not crawl external urls.

Fork from mtlevolio?

Hey i came across this link here https://github.com/mtlevolio/pylinkchecker. Just want to know if your version is a fork from this? Thanks.

UnicodeEncodeError

Making progress on CSV export but while testing plain text output I ran into this issue while scanning our site:

error (<type 'exceptions.UnicodeEncodeError'>): 'ascii' codec can't encode 
character u'\u2019' in position 58: ordinal not in range(128): 
http://my.private.url/about/press-releases/joe-smith-named-sunshine’s-officer

I can do: 'print u'\u2019' - and it correctly prints the single quote in my terminal?

Digging through the code I found in crawler.py where it appears to be handling the exception but I'm not sure how to fix (new to Python)

# Something bad happened when opening the url
exception = ExceptionStr(
unicode(type(response.exception)),
unicode(response.exception))

csv format

It looks like the original author had ideas for other output formats other than plain text. I see HTML as one format in the code.

I was curious how hard it would be to add CSV? It appears I could copy _write_plain_text_report in reporter.py and tweak?

I'm tinkering with the code now and if I come up with anything will send it back.

Possible syntax error in Readme.rst

In "Usage Examples" in readme.rst, it tells "--parser=LXML" but it works for me only if I use "parser=lxml"

allow arbitrary header

E.g.,: change user agent, add custom header to bypass custom authentication or provide oauth2 token, etc.

How to see 302 redirected page

I am crawling this website to find all the pages that 404, But the website i am crawling have the 404's redirected to a pretty 'sorry for 404' page(302). So is there a way to detect link that get redirected like this? , log the links that gets redirected to a pretty 404 link

I was running a small python code like this
import requests link = 'https://example/1234sdsd' r = requests.get(link, allow_redirects=False) print(link,r.status_code, r.headers['Location'])

print log comes like this :"https://example/1234sdsd 302 /404.aspx?item=%2f1234sdsd&user=extranet%5cAnonymous&site=website"

i was looking for something like this with the crawler
"302 - original link (1 of 1669 -0%"

Add raw content check

Add the following cumulative option:

--raw-content-include '/path,content to check'
--raw-content-exclude '/path,content to check'

Fetch partial links

Is there any way to check the resources specified as relative links in the page?

Thanks!

limit scanning reqs/second

We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).

I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?

I will take a stab at it and submit a pull-request when complete.

Thanks!
jim

Validate website locally

Is it possible to validate a static website that I have on my local computer?

I naively tried pointing at an index.html file, but that doesn't work:

$ pylinkvalidate.py docs/_build/html/index.html 
ERROR Crawled 1 urls with 1 error(s) in 0.01 seconds

  Start URL(s): http://docs/_build/html/index.html

  error (<class 'urllib.error.URLError'>): <urlopen error [Errno 8] nodename nor servname provided, or not known>: http://docs/_build/html/index.html

If no, would it be a common enough case to add an option for that, or to mention in the README how it can be done?

The URL must not be empty

Trying to run:

pylinkvalidate.py -P "https://eai.company/"
or
pylinkvalidate.py -P "https://enlightenment.ai/"

results in a:
The URL must not be empty: https://eai.company/ or
The URL must not be empty: https://enlightenment.ai/

I'm not sure what is happening. It says the URL is empty, but it only happens with some domains for some reason.

validate content

add two cumulative options:

--check-html-presence '<tag attr1="value1" attr2="value2">string</tag>'
--check-html-absence '<tag attr1="value1" attr2="value2">string</tag>'

Report an error if the specified tag is found (or not found) on a page.

Bring multi_site back

Refs: https://github.com/mtlevolio/pylinkchecker/compare/multi_site_support

Remove --use-mirrors from Travis CI tests

As per the following: pypa/pip#1098

The --use-mirrors argument for travis.yml was deprecated in 2015.

speed

Any suggestions for how I could speed this up? Anything I could optimize code wise?

Currently I'm scanning about 250,000 urls and with 10 workers it takes about 12-13 hrs :)

I know I can increase workers but then I increase traffic which I can't do.

urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]

Just installed and my first test says this. Tried this as http & https with the exact same response.

$ pylinkvalidate.py --progress http://www.cnn.com/
Starting crawl...
error - http://www.cnn.com/ (1 of 1 - 100%)
Crawling Done...

ERROR Crawled 1 urls with 1 error(s) in 0.11 seconds

  Start URL(s): http://www.cnn.com/

  error (<class 'urllib.error.URLError'>): <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>: http://www.cnn.com/


# A little ENV info on my OSX machine.
$ which python
/usr/local/bin/python

$ python --version
Python 3.6.4

Requesting fork

allow-insecure-content not an option

Code:
import bs4
import pylinkvalidator.api from pylinkvalidator.api import crawl_with_options as crawl_opts
crawled_site = crawl_opts(["https://mysite.net/"], {"run-once": True, "progress": True, "console": True, "show-source": True, "allow-insecure-content": True, "parser": "lxml"})

returns (with IPython 3.7):
"Usage: ipykernel_launcher.py [options] URL ...
ipykernel_launcher.py: error: no such option: --allow-insecure-content
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2"

Link Depth

Do you think it'd be possible to have an additional parameter to define the link depth the crawler goes to?
like --depth=2 would only follow 2 pages deep.
Could be usefull if there was a page ready with all the url one wants to crawl, or do a quicker pass on a big website.

bartdag / pylinkvalidator Goto Github PK

pylinkvalidator's People

Contributors

Stargazers

Watchers

Forkers

pylinkvalidator's Issues

Recommend Projects

Recommend Topics

Recommend Org