cartman720 / pysitemap Goto Github PK

View Code? Open in Web Editor NEW

29.0 2.0 34.0 8 KB

Simple sitemap generator with Python 3

License: MIT License

Python 100.00%

python3 python script sitemap-generator sitemap-xml crawler python-script sitemap generator

pysitemap's Introduction

PySitemap

Simple sitemap generator with Python 3

Description

This is simple and easy sitemap generator written in python which can help you easily create sitemap of your website for SEO and other purposes.

Options

Simply you can run with thisthis command and program will create sitemap.xml with links from url option

python main.py --url="https://www.finstead.com"

If you want custome path for sitemap file you can add --output option like below

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml"

By default program will print parsing urls in console, but if you want to run siletnly you can add --no-verbose option.

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml" --no-verbose

If you want to restrict some urls from being visited by crawler you can exclude them with regex pattern using --exclude option. Below code will exclude png or jpg files.

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml" --exclude="\.jpg|\.png"

pysitemap's People

Contributors

Stargazers

Watchers

pysitemap's Issues

Not found

I get this:

Parsing http://canterano.somenxavier.xyz/trobar-mesures-figures-vnps/figures-1.ggb
Traceback (most recent call last):
  File "main.py", line 21, in <module>
    links = crawler.start()
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 17, in start
    self.crawl(self.url)
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  [Previous line repeated 20 more times]
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 26, in crawl
    response = urllib.request.urlopen(url)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

after python main.py --url="http://canterano.somenxavier.xyz"

urllib.error.HTTPError: HTTP Error 400: Bad Request

PySitemap/crawler.py

Line 26 in 61c728f

response = urllib.request.urlopen(url)

Putting this under a try block should stop the script from crashing

TypeError: init() got an unexpected keyword argument 'exclude'

Hi, tell me what is the problem? Thx
line 17, in
crawler = Crawler(url, exclude=args.exclude, no_verbose=args.no_verbose);
TypeError: init() got an unexpected keyword argument 'exclude'

Hello, somewhere there are restrictions on the length of urls, since urls longer than 300 for some reason are not included in the sitemap. It should be added that they are in Cyrillic, that is, they are replaced with html equivalents.

Escape Characters in URL

Ampersand | & | &amp;
Single Quote | ' | &apos;
Double Quote | " | &quot;
Greater Than | > | &gt;
Less Than | < | &lt;

HTTP Error 403: Forbidden

Traceback (most recent call last):
File "main.py", line 21, in
links = crawler.start()
File "\crawler.py", line 17, in start
self.crawl(self.url)
File "\crawler.py", line 26, in crawl
response = urllib.request.urlopen(url)
File "\Python36_64\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "\Python36_64\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "\Python36_64\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "\Python36_64\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "\Python36_64\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "\Python36_64\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Try Except clause prints that all exceptions are 404s not their correct exception type.

In crawler.py, there's this code:

try:
	response = urllib.request.urlopen(url)

except:
	print('404 error')
	return

However, 404 is not the only possible exception that can occur with urllib.request.urlopen().

Solution:

try:
	response = urllib.request.urlopen(url)

except Exception:
	print(repr(Exception))
	return

This prints the correct exception message and avoids user confusion, as in the case of the closed issue about 404 errors for sites that exist, which can be due to other errors, such as an SSL certificate issue.

404 Error

Valid webistes keep on showing 404 error , so i decided to print out the exception and got this