Coder Social home page Coder Social logo

pysitemap's Introduction

PySitemap

Simple sitemap generator with Python 3

Description

This is simple and easy sitemap generator written in python which can help you easily create sitemap of your website for SEO and other purposes.

Options

Simply you can run with thisthis command and program will create sitemap.xml with links from url option

python main.py --url="https://www.finstead.com"

If you want custome path for sitemap file you can add --output option like below

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml"

By default program will print parsing urls in console, but if you want to run siletnly you can add --no-verbose option.

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml" --no-verbose

If you want to restrict some urls from being visited by crawler you can exclude them with regex pattern using --exclude option. Below code will exclude png or jpg files.

python main.py --url="https://www.finstead.com" --output="/custom/path/sitemap.xml" --exclude="\.jpg|\.png"

pysitemap's People

Contributors

cartman720 avatar ghaseminya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pysitemap's Issues

Not found

I get this:

Parsing http://canterano.somenxavier.xyz/trobar-mesures-figures-vnps/figures-1.ggb
Traceback (most recent call last):
  File "main.py", line 21, in <module>
    links = crawler.start()
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 17, in start
    self.crawl(self.url)
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 49, in crawl
    self.crawl(urljoin(self.url, link))
  [Previous line repeated 20 more times]
  File "/home/xan/Baixades/sitemap/PySitemap-master/crawler.py", line 26, in crawl
    response = urllib.request.urlopen(url)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

after python main.py --url="http://canterano.somenxavier.xyz"

very long urls

Hello, somewhere there are restrictions on the length of urls, since urls longer than 300 for some reason are not included in the sitemap. It should be added that they are in Cyrillic, that is, they are replaced with html equivalents.

Escape Characters in URL

Ampersand | & | &amp;
Single Quote | ' | &apos;
Double Quote | " | &quot;
Greater Than | > | &gt;
Less Than | < | &lt;

HTTP Error 403: Forbidden

Traceback (most recent call last):
File "main.py", line 21, in
links = crawler.start()
File "\crawler.py", line 17, in start
self.crawl(self.url)
File "\crawler.py", line 26, in crawl
response = urllib.request.urlopen(url)
File "\Python36_64\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "\Python36_64\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "\Python36_64\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "\Python36_64\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "\Python36_64\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "\Python36_64\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Try Except clause prints that all exceptions are 404s not their correct exception type.

In crawler.py, there's this code:

try:
	response = urllib.request.urlopen(url)

except:
	print('404 error')
	return

However, 404 is not the only possible exception that can occur with urllib.request.urlopen().

Solution:

try:
	response = urllib.request.urlopen(url)

except Exception:
	print(repr(Exception))
	return

This prints the correct exception message and avoids user confusion, as in the case of the closed issue about 404 errors for sites that exist, which can be due to other errors, such as an SSL certificate issue.

404 Error

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.finstead.com'. (_ssl.c:1129)>

Valid webistes keep on showing 404 error , so i decided to print out the exception and got this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.