Coder Social home page Coder Social logo

jirkapinkas / jsitemapgenerator Goto Github PK

View Code? Open in Web Editor NEW
41.0 3.0 12.0 262 KB

Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming

License: MIT License

Java 100.00%
java web-sitemap java-sitemap-generator sitemap rss-generator robots-txt rss robots-generator java-8 java-8-lambda

jsitemapgenerator's Introduction

Java sitemap generator

This library generates a web sitemap and can ping Google that it has changed (also it can generate RSS feed and robots.txt). It has friendly, easy to use Java 8 functional API and is AWS-lambda friendly.

Typical usage:

Add this library to classpath:

<dependency>
  <groupId>cz.jiripinkas</groupId>
  <artifactId>jsitemapgenerator</artifactId>
  <version>4.5</version>
</dependency>

If you want to use "ping google / bing" functionality, also add this library to classpath:

<dependency>
    <groupId>com.squareup.okhttp3</groupId>
    <artifactId>okhttp</artifactId>
    <version>4.2.2</version> <!-- latest version should be fine, get latest version from https://javalibs.com/artifact/com.squareup.okhttp3/okhttp -->
</dependency>

Typical usage (web sitemap):

String sitemap = SitemapGenerator.of("https://example.com")
    .addPage("foo2.html") // simplest way how to add page - shorthand for addPage(WebPage.of("foo2.html"))
    .addPage(WebPage.of("foo1.html")) // same as addPage("foo1.html")
    .addPage(WebPage.builder().name("bar.html").build()) // builder is more complex
    .addPage(WebPage.builder().maxPriorityRoot().build()) // builder has lots of useful methods
    .toString();

or sitemap in gzip format:

byte[] sitemap = SitemapGenerator.of("https://example.com")
    .addPage(WebPage.builder().maxPriorityRoot().build())
    .addPage("foo.html")
    .addPage("bar.html")
    .toGzipByteArray();

you can set default settings (for the subsequent WebPages):

String sitemap = SitemapGenerator.of("https://example.com")
    .addPage(WebPage.builder().maxPriorityRoot().build()) // URL will be: "/"
    .defaultExtension("html")
    .defaultDir("dir1")
    .addPage("foo") // URL will be: "dir1/foo.html"
    .addPage("bar") // URL will be: "dir1/bar.html"
    .defaultDir("dir2")
    .addPage("hello") // URL will be: "dir2/hello.html"
    .addPage("yello") // URL will be: "dir2/yello.html"
    // btw. specifying dir and / or extension on WebPage overrides default settings
    .addPage(WebPage.builder().dir("dir3").extension(null).name("test").build()) // "dir3/test"
    .resetDefaultDir() // resets default dir
    .resetDefaultExtension() // resets default extension
    .addPage(WebPage.of("mypage")) // URL will be: "mypage"
    .toString();

or with list of pages:

List<String> pages = Arrays.asList("firstPage", "secondPage", "otherPage");
String sitemap = SitemapGenerator.of("https://example.com")
        .addPage(WebPage.builder().nameRoot().priorityMax().build())
        .defaultDir("dirName")
        .addPages(pages, page -> WebPage.of(page))
        .toString();

or list of pages in complex data type:

class News {
    private String name;
    public News(String name) { this.name = name; }
    public String getName() { return name; }
}
List<News> newsList = Arrays.asList(new News("a"), new News("b"), new News("c"));
String sitemap = SitemapGenerator.of("https://example.com")
        .addPage(WebPage.builder().nameRoot().priorityMax().build())
        .defaultDir("news")
        .addPages(newsList, news -> WebPage.of(news::getName))
        .toString();

or to store it to file & ping Google:

Ping ping = Ping.builder()
        .engines(Ping.SearchEngine.GOOGLE)
        .build();
SitemapGenerator.of("https://example.com")
    .addPage(WebPage.builder().maxPriorityRoot().changeFreqNever().lastModNow().build())
    .addPage("foo.html")
    .addPage("bar.html")
    // generate sitemap and save it to file ./sitemap.xml
    .toFile(Paths.get("sitemap.xml"))
    // inform Google that this sitemap has changed
    .ping(ping); // this requires okhttp in classpath!!!
    .callOnSuccess(() -> System.out.println("Pinged Google")) // what will happen on success
    .catchOnFailure(e -> System.out.println("Could not ping Google!")); // what will happen on error

Note: To ping Google / Bing, you can either use built-in support (requires OkHttp in classpath!!!), or you can use your own http client implementation. Supported http clients: Custom OkHttpClient, CloseableHttpClient (Apache Http Client), RestTemplate (from Spring). To use your own http client implementation just call on PingBuilder method: httpClient*() and pass inside your implementation.

How to create sitemap index:

String sitemapIndex = SitemapIndexGenerator.of("https://javalibs.com")
    .addPage("sitemap-plugins.xml")
    .addPage("sitemap-archetypes.xml")
    .toString();

How to create RSS channel:

... RSS ISN'T sitemap :-), but it's basically just a list of links (like sitemap) and if you need sitemap, then probably you also need RSS. Note: RssGenerator has lots of common methods with SitemapGenerator.

String rss = RssGenerator.of("https://topjavablogs.com", "Top Java Blogs", "Best Java Blogs")
    .addPage(WebPage.rssBuilder()
        .pubDate(LocalDateTime.now())
        .title("News Title")
        .description("News Description")
        .link("page-name")
        .build())
    .toString();

How to create robots.txt:

... robots.txt ISN'T sitemap :-), but inside it you reference your sitemap and if you need sitemap, then you probably need robots.txt as well :-)

String robotsTxt = RobotsTxtGenerator.of("https://example.com")
        .addSitemap("sitemap.xml")
        .addRule(RobotsRule.builder().userAgentAll().allowAll().build())
        .toString();

How to check sitemap:

Best practices & performance

  • SitemapGenerator (and other Generator classes) are builders, thus they're not immutable.
  • Also having SitemapGenerator as singleton and at the same time calling addPage() and toString() (in multiple threads) isn't really advised. SitemapGenerator operations aren't thread-safe (with one exception: SitemapGenerator.of(), which creates new instance of SitemapGenerator).
  • When you call addPage(), you store it to Map, where key is page's URL (so you cannot have two items with the same URL in sitemap).
  • toString(), toFile(), toGzipByteArray() methods (terminal operations) generate final sitemap from the Map of objects. So when creating sitemap, most time will be spent executing terminal operation.
  • If you need raw speed for accessing sitemap, I suggest to:
    • either save sitemap to external file and then just get the data from file
    • or cache the result of terminal operation

My other projects:

What I used to upload jsitemapgenerator to Maven Central:

jsitemapgenerator's People

Contributors

jfendler avatar jirkapinkas avatar luxeon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

jsitemapgenerator's Issues

Images are added to existing sitemap incorrectly

Hi,

It looks like image tags are being added to an existing sitemap incorrectly. Specifically, images pertaining to a specific page should be listed inside the url element, as described here:
https://support.google.com/webmasters/answer/178636?hl=en

While your code produces the following XML:

<loc>http://www.somedomain.com/url/path</loc>
<lastmod>2015-11-25T09:19-05:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority>
</url>
<image:image>
<image:loc>imageUrl1</image:loc>
</image:image>
<image:image>
<image:loc>imageUrl2</image:loc>
</image:image>
<url>
...

The problematic methods are constructSitemap and constructUrl inside cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator class.
The simplest way to fix would be to move out.append("</url>\n"); statement from constructUrl method and into constructSitemap, line 52 (i.e. after images elements have been constructed).

Alternatively, constructImage should be called from inside constructUrl(WebPage webPage), before writing out the closing </url> tag.

Very useful project otherwise!

Thanks.

Alternate links support

Support of https://support.google.com/webmasters/answer/189077?hl=en

<url>
    <loc>http://www.example.com/english/page.html</loc>
    <xhtml:link 
               rel="alternate"
               hreflang="de"
               href="http://www.example.com/deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="de-ch"
               href="http://www.example.com/schweiz-deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="en"
               href="http://www.example.com/english/page.html"/>
</url>

PR https://github.com/jirkapinkas/jsitemapgenerator/pull/11/files

Automatically include relevant XML namespaces

At the moment, sitemaps with images require different constructor overload to pass in AdditionalNamespace.IMAGE. It would be better to include relevant XML either always or when presence of images in the sitemap is detected.

Base URL for images

Please apply base URL to images as well. While at it, use proper relative URL resolution methods from either built-in Java API or some URL manipulation library. Otherwise absolute URLs (pointing to e.g. a CDN) would be incorrectly prepended with base URL.

Images are added to existing sitemap incorrectly

Hi,

It looks like image tags are being added to an existing sitemap incorrectly. Specifically, images pertaining to a specific page should be listed inside the url element, as described here:
https://support.google.com/webmasters/answer/178636?hl=en

While your code produces the following XML:

<loc>http://www.somedomain.com/url/path</loc>
<lastmod>2015-11-25T09:19-05:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority>
</url>
<image:image>
<image:loc>imageUrl1</image:loc>
</image:image>
<image:image>
<image:loc>imageUrl2</image:loc>
</image:image>
<url>
...

The problematic methods are constructSitemap and constructUrl inside cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator class.
The simplest way to fix would be to move out.append("\n"); statement from constructUrl method and into constructSitemap, line 52 (i.e. after images elements have been constructed).

Alternatively, constructImage should be called from inside constructUrl(WebPage webPage), before writing out the closing tag.

Very useful project otherwise!

Thanks.

More fluent methods

Please make the following methods fluent (returning this):

  • WebPage.addImage and WebPage.setImages
  • all setters on Image
  • addPage methods in AbstractGenerator and all derived classes

XML escaping for image tags

This is a followup to #6. Page URLs are now escaped, but it appears that image properties are not. Special characters can appear in image URL, caption, title, license, and possibly geo location. Please add special character escaping there.

Use proper XML library

Parameter separator (&) in an URL will be literally written into the output, producing malformed XML. There are likely many other issues with such hand-coded XML serialization. You should use some existing XML library like jdom2 or even the built-in Java XML API.

AbstractSitemapGenerator.toFile requires directory tree

Currently, using AbstractSitemapGenerator.toFile(file) with a file path where parent directory does not exists results in exception, so the user must manually create parent directory with file.getParentFile().mkdirs() before calling this method.

This check should be embedded inside the method to be user-friendly. For example, Apache IO commons uses:

    public static FileOutputStream openOutputStream(final File file, final boolean append) throws IOException {
        if (file.exists()) {
            if (file.isDirectory()) {
                throw new IOException("File '" + file + "' exists but is a directory");
            }
            if (file.canWrite() == false) {
                throw new IOException("File '" + file + "' cannot be written to");
            }
        } else {
            final File parent = file.getParentFile();
            if (parent != null) {
                if (!parent.mkdirs() && !parent.isDirectory()) {
                    throw new IOException("Directory '" + parent + "' could not be created");
                }
            }
        }
        return new FileOutputStream(file, append);
    }

Sitemap url double slash

RobotsTxtGenerator automatically adds "/" to the baseUrl, but doesn't check if added sitemaps already start with "/" resulting in double slashes: "http://host.com//sitemap.xml".

Thus if using RobotsTxtGenerator with a base url without ending slash ("http://host.com") and adding sitemaps with starting slashes ("/sitemap.xml"), the result will be invalid.

Reproducible ordering of sitemap entries

Please use URL as the second sort criteria after priority to ensure the sitemap always looks the same regardless of how the pages/images are ordered when they are added to the sitemap.

IllegalArgumentException: No visible constructors in class cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator

So i've just updated to 4.4 and now I get this error from spring:

Caused by: org.springframework.aop.framework.AopConfigException: Could not generate CGLIB subclass of class cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator: Common causes of this problem include using a final class or a non-visible class; nested exception is java.lang.IllegalArgumentException: No visible constructors in class cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator
	at org.springframework.aop.framework.CglibAopProxy.getProxy(CglibAopProxy.java:208)
	at org.springframework.aop.framework.ProxyFactory.getProxy(ProxyFactory.java:110)
	at org.springframework.aop.scope.ScopedProxyFactoryBean.setBeanFactory(ScopedProxyFactoryBean.java:117)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeAwareMethods(AbstractAutowireCapableBeanFactory.java:1818)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1783)
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:595)
	... 13 more
Caused by: java.lang.IllegalArgumentException: No visible constructors in class cz.jiripinkas.jsitemapgenerator.generator.SitemapGenerator
	at org.springframework.cglib.proxy.Enhancer.filterConstructors(Enhancer.java:760)

I'm creating a SitemapGenerator bean like this:

    @Bean(name = "sitemapGenerator")
    @ConditionalOnMissingBean(name = "sitemapGenerator")
    public SitemapGenerator webSitemapGenerator(
                    @Value("#{jobParameters['" + PlatformCoreSessionConfig.PARAMETER_KEY_SESSION + SiteServiceImpl.SITE_SESSION_ATTR_KEY
                                    + "']}") String siteCode, @Qualifier(SiteRepository.NAME) SiteRepository siteRepository) {
        return SitemapGenerator.of(((SitemapSiteEntityDefinition) siteRepository.findByCode(siteCode)).getSitemapConfig().getBaseUrl());
    }

and apparently spring tries to create a proxy but you don't provide any constructors and spring doesn't know how to make the proxy.

Is it possible to provide a constructor?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.