Coder Social home page Coder Social logo

automattic / msm-sitemap Goto Github PK

View Code? Open in Web Editor NEW
73.0 123.0 38.0 439 KB

Comprehensive sitemaps for your WordPress VIP site. Joint collaboration between Metro.co.uk, WordPress VIP, Alley Interactive, Maker Media, 10up, and others.

CSS 0.90% PHP 89.96% JavaScript 3.32% Shell 5.82%
wordpress wordpress-plugin wpvip-plugin vip

msm-sitemap's Introduction

Comprehensive Sitemaps

Comprehensive sitemaps for your WordPress VIP site. Site-wide sitemaps on WordPress.com includes 1,000 entries by default. This plugin allows you to include all the entries on your site into your sitemap.

Joint collaboration between Metro.co.uk, WordPress.com VIP, Alley Interactive, Maker Media, 10up, and others.

How It Works

Sitemap Data Storage

  • One post type entry for each date.
  • Sitemap XML is generated and stored in meta. This has several benefits:
  • Avoid memory and timeout problems when rendering heavy sitemap pages with lots of posts.
  • Older archives that are unlikely to change can be served up faster since we're not building them on-demand.
  • Archive pages are rendered on-demand.

Sitemap Generation

We want to generate the entire sitemap catalogue async to avoid running into timeout and memory issues.

Here's how the default WP-Cron approach works:

  • Get year range for content.
  • Store these years in options table.
  • Kick off a cron event for the first year.
  • Calculate the months to process for that year and store in an option.
  • Kick off a cron event for the first month in the year we're processing.
  • Calculate the days to process for that year and store in an option.
  • Kick off a cron event for the first day in the month we're processing.
  • Generate the sitemap for that day.
  • Find the next day to process and repeat until we run out of days.
  • Move on to the next month and repeat.
  • Move on to next year when we run out of months.

The Comprehensive Sitemap plugin will only update the standard sitemap. The news sitemap will only contain posts from the last two days, based on Google’s guidelines.

CLI Commands

The plugin ships with a bunch of wp-cli commands to simplify sitemap creation:

$ wp msm-sitemap
usage: wp msm-sitemap generate-sitemap
   or: wp msm-sitemap generate-sitemap-for-year
   or: wp msm-sitemap generate-sitemap-for-year-month
   or: wp msm-sitemap generate-sitemap-for-year-month-day
   or: wp msm-sitemap recount-indexed-posts

See 'wp help msm-sitemap <command>' for more information on a specific command.

Custom post types

Include custom post types in the generated sitemap with the msm_sitemap_entry_post_type filter.

Generate Sitemap with posts of a custom status other than 'publish'

By default, the sitemap will only fetch posts with the status of 'publish'. To change this, use the msm_sitemap_post_status filter.

function example_filter_msm_sitemap_post_status( $post_status ) {
    return 'my_custom_status';
}
add_filter( 'msm_sitemap_post_status', 'example_filter_msm_sitemap_post_status', 10, 1 );

Filtering Sitemap URLs

If you need to filter the URLs displayed in a sitemap created via the Comprehensive Sitemap plugin, there are two considerations. First, if you are filtering the individual sitemaps, which display the URLs to the articles published on a specific date, you can use the msm_sitemap_entry hook to filter the URLs. An example for a reverse-proxy situation is below:

function example_filter_msm_sitemap_entry( $url ) {
    $location = str_replace( 'example.wordpress.com', 'example.com/blog', $url->loc );
    $url->loc = $location;
    return $url;
}
add_filter( 'msm_sitemap_entry', 'example_filter_msm_sitemap_entry', 10, 1 );

Second, if you are filtering the root sitemap, which displays the URLs to the individual sitemaps by date, you will need to filter the home_url directly. There is no plugin-specific hook to filter the URLs on the root sitemap.

Filter Sitemap Index

Use the msm_sitemap_index filter to exclude daily sitemaps from the index based on date.

add_filter( 'msm_sitemap_index', function( $sitemaps ) {
    $reference_date = strtotime( '2017-09-09' );

    return array_filter( $sitemaps, function ( $date ) use ( $reference_date ) {
        return ( $reference_date < strtotime( $date ) );
    } );
} );

Customize the last modified posts query

Use the msm_pre_get_last_modified_posts filter to customize the query that gets the last modified posts.

On large sites, this filter could be leveraged to enhance query efficiency by avoiding scanning older posts that don't get updated frequently and making better use of the type_status_date index.

function ( $query, $post_types_in, $date ) {
    global $wpdb;

    $query = $wpdb->prepare( "SELECT ID, post_date FROM $wpdb->posts WHERE post_type IN ( {$post_types_in} ) AND post_status = 'publish' AND post_date >= DATE_SUB(NOW(), INTERVAL 3 MONTH) AND post_modified_gmt >= %s LIMIT 1000", $date );

    return $query;
};

msm-sitemap's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msm-sitemap's Issues

Simplify year sitemap

We can get away with doing just one WP_Query (by year) instead of doing up to 365 queries per year archive.

Plugin Review

Where do we stand with the plugin review/moving to WordPress.com?

Duplicate code for generating sitemaps from CLI and cron

CLI and cron duplicate code used to generate the sitemap for a given day:

$date_stamp = Metro_Sitemap::get_date_stamp( $year, $month, $day );
if ( Metro_Sitemap::date_range_has_posts( $date_stamp, $date_stamp ) ) {
    Metro_Sitemap::generate_sitemap_for_date( $date_stamp ); // TODO: simplify; this function should accept the year, month, day and translate accordingly
}

This should be refactored into a method that includes the date_range_has_posts check, as we need to attempt to delete the sitemap if it doesn't have any posts, and there is no need to maintain that code in multiple places.

Needed for #63

Chained Cron Jobs

Instead of spawning lots of jobs at once, set up the full sitemap generation to chain the jobs, so one completed job spawns the next. This will help curb load issues that happen when the sitemap generation is started.

Duplicate sitemaps

There is a bug in the update code which generates multiple new days. Only one of which contains valid data (and usually the first one generated)

Could be a race condition on the cron events but have never been able to replicate it locally. Might only affect large data sets.

It happens regularly on metro.co.uk, and the template parts are designed to get the "good" one

Clean up template

Too much mixing in of logic and presentation; we should try and cleanly separate the two.

Add Monthly sitemap

To make loading of the sitemap year archives faster, should we add a monthly archive as well?

Add method to delete a sitemap by date

Need to encapsulate the logic to delete the sitemap for a given date. Currently exists like this:

$total_url_count -= intval( get_post_meta( $sitemap_id, 'msm_indexed_url_count', true ) );
update_option( 'msm_sitemap_indexed_url_count' , $total_url_count );
wp_delete_post( $sitemap_id, true );
do_action( 'msm_delete_sitemap_post', $sitemap_id, $year, $month, $day );

https://github.com/Automattic/msm-sitemap/blob/master/msm-sitemap.php#L396-401

This will come in handy to simplify the Metro_Sitemap::generate_sitemap_for_date function, as well as help fix #63

Switch from cron-generation to wp-cli script

The cron-based generation approach is too slow and resource-intensive for the site (especially for a site with many years of data).

Ideally, this process should only ever need to be run once anyway, since post updates alredy trigger individual sitemap changes.

Once we've added a wp-cli-based script, we can hook it up to the WP.com jobs system to run async there.

Moar Filters

Add some filters to allow modifying the sitemap a whee bit. Should mirror ones similar to what the WP.com sitemaps have.

kses for sitemap content

Currently, the generated sitemap content for each day is stored in meta. We just output the meta as-is, presuming it to be safe. This isn't ideal and we should see if we can uses kses to clean the data before outputting it.

Large Sitemap List Issue

If you're using Google Webmaster Tools, you're limited to viewing only 400 sitemaps in the UI. With sitemaps being generated daily with this plugin, that means we're limited to just over a year's worth of sitemaps in the Google UI.

Monthly sitemaps (instead of daily sitemaps) should be an option.

Sitemap Pings

WP.com sitemaps have a ping feature that automatically notifies Google and others when a sitemap changes. We should implement that here.

Remove and/or improve the Sitemaps CPT admin page

Currently the sitemaps custom post type admin page is pretty useless. I think there are a number of things that we could do by adding meta boxes that could show information on:

  • Stats for a specific sitemap. Eg: # of indexed posts in the sitemap
  • List of pages in the sitemap
  • Allow user to regenerate a specific sitemap
  • Inherently this would list all of the sitemaps that have been generated

The sitemaps themselves would not be editable from this admin page as that doesn't really make sense.

For now the page is being removed as it has no real purpose.

Could these changes make it useful enough to bring back?

Prompted by: Mobius5150@de3de21#diff-67d63994628dd99d09072eee489c7487R78

template_redirect handler

The plugin is missing the main bits that actually serve up content on sitemap.xml. We should add that to make it usable :)

Use current_time for fetching time

Currently, we're using time() for fetching time for job queueing, etc. We should use current_time() to ensure that the returned time is a consistent timezone regardless of any extraneous circumstances.

Rethink how "msm_sitemap_update_last_run" works

I noticed something that is perhaps a non-issue, but want to draw attention in case it could become and issue. Along with my work in #76, I've been working on a way of initially deploying the plugin where I have more control over the sitemap generation. I got myself into a state where the cron events would no longer flag a site creation event. To reproduce:

  1. Run the plugin on a site that is not currently using the plugin (i.e., sitemaps should not already be generated)

  2. Disable cron generation using the methods in #76

    wp msm-sitemap cron --disable
    
  3. Start generating scripts manually, using something like (h/t @mjangda):

    for i in $(seq 1994 2015); do for j in $(seq 1 12); do wp msm-sitemap generate-sitemap-for-year-month --year=$i --month=$j; done; done;
    
  4. Kill the generation before it is done (ctrl + C)

  5. Reenable cron generation

    wp msm-sitemap cron --enable
    
  6. Trigger the main cron event

    wp cron event run msm_cron_update_sitemap
    
  7. Observe that no individual site map creation event have been registered:

    wp cron event list | grep msm
    

I find it interesting that when I try to generate the site maps manually, but don't finish, that the cron updater does not notice this. This is because msm_sitemap_update_last_run is set to a recent time and no new generation will happen.

I am only logging this issue because response has been pretty positive to #76. We need to make sure that if we implement a manual generation mode that we maybe rethink msm_sitemap_update_last_run to ensure that we don't get caught in this state where many sitemaps are missing, but there is no indication that this is the case. Yes, the develop is kind of taking things into her own hands, but we can do better to help the develop understand that things are left in an incomplete state.

Sitemap post not deleted when all posts deleted

When the last post on a given day is deleted the underlying post is not removed and may still contain the post data -- meaning it will be served if requested.

We should probably hook into the deleted_post action and check if this is the last sitemap for that day. If it is we can go ahead and delete the post (takes very little time). If it isn't then we should flag that post to be updated.

No Sitemaps for non-public sites

No point in rendering them; we should show a nice error message though.

We should also kill the cron job if the site is not public since it's mostly wasted effort.

Kill The Initial Cron

Per discussions in #76 and #77, we should just remove all the cron stuff.

It's overly complex, has lots of problems (breaks on large sites, especially on initial activation) and doesn't match our use cases. Instead, we should document the CLI pieces better and provide examples on how best to configure ongoing updates.

Indexed URL count option too large

The option that contains the count of indexed URLs in each sitemap msm_sitemap_indexed_url_count risks becoming too large for the WordPress options cache on large blogs.

To fix this, the indexed URL count for each sitemap should be placed in a post meta, and the msm_sitemap_indexed_url_count should contain only the total number of sitemaps.

Good catch @mjangda

Restore admin page for the main plugin

The cron builder (msm-sitemap-builder-cron.php) has a UI page that we should extract and make a part of the main plugin.

Main focus for the page, for now, should be providing an overview of the state of sitemaps for the site, e.g. how many sitemaps are in the site? how many URLs are indexed? when was the last build time? when's the next build time? etc. Any sort of stats or info that would be useful for site owners.

We can look at adding functional features to the plugin as well like kicking off sitemap generation, rebuild, etc. once we have a basic version in place.

If any design help/inspiration is needed, @keoshi can lend a hand :)

Don't store XML

Instead of storing the full sitemap XML, consider generating it on-the-fly to make mods easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.