automattic / newspack-custom-content-migrator Goto Github PK

Custom migration tasks for launching and migrating Newspack sites on Atomic

PHP 99.50% Shell 0.43% HTML 0.07%

newspack-custom-content-migrator's Introduction

Newspack Custom Content Migrator

This plugin is a set of WP CLI commands and scripts used during Newspack sites Live Launches and/or Content Updates.

This Plugin consists of various Migrators (which perform reusable or publisher-specific content migration), and the "Content Diff" logic.

Installation

Run composer install.

Usage

The Plugin is installed on a the Staging Site, and executed there to import the most recent content from the current live site.

Migrators

Migrators are classes which perform content migration and data reformatting functionality. They are organized like this:

located in src/Command are WP CLI Command classes
located in src/Logic are business logic classes

There are two kinds of Commands:

General purpose -- located in src/Command/General, these contain the reusable WP data migration logic, used by multiple Live Launches;
Publisher-specific -- located in src/Command/PublisherSpecific, these are custom functionalities used one-time only for individual Publisher's specific needs.

Content Diff

The Content Diff is a functionality which updates Staging site's content by syncing the newest/freshest content from the Live site on top of the Staging site.

It fetches the newest content from the JP Rewind backup archive, by importing "live site's DB tables" side-by-side to the existing local WP tables, and then searches and imports the live site's newest content, and imports the missing "content diff" on top of the Staging site.

Creating a Migrator

New Command Class

Take any existing migrator from the src/Command and copy it either into the src/Command/General or the src/Command/PublisherSpecific with a new name.

Command classes implement the InterfaceCommand which simply makes sure they register WP CLI commands.

Register the New Command

The new Command should be registered in the newspack-custom-content-migrator.php.

After creating a new Command, run composer dump-autoload to update the autoloading files.

Running the Content Diff

The Knife uses the content_diff_update.sh script to run the whole CD update automatically.

Alternatively, the Content Diff CLI command class exposes commands which we can run manually to first detect the newest content (newspack-content-migrator content-diff-search-new-content-on-live) and then import it (newspack-content-migrator content-diff-migrate-live-content).

newspack-custom-content-migrator's People

Contributors

Stargazers

Watchers

Forkers

isabella232 philipjohn urbanmedialabs patilswapnilv aycholpon

newspack-custom-content-migrator's Issues

TablepressMigrator throws errors when the file is included

This happened on a site where tablepress is installed when running any cli command and having the latest version of the NCCM plugin activated:

❯ wp plugin list

Fatal error: Uncaught Error: Call to undefined function is_user_logged_in() in /srv/htdocs/wp-content/plugins/tablepress/classes/class-wp_user_option.php:40
Stack trace:
#0 /srv/htdocs/wp-content/plugins/tablepress/classes/class-wp_option.php(57): TablePress_WP_User_Option->_get_option('tablepress_user...', NULL)
#1 /srv/htdocs/wp-content/plugins/tablepress/classes/class-tablepress.php(239): TablePress_WP_Option->__construct(Array)
#2 /srv/htdocs/wp-content/plugins/tablepress/models/model-options.php(95): TablePress::load_class('TablePress_WP_U...', 'class-wp_user_o...', 'classes', Array)
#3 /srv/htdocs/wp-content/plugins/tablepress/classes/class-tablepress.php(239): TablePress_Options_Model->__construct(NULL)
#4 /srv/htdocs/wp-content/plugins/tablepress/classes/class-tablepress.php(256): TablePress::load_class('TablePress_Opti...', 'model-options.p...', 'models')
#5 /srv/htdocs/wp-content/plugins/tablepress/classes/class-tablepress.php(149): TablePress::load_model('options')
#6 /srv/htdocs/wp-content/plugins/newspack-custom-content-migrator/src/Logic/TablePress.php(43): TablePress::run()
#7 /srv/htdocs/wp-content/plugins/newspack-custom-content-migrator/src/Command/General/TablePressMigrator.php(27): NewspackCustomContentMigrator\Logic\TablePress->__construct()
#8 /srv/htdocs/wp-content/plugins/newspack-custom-content-migrator/src/Command/General/TablePressMigrator.php(38): NewspackCustomContentMigrator\Command\General\TablePressMigrator->__construct()
#9 /srv/htdocs/wp-content/plugins/newspack-custom-content-migrator/src/PluginSetup.php(63): NewspackCustomContentMigrator\Command\General\TablePressMigrator::get_instance()
#10 /srv/htdocs/wp-content/plugins/newspack-custom-content-migrator/newspack-custom-content-migrator.php(29): NewspackCustomContentMigrator\PluginSetup::register_migrators(Array)
..... EDITED OUT IRRELEVANT STUFF 
#19 {main}
  thrown in /srv/htdocs/wp-content/plugins/tablepress/classes/class-wp_user_option.php on line 40

TMD: Imported content is in HTML blocks

It looks like we need to do some more processing on the content when we import it as we end up with HTML blocks like this (from this post):

Perhaps we could strip out anything that isn't an legitimate editorial element? E.g. p, img, blockquote, iframe etc

Dependency on newspack-cms-importers can't be found

When using this plugin in a composer-based project, composer is unable to resolve the depependency on newspack-cms-importers:

automattic/newspack-custom-content-migrator 1.0.1 requires automattic/newspack-cms-importers dev-master -> could not be found in any version, there may be a typo in the package name.

My guess is that the repo is private. Could the repo be made public, or the dependency removed maybe?

Convert CPTs to Posts with categories

The Real News has a bunch of custom post types that have been used to categorise different types of content. As part of the move to Newspack we need to remove this unnecessary complexity and convert them all to posts, using post categories to denote the different types of content.

This table shows the post type name and the category that should be assigned to posts of this type when converted:

CPT	Category
trnn_column	Columns
trnn_story	N/a
third_party_content	Third Party

On each conversion, we need to check if there are any attached terms from the custom taxonomies and re-assign the converted categories (see #21 #22 #23).

Merge Synopses into Posts

TRN have a "Synopses" post type that essentially contains the post content attached to videos. The videos are in the "Stories" CPT.

The Stories have been converted to Posts. We now need to add the Synopses to the post_content of those converted posts. The existing post_content in the post should be moved to the excerpt.

A new publisher-specific CLI command is needed to:

Register the Synopses CPT if it isn't already
Loop through each Synopsis
Find the corresponding Post
Move the content for the Post into the excerpt field instead
Insert the Synopsis post content into the Post
Set a meta to denote that the Synopsis has been successfully merged

TMD: Date/author mismatch

In many articles there is a date and author given at the top of the article but this seems to often not match the actual assigned author and the actual published date (example).

We should first establish which is the correct author and published date by checking the current live site and then ensuring the migration script does two things:

Assigns the correct user account and published date to the imported post meta
Removes the manually inserted byline and date from the top of post_content

Topics need to be migrated to Categories

The Real News has a custom taxonomy called "Topics" that needs to be converted to Categories.

We previously did this for Asia Times, so should be able to re-use that code, but it would perhaps be worth making a general purpose command that can take the 'source' and 'target' taxonomies as parameters.

TMD: DL/DT elements breaking output

Some posts have some content within DL and DT elements that should be simple paragraphs. These seem to have broken A tags that are causing incorrect styling. This post for example looks very unusual:

Image de-duplicator looses captions

Before the de-duplication CLI is run, the code might look something like this:

<!-- wp:image -->
<figure class="wp-block-image"><img src="//plantbasednews-newspack.newspackstaging.com/image-placeholder-title-MTY1NTk1NzkyNjk4MTg5NjA1/" alt=""/><figcaption>Demand for dairy has plunged as restaurants and cafes have shut (Photo: Animal Equality)</figcaption></figure>
<!-- /wp:image -->

After de-duplication that entire code is gone, the featured image is applied as expected but the caption is no longer present, as the de-dupe command doesn't take it into account.

The command needs to pick up on captions (they might also be present in a [caption] shortcode and make sure they are added to the image's caption field.

OTW: Convert Documentcloud HTML embeds to Shortcode

There are a bunch of posts on OTW where Documentcloud embeds have been inserted via HTML. Here's an example (pre-block conversion):

<!-- wp:html -->
<div id="DV-viewer-6563583-50129896-1" class="DC-embed DC-embed-document DV-container"></div>
<script src="//assets.documentcloud.org/viewer/loader.js"></script>
<script>
  DV.load("https://www.documentcloud.org/documents/6563583-50129896-1.js", {
  width: 400,
    height: 600,
    sidebar: false,
    text: false,
    container: "#DV-viewer-6563583-50129896-1"
  });
</script>
<noscript>
  <a href="https://assets.documentcloud.org/documents/6563583/50129896-1.pdf">HTP Apprenticeship College Ofsted Report Nov 2019 (PDF)</a>
  <br />
  <a href="https://assets.documentcloud.org/documents/6563583/50129896-1.txt">HTP Apprenticeship College Ofsted Report Nov 2019 (Text)</a>
</noscript>
<!-- /wp:html -->

The Documentcloud plugin provides a simple shortcode that works well for embedding these instead, and works with AMP on. The above HTML blocks become:

<!-- wp:shortcode -->
[documentcloud url="https://www.documentcloud.org/documents/6563583-50129896-1.html"]
<!-- /wp:shortcode -->

Note how the document ID (6563583) and file name (50129896-1) are combined in the two different methods. The migration tool will need to take this into account.

New Naratif: References and Bibliography styling

Articles in the "Research" category include a Bibliography and/or References section at the end of the article (example). These sections should use a smaller font. To accomplish that, we want to wrap these sections in a Group block with the class ref-biblio. CSS is then used to change the font size.

As this section is always the last part of the article, it should be feasible to automatically detect the beginning of the section and wrap it into a group block with the required class.

TMD: Convert shortcodes

There are several shortcodes used for embeds on the old website. See this example for YouTube as well as Magnify, which is where we host many custom HTML embeds. Magnify shortcodes should be interpreted as follows: [magnify:<path>,<height>] returns an iframe of height <height> pointing to https://magnify.michigandaily.us/<path>

OTW: Migrate tag descriptions to new pages

OTW have used a plugin to allow HTML in tag descriptions, facilitating better looking tag archive pages. Example: https://onthewight.com/about/2017-general-election/

Note the permalink structure, too.

We'll be replacing this with pages, taking advantage of the block editor. With 892 tags, and a good few including HTML, we need to bulk migrate these to pages. Here's what we've agreed the migrator should do:

Loop through each tag
If the description is not empty:
If there is a heading tag at the start of the description, extract that
If there is HTML in the rest of the description (other than paragraphs), do nothing
Create a new page (if there isn't an existing one) with the heading from step 3 as the page title. The page should be a child of "About".
Use the entire HTML from the tag description as the page content
Add a redirect from the default tag archive page (/tag/[tag-slug]) to the newly created page
Store a meta in the page to indicate which tag it was migrated from
If we haven't created a new page, ensure the old about slug is redirected to the default tag slug.
Update the tag base permalink to the default of tag, instead of about

I've done some pseudo code to try and better explain what the migrator needs to do:

<?php

$tags = get_all_the_tags();

// Loop through the tags
foreach ( $tags as $tag ) {

	// Don't create pages for tags with no description.
	if ( ! empty( $tag->description ) ) {

		// Store the description for later manipulation.
		$new_description = $tag->description;

		// Find the heading, if there is one.
		$heading = extract_any_heading( $tag->description );

		// Get the description without the heading.
		if ( $heading ) {
			$new_description = description_without_heading( $tag->description );
		}

		// Check if the description contains any HTML other than paragraphs.
		if ( description_has_html( $new_description ) ) {

			// Create a new page to replace the tag archive.
			$new_page = wp_create_post( [
				'post_type'    => 'page',
				'title'        => $heading,
				'post_content' => $new_description,
				'post_parent'  => 'about', // This is the parent slug.
			] );

			// Add meta to the new page to indicate which tag it came from.
			add_post_meta( $new_page->ID, '_migrated_from_tag', $tag->slug );

			// Add a redirect for the tag archive to new page.
			add_redirect( [
				'from' => '/tag/' . $tag->slug,
				'to'   => get_the_permalink( $new_page->ID )
			] );

		}

	}

	// When we don't create a new page, we need to ensure the old tag URL is
	// redirected to the new tag URL (`about` vs `tag`).
	if ( ! $new_page ) {

		// Redirect the old tag permalink structure back to the default.
		add_redirect( [
			'from' => '/about/' . $tag->slug,
			'to'   => '/tag/' . $tag->slug,
		] );

	}

}

// Update the permalink base so that tags revert to the default structure
// rather than using 'about'.
update_permalink_base_for_tags( 'tag' );

Migrate Largo subtitles

Two features that are unique to Largo are their bylines and their subtitles. The ability to migrate the former is currently being managed by #234.

For the latter, Largo uses a subtitle post meta field that needs to be migrated to the newspack_post_subtitle field, which is how Newspack manages subtitles.

Can we get a command to migrate that data, too?

Michigan Daily: Fill in the gaps left by the Drupal importer

The MD staging site was created using a plugin that converts content from Drupal but there's a lot missing. Even basic stuff like content hasn't come across. This issue documents where the data resides to guide development of a script that can pull the info from the old database.

For convenience, the tables from the Drupal backup have been imported into the MD staging site database, alongside the WP tables.

Old node ID

Each imported post in the WP install has a meta key of _fgd2wp_old_node_id the value of which points to the node ID used in Drupal. For example, this post as a value of 225700 so we can get the node from the Drupal tables like so: SELECT * FROM node WHERE nid = 225700

Title

The post title is held in the title column of the node table: SELECT title FROM node WHERE nid = 225700

Author

The author ID is also in the node table: SELECT uid FROM node WHERE nid = 225700
Author info can be grabbed from the users table: SELECT * FROM node WHERE nid = 1
Other info then stems from data tables: SELECT * FROM field_data_field_twitter WHERE entity_type = 'user' AND entity_id = '1'. We'll need to build the dataset by checking multiple data tables.

Date

The post created date is held in the created column of the node table: SELECT created FROM node WHERE nid = 225700

Post content

Can be grabbed with SELECT * FROM field_data_body WHERE entity_type = 'node' AND entity_id = 225700.

Other data

We can adapt the content SQL to the other tables to grab other data. E.g. SELECT * FROM field_data_field_byline WHERE entity_type = 'node' AND entity_id = 225700.

Method

It should be possible, with a CLI command, to run through imported posts, grab the node ID, and then pull the relevant data from the various sources in the Drupal tables.

Shows taxonomy needs to be converted to Categories

The Real News has a custom taxonomy called "Shows" that needs to be converted to categories. It makes sense for these to be sub-categories of a "Shows" category.

As with #21 there is previous code to do this which can be adapted but the ability to specify that the conversion include a parent category would be a useful addition to a generic command that can even create the parent.

New Naratif: Podcast & video embed width is not 100%

Articles within the "Podcasts" category include a podcast embed and sometimes a YouTube embed. These are not spanning 100% width because of the original HTML in the content, which should be fixed.

Here's an example: https://newnaratif-newspack.newspackstaging.com/2020/10/09/articulating-an-alternative-to-the-paps-singapore/
This article has been fixed manually, showing how it should look: https://newnaratif-newspack.newspackstaging.com/2020/10/23/critical-theory-in-singapore-and-the-philosophy-of-social-justice/

YouTube code example
This is what the original code looks like for a YouTube Embed:

<!-- wp:html -->
<figure><iframe src="https://www.youtube.com/embed/zDo92lVQdR4" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe></figure>
<!-- /wp:html -->

Instead, it should be a YouTube block.

Podcast code example
This is what the original code for the embedded podcast looks like:

<!-- wp:html -->
<figure><iframe src="https://anchor.fm/politicalagenda/embed/episodes/Articulating-an-Alternative-to-the-PAPs-Singapore-ekqchc" width="400px" height="102px" frameborder="0" scrolling="no"></iframe></figure>
<!-- /wp:html -->

Being in a HTML block is necessary, but the figure element and fixed with seem to throw it off. Here's how it should look:

<!-- wp:html -->
<iframe src="https://anchor.fm/politicalagenda/embed/episodes/Articulating-an-Alternative-to-the-PAPs-Singapore-ekqchc" width="100%" height="102px" frameborder="0" scrolling="no"></iframe>
<!-- /wp:html -->

Image de-duplication failing on PBN

The CLI command isn't working as expected for PBN (sample post) for some reason. It finishes in seconds when running:

$ wp newspack-content-migrator de-dupe-featured-images
Checking 7173 posts.
Finished processing 7173 records in 5 seconds

Looking at the post content there is definitely an image block first, and the same image is set as featured image so it's not clear why this is going wrong.

Muslim Journal: Restricted Content Migration

Muslim Journal (live, staging) currently use the Restricted Content Pro (RCP) plugin to restrict access to certain pieces of content. As part of their migration to Newspack we will be migrating them to use the WooCommerce based restriction system we already have in use for other sites.

RCP uses custom tables to store it's data, which makes it fairly simple to interrogate and extract the relevant data. Members are tied to actual WP users, too.

There are 7 different subscription options, detailed in the kpb_restrict_content_pro table, and new subscription products have been created in WooCommerce for these:

RCP Subscription	Woo Subscription
Digital Subscription 1 Month (ID: 2)	Digital Subscription – Monthly (ID: 1860)
Digital Subscription 6 months (ID: 4)	Digital Subscription – 6 Months (ID: 1861)
Digital Subscription 1 Year (ID: 5)	Digital Subscription – Yearly (ID: 1862)
Digital Subscription 2 years (ID: 6)	Digital Subscription – 2 Years (ID: 1863)
Print Subscription 6 months (ID: 8)	Print Subscription – 6 Months (ID: 1864)
Print Subscription 1 Year (ID: 9)	Print Subscription – Yearly (ID: 1865)
Print Subscription 2 Years (ID: 10)	Print Subscription – 2 Years (ID: 1867)

Memberships are stored in the kpb_rcp_memberships table and contain references to the customer record (customer_id) and WP user (user_id) via which other data can be gathered. The migrator should run through each membership and do the following:

Check that the user (user_id column) exists. Skip if there is no user and report back.
Create a membership in WooCommerce
Match the RCP membership (object_id column) with the new Woo membership plan
Create a new Woo customer & member record (an order may need to be created, matching dates in the RCP tables)
Retrieve all the data from the membership record and linked customer record to add to the new Woo member/customer
Ensure that the "auto renew" flag is honoured so that those not on auto renewal are not charged at the end of their subscription
Ensure no emails are sent out at all during the migration process that could confuse users
If possible, also migrate the PayPal subscription data (gateway_subscription_id) for uninterrupted billing

New Naratif: Add template for handling article collaborators

Many articles have multiple authors, each with a different role. E.g., one might be an illustrator, another is translator. While we don't have a formal way to add these roles, they will be added manually to the top of the content.

For imported posts we need to add these to the content automatically using the template provided in this post. The data is stored in post meta, through ACF fields.

Important: Posts in the "Announcements" category should NOT be included in this process.

OTW: Migrate custom shortcodes to re-usable blocks

OTW use a plugin that allows them to create their own custom shortcodes. This is a use case that can easily be replaced with usable blocks, though. To facilitate this we need to migrate their existing custom shortcodes to re-usable blocks and replace them throughout the content.

The plugin is called Shortcodes Ultimate (and associated creator UI plugin).

The migrator will need to:

Grab a list of all custom shortcodes stored by the SU plugin
Each shortcode has a Title and Content. These should be used to create an Atomic Accordion block.
The new Atomic Accordion block should be converted to a re-usable block
A search & replace should then be performed to replace the shortcode with the re-usable block code

TRNN: Author migration needs to check additional meta

There is a CLI command that migrates TRNN's authors. It checks posts for author IDs stored in meta called bios. However, it appears that some posts store the IDs in different meta keys. There is a host-author and collaborators post meta also that work in the same way. These also need to be checked.

It might be possible to extend the existing query, or it might need a second query

Sahan Journal Authors Migration

On Sahan Journal's live site there's an "author" CPT. Each other is then referenced in posts by ID in a post meta field (ACF).

The authors need to be converted to CAP guest authors, and posts updated to attach those authors. The existing ACF meta fields can be left alone as a backup.

Skip S3 check on S3 migrator command

The attachments-switch-local-images-urls-to-s3-urls scans post_content for references to locally hosted media. If it finds media that is also hosted in Amazon S3, it updates the post content to point to S3.

The command should probably skip that check and just update the domain regardless. A site that is using S3 is not hosting media locally, so leaving the domain unchanged still results in a broken link, whether or not the file is in S3. We want to be able to add the file to S3 later without going back and running the command again.

Check for and import revised content

Publishers may be editing content, no matter how old, for any number of professional, ethical, or legal reasons. And we can't count on one primary point of contact knowing the breadth of these changes: in the time between Berkeleyside's last import (Feb 2022) and their launch (July 2022), previously imported content from over sixty (60!) different authors had been modified in their production environment.

There are two steps we can use during the content refresh process to ensure that revisions to legacy content are migrated:

Currently, if a post has had its title, post_name, status, or date changed (most likely the title), then the content migrator sees it as a new post, even if it was previously migrated. In those scenarios, if post_name hasn't changed, then the result may be posts with duplicate slugs. We'll want the one with the most recent post_modified timestamp to overwrite the older one.
Currently, if only the post_content has changed, then it will not be seen as a new post and will not be migrated. We should still run a comparison of all previously migrated posts to see if their counterparts still have matching post_modified values; if not, then the staging site's post should be updated with the new post_content and post_modified values.

More details in P2: pamTN9-4Tp-p2

CPP: Content stored in custom fields

Carolina Public Press has some posts where content is stored in post meta. It looks like this is ACF-based.

Example:
https://carolinapublicpress.org/28695/analysis-nc-convicts-fewer-than-1-in-4-sexual-assault-defendants/
https://carolinapublicpress-newspack-1.newspackstaging.com/28695/analysis-nc-convicts-fewer-than-1-in-4-sexual-assault-defendants/

We need to:

Assess which posts are constructed this way to see how we can automatically detect it
Produce a CLI command/script that will run through and convert posts

Strip HTML & heading from author bios

OnTheWight have HTML in their author bios, including a header that includes the author name. This results in undesirable output including a repetition of the author name on author bios, e.g. https://onthewight-newspack.newspackstaging.com/author/sal/

The unnecessary HTML should be stripped, and any Heading at the top should be removed entirely. Below is an example of what the before and after should look like.

Before:

<h1>Sally Perry</h1><br>
Contact: <a href="mailto:[email protected]">[email protected]</a>
<br><br>

Sally Perry is co-owner, reporter and editor at Isle of Wight News from OnTheWight.
<br><br>
The publication has built up a large and trusting audience, not only on the Island, but also around the world, helping IW ex-pats stay in touch with Island news.
<br><br>
Sally has been recognised nationally for her in-depth coverage of the three-week occupation of a factory by workers. As Journalism.co.uk commented at the time, “[they} gave the national and local media a run for its money with its comprehensive coverage of the industrial dispute".
<br><br>
She’s also been awarded for her dogged persistence in highlighting the plight of landlocked Ventnor Undercliff residents following a major landslide (over 200 articles in four+ years).
<br><br>
As well as covering local democracy and holding those in power to account for their decisions, OnTheWight is a champion of The Arts, regularly celebrating the creative talents of Islanders.

After:

Contact: <a href="mailto:[email protected]">[email protected]</a>

Sally Perry is co-owner, reporter and editor at Isle of Wight News from OnTheWight.

The publication has built up a large and trusting audience, not only on the Island, but also around the world, helping IW ex-pats stay in touch with Island news.

Sally has been recognised nationally for her in-depth coverage of the three-week occupation of a factory by workers. As Journalism.co.uk commented at the time, “[they} gave the national and local media a run for its money with its comprehensive coverage of the industrial dispute".

She’s also been awarded for her dogged persistence in highlighting the plight of landlocked Ventnor Undercliff residents following a major landslide (over 200 articles in four+ years).

As well as covering local democracy and holding those in power to account for their decisions, OnTheWight is a champion of The Arts, regularly celebrating the creative talents of Islanders.

OTW: Import AWS-hosted content images

There are some images within content on OTW that are currently hosted through an AWS bucket called otwstatgraf. These should be moved back into WordPress and the references updated.

An example is here: https://onthewight-newspack.newspackstaging.com/john-hurt-discussing-capitalism-and-the-loss-of-an-age-of-naivety-podcast/

We need to:

Loop through posts detecting images hosted in the AWS bucket
Download those images to the media library
Replace the image URL in content with the new WordPress URL

Several improvements to the Content Diff

Continuing work from PR #141 here are a few improvements we should add to the Content Diff structure:

let's take an example of this current error message - Error, could not insert term_relationship for live post/object_id=14077 (new post_id=310) because term_taxonomy_id=990 is not found in live DB -- it exists in live term_relationships, but not in live term_taxonomy table. This should be improved so that in case where a Term is missing, we shouldn't be bringing over term_taxonomy_ids at all.
change this error message from and "Error" to a "warning" wording and warn that a specific taxonomy/term hasn't been brought over

Additions:

update_featured_images needs to be refactored better -- the loop and the logging should go to the command class, and just pure "business logic" should be left in this command
same goes for update_blocks_ids
refactor get_existing_post_by_live_id method from the command class into the CD logic class

Regions need to be converted to Categories

The Real News has a custom taxonomy called "Regions" that needs to be converted to categories. It makes sense for these to be sub-categories of a "Regions" category.

TMD: Missing images

It looks like images are missing from the imported posts. For example:

Live: https://www.michigandaily.com/section/news/here-stay
Staging: https://michigandaily-newspack2.newspackstaging.com/2020/10/undocumented-dreamer-students-say-they-are-here-to-stay/

Removing duplicate images

Some themes don't have good featured image support and so publishers have often put an image in the very top of the content instead. This presents a problem when they move to Newspack, as the image can be duplicated. Here's an example from Technode:

This is not uncommon so we should probably develop a script we can run during migration to take care of this. Here's how I think that should work:

Run through each post
Check if the first item in the content is an image
Check if there is also a feature image
If the two images are the same, remove the image from the post content

TMD: Authors missing from some articles

Some articles, like this example, are missing an author byline. In this case, the author is 0 but I was able to track down the original node (253357) which has an associated user ID of 594 but just hasn't been imported for some reason.

TRNN non-YouTube videos missed in migrator

In #35 we have the untested assertion that

YouTube is the only provider we need to worry about.

That has turned out not to be the case, and we are missing, for example, Vimeo videos like this one:

https://therealnews.com/stories/x-malcolms-final-years
https://therealnews-newspack.newspackstaging.com/x-malcolms-final-years

The post meta shows that a vimeo video is attached:

$ wp post meta get 178107 trnn_videosource
vimeo
$ wp post meta get 178107 trnn_othervideoid
147851604

We need to adapt the video migrator from #35 to include any non-YouTube videos and ensure they are embedded properly.

Migrate authors stored in CPTs to CAP Guest Authors

We could do with a more generic method for migrating authors stored in CPTs to CAP's Guest Authors.

Ideally the CLI command will be capable of taking a collection of arguments that describe where to grab relevant data from so that it's all dynamic.

The information each Guest Author can have is:

Display Name
First Name
Last Name
Email
Website
Bio

Each of these could be an argument to the command, like this:

wp newspack-content-migrator co-authors-cpt-to-guest-authors [--display_name=<display name>] [--first_name=<first name>] [--last_name=<last name>] [--email=<email>] [--website=<website>] [--bio=<bio>]

The actual values supplied these could describe where, in the posts, the info can be grabbed from. For example, the name may be the post's title, the bio is post content and the email address is in a meta field called "author_email" which could be provided with the following values:

wp newspack-content-migrator co-authors-cpt-to-guest-authors --display_name=post_title --bio=post_content --email=meta:author_email

TRN: Migrate videos into post content

Many of TRN's posts have a video attached. This is stored in post meta and needs to be migrated to post_content.

We need a CLI command to:

Loop through all Posts that have a value for the trnn_videosource meta data
Find the right video to use based on the source of trnn_videosource and the associated meta (e.g. for "youtube" there is a trnn_youtubeurl meta containing the ID of a YouTube video).
Embed the video at the start of the post_content

TMD: Facebook comments are embedded in post_content

For some reason a Facebook comments block has ended up embedded in the post content and needs to be removed. It looks something like this once block conversion has been run:

<!-- wp:html -->
<div id="fb-root"></div>
<!-- /wp:html -->

<!-- wp:paragraph -->
<p><script>(function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0];
  if (d.getElementById(id)) return;
  js = d.createElement(s); js.id = id;
  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=614402668605778";
  fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script></p>
<!-- /wp:paragraph -->

<!-- wp:html -->
<div class="fb-comments" data-href="http://www.michigandaily.com/content/2008-10-27/blue-scores-upset-over-no-2-northwestern-3-1" data-numposts="5"></div>
<!-- /wp:html -->

<!-- wp:paragraph -->
<p><script type="text/javascript">
  try { _402_Show(); } catch(e) {}
    </script></p>
<!-- /wp:paragraph -->

Example post

Sahan Journal content migration

As may be the case with many page builders, the one in use on Sahan Journal stores content in meta. We need a script therefore to move that content into post_content, ideally with the name of the meta_key specified in a parameter.

I already sort of did this once. I ran a script within wp shell to on the staging site which migrated all the content in posts. It missed out pages, though, and we'll need this in a more robust format to migrate the content during the launch.

Here's the script I used:

<?php

$q = [
	'post_type'      => 'post',
	'posts_per_page' => -1,
];

$wq = new WP_Query( $q );

if ( $wq->have_posts() ) {

	echo sprintf( 'There are %d posts.', $wq->post_count );

	while ( $wq->have_posts() ) {
		$wq->the_post();

		echo PHP_EOL . sprintf( 'Processing %d... ', get_the_ID() );

		/* There's already some content, so skip */
		if ( ! empty( get_the_content() ) ) {
			continue;
		}

		echo 'No content... ';

		/* There's nothing in post meta, so skip */
		if ( empty( get_post_meta( get_the_ID(), 'page_content_0_copy_content', true ) ) ) {
			continue;
		}

		echo 'Content in meta... ';

		$update = wp_update_post( [
			'ID'           => get_the_ID(),
			'post_content' => get_post_meta( get_the_ID(), 'page_content_0_copy_content', true ),
		] );

		if ( is_wp_error( $update ) ) {
			printf( "Failed updating %d", $update );
		} else {
			printf( "Updated %d", $update );
		}
	}
} else {
	echo 'There are no posts.' . PHP_EOL;
}

EastMojo: Retrieve missing content from the API

It has become apparent that our EastMojo import was missing lots of data because the data wasn't present in the RSS feeds their existing CMS provider supplied.

There is an API, however, where we can fetch a JSON representation of each article with all the data included. This is based on GUID, which we imported from the RSS so we can now run through each imported post and fetch the extra data from the API.

Below is a description of the data and how it should be imported. Anything in the JSON not mentioned below shouldn't be imported.

JSON Path	Description	Import to
`story->seo->meta-description`	A string with an article description	Yoast's description meta tag feature
`story->tags`	An array of objects describing tags	Grab the `name` element from each tag object and ensure the post is assigned to that tag
`story->subheadline`	A string with a sub-headline	Newspack's subtitle field
`story->summary`	A string with a story summary	The excerpt
`story->hero-image-attribution`	Credit for the featured image	Newspack's "credit" field for the attached featured image

Install WordPress Importer only when needed

If the Custom Content Migrator is active and WordPress Importer is not installed and active, CCM will do so anytime any WP-CLI command is run. For example:

$ time wp search-replace //charlestoncitypaper-staging.newspackstaging.com //charlestoncitypaper.com --all-tables-with-prefix --report-changed-only --dry-run
Installing and activating the wordpress-importer plugin now...

newspack-custom-content-migrator/src/PluginSetup.php

Lines 70 to 72 in f6d99fb

    
           	/** 
        
           	 * Checks whether wordpress-importer is active and valid, and if not, installs and activates it. 
        
           	 */

It would be preferable if WordPress Importer were installed only when it is needed for the WP-CLI command being run.

TRN: Migrate Transcript content into post_content

TRN has posts that are attached to a specific video. Often, a transcript of this video is provided. This is stored in post meta and their old theme printed that out in a particularly way. We need to migrate that into post_content.

We need a CLI command to do the following:

Loop through all Posts that have the trnn_transcript meta key with a value
Append onto he post_content a HR tag, followed by a heading (H2) of "Story Transcript"
Append the value of the trnn_transcript meta data to the end of post_content

OTW: Replace audio HTML embeds with audio blocks

OTW (and possibly others) have audio embedded using HTML, and served through AWS. We need to:

Run through each post detecting these audio embeds
Download the audio file and insert into the media library
Replace the HTML with an audio block

Here's an example (from this post) HTML embed that we'll see in the content:

<!-- wp:html -->
<audio class="wp-audio-shortcode" id="audio-0-1" preload="none" style="width: 100%;" controls="controls"><source type="audio/mpeg" src="http://otw-audio.s3.amazonaws.com/john-hurt-minghella-film-festival-2010.mp3?_=1"><a href="http://otw-audio.s3.amazonaws.com/john-hurt-minghella-film-festival-2010.mp3">http://otw-audio.s3.amazonaws.com/john-hurt-minghella-film-festival-2010.mp3</a></audio>
<!-- /wp:html -->

This is an example of what that should get converted into:

<!-- wp:audio {"id":372572} -->
<figure class="wp-block-audio"><audio controls src="https://onthewight-newspack.newspackstaging.com/wp-content/uploads/2020/09/john-hurt-minghella-film-festival-2010.mp3"></audio></figure>
<!-- /wp:audio -->

It might be worth creating a general migrator for this, that conditionally sideloads the audio only if it's hosted externally.

	/**
	* Checks whether wordpress-importer is active and valid, and if not, installs and activates it.
	*/