palaniraja / blog2md Goto Github PK

Convert Blogger & Wordpress backup blog posts to hugo compatible markdown documents

JavaScript 100.00%

blog2md's Introduction

Blogger to Markdown

Convert Blogger & WordPress backup blog posts to hugo compatible markdown documents

Usage: node index.js b|w <BLOGGER BACKUP XML> <OUTPUT DIR>

For Blogger imports, blog posts and comments (as seperate file <postname>-comments.md) will be created in "out" directory

node index.js b your-blogger-backup-export.xml out

For WordPress imports, blog posts and comments (as seperate file <postname>-comments.md) will be created in "out" directory

node index.js w your-wordpress-backup-export.xml out

If you want the comments to be merged in your post file itself. you can use flag m at the end. Defaults to s for seperate comments file

node index.js w your-wordpress-backup-export.xml out m

If converting from WordPress, and you have posts that do not contain HTML, you can use a paragraph-fix flag at the end.

node index.js w your-wordpress-backup-export.xml out m paragraph-fix

Installation (usual node project)

Download or Clone this project
cd to directory
Run npm install to install dependencies
Run node index.js <arg...>

Notes to self

Script to convert posts from Blogger to Markdown.

Read XML
Parse Entries (Posts and comments) (with xpath?)
Parse Title, Link, Created, Updated, Content, Link
List Post & Respective comment counts
Content to MD - pandoc?
Parse Images, Files, Videos linked to the posts
Create output dir
List items that are not downloaded( or can't) along with their .md file for user to proceed

Reasons

Wrote this to consolidate and convert my blogs under one roof.
Plain simple workflow with hugo
Ideas was to download associated assets (images/files) linked to post. Gave up, because it was time consuming and anyhow I need to validate the markdown with assets of converted. And I don't see benefit.
Initial assumption was to parse with xpath but I found xml2json.js was easier
Also thought pandoc is a overkill and turndown.js was successful, though I had to wrap empty text to md instead of html.
I want to retain comments. Believe it or not, There were some good comments.
Was sick and spent around ~12 hrs over 5 days in coding and testing with my blog contents over ~150 posts. And also, I find parsing oddly satisfying when it result in success. ¯\_(ツ)_/¯

blog2md's People

Contributors

Stargazers

Watchers

Forkers

agusxyz abrudtkuhl j127 digitalxlabs eramax chathuras ct-martin ichris foolip intkiran joshuaulrich equivalentideas sgumirov monikayadav141 schneidr ceparadise168 aka863 alvinyw amansh39 sunuazizrahayu jamesskemp lonelydev chrisliuqq hengle fabiocosta0305 akobashikawa supersongssr ironsand alexxroche andrewfitzy wprobot rick-roche len-ro idisposable devsr-gt mesandipdatta uganikov xuanjin001 rambothanh ddpruitt einverne

blog2md's Issues

thank you!

that worked really well and was totally painless.

Support for non-English languages

I like this tool a lot.
Just wondering whether some non-English characters can still be recognized in future versions? Currently the export only keeps English letters in the titles. Thanks!

[Question] Will all date & time is retained after migration

I've posts from 2012 in blogger. If i backup my XML file from themes -> backup -> download XML and use blog2md to convert to Hugo compatible, will all dates and time from 2012 till now will be also shown in Hugo as it is?
Thank you!

Error when Blogger draft has no title: `throw er; // Unhandled 'error' event`

When I ran blog2md on my Blogger export, it crashed with the following error:

PS D:\temp\blog2md> ..\node-v18.14.2-win-x64\node.exe index.js b .\blog-03-04-2023.xml ./myblog
INFO: Comments requested to be a separate .md file(m - default)
Total no. of entries found : 295
Content-posts 63
Content-Comments 180
title: "undefined"
date: 2014-08-08T01:35:00.001+03:00
draft: true
node:events:491
      throw er; // Unhandled 'error' event
      ^

Error: Input must be string
    at sanitize (D:\temp\blog2md\node_modules\sanitize-filename\index.js:41:11)
    at module.exports (D:\temp\blog2md\node_modules\sanitize-filename\index.js:54:16)
    at getFileName (D:\temp\blog2md\index.js:245:23)
    at D:\temp\blog2md\index.js:306:42
    at Array.forEach (<anonymous>)
    at D:\temp\blog2md\index.js:286:23
    at Parser.<anonymous> (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:303:18)
    at Parser.emit (node:events:513:28)
    at SAXParser.onclosetag (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:261:26)
    at emit (D:\temp\blog2md\node_modules\sax\lib\sax.js:624:35)
Emitted 'error' event on Parser instance at:
    at exports.Parser.Parser.parseString (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:326:16)
    at Parser.parseString (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:5:59)
    at D:\temp\blog2md\index.js:257:16
    at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68:3)

I realised this was caused by an unpublished draft from 2014 that didn't have a title configured.
After adding a title on Blogger, and downloading a new .xml export, blog2md worked sucessfully.

html table support

Nice program !!

In case, for those who might look for html table support

yarn add turndown-plugin-gfm , or
npm i --save turndown-plugin-gfm
add below code (2 lines, noted by >>) in index.js, about line 15

>> const tables = require('turndown-plugin-gfm').tables;
const TurndownService = require('turndown');

var tds = new TurndownService({ codeBlockStyle: 'fenced', fence: '```' })
>> tds.use(tables);

=> reference here, https://github.com/domchristie/turndown-plugin-gfm#usage

In addition to that, I like the output files (from blogger) to be organized in folder structure as
2019/12/filename.md
Below is the patch, (might not suit everyone need.)

//: line 304
/* delete */ var fname = outputDir + '/' + path.basename(url);
/* add */ var fname = outputDir + '/' + url.substr(url.indexOf(".com")+5);

// line 438
function writeToFile(filename, content, append=false){

>>  let f = path.parse(filename);
>>  if (!fs.existsSync(f.dir))
>>      fs.mkdirSync(f.dir, {recursive: true});

NPM Package is a lot better, because we can use NPX

I have an idea to make this nodejs project into an npm package so everyone on the internet can use this project only using npx.

Post without title

Blogger allows posts without title, but if I run node index.js b ... against a backup of a blog with some posts without title, I get an error from sanitize function: "Input must be a string".

I resolved by manually adding a title to the only two posts without a title in .xml file, but I think it would be better manage this case in the script.

However, thank you very much for this script, it saved me lot of time!

Post front matter

This is more of a question than an issue. I ran this variation of the script: node index.js w your-wordpress-backup-export.xml out m

However, the front matter in the output looks different than what I am used to seeing for Hugo. Is there a way to generate TOML front matter rather than YAML?

Expected:

+++
date = "2011-02-07T14:30:43+00:00"
title = "First Post"
draft = false
tags = ['Fun', 'News']
+++

Output:

---
title: 'First Post'
date: Mon, 07 Feb 2011 14:30:43 +0000
draft: false
tags: ['Fun', 'News']
---

Doesn't interpret category taxonomy?

What the title says. I know Hugo emphasizes tags over categories, but afaik allows both taxonomies. I've always thought that categories "outrank" tags, perhaps more so in WP than blogger?

Either way, it would be nice to optionally specify one, both, or the other?

Markdown text from WP XML stripped off line breaks

I have been testing Hugo and decided to dump out my WP posts which number close to 5000. I ran blog2md on the XML file and everything went very well except when I checked the md files.

All the post data is correct except for the body of the post, which is stripped of all linebreaks for the paragraphs.

I tried poking around in the index.js to see why but couldn't see anything obvious.

My WP posts are all in Markdown so I am wondering if your code is expecting HTML instead.

Here's a screenshot of the WP XML file followed by what it looks like after blog2md:

Thanks for your help.

URLs are not correctly migrated into Hugo

I have migrated my Wordpress blog into Hugo just some days ago, and while reviewing the changes, I've noticed that the URL information did not get migrated, which is unfortunately a very important issue as I rely on the slug for the navigation on my site.

Draft content not exported?

It seems to me that draft content is not exported. It would be nice to have an option to also export the draft content of the blog.

err parsing xml

installation worked fine. copied my xml file in the blog2md folder and run the script, but getting an error.

$ node index.js w MyFile.xml out s
WARNING: Given output directory "out" already exists. Files will be overwritten.
INFO: Comments requested to be a separate .md file(m - default)
Error parsing xml file (MyFile.xml)
{}

the xml file was just downloaded without any problems.
Any advice which problem accured?
Thanks