Coder Social home page Coder Social logo

blog2md's Introduction

Blogger to Markdown

Convert Blogger & WordPress backup blog posts to hugo compatible markdown documents

Usage: node index.js b|w <BLOGGER BACKUP XML> <OUTPUT DIR>

For Blogger imports, blog posts and comments (as seperate file <postname>-comments.md) will be created in "out" directory

node index.js b your-blogger-backup-export.xml out

For WordPress imports, blog posts and comments (as seperate file <postname>-comments.md) will be created in "out" directory

node index.js w your-wordpress-backup-export.xml out

If you want the comments to be merged in your post file itself. you can use flag m at the end. Defaults to s for seperate comments file

node index.js w your-wordpress-backup-export.xml out m

If converting from WordPress, and you have posts that do not contain HTML, you can use a paragraph-fix flag at the end.

node index.js w your-wordpress-backup-export.xml out m paragraph-fix

Installation (usual node project)

  • Download or Clone this project
  • cd to directory
  • Run npm install to install dependencies
  • Run node index.js <arg...>

Notes to self

Script to convert posts from Blogger to Markdown.

  • Read XML
  • Parse Entries (Posts and comments) (with xpath?)
  • Parse Title, Link, Created, Updated, Content, Link
  • List Post & Respective comment counts
  • Content to MD - pandoc?
  • Parse Images, Files, Videos linked to the posts
  • Create output dir
  • List items that are not downloaded( or can't) along with their .md file for user to proceed

Reasons

  • Wrote this to consolidate and convert my blogs under one roof.
  • Plain simple workflow with hugo
  • Ideas was to download associated assets (images/files) linked to post. Gave up, because it was time consuming and anyhow I need to validate the markdown with assets of converted. And I don't see benefit.
  • Initial assumption was to parse with xpath but I found xml2json.js was easier
  • Also thought pandoc is a overkill and turndown.js was successful, though I had to wrap empty text to md instead of html.
  • I want to retain comments. Believe it or not, There were some good comments.
  • Was sick and spent around ~12 hrs over 5 days in coding and testing with my blog contents over ~150 posts. And also, I find parsing oddly satisfying when it result in success. ¯\_(ツ)_/¯

blog2md's People

Contributors

amansh39 avatar chathuras avatar clicktravel-andrew avatar coliff avatar ct-martin avatar dependabot[bot] avatar foolip avatar ironsand avatar jamesskemp avatar joshuaulrich avatar lonelydev avatar palaniraja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

blog2md's Issues

thank you!

that worked really well and was totally painless.

Support for non-English languages

I like this tool a lot.
Just wondering whether some non-English characters can still be recognized in future versions? Currently the export only keeps English letters in the titles. Thanks!

Error when Blogger draft has no title: `throw er; // Unhandled 'error' event`

When I ran blog2md on my Blogger export, it crashed with the following error:

PS D:\temp\blog2md> ..\node-v18.14.2-win-x64\node.exe index.js b .\blog-03-04-2023.xml ./myblog
INFO: Comments requested to be a separate .md file(m - default)
Total no. of entries found : 295
Content-posts 63
Content-Comments 180
title: "undefined"
date: 2014-08-08T01:35:00.001+03:00
draft: true
node:events:491
      throw er; // Unhandled 'error' event
      ^

Error: Input must be string
    at sanitize (D:\temp\blog2md\node_modules\sanitize-filename\index.js:41:11)
    at module.exports (D:\temp\blog2md\node_modules\sanitize-filename\index.js:54:16)
    at getFileName (D:\temp\blog2md\index.js:245:23)
    at D:\temp\blog2md\index.js:306:42
    at Array.forEach (<anonymous>)
    at D:\temp\blog2md\index.js:286:23
    at Parser.<anonymous> (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:303:18)
    at Parser.emit (node:events:513:28)
    at SAXParser.onclosetag (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:261:26)
    at emit (D:\temp\blog2md\node_modules\sax\lib\sax.js:624:35)
Emitted 'error' event on Parser instance at:
    at exports.Parser.Parser.parseString (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:326:16)
    at Parser.parseString (D:\temp\blog2md\node_modules\xml2js\lib\parser.js:5:59)
    at D:\temp\blog2md\index.js:257:16
    at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68:3)

I realised this was caused by an unpublished draft from 2014 that didn't have a title configured.
After adding a title on Blogger, and downloading a new .xml export, blog2md worked sucessfully.

html table support

Nice program !!

In case, for those who might look for html table support

  • yarn add turndown-plugin-gfm , or
  • npm i --save turndown-plugin-gfm
  • add below code (2 lines, noted by >>) in index.js, about line 15
>> const tables = require('turndown-plugin-gfm').tables;
const TurndownService = require('turndown');

var tds = new TurndownService({ codeBlockStyle: 'fenced', fence: '```' })
>> tds.use(tables);

=> reference here, https://github.com/domchristie/turndown-plugin-gfm#usage


In addition to that, I like the output files (from blogger) to be organized in folder structure as
2019/12/filename.md
Below is the patch, (might not suit everyone need.)

//: line 304
/* delete */ var fname = outputDir + '/' + path.basename(url);
/* add */ var fname = outputDir + '/' + url.substr(url.indexOf(".com")+5);

// line 438
function writeToFile(filename, content, append=false){

>>  let f = path.parse(filename);
>>  if (!fs.existsSync(f.dir))
>>      fs.mkdirSync(f.dir, {recursive: true});

Post without title

Blogger allows posts without title, but if I run node index.js b ... against a backup of a blog with some posts without title, I get an error from sanitize function: "Input must be a string".

I resolved by manually adding a title to the only two posts without a title in .xml file, but I think it would be better manage this case in the script.

However, thank you very much for this script, it saved me lot of time!

Post front matter

This is more of a question than an issue. I ran this variation of the script: node index.js w your-wordpress-backup-export.xml out m

However, the front matter in the output looks different than what I am used to seeing for Hugo. Is there a way to generate TOML front matter rather than YAML?

Expected:

+++
date = "2011-02-07T14:30:43+00:00"
title = "First Post"
draft = false
tags = ['Fun', 'News']
+++

Output:

---
title: 'First Post'
date: Mon, 07 Feb 2011 14:30:43 +0000
draft: false
tags: ['Fun', 'News']
---

Doesn't interpret category taxonomy?

What the title says. I know Hugo emphasizes tags over categories, but afaik allows both taxonomies. I've always thought that categories "outrank" tags, perhaps more so in WP than blogger?

Either way, it would be nice to optionally specify one, both, or the other?

Markdown text from WP XML stripped off line breaks

I have been testing Hugo and decided to dump out my WP posts which number close to 5000. I ran blog2md on the XML file and everything went very well except when I checked the md files.

All the post data is correct except for the body of the post, which is stripped of all linebreaks for the paragraphs.

I tried poking around in the index.js to see why but couldn't see anything obvious.

My WP posts are all in Markdown so I am wondering if your code is expecting HTML instead.

Here's a screenshot of the WP XML file followed by what it looks like after blog2md:

Screen Shot 2019-11-21 at 2 32 27 pm

Screen Shot 2019-11-21 at 2 34 02 pm

Thanks for your help.

URLs are not correctly migrated into Hugo

I have migrated my Wordpress blog into Hugo just some days ago, and while reviewing the changes, I've noticed that the URL information did not get migrated, which is unfortunately a very important issue as I rely on the slug for the navigation on my site.

Draft content not exported?

It seems to me that draft content is not exported. It would be nice to have an option to also export the draft content of the blog.

err parsing xml

installation worked fine. copied my xml file in the blog2md folder and run the script, but getting an error.

$ node index.js w MyFile.xml out s
WARNING: Given output directory "out" already exists. Files will be overwritten.
INFO: Comments requested to be a separate .md file(m - default)
Error parsing xml file (MyFile.xml)
{}

the xml file was just downloaded without any problems.
Any advice which problem accured?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.