some-programs / exitwp Goto Github PK

Exitwp is tool primarily aimed for making migration from one or more wordpress blogs to the jekyll blog engine as easy as possible.

Python 100.00%

exitwp's People

Contributors

Stargazers

Watchers

Forkers

borismus anonomis ronnie76er bascht ashipkowski jamesward jberryman bentsai mwesten agateau amithgeorge geier dinomite wil olba oz123 goldenboy xujyan quantisan jackmcdade gpolitis leonderijke passionfactory tarasglek cedricporter jmartindf strogo mgalves ibank davefowler rwmpelstilzchen demonnico apike drucko cnlpete astockwell ardydedase nbomberger calebmadrigal kurrunk37 satnosun fencerfan davelab6 danifbento onnayokheng daoluan kscc25 thunderrabbit phybros tonicebrian cruzergz andrewferrier iambryan jonprindiville coopermaruyama ryanstraits timbroder yourtion foxweek kublaj hujianfei1989 teino1978-corp michaelsync bill joao-parana charygao dev-alex-alex2006hw ralphmorton sdnssr dsteinkopf shisaq hytd pajamaw atoz-chevara devkmsg westermarck fnsoxt rastabrane ablozhou tallcoleman wooni005 nikolayvoronchikhin yihui tarek-g petzi53 ilasoft nanaakwasiabayieboateng kylekirkby dts0 a-hurst vegahu rivy-t darkwing-17 miwy lexoyo mirams russ-k nimdvir robx 8ch9azbsfifz

exitwp's Issues

Export custom post type?

I'm trying to export a WP site with the Portfolio Press theme: https://wptheming.com/portfolio-press/. Most of my content is saved as the custom post type "portfolio". They appear in the .xml file after exporting from WP.

But exitwp.py does not output the portfolio posts in /build. There are several lines at the end of the report in the command line .Unknown item type :: portfolio

How do I include this custom post type in the output?

<script> and <iframe> are lost in markdown output

the issue appears to be that html2text_file is too eager to cleanup

Substitute image links with downloaded paths

It's rather good that exitwp has download_images option. It'd be great if all image links would be replaced with new ones in process.

ValueError: invalid literal for int() with base 10: 'No Content Found'

Thank you very much for this awesome tool. Unfortunately, I am having issues with the newest WordPress export (I did not find out whether they changed something). Anyways, for one, I get "Wrong date in {my title}" errors, from line 218, as all the dates are formatted like e.g. Mon, 17 Nov 2014 18:58:32 +0000. This is not as bad as the crash I get after this warning with the following traceback:

Traceback (most recent call last):
  File "exitwp.py", line 382, in <module>
    write_jekyll(data, target_format)
  File "exitwp.py", line 306, in write_jekyll
    'wordpress_id': int(i['wp_id']),
ValueError: invalid literal for int() with base 10: 'No Content Found'

Unfortunately, I did not find an issue in my XML, there is the corresponding <wp:post_id> tag, no xmllint issues found. Can you give me a hint what to look out for? Thanks.

No Content Found (gi function fail?)

Hi, first time writing an issue but I'll try to get to the point asap.

The gi (get information I suppose?) function at line 117 does not seem to properly parse neither sweep my xml export file, it does not find a wp_id element with which to write the new build.

I do not know if this is an error with the namespace, the find method, or both. Or something else entirely, I'm a beginner still at Python. [programming in general as well, so apologies in advance if this issue has some redundancy (i did search for similar ones tough)]

Below is the last error message it

Tearful Hexagon https://xampuparaossos.com.br/2019/08/23/tearful-hexagon/ writing.Wrong date in Placeholder Image Traceback (most recent call last): File "exitwp.py", line 382, in <module> write_jekyll(data, target_format) File "exitwp.py", line 306, in write_jekyll 'wordpress_id': int(i['wp_id']), ValueError: invalid literal for int() with base 10: 'No Content Found'

It seems I cannot attach a xml file format, but I checked the wordpress import and it appears to have all required tags (wp:post_id etc). I've tried running python2 and python3, to the same result.

I know you're a busy guy, but if you see this I just want to say thanks for reading my issue and updating this awesome project up till so recently! Looking forward for a fix so I can continue working. c;

unsuccessful convert

Thanks for providing this excellent tool, I'm giving it a try in an attempt to migrate to a static jekyll-powered site.
I'm on ubuntu, I've installed the dependencies, made sure my exported xml passes through xmllnt without errors, and then stick it in wordpress_xml directory and run your python script, but I'm hitting the following error and no idea what to try next. Any suggestions?

Traceback (most recent call last):
  File "exitwp.py", line 323, in <module>
    data = parse_wp_xml(wpe)
  File "exitwp.py", line 137, in parse_wp_xml
    'items': parse_items(),
  File "exitwp.py", line 105, in parse_items
    body = gi('content:encoded')
  File "exitwp.py", line 100, in gi
    result = i.find(ns[namespace] + tag).text
AttributeError: 'NoneType' object has no attribute 'text'

License?

@thomasf , I read

The problem is, I created this project to get out of wordpress.. Now when I'm out the incentive to work actively on exitwp is kind of low.

You have no obligation, but if you were to choose an open source license, then this project would be free ;-)

"No module named yaml" error

raceback (most recent call last):
File "exitwp.py", line 13, in
import yaml
ImportError: No module named yaml

This is on Mac OS X El Capitan. I removed and reinstalled Python using Brew. Then followed the diirections again. I am not a Python programmer, but wondering why line 13 is "import yaml" and not "import pyYAML" However, I tried that and had the same issue. Must be something in my configuration.

Here is what I have installed via pip:

awscli (1.9.6)
beautifulsoup4 (4.2.0)
botocore (1.3.6)
colorama (0.3.3)
docutils (0.12)
html2text (3.200.3)
html5lib (1.0b1)
jmespath (0.9.0)
pip (8.1.2)
pyaml (15.8.2)
pyasn1 (0.1.9)
python-dateutil (2.4.2)
PyYAML (3.10)
rsa (3.2.3)
setuptools (19.4)
six (1.10.0)
vboxapi (1.0)
wheel (0.26.0)

Here are details about PyYAML:

Metadata-Version: 2.0
Name: PyYAML
Version: 3.10
Summary: YAML parser and emitter for Python
Home-page: http://pyyaml.org/wiki/PyYAML
Author: Kirill Simonov
Author-email: [email protected]
Installer: pip
License: MIT
Location: /usr/local/lib/python2.7/site-packages
Requires:
Classifiers:
Development Status :: 5 - Production/Stable
Intended Audience :: Developers
License :: OSI Approved :: MIT License
Operating System :: OS Independent
Programming Language :: Python
Programming Language :: Python :: 2
Programming Language :: Python :: 2.3
Programming Language :: Python :: 2.4
Programming Language :: Python :: 2.5
Programming Language :: Python :: 2.6
Programming Language :: Python :: 2.7
Programming Language :: Python :: 3
Programming Language :: Python :: 3.0
Programming Language :: Python :: 3.1
Programming Language :: Python :: 3.2
Topic :: Software Development :: Libraries :: Python Modules
Topic :: Text Processing :: Markup

Show stack trace

I was running errors when importing my file and it really bothered me to have a single Parse error on: post title...

I suggest you to add this:

import traceback

// ...

try:
    out.write(html2fmt(i['body'], target_format))
except:
    print "\n Parse error on: " + i['title']
    traceback.print_exc(file=sys.stdout)

So we could have more details about the exception.

Thanks,

exitwp creates hidden blog folder

This is probably a simple answer.

I've gone through exitwp. I'm using a Mac running 10.9.5. Before I started, I set up an empty Jekyll directory; I was planning to move all my markdown files into the new '_posts' directory when I was done. However, while I can see all of my posts in Terminal, the Jekyll directory created by exitwp is hidden in my Finder window. I am able to copy or move files around via Terminal, but I cannot actually see anything with Finder. How can I reveal this folder, without revealing all hidden folders on my computer?

Support different Metadata Layouts with Templates

I have a Wordpress Blog and want to export to a StaticSiteGenerator. (don't know which one yet)

Some of them have an import integrated, like Nikola (which currently don't work), but why do we always reinvent the wheel if this tool works?

We only have to support all the different metadata formats.

I would suggest to use Jinja2 templates but i think supporting Mako (later) also is not hard.

If we have the templates someone can also write a converter so switching the Generator would also be possible.

Not parsing content:encoded

I'm getting the following after running your tool:

reading: wordpress-xml/wordpress.2011-07-11.xml
Traceback (most recent call last):
File "exitwp.py", line 287, in
data=parse_wp_xml(wpe)
File "exitwp.py", line 134, in parse_wp_xml
'items': parse_items(),
File "exitwp.py", line 102, in parse_items
body=gi('content:encoded')
File "exitwp.py", line 98, in gi
result=i.find(ns[namespace]+tag).text
AttributeError: 'NoneType' object has no attribute 'text'

Any idea what may go wrong? Also, what is the version of WP required for the exported xml to work?

exitwp.py not compatible with BeautifulSoup 4

Had to change this line to make it work:

from BeautifulSoup import BeautifulSoup

from bs4 import BeautifulSoup

Create some wordpress.xml-files for testing.

I'm having trouble maintaining this application due to the lack of tests.
A bunch of wordpress.xml's from different sources to verify that it's all working should do fine.

Error with time data

When I run python exitwp.py I got the following error message:

Traceback (most recent call last): File "exitwp.py", line 374, in <module> write_jekyll(data, target_format) File "exitwp.py", line 296, in write_jekyll i['date'], '%Y-%m-%d %H:%M:%S').replace(tzinfo=UTC()), File "/Users/xxxxx/.pyenv/versions/2.7.10/lib/python2.7/_strptime.py", line 325, in _strptime (data_string, format)) ValueError: time data '0000-00-00 00:00:00' does not match format '%Y-%m-%d %H:%M:%S'

I am not a Python programmer but I looked up what the time data format means and it seems to me that '0000-00-00 00:00:00' DOES MATCH the format '%Y-%m-%d %H:%M:%S'.

Any idea what is wrong? Help would be very appreciated!

where is the build directory ?

You should now have all the blogs converted into separate directories under the build directory

ValueError: multi-byte encodings are not supported

I'm using exitwp to migrate from Wordpress to Octopress

I export my Wordpress (4.3.1) posts to wordpress.xml
I add xmlns:atom="http://www.w3.org/2005/Atom to my rss element

but then I get ValueError: multi-byte encodings are not supported.

My wordpress.xml is <?xml version="1.0" encoding="UTF-7" ?>, so I must convert it to UTF-8 for it to work. Hope it helps someone

Thank you

I know this doesn't belong here but this saved me a lot of work.
Works great,

Thank you

Redirecting stdout causes "UnicodeEncodeError: 'ascii' codec can't encode characters..."

'python exitwp.py' converts fine and successfully creates build/ output, but ' python exitwp.py > /dev/null' just fails with :

Traceback (most recent call last):
  File "exitwp.py", line 353, in 
    data = parse_wp_xml(wpe)
  File "exitwp.py", line 164, in parse_wp_xml
    'items': parse_items(),
  File "exitwp.py", line 127, in parse_items
    body = gi('content:encoded')
  File "exitwp.py", line 120, in gi
    print result
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-12: ordinal not in range(128)

Commenting print result on line 120 in exitwp.py serves as a workaround here.

Handle WordPress [caption] tags

WordPress pages use a square-bracketed format for describing image captions in the editor markup, which is then converted to a div when WordPress renders the page.

Because exitwp/html2text sees the markup and doesn't understand it, this makes it through to the markdown output as text that wraps around the image markdown and must be manually removed.

I realise there is no standard markdown for an image caption, but could the script do something a bit nicer than just leaving the [caption] blocks where they are? (e.g. remove them entirely, put the caption in as a new paragraph, etc.) Not an ideal solution, but I don't think there is an ideal solution.

No module name yaml??

I get the following error:

Traceback (most recent call last):
File "exitwp.py", line 10, in
import yaml
ImportError: No module named yaml

What am I missing?

-- Smittie

escape title of blog post

The title of a blog post needs to be escaped, e. g. "This & That" -> "title: This & That"

Error during read

I get the following trying to convert my blog exported from Wordpress 3.0.1:

reading: wordpress-xml/wordpress.2011-07-24.xml
Traceback (most recent call last):
File "exitwp.py", line 299, in
data=parse_wp_xml(wpe)
File "exitwp.py", line 135, in parse_wp_xml
'items': parse_items(),
File "exitwp.py", line 87, in parse_items
t_domain=unicode(tax.attrib['domain'])
KeyError: 'domain'

AttributeError: 'NoneType' object has no attribute 'text'

I just tried to get rid of one of my wordpress blogs but pretty early had to stop because of this:

reading: wordpress-xml/photos.xml
Traceback (most recent call last):
  File "exitwp.py", line 292, in <module>
    data=parse_wp_xml(wpe)
  File "exitwp.py", line 127, in parse_wp_xml
    'items': parse_items(),
  File "exitwp.py", line 111, in parse_items
    'date' : gi('wp:post_date'),
  File "exitwp.py", line 91, in gi
    result=i.find(ns[namespace]+tag).text
AttributeError: 'NoneType' object has no attribute 'text'

Am I doing it wrong?

Parsing and encoding error

I get an encoding error (see below) "suddenly" after everything has been running fine for some weeks every day. Maybe it has got something todo with #58, but I don't think so.

My input XML is UTF-8 (also stated in the xml header line).

I was not able to find out which input (line) causes the problem. It seems there is some problem in the input that causes the parser to fail. And then even the print of the problematic line (variable body) fails.

Any hint on how to continue debugging? (I am not a python nerd...)

Traceback (most recent call last):
  File "/home/stk/exitwp/exitwp.py", line 373, in <module>
    data = parse_wp_xml(wpe)
  File "/home/stk/exitwp/exitwp.py", line 176, in parse_wp_xml
    'items': parse_items(),
  File "/home/stk/exitwp/exitwp.py", line 148, in parse_items
    print 'could not parse html: ' + body
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 138: ordinal not in range(128)

Download all links which point to images

Some links have image inside:

<a href="A.png">
   <img src="B.png"/>
</a

If we use download_images option, image B.png will be downloaded, while image A.png won't be.

I suggest to download link href if either:

href attribute ends with an image
HTTP MIME is image/*

Downloaded images corrupt

I tried exitwp on 2 different machines - the downloaded images are corrupt, all of them are exactly 162 bytes as well. It's as if it's only downloading the first 'chunk' or something. I'm using python 2.7.6 on OS X Mavericks.

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 31818,

I'm trying to import from a wordpress 3.6 xml and i'm getting this error.

Any idea on how to fix it?

reading: wordpress-xml/wordpress.xml
Traceback (most recent call last):
File "exitwp.py", line 361, in
data = parse_wp_xml(wpe)
File "exitwp.py", line 82, in parse_wp_xml
root = tree.parse(file, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 653, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 31818, column 1

No html Site with tablepress Plugin

No html _site generate with python exitwp.py error:

publish tablepress_table 2045 0 closed writing.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Unknown item type :: tablepress_table .Unknown item type :: tablepress_table .Unknown item type :: tablepress_table .Unknown item type :: tablepress_table ..Unknown item type :: tablepress_table .Unknown item type :: tablepress_table .Unknown item type :: tablepress_table .Unknown item type :: tablepress_table .Unknown item type :: tablepress_table ... .Unknown item type :: tablepress_table

I have a WP-Site with tablepress Plugin (TablePress Version 1.9.2.)

WP-Sites without this Plugin works fine.

What can i do?

Error while importing

I have exported from wordpress.com but am seeing this when I run the script :-

reading: wordpress-xml/markwaters.wordpress.20111227.xml
Traceback (most recent call last):
File "exitwp.py", line 292, in
data=parse_wp_xml(wpe)
File "exitwp.py", line 61, in parse_wp_xml
root=tree.parse(file)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse
parser.feed(data)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: unbound prefix: line 455, column 1

Running on Ubuntu 10.04.3

Errors On File Read

Keep getting these three errors when trying to run the conversion, any idea what they mean?

Traceback (most recent call last):
File "exitwp.py", line 292, in <module>
data=parse_wp_xml(wpe)
File "exitwp.py", line 61, in parse_wp_xml
root=tree.parse(file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 653, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: unbound prefix: line 9, column 1

Support extracting "Feature Image"

Thanks a lot for this tool.

Today, Wordpress allows writers to attach a "Feature Image" to each post. This image is for example used as an image preview when listing posts, or as an image header when viewing the post.

I looked at my wordpress.xml file, and it seems that the image is captured as the following:

The post <item> contains this metadata:

  <wp:postmeta>
    <wp:meta_key>_thumbnail_id</wp:meta_key>
    <wp:meta_value><![CDATA[1089]]></wp:meta_value>
  </wp:postmeta>

Which references another <item>, which has the same ID: <wp:post_id>1089</wp:post_id>

The URL of the image is stored in the <wp:attachment_url> attribute and/or in the <guid isPermaLink="false"> attribute.

Is this something that exitwp could support?

Is it possible to style posts in octopress according to its tags?

I need to style all the posts in Octopress which have the tag 'old' differently. Like, show only the title and no image in archives and keep them separated! How can I do this? (Note : There are nearly 1,500 posts with the old tag)

Error reading

Hey, thanks for making this available. I am getting the following error:

$ python exitwp.py
reading: wordpress-xml/wordpress.2011-08-25.xml
Traceback (most recent call last):
  File "exitwp.py", line 299, in <module>
    data=parse_wp_xml(wpe)
  File "exitwp.py", line 135, in parse_wp_xml
    'items': parse_items(),
  File "exitwp.py", line 119, in parse_items
    'date' : gi('wp:post_date'),
  File "exitwp.py", line 99, in gi
    result=i.find(ns[namespace]+tag).text
AttributeError: 'NoneType' object has no attribute 'text'

Thanks,

[Solved] Issue when facing errors in tags

I have some malformed tags in my posts. For example ">Text", Html2text is then unable to parse...

I wrote a little patch for this, but I'm not very familiar with GitHub so here it is:
try:
out.write(html2fmt(i['body'], target_format))
except:
print "\n Parse error on"+i['title']

(Around line 282 of exitwp.py)

It works for me, but it must be very dirty :)

Migrating to python 3

Hi,

I'm trying to make a python 3 version of this module. But i got a little problem replacing XMLtreeBuilder by XMLParser in this class:

`class ns_tracker_tree_builder(XMLParser):

def __init__(self):
    XMLParser.__init__(self)
    self._parser.StartNamespaceDeclHandler = self._start_ns
    self.namespaces = {}

def _start_ns(self, prefix, ns):
    self.namespaces[prefix] = '{' + ns + '}'`

I get this error:

Traceback (most recent call last):
File "exitwp.py", line 374, in
data = parse_wp_xml(wpe)
File "exitwp.py", line 83, in parse_wp_xml
parser = ns_tracker_tree_builder()
File "exitwp.py", line 65, in init
self._parser.StartNamespaceDeclHandler = self._start_ns
AttributeError: 'ns_tracker_tree_builder' object has no attribute '_parser'

regards,

img wrapped in ahref

most images/attachments in wordpress are formatted as ahref wrapping a img. In these cases, the img tag has info about the thumbnail... while the ahref points to the actual file.

can both these be downloaded ? I'm not an expert, but I looked at stackoverflow to get an idea of how to do this http://stackoverflow.com/questions/24993292/python-extract-the-href-surrounding-image

Markdown output wraps at 80 characters

The markdown output files that the script creates from my wordpress.xml all wrap normal text paragraphs are 80 characters (or less if a word will go across the boundary). Unfortunately, when I import this into my Jekyll blog (or any other markdown interpreter), these new lines are preserved. So, rather than getting paragraphs naturally following screen width when converted from markdown to HTML, I get formatting that is wrapped at the same point.

The only except seems to be image urls, which are not wrapped.

Can the script be updated to not wrap lines at 80 characters?

Error while running python exitwp.py command

I'm no python developer, just trying to convert my wordpress.com site to jekyll

while running exitwp.py command getting following error:

Traceback (most recent call last):
  File "exitwp.py", line 13, in <module>
    import yaml
ImportError: No module named yaml

removing embedded object tags

Would there be any reason exitwp removes object tags? i.e. embedded flickr videos don't make it through the conversion process. They're removed. e.g.

...

isn't in the converted post

Is it possible to export custom taxanomies like meta description?

In my export file, i have Yoast meta descriptions like this :

  <wp:postmeta>
        <wp:meta_key>_yoast_wpseo_metadesc</wp:meta_key>
        <wp:meta_value><![CDATA[Bitdefender 2014 's main point is speed. According to Bitdefender, their new photon technology will make things a lot lighter. That means...]]></wp:meta_value>
    </wp:postmeta>

Is it possible to turn them into description: ?