frenzie / myopera-backup Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 8.13 MB

A Python script to grab posts and as much relevant metadata as possible from MyOpera.

License: GNU General Public License v2.0

Python 100.00%

myopera-backup's Introduction

myopera-backup

A Python script to grab posts and as much relevant metadata as possible from MyOpera.

myopera-backup's People

Contributors

Watchers

myopera-backup's Issues

Add trailing zeroes

We can change the directories like so

rename 's/(\d+)/sprintf("%08d",$&)/ge' *
rename 's/(\d+)/sprintf("%08d",$&)/ge' */*

(match all occurrences of numbers and pad them to consist of 8)

We can use str.zfill(8) in Python.

Some topics don't have titles

You truly come across the darnedest things. The script was coming along nicely until it broke on this topic.

http://my.opera.com/community/forums/findpost.pl?id=101610

The fix is of course trivial. When writing the regex I assumed a topic necessarily had a title.

<h1>(.+?)</h1>

Simply allow no title

<h1>(.*?)</h1>

Not all comments quote

http://my.opera.com/community/forums/findpost.pl?id=4905

None of the comments by this deleted(?) user work when you try to quote them. Are there more like it?

Timestamp wrong

A closer look learns that the timestamp in the URL is actually the time of the request. Not sure what the use of that is, but it means we're missing the timestamp… :(

Data integrity check

Occasionally I've used keyboard interrupts, but I didn't build anything into the code to handle it properly.

Something very simple like this should do the trick

find backup-data -type f -exec sh -c "tail -3 {} " \; | sed NOT </div>\n</div>\n</div>

Or maybe better in a Bash or Python script. Even straight-up string comparison might do.

Missing forum_category

Apparently not all groups work well through the general MyOpera forum interface.

http://my.opera.com/community/forums/findpost.pl?id=1065287
http://my.opera.com/vets/forums/findpost.pl?id=1065287

If we want to obtain the forum_category name we'll have to do it through post-processing. For now let's just go on grabbing all the data.

Share data

Here's my proposed compression scheme. Run it in backup-data.

$ for i in *; do tar --lzip -cf $i.tar.lz $i; done

Extracting data

When all the data has been collected, it needs to be processed more thoroughly into useful bits and pieces. I already wrote some potentially interesting or helpful things earlier.

From v0.1

# Decode HTML entities
# Thanks to http://stackoverflow.com/a/2087433
import html.parser
h = html.parser.HTMLParser()
post_text = h.unescape(post_text)

From v0.2.1

comments_regex = r'''
<div class="fpost.*?" id=".+?">
<a name="comment[0-9]+"></a><p class="posted">(?:<span class="unread">unread</span>)?<a href="findpost\.pl\?id=([0-9]+)" title="permanent link to post"> (.+?)</a>(?: <b>\((edited)\)</b>)?</p>
<div class="pad">
<div class="poster">
(?:<img src=".+?" width="72" height="29" alt="(.+?)" title=".+?" class="right">)?<a href=".+?"><img src=".+?" alt="" class="forumavatar"></a><p><b><a href=".+?"(?: title=".+?")?>(.+?)</a></b></p>
<p>.*?</p>
<p class="userposts">Posts: <a href=".+?">[0-9]+</a></p>
</div>
<div class="thepost">((?:\n)?.+?(?:<div class="forumpoll">.+?</div>)?)(?:<div class="sig">(.+?)(?:\n)?</div>)?(?:\n)?</div>'''

# re.DOTALL makes dot also match newlines
comments = re.findall(comments_regex, page, re.DOTALL)

###############
# enter individual comments for loop
for comment in comments:
    comment_id = comment[0]
    timestamp = comment[1]
    edited = comment[2]
    user_status = comment[3]
    user = comment[4]
    signature = comment[6]
    post_text = comment[5]

Some topics don't exist, but don't realize it

Yet another beauty: http://my.opera.com/community/forums/findpost.pl?id=262299

Because of the many oddities I should grab more HTML. Keep the data extraction for later. BeautifulSoup seems like the most obvious way.

from bs4 import BeautifulSoup
import requests

r = requests.get('http://my.opera.com/community/forums/findpost.pl?id=262299')

html_doc = r.text

soup = BeautifulSoup(html_doc)

# forum metadata
# There are two, one on top and one at the bottom; we only need the first.
nav = soup.find('div', id='forumnav')

if nav:
    print('we have a real topic')
#or rather something like
#if not nav:
#  log('not a real topic')
#  continue

# posts
posts = soup.findAll('div', 'fpost')

Data to fetch on a second pass

This data would be fetched based on a local analysis of the data. This would include peripherals like:

avatars
user profiles?
any MyOpera-hosted files linked on the forum smaller than so many kB/MB with a log for manual inclusion of larger files? or just everything so we can weed it out later? (i.e. is it in any way imaginable that we'd surpass a few 100GB?)
durrrrr…

frenzie / myopera-backup Goto Github PK

myopera-backup's Introduction

myopera-backup

myopera-backup's People

Contributors

Watchers

myopera-backup's Issues

Add trailing zeroes

Some topics don't have titles

Not all comments quote

Timestamp wrong

Data integrity check

Missing forum_category

Share data

Extracting data

Some topics don't exist, but don't realize it

Data to fetch on a second pass

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent