Coder Social home page Coder Social logo

myopera-backup's Introduction

myopera-backup

A Python script to grab posts and as much relevant metadata as possible from MyOpera.

myopera-backup's People

Contributors

frenzie avatar

Watchers

 avatar  avatar

myopera-backup's Issues

Add trailing zeroes

We can change the directories like so

rename 's/(\d+)/sprintf("%08d",$&)/ge' *
rename 's/(\d+)/sprintf("%08d",$&)/ge' */*

(match all occurrences of numbers and pad them to consist of 8)

We can use str.zfill(8) in Python.

Timestamp wrong

A closer look learns that the timestamp in the URL is actually the time of the request. Not sure what the use of that is, but it means we're missing the timestamp… :(

Data integrity check

Occasionally I've used keyboard interrupts, but I didn't build anything into the code to handle it properly.

Something very simple like this should do the trick

find backup-data -type f -exec sh -c "tail -3 {} " \; | sed NOT </div>\n</div>\n</div>

Or maybe better in a Bash or Python script. Even straight-up string comparison might do.

Share data

Here's my proposed compression scheme. Run it in backup-data.

$ for i in *; do tar --lzip -cf $i.tar.lz $i; done

Extracting data

When all the data has been collected, it needs to be processed more thoroughly into useful bits and pieces. I already wrote some potentially interesting or helpful things earlier.

From v0.1

# Decode HTML entities
# Thanks to http://stackoverflow.com/a/2087433
import html.parser
h = html.parser.HTMLParser()
post_text = h.unescape(post_text)

From v0.2.1

comments_regex = r'''
<div class="fpost.*?" id=".+?">
<a name="comment[0-9]+"></a><p class="posted">(?:<span class="unread">unread</span>)?<a href="findpost\.pl\?id=([0-9]+)" title="permanent link to post"> (.+?)</a>(?: <b>\((edited)\)</b>)?</p>
<div class="pad">
<div class="poster">
(?:<img src=".+?" width="72" height="29" alt="(.+?)" title=".+?" class="right">)?<a href=".+?"><img src=".+?" alt="" class="forumavatar"></a><p><b><a href=".+?"(?: title=".+?")?>(.+?)</a></b></p>
<p>.*?</p>
<p class="userposts">Posts: <a href=".+?">[0-9]+</a></p>
</div>
<div class="thepost">((?:\n)?.+?(?:<div class="forumpoll">.+?</div>)?)(?:<div class="sig">(.+?)(?:\n)?</div>)?(?:\n)?</div>'''

# re.DOTALL makes dot also match newlines
comments = re.findall(comments_regex, page, re.DOTALL)

###############
# enter individual comments for loop
for comment in comments:
    comment_id = comment[0]
    timestamp = comment[1]
    edited = comment[2]
    user_status = comment[3]
    user = comment[4]
    signature = comment[6]
    post_text = comment[5]

Some topics don't exist, but don't realize it

Yet another beauty: http://my.opera.com/community/forums/findpost.pl?id=262299

Because of the many oddities I should grab more HTML. Keep the data extraction for later. BeautifulSoup seems like the most obvious way.

from bs4 import BeautifulSoup
import requests

r = requests.get('http://my.opera.com/community/forums/findpost.pl?id=262299')

html_doc = r.text

soup = BeautifulSoup(html_doc)

# forum metadata
# There are two, one on top and one at the bottom; we only need the first.
nav = soup.find('div', id='forumnav')

if nav:
    print('we have a real topic')
#or rather something like
#if not nav:
#  log('not a real topic')
#  continue

# posts
posts = soup.findAll('div', 'fpost')

Data to fetch on a second pass

This data would be fetched based on a local analysis of the data. This would include peripherals like:

  • avatars
  • user profiles?
  • any MyOpera-hosted files linked on the forum smaller than so many kB/MB with a log for manual inclusion of larger files? or just everything so we can weed it out later? (i.e. is it in any way imaginable that we'd surpass a few 100GB?)
  • durrrrr…

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.