myopera-backup
A Python script to grab posts and as much relevant metadata as possible from MyOpera.
A Python script to grab posts and as much relevant metadata as possible from MyOpera.
License: GNU General Public License v2.0
A Python script to grab posts and as much relevant metadata as possible from MyOpera.
We can change the directories like so
rename 's/(\d+)/sprintf("%08d",$&)/ge' *
rename 's/(\d+)/sprintf("%08d",$&)/ge' */*
(match all occurrences of numbers and pad them to consist of 8)
We can use str.zfill(8) in Python.
You truly come across the darnedest things. The script was coming along nicely until it broke on this topic.
http://my.opera.com/community/forums/findpost.pl?id=101610
The fix is of course trivial. When writing the regex I assumed a topic necessarily had a title.
<h1>(.+?)</h1>
Simply allow no title
<h1>(.*?)</h1>
http://my.opera.com/community/forums/findpost.pl?id=4905
None of the comments by this deleted(?) user work when you try to quote them. Are there more like it?
A closer look learns that the timestamp in the URL is actually the time of the request. Not sure what the use of that is, but it means we're missing the timestamp… :(
Occasionally I've used keyboard interrupts, but I didn't build anything into the code to handle it properly.
Something very simple like this should do the trick
find backup-data -type f -exec sh -c "tail -3 {} " \; | sed NOT </div>\n</div>\n</div>
Or maybe better in a Bash or Python script. Even straight-up string comparison might do.
Apparently not all groups work well through the general MyOpera forum interface.
http://my.opera.com/community/forums/findpost.pl?id=1065287
http://my.opera.com/vets/forums/findpost.pl?id=1065287
If we want to obtain the forum_category name we'll have to do it through post-processing. For now let's just go on grabbing all the data.
Here's my proposed compression scheme. Run it in backup-data.
$ for i in *; do tar --lzip -cf $i.tar.lz $i; done
When all the data has been collected, it needs to be processed more thoroughly into useful bits and pieces. I already wrote some potentially interesting or helpful things earlier.
From v0.1
# Decode HTML entities
# Thanks to http://stackoverflow.com/a/2087433
import html.parser
h = html.parser.HTMLParser()
post_text = h.unescape(post_text)
From v0.2.1
comments_regex = r'''
<div class="fpost.*?" id=".+?">
<a name="comment[0-9]+"></a><p class="posted">(?:<span class="unread">unread</span>)?<a href="findpost\.pl\?id=([0-9]+)" title="permanent link to post"> (.+?)</a>(?: <b>\((edited)\)</b>)?</p>
<div class="pad">
<div class="poster">
(?:<img src=".+?" width="72" height="29" alt="(.+?)" title=".+?" class="right">)?<a href=".+?"><img src=".+?" alt="" class="forumavatar"></a><p><b><a href=".+?"(?: title=".+?")?>(.+?)</a></b></p>
<p>.*?</p>
<p class="userposts">Posts: <a href=".+?">[0-9]+</a></p>
</div>
<div class="thepost">((?:\n)?.+?(?:<div class="forumpoll">.+?</div>)?)(?:<div class="sig">(.+?)(?:\n)?</div>)?(?:\n)?</div>'''
# re.DOTALL makes dot also match newlines
comments = re.findall(comments_regex, page, re.DOTALL)
###############
# enter individual comments for loop
for comment in comments:
comment_id = comment[0]
timestamp = comment[1]
edited = comment[2]
user_status = comment[3]
user = comment[4]
signature = comment[6]
post_text = comment[5]
Yet another beauty: http://my.opera.com/community/forums/findpost.pl?id=262299
Because of the many oddities I should grab more HTML. Keep the data extraction for later. BeautifulSoup seems like the most obvious way.
from bs4 import BeautifulSoup
import requests
r = requests.get('http://my.opera.com/community/forums/findpost.pl?id=262299')
html_doc = r.text
soup = BeautifulSoup(html_doc)
# forum metadata
# There are two, one on top and one at the bottom; we only need the first.
nav = soup.find('div', id='forumnav')
if nav:
print('we have a real topic')
#or rather something like
#if not nav:
# log('not a real topic')
# continue
# posts
posts = soup.findAll('div', 'fpost')
This data would be fetched based on a local analysis of the data. This would include peripherals like:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.