Coder Social home page Coder Social logo

basc-archiver's Introduction

The Bibliotheca Anonoma

The Bibliotheca Anonoma is a wiki designed to collect, document, and safeguard the products and history of internet culture; which constitutes the shared experience of humanity on a network that defines our lives.

The Wiki

This is the source code viewer for the Bibliotheca Anonoma Wiki.
To actually view and edit the Wiki follow one of the links below:

basc-archiver's People

Contributors

alabard avatar antonizoon avatar danieloaks avatar huggablesquare avatar joshbarrass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

basc-archiver's Issues

--follow-children needs to check for dupes

It seems that threads added by --follow-children get duplicated, so if the same thread is linked three times by different threads it'll be added to our list three times in a row and cause issues.

Alternatively, the child thread that kept getting added was a closed thread, so I wonder whether it just kept seeing that the thread was closed, removing it from our list, and then next time it checked for child threads it just readded it because it didn't seem to exist at all in our list.

Will investigate.

Issue while downloading thumbnails

Got this while downloading threads and thumbnails and such, on the threaded branch.

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "<etc>/basc_archiver/sites/base.py", line 60, in run
    self.site.download_item(next_item)
  File "<etc>/basc_archiver/sites/fourchan.py", line 217, in download_item
    self.threads[thread_id]['total_files'] = len(list(running_thread.filenames()))
UnboundLocalError: local variable 'running_thread' referenced before assignment

PyPi install failure

Collecting basc-archiver
  Using cached BASC-Archiver-0.9.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "C:\Users\Dudu\AppData\Local\Temp\pip-build-oqsbf0f3\basc-archiver\setup.py", line 17, in <module>
        long_description = file.read()
      File "c:\users\dudu\appdata\local\programs\python\python35\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1142: character maps to <undefined>

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\Dudu\AppData\Local\Temp\pip-build-oqsbf0f3\basc-archiver

It doesn't work anymore, just idles

I'm not sure what happened but BASC-archiver doesn't work anymore. I'm guessing some Python lib broke it? When I try to download a thread, it just idles and it does report the file count but doesn't download anything. There are no errors.

$ thread-archiver --path=/mnt/archives/ --thread-check-delay=60 --ssl --nothumbs https://boards.4chan.org/wg/thread/<threadidhere>
Starting download
Thread 4chan / wg / id  -  85 new replies

and then nothing...

Version: BASC-Archiver v0.9.3

Updating...

$ pip3 install basc-archiver
Requirement already satisfied: basc-archiver in /usr/lib/python3.6/site-packages
Requirement already satisfied: requests in /usr/lib/python3.6/site-packages (from basc-archiver)
Requirement already satisfied: docopt>=0.5.0 in /usr/lib/python3.6/site-packages (from basc-archiver)
Requirement already satisfied: BASC-py4chan>=0.5.5 in /usr/lib/python3.6/site-packages (from basc-archiver)

Unfortunately, verbose option does not work at all:

$ thread-archiver https://boards.4chan.org/wg/thread/<id here> --verbose
Usage:
  thread-archiver <url>... [options]
  thread-archiver -h | --help
  thread-archiver -v | --version

OS is ArchLinux. Everything is updated. Python version: Python 3.6.1

Any ideas or suggestions how to fix this?

Request: Thread title or some words from the OP post in the created folder

It is a nightmare going through the dozens of folders of archived threads without no way to differentiate between them.

A --title option would be enormously welcomed.
Also, if you could make the archiver write to "/site/board/thread*/", for example "/4chan/a/745894 - dumb waaabus" would still work, even if renamed manually.

Generate a browsable index.html thread listing

One of the great ideas suggested by @antonizoon is generating an index.html file in the root or somewhere we can use to browse the various threads in our archive. That or a JSON file or something similar, but I think a decent little HTML file generated with some templates and showing a thread listing similar to the pages in a 4chan board shouldn't be too much trouble.

Would make it really nice to browse archives on our personal machines, and I can see this being a great feature.

8chan support

image
Whenever I try to acces the setting on my 8chan board,I get this error.
What should I do?

Make an Android/iOS/Windows app GUI with Kivy

Might seem daunting, but thanks to new technology (Kivy Python NUI) it is extremely easy to make a cross platform touchscreen GUI with Python. Kivy supports iOS/Android/Windows/Mac OS X/Linux. And most of all, it looks great.

I'd like to fulfill the request of FriendlyAnon, because nowadays, more and more anons use smartphones to browse 4chan.

I'll probably try this out myself when I have time.

Implementation notes:

  • On Android, we should try to use intents so that a user can "Share" the 4chan URL from another app (Chanu, Chant, ChanReader, the browser, etc) into the BASC-Archiver.

8chan Support

Apparently 8chan has a 4chan API compatible API. Should be pretty easy to implement in the BASC-Archiver, in that case.

Though there may be some small divergences in the future (such as the fact that pages still start at 0 and not 1), so we should make a specific py8chan for it.

Windows version continues the tries after 404

"Keep downloading until 404 (with a user-set delay)"

As the title says, it keeps going even if the thread 404s.
It gets REALLY annoying when you have multiple ones running.
And another issue is that there is no 0.8.7 for windows and easy-install doesn't seem to do the job.

Add `file_count()` method in py4chan

In fourchan.py, there happen to be these notes in line 246:

# TODO: extend BASC-py4chan to give us this number directly
self.threads[thread_id]['total_files'] = len(list(thread['thread'].filenames()))

And line 255:

# TODO: extend BASC-py4chan to give us this number directly
self.threads[thread_id]['total_files'] = len(list(running_thread.filenames()))

Apparently, it might be a good idea to have a file_count() method in basc_py4chan.Thread that counts the amount of replies where has_file returns True.

[Suggestion] --images and --links options

Something to just grab those without creating any other files.
Use of the html is very rare (if you're just saving things for yourself) so thumbs, css, js, json are just bloat.

Desustorage support

Any plans to do this, or someone working on it?

I would be really nice to download threads from Desustorage in some cases.

Thumbs Regex broken on Windows, use JSON HTML Templating instead?

Apparently the thumbs regex doesn't work on the windows version.

So when you open the HTML file, it fails to link to the internal thumbnails. However, the Image link conversion seems to work

But I think the real solution is to procedurally generate a JSON HTML templater, so we don't ever have to scrape CSS or HTML again

http://beebole.com/pure/

http://twigkit.github.io/tempo/

The question is how to load a local JSON file, since usually it's not allowed. There are some hacks here:

http://stackoverflow.com/a/18637657

Generating WARC files

Honestly, I feel like I should implement a command-line switch to generate WARC files while downloading threads, so I can upload them to the Wayback Machine or do whatever else, and have a whole lot more flexibility with the archives we create.

Nobody else should mind this feature too much, I'll work on it myself.

Get a GUI!

Another thing suggested by @antonizoon is getting a GUI up and running.

I've never done any GUI work on Python myself but I'll have a look and a play with my notepad, see if I can come up with anything nice there. The ChanThreadWatch interface is well-done, as an example.

Automatically download threads

It would be nice to set certain options such as thread name, username, subject, etc so that BASC automatically adds them to the list and downloads them.

I'd like to archive some general threads and currently this requires me to watch the board for them and insert their links manually into the download .txt. Being able to set BASC to automatically download would be very convenient.

JSON HTML Templating with Jinja2

It's about time for us to create a JSON templating system, which will be integrated as part of the .chan.arc standard. This way, we don't have to manually retrieve the CSS and HTML on every thread grab, nor do we need to deal with regex'ing current files (which has many problems on systems with non-UTF encoding)...

Jinja2 is fairly simple, basically html file that uses {% %} and {{ }} tags to play with input data and insert data.
basically have a file like: <html><head><title>{{ title }}</title></head><body>{% for thread in threads %}{{ thread.title }} etc etc{% endfor %}</body></html>
and then you just call the templator with something like (title='Page title', threads=[thread1, thread2, thread3])

You'll still have an option that allows you to grab the original HTML and CSS the way we currently do it (for events such as CSS madness on /pol/ and /b/). But you only need to grab them when they are an important aspect of the thread.

In any case, we will make a WARC option someday that grabs a thread snapshot the way the Internet Archive demands it.

Support 4chan's 4channel.org domain

Context from Wikipedia:

On November 17, 2018, it was announced that the site would be split into two, with the work-safe boards moved to a new domain, 4channel.org, while the NSFW boards would remain on the 4chan.org domain.

Error output:

$ thread-archiver --version
BASC-Archiver v0.9.8
$ thread-archiver "https://boards.4channel.org/g/thread/51971506"
Starting download
We could not find a valid archiver for: https://boards.4channel.org/g/thread/51971506

We could not find any of the supplied threads, exiting.

Some image HTML not converted

Just noticed this happening with somewhat newer threads, but 4chan seems to have a new place to store images: is.4chan. Images with this domain are not converted to the downloaded image, but are downloaded.

--original-filenames option

Option to use the original filenames for files when writing threads out. Need to make sure we modify the images/thumbnails, and the names we write out into the html file.

Make sure the original and the proper file numbers are in the thread manifest json (as well as explicitly the name we write it out as), so we can check it out using that later on.

Threads excepting

Just some crashes/errors I've been running into.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/base.py", line 62, in run
    self.site.download_item(next_item)
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/fourchan.py", line 252, in download_item
    new_replies = len(running_thread.all_posts)
AttributeError: 'NoneType' object has no attribute 'all_posts'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/base.py", line 62, in run
    self.site.download_item(next_item)
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/fourchan.py", line 307, in download_item
    utils.download_json(local_filename, url, clobber=True)
  File "<basc-archiver-root>/basc-archiver/basc_archiver/utils.py", line 50, in download_json
    original_data = json.loads(open(local_filename).read())
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/base.py", line 62, in run
    self.site.download_item(next_item)
  File "<basc-archiver-root>/basc-archiver/basc_archiver/sites/fourchan.py", line 219, in download_item
    new_replies = thread['thread'].update()
  File "build/bdist.macosx-10.10-x86_64/egg/basc_py4chan/thread.py", line 186, in update
    res.raise_for_status()
  File "/usr/local/lib/python2.7/site-packages/requests/models.py", line 851, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 502 Server Error: Bad Gateway

420chan Support

I've finished adapting py4chan to create py420chan, since 420chan's API is so similar to 4chan's..

It would be great if 420chan was now supported on the BASC-Archiver.

One thing to note is that because 420chan doesn't have the Last-Modified HTTP header on the thread JSON, we cannot use the expand() function on the Thread Class anymore. Instead, just use the update() function with a delay.

once thread dies, the program just kinda sits there.

basically, I wrote a batch script to download all thread URLs in a text file and it works, I have it so it names the threads based on a name I put after it in the file. the thing is, once the post dies, it just freezes. it won't rename files, or close itself. code.
@echo off set var1=%1 set link=%var1:~0,54% set post=%var1:~34,9% set name=%var1:~55% set board=%var1:~25,1% echo Downloading: thread %board%\�[91m%post%�[0m/�[93m%name%�[0m cd D:\4chan-archive\BASC-Archiver-0.9.9 python3 thread-archiver %link% --path=D:\4chan-archive --delay=5 echo D:\4chan-archive\4chan\%board%\%post% move D:\4chan-archive\4chan\%board%\%post% D:\4chan-archive\4chan\%board%\%name% echo �[46mdownload of thread %board%\�[95m%post%\%name%�[0m�[46m succeded.�[0m echo. echo. echo. exit

Keep track of JSON better

We should probably create a JSON manager that loads and saves the thread json file, so we can handle it like *Fuuka does their thread grabbing. For example, being able to put "deleted": true as an attribute of specific posts when they go missing, stuff like that. So we can load and save it as necessary, and keep a more accurate representation of the thread.

Make sure duplicate threads don't get added

Make sure that if a duplicate thread gets added, just return True and act like we were able to add it just fine. For the 'follow child threads' feature, I think simply finding all the URLs in every new post, seeing if we can add them, and then adding them proper would be fine, and I'd like the add_thread function to handle this automatically.

In addition, we should split out the url information extraction (board/post_id) to its own function, so we can use it in both the adding function and in the child threads feature without duping code. (We need to know the board/id of threads we're adding to enforce the same/different board constraint)

need help with the windows version

when I use "thread-archiver.exe https:4chan..." it comes up with an error. "requests.packages.urllib3.exceptions.SSLError: [Errno 2] No such file or directory" if anyone could help me with what I need to do that'd be great. I use python 3.9.2 if that's needed. Very new to github and don't really code or anything

Make a Windows Executable

Using Pip on Windows is an exercise in frustration, unlike the quick and easy method on Mac OS X and Linux. An EXE build is critical. Have a .exe for both command line and GUI versions.

Daniel Oaks is currently on the case, though I'm going to try and make a quick and dirty executable for myself for the moment.

Android CLI app with QPython

Amazingly, using the QPython app, you can get a full Python interpreter for Android. This means that the BASC-Archiver can be run as a command line app without modifications on Android phones.

It works perfectly for the most part. Here's how you do it:

Android (CLI)

Note: This is a temporary solution until we put together some kind of Android GUI app.

Thanks to the QPython interpreter, you can effortlessly run the BASC-Archiver on your Android phone.

  1. Install the QPython app.
  2. Open the QPython app, and swipe left to reach the menu.
  3. Tap Package Index. Then scroll down and tap Pip Console.
  4. Run the following commands (after starting the pip_install.py script):
pip install requests
pip install basc-archiver

Now you can just open QPython, tap My QPython, tap pip_console, and run the following command with your own thread URL:

thread-archiver --path=/sdcard/ http://boards.4chan.org/qa/thread/23839

To run the script in the background, press the back button, and tap OK at the Run in Background prompt. You can stop the script anytime using Vol Down + C.

  • Note: On Android (CLI), it is important to set the path to /sdcard/, so the thread dump can be accessed from the /sdcard/archives/4chan/ folder.
  • Note: To update the BASC-Archiver on Android (CLI), you must open QPython, press the 3-dot menu button, scroll down and tap Reset Private Space. Then just reinstall the BASC-Archiver.

Multi-threading

Basically, I think we should switch to a multi-threaded internal structure, both so the GUI can work, as well as just speeding up the CLI and general downloading.

I'm looking into and experimenting with multi-threaded architectures and layouts, so I'll throw my plans in here when I get something decent together.

Doesn't exit, even with --runonce

I'm using the CLI version for Windows from here:
https://github.com/bibanon/BASC-Archiver/releases

I'm on Windows 10 Pro x64 v15.11

Downloading a thread works but the program never finishes. Is this intentional? Continue to watch a thread?

However, even with --runonce I had to stop the program with Ctrl+C

By the way, is it only 4chan that is supported right now? Honestly, I have no idea about these different image boards and what APIs they provide (if any at all), but I always thought that the majority of these boards run on the same software, more or less..

Freezes

Hello, I have an issue where it freezes and doesn't fetch any new replies, until I ctrl-c and restart it. How do I fix this?

Upgrading the Windows release

Please forgive the noob question, but how does one go about installing the upgrades like 0.9.6? I've tried numerous things to do it, but have been unable to make any progress on it. I am on Windows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.