Coder Social home page Coder Social logo

kiwix-zim-updater's Introduction

Hi, I'm jojo2357 and I code open source for fun and make some cool things. Check out the pins below!

Favorite Languages:
jojo2357's most used langs


jojo2357's GitHub Stats

My sponsors:

5 largest sponsors:

5 most recent sponsors:

A big thank you to all my sponsors!

To sponsor me, go to my sponsor page. Any amount means a lot to me, so thanks.

kiwix-zim-updater's People

Contributors

docdrydenn avatar jojo2357 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kiwix-zim-updater's Issues

Add option to skip downloads of archives larger than xxx MiB

As I prefer to download large ZIMs over BitTorrent (so I can stop and start, and to spread server load), I'd like to be able to set a size threshold over which a ZIM archive won't be downloaded (and the existing archive won't be purged).

script not working with wget2

On some Linux distributions the newer wget2 will be used. That does not has an option "--show-progress" and therefore will result in an error.

I would suggest if wget --version > 2 then don't use the option "--show-progress".

I am not sure, but it seems that --progress=bar:force should work for all wget versions.

Script doesn't verify download prior to purging

Tried the new script with curl download method and for some reason the download completed but failed to save in the directory. This didn't stop the script from purging the old file, so now that item is lost!

For each file, it should verify that the new file is in the directory (and maybe even the right file size / hash, but at least existence) prior to deleting the old one.

5. Downloading Updates...

      ✓ Download: https://download.kiwix.org/zim/wikivoyage/wikivoyage_en_all_maxi_2022-08.zim

/volume1/docker/kiwix/kiwix-zim/kiwix-zim.sh: line 202: rev: command not found
/volume1/docker/kiwix/kiwix-zim/kiwix-zim.sh: line 202: rev: command not found
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   270  100   270    0     0    501      0 --:--:-- --:--:-- --:--:--   500
Warning: Failed to create the file /volume1/docker/kiwix/: Is a directory

  0  682M    0 16375    0     0  20050      0  9:54:49 --:--:--  9:54:49 20050
curl: (23) Failure writing output to destination

6. Purging Replaced ZIM(s)...

      ✓ Purge: /volume1/docker/kiwix/wikivoyage_en_all_maxi_2021-12.zim

Packaging for Debian

Hi, thanks for writing this! I currently maintain the Kiwix/openZIM stack in Debian and intend to package this as well (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051782).

The main question I have is regarding the name...typically it would get installed in /usr/bin/ and things there don't have a .sh extension. I am wondering if installing it as /usr/bin/kiwix-zim-updater would be a better name and acceptable to you? Or maybe you have a different preference?

Speed up update check

I noticed that the update check will repeatedly wget the same index url. This can be sped up by keeping a cache of the results from the indexes to prevent repeated wgets.

You may assign this to me as I already have code for it, and I am just waiting to develop more features into a pr.

Broken Cache Requires Being Manually Repaired

I ran kiwix-zim-updater with no internet connection, which created a log file. Since this log file was garbage, the script used it, presuming that it was still valid even after a day. Workaround of the -g flag worked as expected

SHA256 (256-bit) checksums

Researching the ability to validate SHA256 (256-bit) checksums.

  • SHA256 checksums are available from the servers. Just add .sha256 to the end of the file's link.
  • sha256sum seems to be a default app/program on most Linux distros.

Please provide input on what the script should be expected to do.

Should these checksum files be saved along side the ZIMs and kept? Should they be purged after validation? Should the script always validate checksums?

I'm just stuck on what the "flow" should be. Please help me visualize this.

wget not found on Synology NAS

Thanks so much for making this! I've been looking for something like that since first setting up Kiwix ages ago.

One small issue is that it's not discovering that wget is installed on Synology NAS, although it is there and working fine. Besides that, there are no errors in the script and it seems to update the files.

2. Checking Required Packages...

dpkg-query: no packages found matching wget
   ✗ wget: Not Found

2a. Installing Missing Packages:

./kiwix-zim.sh: line 82: apt: command not found

wget is there and works fine:

$ which wget
/bin/wget

Checksum entire library

In continuation of the conversation on #13, I am opening this issue since I currently have an implementation of checksumming the user's library without an entire rewrite.

@Jaifroid @DocDrydenn when you get the chance, please clone and test my fork with the new -f|--verfiy-library flags. In dry run, no files will be deleted except for temp sha256 files. Going out of dry run should delete corrupt files, and we should consider what to do: redownload or just purge?

Purge Zims Right After Update

I have a fairly large library which I updated (so awesome btw)

I did some by hand before discovering the script, but I still had

69.3G
44.5G
30.5G
21.9G
15.5G
31.2G
~5G

~= 220G

And the script skipped Wikipedia (~93G) which I had done by hand.

The issue I barely missed? Running out of disk space. The script downloaded all the new zims and then removed the old ones. I think it would be better to download one, delete one, download one, delete, etc. That way I would (in the worst case) have needed 93G instead of 310G of free disk.

Unable to update certain ZIM files

It appears that for any of the stack overflow ZIM files the script is unable to find newer versions. The script works for updating other files (like Ifixit and Wikipedia), but not for the stack overflow ZIMs.

Here is what a test run of the script looks like for me:

==========================================
 kiwix-zim
       download.kiwix.org ZIM Updater

   v1.12 by DocDrydenn
==========================================

            DRY-RUN/SIMULATION
               - DISABLED -

             !!! Caution !!!

==========================================

1. Preprocessing...

  -Validating ZIM directory...
    ✓ Valid.

  -Parsing ZIM(s)...
    ✓ vegetarianism.stackexchange.com_en_all_2022-05.zim
    ✓ vegetarianism.stackexchange.com_en_all_2022-11.zim

    2 ZIM(s) found.

2. Checking Required Packages...

  ✓ curl: Found

3. Checking for Script Updates...

   ✓ Git Clone Detected: Checking Script Version...
   ✓ Version: Current

4. Processing ZIM(s)...

  -Checking: vegetarianism.stackexchange.com_en_all_2022-05.zim:
    ✗ No new update

  -Checking: vegetarianism.stackexchange.com_en_all_2022-11.zim:
    ✗ No new update

5. Downloading New ZIM(s)...

6. Purging Old ZIM(s)...

==========================================
 Process Complete.
==========================================

            DRY-RUN/SIMULATION
               - DISABLED -

==========================================

Could not find any remote files

Hi, I'm getting the following message when I try to run the script with kiwix-zim.sh /storage/wiki/zim

2. Preprocessing...

  -Validating ZIM directory...
  ✓ Valid ZIM Directory

  -Building online ZIM list...
    ✗  Could not find any remote files, exiting

The /storage/wiki/zim folder looks like this:

-rwxrwxrwx 1    16572307 17 mars   2021 ekopedia_fr_all_maxi_2021-03.zim
-rw-r--r-- 1 75029282468  3 mai   04:57 gutenberg_en_all_2023-04.zim
-rw-r--r-- 1  3797739884  3 mai   05:03 gutenberg_fr_all_2023-04.zim
-rw-r--r-- 1  2765484808  3 mai   05:08 ifixit_en_all_2023-04.zim
-rwxrwxrwx 1    44098910 28 févr.  2021 wikem_en_all_maxi.zim
-rw-r--r-- 1 19176227890  3 mai   05:41 wikihow_fr_maxi_2023-02.zim
-rw-r--r-- 1 99899984433  3 mai   08:36 wikipedia_en_all_maxi_2023-04.zim
-rw-r--r-- 1 40419091525  3 mai   09:48 wikipedia_fr_all_maxi_2023-04.zim
-rwxrwxrwx 1 17402604038 16 sept.  2022 wikisource_en_all_maxi_2022-09.zim
-rwxrwxrwx 1 15729545803 20 avril  2022 wikisource_fr_all_maxi_2022-04.zim
-rw-r--r-- 1  1076490272  3 mai   09:50 wikivoyage_en_all_maxi_2023-04.zim
-rw-r--r-- 1  7695753078  3 mai   10:04 wiktionary_en_all_maxi_2023-02.zim
-rwxrwxrwx 1  4045685773 18 oct.   2022 wiktionary_fr_all_maxi_2022-10.zim
-rwxrwxrwx 1   213115960 15 mars   2022 zimgit-food-preparation_en_2022-03.zim
-rwxrwxrwx 1    29090556 15 mars   2022 zimgit-knots_en_2022-03.zim
-rwxrwxrwx 1    71296044 15 mars   2022 zimgit-medicine_en_2022-03.zim
-rw-r--r-- 1   800384896  3 mai   05:09 zimgit-post-disaster_en_2023-05.zim
-rwxrwxrwx 1    22078528 15 mars   2022 zimgit-water_en_2022-03.zim

What am I doing wrong ?

Invalid args does not immediately halt script

Running kiwix-zim.sh with no extra arguments does not error out immediately, despite the obvious misconfiguration. May I suggest checking the ZIMPath is present before calling master_scrape? I would also suggest a much more scalable args parser. Rather than hardcoding support for only 3 args.

Add option to skip purge

As I collect ZIMs of different types and ages for testing (as the dev of one of the Kiwix apps), I'd like to be able to suppress the purge via a commandline option.

Write log output to script directory

It'd be nice if the script wrote a log of its current / last-run progress to the script directory for cases where the script is run via a triggered task and the live output cannot easily be viewed.

For the slow parts it can fill up a progress bar with a visible end, something like this would work by continuously adding characters to the end during a download:

Completed –
[0% - - - - - - - - - - - - - - - - - 100%]
[#########################################]

In Progress –
[0% - - - - - - - - - - - - - - - - - 100%]
[######################

Allow Setting country for mirror use

The mirrors do not do a good job for me, so I will add an option to specify a country code. Should it be an illegal country (aka it breaks the query) then it will fallback to using kiwix's geolocation.

Recover From Internet Issues Better

Wikipedia is big. Very big. Sometimes I have to move my laptop before it can complete the download. It is especially frustrating when this happens late in the download, moreso because ctrl+c of the program means I will have extra archives and no checksums run.

I have changes on the way, so this is mostly to remind me to open that pr when Im ready

Torrent feature

Torrent feature would be dope, just not to bash the single server and be more responsible
I think .torrent files in target location would be enough to simplify this

Cache zim index

Cache for, say, 1 day? Also with an option to force getting a new index? Sounds good to me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.