Coder Social home page Coder Social logo

iadownload's Introduction

iadownload

NAME

iadownload - Download files from Internet Archive.

SYNOPSIS

usage: iadownload.py [-h] [--collection COLLECTION_ID] [--item ITEM_ID]
                     --outdir OUTDIR [--verbose]

Download files from an Internet Archive collection.

optional arguments:
  -h, --help            show this help message and exit
  --collection COLLECTION_ID
                        Internet Archive collection identifier
  --item ITEM_ID        Internet Archive item identifier
  --outdir OUTDIR       Directory where the files will be saved
  --verbose             Have verbose output

DESCRIPTION

iadownload is a Python script I wrote to download files from all of the items contained in a single collection at Internet Archive.

The script can also download files from a single item.

The script creates a collection directory in the location you specify. Within that directory it will create a subdirectory for each item in the collection. These subdirectories will contain the files for their respective items.

LIMITATIONS!

Because this script was written so I could download hundreds of books from a single collection, it currently will only download text-formatted files from the items. The specific formats which it will download:

  • Comic Book RAR
  • EPUB
  • Animated GIF
  • Text PDF
  • Image Container PDF

In the future the script will allow for specification of the desired formats at the command line (or perhaps via a YAML config file).

As well, the script currently has no accommodations for downloading files from protected collections (access limited to certain Internet Archive patron accounts, therefore downloads require authorization). The script only downloads files from public collections. But as there are almost no protected collections on IA, this probably isn't a feature that I'll be adding any time soon.

PREREQUISITES

Internet Archive Python Library

This script uses the internetarchive Python Library.

To install (assumes global install):

sudo pip install "internetarchive[speedups]"

This will install not only the library but also some optional dependencies which will allow the downloads to happen more quickly.

Cython

The Internet Archive Python Library uses Cython to perform concurrent downloads.

To install (assumes global install):

sudo pip install cython git+git://github.com/surfly/[email protected]#egg=gevent

OPTIONS

--collection

Defines the identifier of the Internet Archive collection from which you'd like to download files.

If this option is specified, the script will download files from every item in the collection. The files for each item will be placed in a subdirectory named using the identifier of the item.

This option may not be used in conjunction with the --item option.

Either this option or --item is required for the script to function.

--item

Defines the identifier of the Internet Archive item from which you'd like to download files.

If this option is specified, the script will download files for the item in the collection. The files for the item will be placed in a subdirectory named using the identifier of the item.

This option may not be used in conjunction with the --collection option.

Either this option or --collection is required for the script to function.

--outdir

Defines the directory to which you'd like the items to be downloaded. You must have write access to this directory.

This option is required.

--verbose

If defined, this option will enable output to STDOUT to allow you to track the progress of the downloads.

This option is optional. If it is not defined, the script will be entirely silent unless it exits due to error.

EXAMPLE

./iadownload.py --collection=sfperlmongers --outdir=~/Desktop/iadownloads --verbose

AUTHOR

This script is written and maintained by VM Brasseur.

KNOWN BUGS

All known bugs and enhancement requests are tracked in the issues on this repo.

REPORTING BUGS OR ENHANCEMENTS

If you use this script and would like to report bugs or suggest enhancements, please use the issues on this repo.

CONTRIBUTING

If you'd like to contribute to this project (docs, code, tests, etc.), please send a pull request.

COPYRIGHT AND LICENSE

All work on this project is copyright the authors of said work.

The source code for this project is licensed under the Apache License v2.0.

All documentation, web, or other content are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please see the LICENSE file for copies of these licenses.

SEE ALSO

iadownload's People

Contributors

vmbrasseur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

iadownload's Issues

Confirm either --collection or --item are defined

One of --collection or --item are required (but both cannot be defined).

Right now I'm doing a check to confirm both aren't defined at the same time but I'm not doing any checks at all to confirm that one of them is.

Do a bit more defensive programming

For example:

def create_dir(basedir, identifier):
  newdir = basedir + "/" + identifier
  if os.path.exists(newdir):
    verboseout(newdir + " already exists. Not creating")
  else:
    verboseout(" Going to create" + newdir)
    os.makedirs(newdir) #XXX confirm this worked
    verboseout(newdir + " created")
  return newdir

There are several places in the code where some try/catch action ought to happen to make things a bit more robust.

Add count/countdown for items

For --verbose.

To help keep track of download progress:

  • Before starting the download, print out the total number of items
  • Print a countdown during download, "num of total" (or some such) to show the user how many items remain

Improve format handling/defining

The download function in the internetarchive library allows you to specify which file formats to include in the download. [*]

The formats are currently hardcoded:

  f = ["Comic Book RAR", "EPUB", "Animated GIF", "Text PDF", "Image Container PDF"]

This should be configurable in some way (YAML? --format?).

If not defined, the default is to download everything.

SNAG: The Archive has not documented all of the formats which are possible/valid. As you can see from the example above, it's not as simple as "*.pdf, *.epub". So I'll need to do some digging to figure out the possible/valid values then document them here (and send a pull request to Jake for his internetarchive library docs).

[*] Unfortunately, there's no way to exclude certain files, only include. So if you, say, want everything EXCEPT the .xml files? You need to specify each and every format you want rather than just saying "all except .xml" or "all except metadata".

Docs are OS X/Linux focused

I sent the repo link to a Windows user. He was perplexed. Fair 'nuff.

Update the docs to show how to install/use on Windows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.