Coder Social home page Coder Social logo

internetarchive's People

Contributors

anarcat avatar benbou8231 avatar brycedrennan avatar cclauss avatar codersquid avatar dobatymo avatar duncandhall avatar eggplants avatar fibn144 avatar hornc avatar ibnesayeed avatar jessetg avatar jesseweinstein avatar jjjake avatar justanotherarchivist avatar kngenie avatar lunixbochs avatar maxz avatar mlissner avatar mshemuni avatar nemobis avatar nlevitt avatar rajbot avatar saper avatar smokris avatar smrohrer avatar thetaiter avatar ursafoot avatar varun-magesh avatar willf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

internetarchive's Issues

ConnectionError on upload

Creating a new ticket, since I can't seem to reopen #47.

I just tried to write another script using the internetarchive module, and again I am unable to upload anything. When I attempt to upload something, I get the following output:

sven@linux-rfa7:~/projects/pdfhost> ./mirror.py 
2014-05-22 14:23:25,958 Starting new HTTP connection (1): archive.org
2014-05-22 14:23:28,382 "GET /metadata/pdfy-rAU3zLsvxo-Ll4VC HTTP/1.1" 200 None
2014-05-22 14:23:28,405 Starting new HTTP connection (1): s3.us.archive.org
2014-05-22 14:23:28,852 Retrying (1 attempts remain) after connection broken by 'error(32, 'Broken pipe')': /pdfy-rAU3zLsvxo-Ll4VC/regionale-jaarcijfers-meld-misdaad-anoniem-2013.pdf
2014-05-22 14:23:28,852 Starting new HTTP connection (2): s3.us.archive.org
Traceback (most recent call last):
  File "./mirror.py", line 68, in <module>
    if item.upload([(real_filename, source_file)], metadata=metadata, access_key=conf["internetarchive"]["accesskey"], secret_key=conf["internetarchive"]["secretkey"]):
  File "/home/sven/projects/pdfhost/internetarchive/item.py", line 545, in upload
    resp = self.upload_file(body, key=key, **kwargs)
  File "/home/sven/projects/pdfhost/internetarchive/item.py", line 473, in upload_file
    response = self.http_session.send(prepared_request, stream=True)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 382, in send
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s3.us.archive.org', port=80): Max retries exceeded with url: /pdfy-rAU3zLsvxo-Ll4VC/regionale-jaarcijfers-meld-misdaad-anoniem-2013.pdf (Caused by <class 'socket.error'>: [Errno 32] Broken pipe)

This is my code:

#!/usr/bin/env python

import json, internetarchive, oursql, os
from requests.exceptions import ConnectionError


###

import logging
import sys
log = logging.getLogger()
out_hdlr = logging.StreamHandler(sys.stdout)
out_hdlr.setFormatter(logging.Formatter('%(asctime)s %(message)s'))
out_hdlr.setLevel(logging.DEBUG)
log.addHandler(out_hdlr)
log.setLevel(logging.DEBUG)

###



template = """
<p>
    <strong>This public document was automatically mirrored from <a href="http://pdf.yt/">PDFy</a>.</strong>
</p>

<ul>
    <li><strong>Original filename:</strong> %(real_filename)s</li>
    <li><strong>URL:</strong> <a href="http://pdf.yt/d/%(slug)s">http://pdf.yt/d/%(slug)s</a></li>
    <li><strong>Upload date:</strong> %(upload_date)s</li>
</ul>
"""

with open("config.json", "r") as f:
    conf = json.loads(f.read())

dbconn = oursql.Connection(host=conf["database"]["hostname"], user=conf["database"]["username"], passwd=conf["database"]["password"], db=conf["database"]["database"], autoreconnect=True)
cur = dbconn.cursor()

cur.execute("SELECT `Id`, `SlugId`, `Filename`, `Uploaded`, `OriginalFilename` FROM documents WHERE `Mirrored` = 0 AND `Public` = 1")
items = cur.fetchall()

for doc in items:
    id_, slug, storage_filename, upload_date, real_filename = doc

    if upload_date is None:
        upload_date = "Before April 27, 2014"
    else:
        upload_date = upload_date.strftime("%B %d, %Y %H:%M:%S")

    source_file = "storage/%s" % storage_filename

    item = internetarchive.get_item("pdfy-%s" % slug)

    metadata = {
        "mediatype": "texts",
        "subject": ["mirror"],
        "collection": "test_collection",
        "title": "%s (PDFy mirror)" % real_filename,
        "description": template % {
            "real_filename": real_filename,
            "slug": slug,
            "upload_date": upload_date
        },
        "date": "2014-01-01"
    }

    if item.upload([(real_filename, source_file)], metadata=metadata, access_key=conf["internetarchive"]["accesskey"], secret_key=conf["internetarchive"]["secretkey"]):
        cur = dbconn.cursor()
        cur.execute("UPDATE documents SET `Mirrored` = 1 WHERE `Id` = ?", (id_,))

        print "Uploaded %s (%s)" % (slug, title)
    else:
        print "FAILED upload of %s (%s)!" % (slug, title)

I have verified through some duct-tape debugging that my credentials are set correctly, and that the Authorization header is also set correctly. The below are the headers for a request (with access/secret key obscured):

CaseInsensitiveDict({'x-archive-meta00-title': u'regionale-jaarcijfers-meld-misdaad-anoniem-2013.pdf (PDFy mirror)', 'x-archive-meta00-collection': 'test_collection', 'x-archive-meta00-scanner': 'Internet Archive Python library 0.6.2', 'Content-Length': '1892395', 'x-archive-meta00-subject': 'mirror', 'Content-MD5': '89b85d0a0cc91caedd7f7d4da4392e7b', 'x-archive-meta00-description': u'\n<p>\n\t<strong>This public document was automatically mirrored from <a href="http://pdf.yt/">PDFy</a>.</strong>\n</p>\n\n<ul>\n\t<li><strong>Original filename:</strong> regionale-jaarcijfers-meld-misdaad-anoniem-2013.pdf</li>\n\t<li><strong>URL:</strong> <a href="http://pdf.yt/d/rAU3zLsvxo-Ll4VC">http://pdf.yt/d/rAU3zLsvxo-Ll4VC</a></li>\n\t<li><strong>Upload date:</strong> Before April 27, 2014</li>\n</ul>\n', 'x-archive-auto-make-bucket': 1, 'x-archive-meta00-mediatype': 'texts', 'x-archive-size-hint': 1892395, 'Authorization': 'LOW accesskey:secretkey'})

I am certain that I can reach the host (s3.us.archive.org), having attempted that in a regular browser. The "Broken Pipe" seems to suggest that the IA S3 server is willingly aborting the connection (because of some invalid header?), but it's unclear to me why.

I am currently using my own fork of ia-wrapper that is based on the master branch of this repository, with two patches applied (see #61).

class File(object) issues in internetarchive/item.py

i don't understand lines #418-#420.
i think they make sense if they're trying to define "length" rather than "size", because original and derived (video) files have a "length" property ("length" is the number of seconds in the video) and jpgs and metadata files don't.

is that what the code is trying to do with the None if test?

if not, #418 is a dup of line 417 which handles "size".

and shouldn't #424 be defining "self.crc32" rather than "self.sha1"?

i don't know anything about python, but i need to be able to retrieve "length" for "original" and "derived" video files which led me to look at the code. if i don't get "length" out of the "files" section, i have to lookup the latest "derive" task in an item's history, and scrape it out of the ffmpeg output.

it would be great if "length" could be a column in the --files output of "ia metadata", but i understand if it's not broadly useful enough to be included.

thanks again for all your work on this, jake!
john

Better error handling needed for bonehead args

Command line used:

ia upload adamkennedyafewofmyfavouritethings \
~/Desktop/Adam_Kennedy-A_Few_of_My_Favourite_Things.flv  \
~/Desktop/Adam_Kennedy-A_Few_of_My_Favourite_Things.pdf \
--metadata="title:Adam Kennedy: A Few of My Favourite Things"   \
--metadata="subject:perl,computer languages, computer programming"   \
--metadata="collection:sfperlmongers"   \
--metadata="description:TBD"   \
--metadata="movies"   \
--metadata="date:2013-09-10"   \
--metadata="licenseurl:http://creativecommons.org/licenses/by-nc-sa/3.0/"

Result is an error:

Traceback (most recent call last):
  File "/usr/local/bin/ia", line 9, in <module>
    load_entry_point('internetarchive==0.3.4', 'console_scripts', 'ia')()
  File "/Library/Python/2.7/site-packages/iacli/ia.py", line 61, in main
    ia_module.main(argv)
  File "/Library/Python/2.7/site-packages/iacli/ia_upload.py", line 35, in main
    for k,v in changes:
ValueError: need more than 1 value to unpack

Both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set and exported (naturally I'm not sharing those with you; neener neener neener :-P ).

Judging from the code at line 35 of ia_upload.py, it's seen that it has --metadata args but somehow something isn't splitting correctly. As you can see from the example above, a certain bonehead (me) forgot to include "mediatype" in one of the metadata args so, yeah, of course it can't split on a ":" which isn't there.

Rather than dying with this opaque and un-user-friendly system-generated message, there should be some defensive arg/error checking and exception handling, allowing for a more graceful bail should a bonehead like me screw up the args.

This is on a MacBook Pro running OS X 10.8.4 and Python 2.7.2 (i.e., the system Python).

Single-file downloads via CLI tool are broken

$ ia download TripDown1905 TripDown1905_512kb.mp4
downloading: TripDown1905_512kb.mp4
Traceback (most recent call last):
  File "/home/rkumar/pyenvs/kitchensink/bin/ia", line 8, in <module>
    load_entry_point('internetarchive==0.4.6', 'console_scripts', 'ia')()
  File "/home/rkumar/dev/git/ia-wrapper/iacli/ia.py", line 89, in main
    ia_module.main(argv)
  File "/home/rkumar/dev/git/ia-wrapper/iacli/ia_download.py", line 67, in main
    verbose=verbose, ignore_existing=args['--ignore-existing'])
TypeError: download() got an unexpected keyword argument 'dry_run'

503s when downloading are unexpectly persistent

This client doesn't throttle when it gets a 503 from the server. I see problems when I'm doing

cat big-list-of-ids | parallel ia download {1} {1}_abbyy.gz

Once it starts getting 503s, it gets a lot. Notice that each client execution is a separate process.

I added a time.sleep(2.0) call to item.py right before "raise HTTPError". Now I only see a couple of 503s before it returns to normal downloading.

ujson requirement causes install failure on linux boxes without python-dev installed

A user trying to sudo pip install internetarchive received this error on a debian-based distro: http://pastebin.com/zRfgEQmm

Fixed by doing a sudo apt-get install python-dev first.

To make it easier for our users, we should remove ujson as an install requirement.

If ujson is present in the environment, we should use it, but fall back to json if needed.

Here is the traceback in case the pastebin disappears:

------------------------------------------------------------
/usr/bin/pip run on Sun Sep 15 11:00:21 2013
Requirement already satisfied (use --upgrade to upgrade): internetarchive in /usr/local/lib/python2.7/dist-packages

Requirement already satisfied (use --upgrade to upgrade): boto==2.9.9 in /usr/local/lib/python2.7/dist-packages (from internetarchive)

Requirement already satisfied (use --upgrade to upgrade): jsonpatch==1.1 in /usr/local/lib/python2.7/dist-packages (from internetarchive)

Downloading/unpacking ujson==1.33 (from internetarchive)

  Running setup.py egg_info for package ujson

    running egg_info
    writing pip-egg-info/ujson.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/ujson.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/ujson.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/ujson.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/ujson.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/ujson has version 1.33, which satisfies requirement ujson==1.33 (from internetarchive)
Downloading/unpacking pytest==2.3.4 (from internetarchive)

  Running setup.py egg_info for package pytest

    running egg_info
    writing requirements to pip-egg-info/pytest.egg-info/requires.txt
    writing pip-egg-info/pytest.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/pytest.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/pytest.egg-info/dependency_links.txt
    writing entry points to pip-egg-info/pytest.egg-info/entry_points.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/pytest.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/pytest.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/pytest has version 2.3.4, which satisfies requirement pytest==2.3.4 (from internetarchive)
Downloading/unpacking docopt==0.6.1 (from internetarchive)

  Running setup.py egg_info for package docopt

    running egg_info
    writing pip-egg-info/docopt.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/docopt.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/docopt.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/docopt.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/docopt.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/docopt has version 0.6.1, which satisfies requirement docopt==0.6.1 (from internetarchive)
Downloading/unpacking PyYAML==3.10 (from internetarchive)

  Running setup.py egg_info for package PyYAML

    running egg_info
    writing pip-egg-info/PyYAML.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/PyYAML.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/PyYAML.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/PyYAML.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/PyYAML.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/PyYAML has version 3.10, which satisfies requirement PyYAML==3.10 (from internetarchive)
Downloading/unpacking jsonpointer>=1.0 (from jsonpatch==1.1->internetarchive)

  Running setup.py egg_info for package jsonpointer

    running egg_info
    writing pip-egg-info/jsonpointer.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/jsonpointer.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/jsonpointer.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/jsonpointer.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/jsonpointer.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/jsonpointer has version 1.0, which satisfies requirement jsonpointer>=1.0 (from jsonpatch==1.1->internetarchive)
Downloading/unpacking py>=1.4.12 (from pytest==2.3.4->internetarchive)

  Running setup.py egg_info for package py

    running egg_info
    writing pip-egg-info/py.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/py.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/py.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/py.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/py.egg-info/SOURCES.txt'
  Source in /tmp/pip-build-root/py has version 1.4.15, which satisfies requirement py>=1.4.12 (from pytest==2.3.4->internetarchive)
Installing collected packages: ujson, pytest, docopt, PyYAML, jsonpointer, py

  Running setup.py install for ujson

    Running command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-root/ujson/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-wUL0HF-record/install-record.txt --single-version-externally-managed
    running install
    running build
    running build_ext
    building 'ujson' extension

    creating build
    creating build/temp.linux-x86_64-2.7
    creating build/temp.linux-x86_64-2.7/python
    creating build/temp.linux-x86_64-2.7/lib
    x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I./python -I./lib -I/usr/include/python2.7 -c ./python/ujson.c -o build/temp.linux-x86_64-2.7/./python/ujson.o -D_GNU_SOURCE

    In file included from ./python/ujson.c:38:0:

    ./python/py_defines.h:38:20: fatal error: Python.h: No such file or directory

    compilation terminated.

    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    Complete output from command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-root/ujson/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-wUL0HF-record/install-record.txt --single-version-externally-managed:

    running install

running build

running build_ext

building 'ujson' extension

creating build

creating build/temp.linux-x86_64-2.7

creating build/temp.linux-x86_64-2.7/python

creating build/temp.linux-x86_64-2.7/lib

x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I./python -I./lib -I/usr/include/python2.7 -c ./python/ujson.c -o build/temp.linux-x86_64-2.7/./python/ujson.o -D_GNU_SOURCE

In file included from ./python/ujson.c:38:0:

./python/py_defines.h:38:20: fatal error: Python.h: No such file or directory

compilation terminated.

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------

Command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-root/ujson/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-wUL0HF-record/install-record.txt --single-version-externally-managed failed with error code 1 in /tmp/pip-build-root/ujson

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 139, in main
    status = self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 271, in run
    requirement_set.install(install_options, global_options, root=options.root_path)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1185, in install
    requirement.install(install_options, global_options, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 592, in install
    cwd=self.source_dir, filter_stdout=self._filter_install, show_stdout=False)
  File "/usr/lib/python2.7/dist-packages/pip/util.py", line 662, in call_subprocess
    % (command_desc, proc.returncode, cwd))
InstallationError: Command /usr/bin/python -c "import setuptools;__file__='/tmp/pip-build-root/ujson/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-wUL0HF-record/install-record.txt --single-version-externally-managed failed with error code 1 in /tmp/pip-build-root/ujson

add delete

please add delete file from item command

Use trollius (asyncio) instead of gevent.

Gevent is a pain to install, and many users do not get the benefits of ia-wrapper's concurrent functionality. Trollius is easy to install, so all of ia-wrapper's concurrent functionality can be provided by default -- no need to install gevent to use Mine(). Asyncio has also been included in the Python standard library (on a provisional basis) as of 3.4.

I have created a new branch to start working on this: https://github.com/jjjake/ia-wrapper/tree/mine

One of the biggest changes is how Mine is called and used. Rather than the Mine object being an iterable that yields Item objects, Item objects can be manipulated via callback's:

>>> from internetarchive.mine import Mine
>>> identifiers = [x.strip() for x in open('itemlist.txt')]
>>> download = lambda item: item.download()
>>> miner = Mine(identifiers, callback=download)
>>> miner.run()

This branch is currently a work in progress, and ideas and suggestions are welcome!

Error in docs on using search.results()

I think your documentation may contain an error. When you explain how to iterate over search results, your documentation is missing some parentheses for the results() function:

It currently reads:

for result in search.results:
...     print result['identifier']

I believe it should read:

for result in search.results():
     print result['identifier']

Minor bug .internetarchive.yml

Hi,

ia-wrapper seems to expect the file .internetarchive.yml to exist in the users' home directory but it doesn't create the file during installation. If the file isn't there it throws the following error.

Traceback (most recent call last):
File "modify_metadata.py", line 7, in
item.modify_metadata(md)
File "/usr/local/lib/python2.7/dist-packages/internetarchive/internetarchive.py", line 189, in modify_metadata
self._configure()
File "/usr/local/lib/python2.7/dist-packages/internetarchive/internetarchive.py", line 78, in _configure
self.config = yaml.load(open(self.config_file))
IOError: [Errno 2] No such file or directory: '/home/jdurno/.internetarchive.yml'

Creating an empty file called .internetarchive.yml in my home directory fixed the problem.

BTW, thanks very much for creating ia-wrapper. Much nicer than using boto directly.

best,
John

Download of large file causes memory error

>>> import internetarchive as ia
>>> item = ia.Item('NO404-WKP-20131104215558-crawl345')
>>> f = item.file('NO404-WKP-20131104222227-08103.warc.gz')
>>> f.download()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "internetarchive/item.py", line 467, in download
    fp.write(data.read())
  File "/usr/lib/python2.7/socket.py", line 359, in read
    return buf.getvalue()
MemoryError

>>> ia.__version__
'0.4.6'

subject metadata arguments are ignored

ia upload is ignoring --metadata="subject:foo".

I tried:

ia upload testupload_2013_09_25_5 foo.txt bar.txt --metadata="title:Adam Kennedy A Few of My Favourite Things" --metadata="subject:perl" --metadata="subject:computer languages" --metadata="subject:computer programming" --metadata="collection:sfperlmongers" --metadata="description:TBD"

Although the --debug shows the subject fields are added to the S3 headers, the meta.xml does not contain any <subject> elements: https://archive.org/download/testupload_2013_09_25_5/testupload_2013_09_25_5_meta.xml

Can not upload with `date` in metadata cli argument

See #20

ia upload fails if --metadata="date:2013" is passed as a cli argument.

The files are not uploaded and usage information is dumped:

$ ia upload testupload_2013_09_25_5 foo.txt --metadata="date:2013"
Upload items to archive.org.

usage: 
    ia upload <identifier> <file>... [options...]

options:

 -h, --help
 -n, --no-derive             Do not derive the item after files have been 
                             uploaded.
 -d, --debug                 Return the headers to be sent to IA-S3. default: True
 -M, --multipart             Upload files to archive.org in parts, using 
                             IA-S3 multipart.
 -i, --ignore-bucket         Destroy and respecify the metadata for a 
                             given item.
 -m, --metadata=<key:value>  Metadata to add to the item. default: None
 -H, --header=<key:value>    default: None

ConnectionError on upload

I've been working on a script to automatically upload recorded livesets to the Internet Archive, but I've been running into a ConnectionError exception while doing so, and I've been unable to figure out what is causing it.

I've been attempting to run this several times during the past few weeks, thinking it might just be a network issue, but the issue always occurs for this script. Another script I run on the same server, for automatically uploading scraped pastes from Pastebin, works just fine. Is there something I'm doing wrong, or is this a bug?

This is what happens:

stream@croissant:~$ python upload.py sorted/
Traceback (most recent call last): THE BLEND TECHNOTERRA SOUNDSYSTEM GROSSETO.mp3: [                                ] 19/219327 - 00:00:00
  File "upload.py", line 72, in <module>
    item.upload([mp3_filename, json_filename], metadata=headers, verbose=True)
  File "/usr/local/lib/python2.7/dist-packages/internetarchive/item.py", line 433, in upload
    resp = self.upload_file(f, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/internetarchive/item.py", line 386, in upload_file
    return self.session.send(prepared_request, stream=True)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 382, in send
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s3.us.archive.org', port=80): Max retries exceeded with url: /TEST-Technoterra-Live_THE_BLEND_TECHNOTERRA_SOUNDSYSTEM_GROSSETO-28Oct2011/20111028%20-%20082053-%20Live%20THE%20BLEND%20TECHNOTERRA%20SOUNDSYSTEM%20GROSSETO.mp3 (Caused by <class 'socket.error'>: [Errno 32] Broken pipe)

This is my code:

import internetarchive, os, json, sys
import string, random, datetime, re, yaml

with open("config.yaml", "r") as f:
        config = yaml.safe_load(f)

os.environ["AWS_ACCESS_KEY_ID"] = config["s3"]["access"]
os.environ["AWS_SECRET_ACCESS_KEY"] = config["s3"]["secret"]

def parse_iso_datetime(str) :
    # parse a datetime generated by datetime.isoformat()
    # http://dwiel.net/blog/python-parsing-the-output-of-datetime-isoformat/
    try :
        return datetime.datetime.strptime(str, "%Y-%m-%dT%H:%M:%S")
    except ValueError :
        return datetime.datetime.strptime(str, "%Y-%m-%dT%H:%M:%S.%f")

def random_string(length = 12):
    return "".join(random.choice(string.lowercase) for x in xrange(0, length))

current_file = ""
sys.stdout.write("")

def urlize(text):
    text = re.sub("[^a-zA-Z0-9_-]", "_", text)
    text = re.sub("_{2,}", "_", text)
    return text

description_template = """
Liveset title: %(title)s<br>
DJ/Show: %(artist)s<br>
Broadcast started: %(date)s<br>
Station: <a href="http://afterhoursdjs.org/">AfterHoursDJs.org</a><br>
<br>
Please note that this liveset was recorded, classified, and uploaded automatically. Information may be inaccurate, and the recording may be incomplete. Please report any inaccuracies to the uploader.<br>
<br>
The classification metadata can be found in the accompanying JSON file.
"""

source_dir = sys.argv[1]
files = [os.path.join(source_dir, f) for f in os.listdir(source_dir) if os.path.isfile(os.path.join(source_dir, f)) and f.endswith(".mp3")]

for mp3_filename in files:
    json_filename =  os.path.splitext(mp3_filename)[0] + ".json"

    json_file = open(json_filename, "r")
    metadata = json.load(json_file)
    json_file.close()

    broadcast_date = parse_iso_datetime(metadata["broadcast_date"])
    bucket_name = "TEST-%s-%s-%s" % (metadata["artist"].replace(" ", ""), urlize(metadata["title"]), broadcast_date.strftime("%d%b%Y"))

    headers = {
        "mediatype": "audio",
        "collection": "afterhoursdjs_livesets",
        "subject": ["AfterHoursDJs.org", "liveset", metadata["artist"], broadcast_date.year, "test_collection"],
        "title": "%s (%s, %s)" % (metadata["title"], metadata["artist"], broadcast_date.strftime("%d %b %Y")),
        "description": description_template % {
            "title": metadata["title"],
            "artist": metadata["artist"],
            "date": broadcast_date.strftime("%d %B %Y, %H:%M:%S")
        }
    }

    item = internetarchive.Item(bucket_name)

    if item.exists:
        print "Skipping %s, already exists..." % bucket_name
        continue
    else:
        # Item does not exist yet, upload...
                item.upload(mp3_filename, metadata=headers, verbose=True)
                item.upload(json_filename, verbose=True)

This is what is in said directory:

stream@croissant:~$ ls sorted
20090801 - Misjah In The Mix.json                                     20101226 - 054733- Live LRCN ChillTemp01.mp3
20090801 - Misjah In The Mix.mp3                                      20101226 - 080133- Live LRCN Tech Therapy 010.json
20090901 - Misjah In The Mix.json                                     20101226 - 080133- Live LRCN Tech Therapy 010.mp3
20090901 - Misjah In The Mix.mp3                                      20101228 - 090420- Live Mash D Random Music.json
20091204 - 080400- Live The Live Blend technoterra LIVE 04 12 2009.json                   20101228 - 090420- Live Mash D Random Music.mp3
20091204 - 080400- Live The Live Blend technoterra LIVE 04 12 2009.mp3                    20110103 - 105213- Live RaveOlution 50 with DJ Teknikal Crysis.json
20091230 - 161109- Live Bass Attack V with DJs G LuX and CB SoundZ.json                   20110103 - 105213- Live RaveOlution 50 with DJ Teknikal Crysis.mp3
20091230 - 161109- Live Bass Attack V with DJs G LuX and CB SoundZ.mp3                    20110106 - 191317- Live Jake Encinas Live from Minneapolis.json
20091230 - 193710- Live SinCitySyndicate presents DJ Neuro.json                       20110106 - 191317- Live Jake Encinas Live from Minneapolis.mp3
20091230 - 193710- Live SinCitySyndicate presents DJ Neuro.mp3                        20110110 - 093123- Live TonY Presents Club Beats.json
20100104 - 193934- Live Teknofunk w DJ Jakub.json                             20110110 - 093123- Live TonY Presents Club Beats.mp3
20100104 - 193934- Live Teknofunk w DJ Jakub.mp3                              20110113 - 130058- Live Moments of joy with Delune x11 Session 2.json
20100108 - 192245- Live DJ TAD Satech 3.json                                  20110113 - 130058- Live Moments of joy with Delune x11 Session 2.mp3
20100108 - 192245- Live DJ TAD Satech 3.mp3                               20110115 - 130431- Live The Bidness 069 January 2011 with D Trax and LRCN.json
20100109 - 095900- Live RaveOlution Episode 27 Teknikal Crysis.json                   20110115 - 130431- Live The Bidness 069 January 2011 with D Trax and LRCN.mp3
20100109 - 095900- Live RaveOlution Episode 27 Teknikal Crysis.mp3                    20110122 - 120742- Live The Truth w Kegan EP 82.json
20100114 - 230226- Live Jake Encinas Live.json                                20110122 - 120742- Live The Truth w Kegan EP 82.mp3
20100114 - 230226- Live Jake Encinas Live.mp3                                 20110302 - 171341- Live DJ TAD PROTRONIC EP 58.json
20100115 - 110623- Live Sick Note SmackHI chapter 26 LiVe IT.json                     20110302 - 171341- Live DJ TAD PROTRONIC EP 58.mp3
20100115 - 110623- Live Sick Note SmackHI chapter 26 LiVe IT.mp3                      20110323 - 170016- Live DJ TAD Protronic 61.json
20100115 - 165243- Live The Truth w Kegan Episode 68 Yeah its for real.json               20110323 - 170016- Live DJ TAD Protronic 61.mp3
20100115 - 165243- Live The Truth w Kegan Episode 68 Yeah its for real.mp3                20110415 - 153002- Live Moments of joy with Delune x11 Session 8.json
20100115 - 204501- Live Justin Styles Live from the W Hotel Minneapolis.json                  20110415 - 153002- Live Moments of joy with Delune x11 Session 8.mp3
20100115 - 204501- Live Justin Styles Live from the W Hotel Minneapolis.mp3               20110420 - 183111- Live DJ TAD Protronic 64 PT2.json
20100116 - 104059- Live Gabriel Setis Tribute to Melody 47 LIVE w Stickam Join the IRC for Info.json  20110420 - 183111- Live DJ TAD Protronic 64 PT2.mp3
20100116 - 104059- Live Gabriel Setis Tribute to Melody 47 LIVE w Stickam Join the IRC for Info.mp3   20110422 - 081242- Live THE BLEND the sound of technoterra.json
20100120 - 212947- Live Jake Encinas Live.json                                20110422 - 081242- Live THE BLEND the sound of technoterra.mp3
20100120 - 212947- Live Jake Encinas Live.mp3                                 20110424 - 132410- Live Moments of Joy easter edit with Delune.json
20100123 - 162957- Live Patrick Kroft LIVE.json                               20110424 - 132410- Live Moments of Joy easter edit with Delune.mp3
20100123 - 162957- Live Patrick Kroft LIVE.mp3                                20110427 - 172336- Live DJ TAD Protronic 65.json
20100123 - 184255- Live Sub Assassins Studios.json                            20110427 - 172336- Live DJ TAD Protronic 65.mp3
20100123 - 184255- Live Sub Assassins Studios.mp3                             20110429 - 081626- Live TECHNOTERRA THE BLEND 29april2011.json
20100212 - 011436- Live Moments of Joy Wake up tunes with Delune.json                     20110429 - 081626- Live TECHNOTERRA THE BLEND 29april2011.mp3
20100212 - 011436- Live Moments of Joy Wake up tunes with Delune.mp3                      20110513 - 091155- Live TECHNOTERRA THE BLEND 13 05 2011.json
20100215 - 100456- Live Patrick Kroft.json                                20110513 - 091155- Live TECHNOTERRA THE BLEND 13 05 2011.mp3
20100215 - 100456- Live Patrick Kroft.mp3                                 20110514 - 111133- Live The Truth w Kegan EP 93.json
20100219 - 171419- Live The Truth w kegan Ep 73.json                              20110514 - 111133- Live The Truth w Kegan EP 93.mp3
20100219 - 171419- Live The Truth w kegan Ep 73.mp3                           20110525 - 192439- Live DJ TAD EP 69 PT 2.json
20100220 - 122103- Live SinCitySyndicate presents DJ Neuro LIVE.json                      20110525 - 192439- Live DJ TAD EP 69 PT 2.mp3
20100220 - 122103- Live SinCitySyndicate presents DJ Neuro LIVE.mp3                   20110603 - 083138- Live TECHNOTERRA THE BLEND 02 JUNE 2011.json
20100223 - 093337- Live Stephen Wiley.json                                20110603 - 083138- Live TECHNOTERRA THE BLEND 02 JUNE 2011.mp3
20100223 - 093337- Live Stephen Wiley.mp3                                 20110603 - 093916- Live TECHNOTERRA THE BLEND 02 JUNE 2011.json
20100312 - 212521- Live Jake Encinas Live.json                                20110603 - 093916- Live TECHNOTERRA THE BLEND 02 JUNE 2011.mp3
20100312 - 212521- Live Jake Encinas Live.mp3                                 20110624 - 092712- Live TECHNOTERRA THE BLEND 24 JUNE 2011.json
20100313 - 180519- Live DJ Teknikal Crysis The RaveOlution Episode 32.json                20110624 - 092712- Live TECHNOTERRA THE BLEND 24 JUNE 2011.mp3
20100313 - 180519- Live DJ Teknikal Crysis The RaveOlution Episode 32.mp3                 20110701 - 080221- Live TECHNOTERRA THE BLEND 01 JULY 2011.json
20100315 - 154833- Live TumTum Random Selctions 2010.json                         20110701 - 080221- Live TECHNOTERRA THE BLEND 01 JULY 2011.mp3
20100315 - 154833- Live TumTum Random Selctions 2010.mp3                          20110706 - 170543- Live DJ TAD Protronic 73.json
20100321 - 171750- Live Prognosis Live with Jake Encinas and Sean Merrell.json                20110706 - 170543- Live DJ TAD Protronic 73.mp3
20100321 - 171750- Live Prognosis Live with Jake Encinas and Sean Merrell.mp3                 20110711 - 061122- Live Moments of joy with Delune x11 Session 9.json
20100325 - 224843- Live Jake Encinas.json                                 20110711 - 061122- Live Moments of joy with Delune x11 Session 9.mp3
20100325 - 224843- Live Jake Encinas.mp3                                  20110730 - 125931- Live Mash D ElectroNight 48.json
20100331 - 111019- Live Moments of Joy with Delune Episode 63.json                    20110730 - 125931- Live Mash D ElectroNight 48.mp3
20100331 - 111019- Live Moments of Joy with Delune Episode 63.mp3                     20110802 - 145215- Live Mash D ElectroNight 48.json
20100405 - 155343- Live Hubris Sessions with Justin Sytles.json                       20110802 - 145215- Live Mash D ElectroNight 48.mp3
20100405 - 155343- Live Hubris Sessions with Justin Sytles.mp3                        20110908 - 085508- Live Moments of Joy with Delune 08 09 11.json
20100408 - 165549- Live The RaveOlution Episode 33 with DJ Teknikal Crysis.json               20110908 - 085508- Live Moments of Joy with Delune 08 09 11.mp3
20100408 - 165549- Live The RaveOlution Episode 33 with DJ Teknikal Crysis.mp3                20110909 - 133401- Live Mike Hunter and Ciacomix from Germany.json
20100408 - 190124- Live DJ TAD Rock The Vote Ep 4 www djtad com vote.json                 20110909 - 133401- Live Mike Hunter and Ciacomix from Germany.mp3
20100408 - 190124- Live DJ TAD Rock The Vote Ep 4 www djtad com vote.mp3                  20110911 - 114806- Live Setis September 2011 Promo.json
20100409 - 172025- Live DJ TAD Rock The Vote Ep 5 www djtad com vote.json                 20110911 - 114806- Live Setis September 2011 Promo.mp3
20100409 - 172025- Live DJ TAD Rock The Vote Ep 5 www djtad com vote.mp3                  20110912 - 085706- Live Darthiis Electro House Session 19.json
20100416 - 133929- Live Moments of Joy with Delune the weekend edition.json               20110912 - 085706- Live Darthiis Electro House Session 19.mp3
20100416 - 133929- Live Moments of Joy with Delune the weekend edition.mp3                20110918 - 085712- Live w a r m d i g i t s technoterra soundsystem EXTRA 18 08 2011.json
20100416 - 225413- Live Jake Encinas ALL VINYL ALL OLD SCHOOL.json                    20110918 - 085712- Live w a r m d i g i t s technoterra soundsystem EXTRA 18 08 2011.mp3
20100416 - 225413- Live Jake Encinas ALL VINYL ALL OLD SCHOOL.mp3                     20110921 - 191036- Live DJ TAD Protronic 77.json
20100420 - 160422- Live 24 hour live dirty sexy set by betamaxdj.json                     20110921 - 191036- Live DJ TAD Protronic 77.mp3
20100420 - 160422- Live 24 hour live dirty sexy set by betamaxdj.mp3                      20110923 - 084348- Live The blend technoterra 23 09 2011.json
20100428 - 193139- Live The Root Cause with Sean Merrell.json                         20110923 - 084348- Live The blend technoterra 23 09 2011.mp3
20100428 - 193139- Live The Root Cause with Sean Merrell.mp3                          20110926 - 085923- Live Darthii s Electro House Session 21.json
20100709 - 084615- Live technoterra.json                                  20110926 - 085923- Live Darthii s Electro House Session 21.mp3
20100709 - 084615- Live technoterra.mp3                                   20111001 - 095301- Live Mash D ElectroNight 56 p2.json
20100709 - 110433- Live DJ Mash D s Sunshine and Tribal mix from Germany.json                 20111001 - 095301- Live Mash D ElectroNight 56 p2.mp3
20100709 - 110433- Live DJ Mash D s Sunshine and Tribal mix from Germany.mp3                  20111005 - 190001- Live DJ TAD Protronic 78.json
20100716 - 110252- Live HandzUp for the weekend with DJ Mash D.json                   20111005 - 190001- Live DJ TAD Protronic 78.mp3
20100716 - 110252- Live HandzUp for the weekend with DJ Mash D.mp3                    20111007 - 083614- Live theBlend TECHNOTERRA SoundSystem 07 10 2011.json
20100724 - 000244- Live Sub Assassins We ate a bunch of mushrooms.json                    20111007 - 083614- Live theBlend TECHNOTERRA SoundSystem 07 10 2011.mp3
20100724 - 000244- Live Sub Assassins We ate a bunch of mushrooms.mp3                     20111008 - 090307- Live Mash D ElectroNight 57.json
20100725 - 134805- Live SinCitySyndicate weekly LIVE radio show w DJ Neuro.json               20111008 - 090307- Live Mash D ElectroNight 57.mp3
20100725 - 134805- Live SinCitySyndicate weekly LIVE radio show w DJ Neuro.mp3                20111010 - 114200- Live Moments of Joy with Delune 10 10 11.json
20100726 - 101446- Live Moments of Joy with Delune August.json                        20111010 - 114200- Live Moments of Joy with Delune 10 10 11.mp3
20100726 - 101446- Live Moments of Joy with Delune August.mp3                         20111015 - 115050- Live Moments of Joy weekendmix 15 10 11.json
20100802 - 185153- Live Teknofunk with DJ Jakub.json                              20111015 - 115050- Live Moments of Joy weekendmix 15 10 11.mp3
20100802 - 185153- Live Teknofunk with DJ Jakub.mp3                           20111021 - 071706- Live Moments of Joy Weekend edition 211011.json
20100803 - 123616- Live Little B Day Mix by Mash D part 2.json                        20111021 - 071706- Live Moments of Joy Weekend edition 211011.mp3
20100803 - 123616- Live Little B Day Mix by Mash D part 2.mp3                         20111021 - 083509- Live technoTerra SoundSystem theBlend 21 10 2011.json
20100803 - 180401- Live Happy Birthday to Mash D w DJ Teknikal Crysis.json                20111021 - 083509- Live technoTerra SoundSystem theBlend 21 10 2011.mp3
20100803 - 180401- Live Happy Birthday to Mash D w DJ Teknikal Crysis.mp3                 20111022 - 125716- Live Mash D ElectroNight 58.json
20100804 - 194038- Live The Root Cause Episode 6 w Sean Merrell.json                      20111022 - 125716- Live Mash D ElectroNight 58.mp3
20100804 - 194038- Live The Root Cause Episode 6 w Sean Merrell.mp3                   20111023 - 083023- Live the WARM DIGITS session 23 10 2011.json
20100806 - 143853- Live Vale In The Mix.json                                  20111023 - 083023- Live the WARM DIGITS session 23 10 2011.mp3
20100806 - 143853- Live Vale In The Mix.mp3                               20111023 - 140207- Live SinCitySyndicate Weekly with DJ Neuro and more.json
20100820 - 082539- Live the blend technoterra 20 08 2010.json                         20111023 - 140207- Live SinCitySyndicate Weekly with DJ Neuro and more.mp3
20100820 - 082539- Live the blend technoterra 20 08 2010.mp3                          20111028 - 082053- Live THE BLEND TECHNOTERRA SOUNDSYSTEM GROSSETO.json
20100820 - 130339- Live Teknofunk with DJ Jakub.json                              20111028 - 082053- Live THE BLEND TECHNOTERRA SOUNDSYSTEM GROSSETO.mp3
20100820 - 130339- Live Teknofunk with DJ Jakub.mp3                           20111109 - 191241- Live DJ TAD Protronic 82.json
20100822 - 130709- Live Sin City Syndicate Weekly Radio Show with DJ Neuro.json               20111109 - 191241- Live DJ TAD Protronic 82.mp3
20100822 - 130709- Live Sin City Syndicate Weekly Radio Show with DJ Neuro.mp3                20111111 - 082931- Live T H E B L E N D TECHNOTERRA 11 11 2011.json
20100825 - 171323- Live DJ TAD Protronic 35.json                              20111111 - 082931- Live T H E B L E N D TECHNOTERRA 11 11 2011.mp3
20100825 - 171323- Live DJ TAD Protronic 35.mp3                               20111118 - 081539- Live the blend technoterra soundSystem 18 11 2011.json
20100825 - 182613- Live DJ TAD Protronic 35 part2.json                            20111118 - 081539- Live the blend technoterra soundSystem 18 11 2011.mp3
20100825 - 182613- Live DJ TAD Protronic 35 part2.mp3                             20111119 - 110217- Live Mash D ElectroNight 61.json
20100903 - 120346- Live teknofunk w dj jakub.json                             20111119 - 110217- Live Mash D ElectroNight 61.mp3
20100903 - 120346- Live teknofunk w dj jakub.mp3                              20111130 - 190203- Live DJ TAD Protronic 85.json
20100903 - 180646- Live Moments of Joy midnightmix september with Delune.json                 20111130 - 190203- Live DJ TAD Protronic 85.mp3
20100903 - 180646- Live Moments of Joy midnightmix september with Delune.mp3                  20111203 - 110427- Live Mash D ElectroNight 63.json
20100906 - 211022- Live The Red Bull Effect with Justin Styles.json                   20111203 - 110427- Live Mash D ElectroNight 63.mp3
20100906 - 211022- Live The Red Bull Effect with Justin Styles.mp3                    20111207 - 190005- Live DJ TAD Protronic 86.json
20100907 - 094240- Live brunch with betamaxDj dirty sexy house.json                   20111207 - 190005- Live DJ TAD Protronic 86.mp3
20100907 - 094240- Live brunch with betamaxDj dirty sexy house.mp3                    20111209 - 120025- Live Ciacomix from Germany Past Present Future.json
20100909 - 213427- Live Jake Encinas.json                                 20111209 - 120025- Live Ciacomix from Germany Past Present Future.mp3
20100909 - 213427- Live Jake Encinas.mp3                                  20111211 - 140012- Live Sin City Syndicate Weekly with DJ Neuro and Pyro Tech.json
20100910 - 083215- Live The blend technoterra 10 09 2010.json                         20111211 - 140012- Live Sin City Syndicate Weekly with DJ Neuro and Pyro Tech.mp3
20100910 - 083215- Live The blend technoterra 10 09 2010.mp3                          20111212 - 170405- Live The House of Styles Episode 001 with Sean Merrell.json
20100912 - 130057- Live SinCitySynidcate Weekly Radio Show and House Party with DJ Neuro.json         20111212 - 170405- Live The House of Styles Episode 001 with Sean Merrell.mp3
20100912 - 130057- Live SinCitySynidcate Weekly Radio Show and House Party with DJ Neuro.mp3          20111217 - 133152- Live Ciacomix Birthday Pary from Germany.json
20100917 - 103958- Live Sick Note aka Ganzak SmackHI chapter 35 LiVe IT.json                  20111217 - 133152- Live Ciacomix Birthday Pary from Germany.mp3
20100917 - 103958- Live Sick Note aka Ganzak SmackHI chapter 35 LiVe IT.mp3               20111225 - 140113- Live SinCitySyndicate Weekly Xmas 2012 Special with DJ Neuro.json
20100918 - 085344- Live ElectroNight 13 with Mash D from Germany.json                     20111225 - 140113- Live SinCitySyndicate Weekly Xmas 2012 Special with DJ Neuro.mp3
20100918 - 085344- Live ElectroNight 13 with Mash D from Germany.mp3                      20111231 - 211325- Live Welcome to 2012 w Gabriel Setis.json
20100924 - 091430- Live The blend technoterra 24 09 2010.json                         20111231 - 211325- Live Welcome to 2012 w Gabriel Setis.mp3
20100924 - 091430- Live The blend technoterra 24 09 2010.mp3                          20111231 - 211522- Live Welcome to 2012 w Gabriel Setis.json
20100929 - 155627- Live The RaveOlution Episode 44 w DJ Teknikal Crysis www teknikalcrysis com.json   20111231 - 211522- Live Welcome to 2012 w Gabriel Setis.mp3
20100929 - 155627- Live The RaveOlution Episode 44 w DJ Teknikal Crysis www teknikalcrysis com.mp3    20120106 - 080034- Live TECHNOTERRA SOUNDSYSTEM 06 JAN 2012.json
20100930 - 210654- Live Jake Encinas.json                                 20120106 - 080034- Live TECHNOTERRA SOUNDSYSTEM 06 JAN 2012.mp3
20100930 - 210654- Live Jake Encinas.mp3                                  20120109 - 091024- Live TonY Presents Club Beats http tonil arkku net.json
20101002 - 110139- Live Moments of Joy with Delune October 2010.json                      20120109 - 091024- Live TonY Presents Club Beats http tonil arkku net.mp3
20101002 - 110139- Live Moments of Joy with Delune October 2010.mp3                   20120115 - 140037- Live SinCitySyndicate Weekly with DJ Neuro and JusJoshua.json
20101008 - 082719- Live technoterra the blend 08 10 2010.json                         20120115 - 140037- Live SinCitySyndicate Weekly with DJ Neuro and JusJoshua.mp3
20101008 - 082719- Live technoterra the blend 08 10 2010.mp3                          20120118 - 110845- Live Mash D ElectroNight 65.json
20101008 - 104224- Live Sick Note aka Ganzak SmackHI chapter 37 LiVe IT.json                  20120118 - 110845- Live Mash D ElectroNight 65.mp3
20101008 - 104224- Live Sick Note aka Ganzak SmackHI chapter 37 LiVe IT.mp3               20120122 - 111942- Live Mash D in the mix recorded live at saltandpepper.json
20101008 - 151235- Live Moments of Joy with Delune Midnight Sessions 01.json                  20120122 - 111942- Live Mash D in the mix recorded live at saltandpepper.mp3
20101008 - 151235- Live Moments of Joy with Delune Midnight Sessions 01.mp3               20120122 - 140014- Live Neurology Weekly with DJ Neuro and Stephen Wiley.json
20101016 - 164845- Live The Bidness 064 August 2010 P2 Geoff Ledak.json                   20120122 - 140014- Live Neurology Weekly with DJ Neuro and Stephen Wiley.mp3
20101016 - 164845- Live The Bidness 064 August 2010 P2 Geoff Ledak.mp3                    20120128 - 070245- Live Moments of Joy Februari 2012.json
20101017 - 140551- Live SinCitySyndicate show w DJ Neuro Chris Smotherman.json                20120128 - 070245- Live Moments of Joy Februari 2012.mp3
20101017 - 140551- Live SinCitySyndicate show w DJ Neuro Chris Smotherman.mp3                 20120206 - 171205- Live The House of Styles with guest K Drive.json
20101022 - 144629- Live Moments of joy with Delune 221010.json                        20120206 - 171205- Live The House of Styles with guest K Drive.mp3
20101022 - 144629- Live Moments of joy with Delune 221010.mp3                         20120208 - 190541- Live DJ TAD Protronic Japan.json
20101025 - 203834- Live The Red Bull Effect with Justin Styles.json                   20120208 - 190541- Live DJ TAD Protronic Japan.mp3
20101025 - 203834- Live The Red Bull Effect with Justin Styles.mp3                    20120212 - 140140- Live Neurology Weekly with DJ Neuro.json
20101029 - 120853- Live Teknofunk w Killer Meph.json                              20120212 - 140140- Live Neurology Weekly with DJ Neuro.mp3
20101029 - 120853- Live Teknofunk w Killer Meph.mp3                           20120228 - 111803- Live Mash D ElectroNight 66.json
20101030 - 111456- Live Tribute to Melody 57 w Gabriel Setis.json                     20120228 - 111803- Live Mash D ElectroNight 66.mp3
20101030 - 111456- Live Tribute to Melody 57 w Gabriel Setis.mp3                      20120301 - 055715- Live Moments of Joy March edition 2012 with Delune.json
20101031 - 120000- Live Happy Halloween 2k10 from Mash D.json                         20120301 - 055715- Live Moments of Joy March edition 2012 with Delune.mp3
20101031 - 120000- Live Happy Halloween 2k10 from Mash D.mp3                          20120307 - 190222- Live DJ TAD Protronic 94.json
20101101 - 010756- Live MrHat Halloween.json                                  20120307 - 190222- Live DJ TAD Protronic 94.mp3
20101101 - 010756- Live MrHat Halloween.mp3                               20120312 - 101207- Live TonY Presents Club Beats http tonil arkku net.json
20101102 - 110703- Live Mash D playing House Music for you.json                       20120312 - 101207- Live TonY Presents Club Beats http tonil arkku net.mp3
20101102 - 110703- Live Mash D playing House Music for you.mp3                        20120316 - 085117- Live TECHNOTERRA SOUNDSYSTEM 16 02 2012.json
20101102 - 130633- Live betamaxdj extended tuesday disco tunez.json                   20120316 - 085117- Live TECHNOTERRA SOUNDSYSTEM 16 02 2012.mp3
20101102 - 130633- Live betamaxdj extended tuesday disco tunez.mp3                    20120325 - 140016- Live Neurology Weekly with DJ Neuro and Slinky.json
20101103 - 170811- Live DJ TAD Protronic 43.json                              20120325 - 140016- Live Neurology Weekly with DJ Neuro and Slinky.mp3
20101103 - 170811- Live DJ TAD Protronic 43.mp3                               20120330 - 073142- Live TECHNOTERRA SOUNDSYSTEM 30 03 2012.json
20101107 - 100841- Live HouseMouse EP3 w Delune and Mash D.json                       20120330 - 073142- Live TECHNOTERRA SOUNDSYSTEM 30 03 2012.mp3
20101107 - 100841- Live HouseMouse EP3 w Delune and Mash D.mp3                        20120408 - 121626- Live Ciacomix from Germany Pres Easter Egg Trance.json
20101110 - 171411- Live DJ TAD Protronic 44.json                              20120408 - 121626- Live Ciacomix from Germany Pres Easter Egg Trance.mp3
20101110 - 171411- Live DJ TAD Protronic 44.mp3                               20120408 - 140356- Live Neurology Weekly with DJ Neuro.json
20101113 - 200540- Live Jake Encinas from Minneapolis.json                        20120408 - 140356- Live Neurology Weekly with DJ Neuro.mp3
20101113 - 200540- Live Jake Encinas from Minneapolis.mp3                         20120409 - 170734- Live The House of Styles with Chico Brown.json
20101114 - 141941- Live SinCitySyndicate weekly w Ashley Power.json                   20120409 - 170734- Live The House of Styles with Chico Brown.mp3
20101114 - 141941- Live SinCitySyndicate weekly w Ashley Power.mp3                    20120413 - 075745- Live the Worldwide Session technoterra 13 04 2012.json
20101120 - 092726- Live DJ Mash D Trance Time EN 22.json                          20120413 - 075745- Live the Worldwide Session technoterra 13 04 2012.mp3
20101120 - 092726- Live DJ Mash D Trance Time EN 22.mp3                           20120422 - 140018- Live Neurology Weekly with DJ Neuro.json
20101120 - 114604- Live Tribute to Melody 59 w Gabriel Setis.json                     20120422 - 140018- Live Neurology Weekly with DJ Neuro.mp3
20101120 - 114604- Live Tribute to Melody 59 w Gabriel Setis.mp3                      20120515 - 073454- Live the warm digits session technoterra 15 may 2012.json
20101128 - 141605- Live DJ Neuro Weekly SCS Show.json                             20120515 - 073454- Live the warm digits session technoterra 15 may 2012.mp3
20101128 - 141605- Live DJ Neuro Weekly SCS Show.mp3                              20120523 - 110007- Live Mash D Kinda Random.json
20101201 - 224939- Live The Red Bull Effect with Justin Styles twitter com trahma.json            20120523 - 110007- Live Mash D Kinda Random.mp3
20101201 - 224939- Live The Red Bull Effect with Justin Styles twitter com trahma.mp3             20120525 - 105000- Live SickNote themes fron the black forest 25 05 2012.json
20101204 - 151025- Live Back on track Ciacomix from Germany.json                      20120525 - 105000- Live SickNote themes fron the black forest 25 05 2012.mp3
20101204 - 151025- Live Back on track Ciacomix from Germany.mp3                       20120531 - 235222- Live technoterra the blend first of june 12.json
20101207 - 110240- Live Redlight Music with Delune.json                           20120531 - 235222- Live technoterra the blend first of june 12.mp3
20101207 - 110240- Live Redlight Music with Delune.mp3                            20120601 - 140840- Live Moments of Joy June edition 2012 with Delune.json
20101207 - 220531- Live betamax from miami beach flashback afterhours.json                20120601 - 140840- Live Moments of Joy June edition 2012 with Delune.mp3
20101207 - 220531- Live betamax from miami beach flashback afterhours.mp3                 20120628 - 190425- Live DJ TAD Protronic 109.json
20101212 - 111854- Live Doktor Domi DJ presents Back To 2010.json                     20120628 - 190425- Live DJ TAD Protronic 109.mp3
20101212 - 111854- Live Doktor Domi DJ presents Back To 2010.mp3                      20120629 - 110733- Live sicknote themes from the black forest 29 june 2012.json
20101212 - 122803- Live Doktor Domi DJ presents Back To 2010.json                     20120629 - 110733- Live sicknote themes from the black forest 29 june 2012.mp3
20101212 - 122803- Live Doktor Domi DJ presents Back To 2010.mp3                      [Live!] Mash-D - Pumping Electro For Your Ears 002.json
20101212 - 140658- Live SCS Weekly Show w DJ Neuro and guest Shawn Frazier.json               [Live!] Mash-D - Pumping Electro For Your Ears 002.mp3
20101212 - 140658- Live SCS Weekly Show w DJ Neuro and guest Shawn Frazier.mp3                 - [Live!] Neurologly Weekly with NeuroUnderground.json
20101214 - 205623- Live betamaxDj live from miami happy fetivus to alll.json                   - [Live!] Neurologly Weekly with NeuroUnderground.mp3
20101214 - 205623- Live betamaxDj live from miami happy fetivus to alll.mp3                - [Live!] sick nota porngrooves Turin IT.json
20101217 - 105830- Live SickNote aka Ganzak SmackHI chapter 40 LiVe IT.json                - [Live!] sick nota porngrooves Turin IT.mp3
20101217 - 105830- Live SickNote aka Ganzak SmackHI chapter 40 LiVe IT.mp3                 - [Live!] The House of Styles with DJ Matty Matt.json
20101218 - 090421- Live ElectroNight 26 with Mash D from Germany.json                      - [Live!] The House of Styles with DJ Matty Matt.mp3
20101218 - 090421- Live ElectroNight 26 with Mash D from Germany.mp3                       - [Live!] The House of Styles with special guest Drewbinski.json
20101218 - 134853- Live LRCN Dark Planet Recorded Live at Coshland 2010 Halloween Part 1.json          - [Live!] The House of Styles with special guest Drewbinski.mp3
20101218 - 134853- Live LRCN Dark Planet Recorded Live at Coshland 2010 Halloween Part 1.mp3           - [Live!] The House of Styles with special guest Kyle_G.json
20101218 - 172008- Live Moments of joy with Delune 20101219.json                       - [Live!] The House of Styles with special guest Kyle_G.mp3
20101218 - 172008- Live Moments of joy with Delune 20101219.mp3                        - Misjah In The Mix 1110192.json
20101226 - 054733- Live LRCN ChillTemp01.json                                  - Misjah In The Mix 1110192.mp3

Modify tags of files

Is it possible to use the bash or Python interface to change the tags of a file, as opposed to an item.

For instance, I can

ia metadata [--modify=key:value...] identifier

to modify the tag of item identifier, but can I do that for say the Title of a song within that item?

Search by URL in Wayback?

While trying out the wrapper I couldn't figure out how to search for specific URLS in Wayback. Is this possible at all?

Already tried figuring this out from the achive.org web interface. There this works (using simple search):

http://web.archive.org/web/*/http://www.projectmoonbase.com/

However if I try this using an advanced search:

https://archive.org/search.php?query=%28http%3A%2F%2Fwww.projectmoonbase.com%2F%29

this gives me 0 hits. Also, it's not clear to me what field name I should be using here.

So maybe this is just a limitation of the API of archive.org? Otherwise, if this is somehow possible, it would be helpful to add an example to the documentation, as this looks like a pretty obvious use case. Or maybe I'm just overlooking something obvious myself?

Update documentation

Docs need some updating. Based on some informal polling, external users are using the ia cli tool a lot more than they are using the internetarchive python module.

I would suggest putting the command-line usage at the top of the documentation, and documenting the python interface at the bottom.

"Too many open files" when uploading 1020th file.

On file number 1020 uploaded in a single session, I'm getting an error. Reproduced twice now, both times on the 1020th file. I can start up again and upload additional files to the same item.

Here's the traceback, parts in double square brackets are replacement generic values:

Traceback (most recent call last):
File "/usr/bin/ia", line 9, in
File "/usr/lib/python2.6/site-packages/internetarchive/iacli/ia.py", line 95, in main
File "/usr/lib/python2.6/site-packages/internetarchive/iacli/ia_upload.py", line 185, in main
File "/usr/lib/python2.6/site-packages/internetarchive/iacli/ia_upload.py", line 78, in _upload_files
File "/usr/lib/python2.6/site-packages/internetarchive/item.py", line 633, in upload
File "/usr/lib/python2.6/site-packages/internetarchive/item.py", line 528, in upload_file
File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 559, in send
File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 375, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s3.us.archive.org', port=80): Max retries exceeded with url: /[[item identifier]]/[[filename]](Caused by <class 'socket.error'>: [Errno 24] Too many open files)

TypeError: 'float' object cannot be interpreted as an integer in Search.py

In "def iter"...

"total_pages = ((self.num_found / self.params['rows']) + 2)" returns a Float. This causes a "'float' object cannot be interpreted as an integer' exception in the next line ("for page in range(1, total_pages)").

I believe total_pages should be the ceil() of the number found divided by the rows. If truncated to an integer, the last remaining items would not be returned.

Broken pipe uploading

Using "ia upload" I get a Broken Pipe error at the 2nd megabyte transferred. With tcpdump I see that the remote S3 host actually sends a TCP reset; ia retries with a 2nd connection, but it's also reset, and then the python exception is dumped to stderr. The "-R" (max num of retries) seems to have little effect

I've tried regenerating the AWS keys, I ensured that the process has the env vars set with the key/secret, and --status-check reports all fine. Nevertheless, my uploads always get Broken Pipe.

This may be a transient error from S3, but it happened to me several times also in the past; anyone has an idea on how to solve this? My solution until now has always been to use the web upload.

I'm susrprised because the connection is dropped after almost 2MB tranferred.

Unicode in x-archive-meta header keys fails on upload

When I try to upload something through a utf-8 encoded csv with Chinese characters, I get the following error. I tried doing urllib.parse.quote on the value with no success.

When I remove the unicode characters, it works fine.

Traceback (most recent call last):########## ] 1/2 - 00:00:00
File "/home/eitan/tmp/.amir/bin/ia", line 9, in
load_entry_point('internetarchive==0.7.1', 'console_scripts', 'ia')()
File "/home/eitan/tmp/ia-wrapper/internetarchive/iacli/ia.py", line 81, in main
ia_module.main(argv)
File "/home/eitan/tmp/ia-wrapper/internetarchive/iacli/ia_upload.py", line 173, in main
_upload_files(args, identifier, local_file, upload_kwargs, prev_identifier)
File "/home/eitan/tmp/ia-wrapper/internetarchive/iacli/ia_upload.py", line 76, in _upload_files
response = item.upload(files, *_upload_kwargs)
File "/home/eitan/tmp/ia-wrapper/internetarchive/item.py", line 618, in upload
resp = self.upload_file(body, key=key, *_kwargs)
File "/home/eitan/tmp/ia-wrapper/internetarchive/item.py", line 506, in upload_file
response = self.http_session.send(prepared_request, stream=True)
File "/home/eitan/tmp/.amir/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/home/eitan/tmp/.amir/lib/python2.7/site-packages/requests/adapters.py", line 375, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s3.us.archive.org', port=80): Max retries exceeded with url: /EB_00024s2011.jpg/00024s2011.jpg (Caused by <class 'socket.error'>: [Errno 32] Broken pipe)

upload_file() calculates checksums even if checksum isn't used.

I'm trying to upload a large file, and the extreme slowness before the uploader begins sending data suggests to me that it's either reading the whole file into memory, or at least reading over it once before beginning.

It seems like it might be a problem here, when it looks to calculate a size hint for the API:

body.seek(0, os.SEEK_END)
size = body.tell()
body.seek(0, os.SEEK_SET)

Python's os module has ways of fetching the size of a given file path that can read it directly from filesystem metadata, rather than seeking through the file.

It also looks like a problem here, when an MD5 is calculated for the entire file:

md5_sum = utils.get_md5(body)

Providing a way to disable the MD5 check, or multipart uploads where an MD5 only has to be taken per-chunk, would help mitigate this.

It looks like multipart uploads got ditched in the switch from boto to requests. Is restoring support for multipart uploads a planned feature?

[Enhancement] Check hash of uploaded files

It would be nice if there was a way for the script to check that the hash of the file actually received/stored by the item matches the local one. I know that this is not trivial because it would need to be async or something (waiting for archive.org to update metadata): maybe it should instead be a command to compare uploaded files to local copies and retry uploads/downloads when there are mistakes.

If I read https://github.com/jjjake/ia-wrapper/blob/a206321d3b0214b0789435da03ac52cde25e35ec/iacli/ia_upload.py#L84 correctly, you "only" check for a HTTP 200 code but all sorts of errors can happen.

Context: currently we use a very simplistic custom script (https://code.google.com/p/wikiteam/source/browse/trunk/uploader.py) which only checks for curl exit code. It sometimes happened to me today that instead of uploading two files it uploaded two copies of the same one, with two different filenames.

Get list of fields/columns

There doesn't seem to be any way to get the names of fields for search?
Or any way to get columns for list?

REMOVE_TAG not working in modify_metadata()

REMOVE_TAG doesn't seem to be working:

In [9]: md = dict(new_key='REMOVE_TAG')

In [10]: item.modify_metadata(md)
Out[10]: 
{'content': {u'error': u'no changes to _meta.xml', u'success': False},
 'status_code': 400}

How to deal with Unicode filenames?

Upon encountering a filename that contains Unicode(?) characters, the following occurs:

Traceback (most recent call last):
  File "./mirror.py", line 67, in <module>
    if item.upload([(real_filename, source_file)], metadata=metadata, access_key=conf["internetarchive"]["accesskey"], secret_key=conf["internetarchive"]["secretkey"]):
  File "/var/sites/pdfhost/internetarchive/item.py", line 544, in upload
    resp = self.upload_file(body, key=key, **kwargs)
  File "/var/sites/pdfhost/internetarchive/item.py", line 419, in upload_file
    url = '{base_url}/{key}'.format(base_url=base_url, key=key)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-14: ordinal not in range(128)

Evidently, it doesn't like these characters. What is the correct way of dealing with this, that doesn't upset either the library or the IA HTTPd? Is it possible at all to have non-ASCII characters in filenames on IA?

For reference, this is the offending document (see the filename in the sidebar).

ia mine --output creates output file with only the first letter of specified filename

jake:
i love the effort you've put into this.

i was playing with "ia mine" and per the documentation was attempting to use the "--output filename" feature. the result is a file output whose filename is "f".

not a big deal in that i can use i/o redirection to create the proper filename:
"ia mine itemlist.txt > eastoncat_metadata.json" instead of
"ia mine itemlist.txt --output eastoncat_metadata.json"

just thought you should know about this wrinkle...

macbook air with python 2.7.1
ia -version
0.4.4

keep up the great work!
john hauser

Configuration fails on Debian stable

$ cat /etc/debian_version 
8.1
$ ia configure
Please enter your Archive.org credentials below to have your
Archive.org cookies and IA-S3 keys added to your config file.

Traceback (most recent call last):
  File "/usr/local/bin/ia", line 9, in <module>
    load_entry_point('internetarchive==0.8.5', 'console_scripts', 'ia')()
  File "/usr/local/lib/python3.4/dist-packages/internetarchive/iacli/ia.py", line 95, in main
    ia_module.main(argv)
  File "/usr/local/lib/python3.4/dist-packages/internetarchive/iacli/ia_configure.py", line 28, in main
    username=raw_input('Email address: '),
NameError: name 'raw_input' is not defined

If I change raw_input to input:

$ ia configure
Please enter your Archive.org credentials below to have your                    
Archive.org cookies and IA-S3 keys added to your config file.

Email address: my@email
Password: 
Traceback (most recent call last):
  File "/usr/local/bin/ia", line 9, in <module>
    load_entry_point('internetarchive==0.8.5', 'console_scripts', 'ia')()
  File "/usr/local/lib/python3.4/dist-packages/internetarchive/iacli/ia.py", line 95, in main
    ia_module.main(argv)
  File "/usr/local/lib/python3.4/dist-packages/internetarchive/iacli/ia_configure.py", line 54, in main
    fp.write(configfile)
TypeError: 'str' does not support the buffer interface

Ship a 1.0 release

I've just created a 1.0 branch.

I'd like to improve the test coverage, docs, and freeze the Python API and CLI before shipping. I would also like to improve Python 3 compatibility.

Suggestions, contributions, and questions are very welcome!

Out of memory when uploading files

Using version 0.4.4, I am trying to upload some files but it gets killed for running out of memory:

ia upload webtv_fire_grab webtv.megawarc.json.gz webtv.megawarc.tar webtv.megawarc.warc.gz --metadata="title:MSN TV / WebTV Community Web Pages" --metadata="collection:texts" --metadata="creator:webtv.net" --metadata="description:This is a partial grab of webtv.net community web pages before the service was shutdown on 2013-09-30." --verbose
getting item: webtv_fire_grab
 uploading file: webtv.megawarc.json.gz
 uploading file: webtv.megawarc.tar
Killed

syslog:

Oct 23 20:02:58 localhost kernel: [3453014.035781] Out of memory: Kill process 6695 (ia) score 424 or sacrifice child
Oct 23 20:02:58 localhost kernel: [3453014.036244] Killed process 6695 (ia) total-vm:2159244kB, anon-rss:1091448kB, file-rss:4kB

ls -hal:

total 2.1G
drwxr-xr-x 2 archiveteam archiveteam 4.0K Oct 23 19:18 .
drwxr-xr-x 6 archiveteam archiveteam 4.0K Oct 23 19:16 ..
-rw-r--r-- 1 archiveteam archiveteam 1.7M Oct 23 19:24 webtv.megawarc.json.gz
-rw-r--r-- 1 archiveteam archiveteam    0 Oct 23 19:18 webtv.megawarc.tar
-rw-r--r-- 1 archiveteam archiveteam 2.0G Oct 23 19:24 webtv.megawarc.warc.gz

`ia download item` tries to download files twice

Happens with both git master and verison 0.4.3 from PyPi. Did not happen with version 0.4.1.

$ ia --version
0.4.3

$ ia download foot090
downloading: foot090_rules.conf
downloading: foot090_large.jpg
downloading: foot090.jpg
downloading: foot090_07-nienvox-transparent_moods_boogie.mp3
downloading: foot090_01-nienvox-sign.mp3
downloading: foot090_03-nienvox-inside_part_2.mp3
downloading: foot090_02-nienvox-structure_of_game.mp3
downloading: foot090_meta.xml
downloading: foot090_05-nienvox-water_formation.mp3
downloading: foot090_06-nienvox-crosslights.mp3
downloading: foot090_04-nienvox-imprint.mp3
downloading: foot090_reviews.xml
downloading: foot090_files.xml
downloading: foot090_rules.conf
Traceback (most recent call last):
  File "/Users/rkumar/dev/pyenvs/kitchensink/bin/ia", line 8, in <module>
    load_entry_point('internetarchive==0.4.3', 'console_scripts', 'ia')()
  File "/Users/rkumar/dev/pyenvs/kitchensink/lib/python2.7/site-packages/iacli/ia.py", line 91, in main
    ia_module.main(argv)
  File "/Users/rkumar/dev/pyenvs/kitchensink/lib/python2.7/site-packages/iacli/ia_download.py", line 74, in main
    ignore_existing=args['--ignore-existing'])
  File "/Users/rkumar/dev/pyenvs/kitchensink/lib/python2.7/site-packages/internetarchive/item.py", line 194, in download
    f.download(path, ignore_existing=ignore_existing)
  File "/Users/rkumar/dev/pyenvs/kitchensink/lib/python2.7/site-packages/internetarchive/item.py", line 442, in download
    raise IOError('File already exists: {0}'.format(file_path))
IOError: File already exists: foot090/foot090_rules.conf

Also, we need a test case that catches this.

Request: how to search by multiple parameters

Thanks for this code!

One request: is it possible, and if so how can one search using multiple parameters? For example, if I want to search by the word python in the collection gutenberg and sort by downloads.

ValueError

I get the following error:

$ ia download AssemblyField-VaCompilation2 --glob='*ogg'

Traceback (most recent call last):
File "/usr/local/bin/ia", line 9, in
load_entry_point('internetarchive==0.4.7', 'console_scripts', 'ia')()
File "/usr/local/lib/python2.7/dist-packages/iacli/ia.py", line 89, in main
ia_module.main(argv)
File "/usr/local/lib/python2.7/dist-packages/iacli/ia_download.py", line 75, in main
dry_run=args['--dry-run'], ignore_existing=args['--ignore-existing'])
File "/usr/local/lib/python2.7/dist-packages/internetarchive/item.py", line 209, in download
files = [f for f in files if fnmatch(f.name, glob_pattern)]
File "/usr/local/lib/python2.7/dist-packages/internetarchive/item.py", line 137, in files
file = File(self, file_dict)
File "/usr/local/lib/python2.7/dist-packages/internetarchive/item.py", line 447, in init
self.length = float(file_dict.get('length')) if file_dict.get('length') else None
ValueError: invalid literal for float(): 07:12

When uploading to someone else's Item, doesn't handle the 403

Let's take the item archive.org/details/asd as an example. This is not mine.
If I do ia upload ia myfile.pdf it replies like that (

asd:
Traceback (most recent call last):
[SNIP]
  File "/home/gordo/sw/internetarchive_ve/lib/python2.7/site-packages/requests/adapters.py", line 382, in send
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s3.us.archive.org', port=80): Max retries exceeded with url: /asd/myfile.ogg (Caused by <class 'socket.error'>: [Errno 32] Broken pipe)
ia upload asd myfile.ogg  0,26s user 0,02s system 1% cpu 25,469 total

It's a lot of time for such a simple error; plus, the error does not clear out to the user what happened. I'm going to submit a patch for this.

Add test cases

We need a few more test cases:

  • Test the ia command line. Right now we only test the python interface.
  • Add a script to test uploading, outside of the py.test unit testing framework.

Uploader is so close to working under Python 3

I've been trying to get ~40GB of government reports into the Archive, as detailed in unitedstates/inspectors-general#63. It's a Python 3 project, so I've been running up against some hurdles.

  1. Annoying, but not a dealbreaker, is that I can't use the [speedups] tag in my requirements.txt, because gevent is not compatible with Python 3. I see you're working on a refactor to use asyncio, which I ๐Ÿ‘'d in #75 (comment) and would be very helpful.
  2. A bigger problem is that the uploader and request object do some unicode encoding/decoding stuff that doesn't exist in Python 3 (because all strings are unicode strings) and is depending on __iter__ not being defined on str (in Python 3, it is).

I resoved 2. with a surgical but ugly hack, which you can see in this branch compare, and so now I'm working off of my fork for the time being.

To make the uploader work under Python 3 and still work under Python 2 will require a bit more work than my hack, but it should be possible.

Perplexing login error output

Using the following command:

ia upload adamkennedyafewofmyfavouritethings \
~/Desktop/Adam_Kennedy-A_Few_of_My_Favourite_Things.flv \
~/Desktop/Adam_Kennedy-A_Few_of_My_Favourite_Things.pdf \
--metadata="title:Adam Kennedy: A Few of My Favourite Things" \
--metadata="subject:perl,computer languages, computer programming" \
--metadata="collection:sfperlmongers" \
--metadata="description:TBD" \
--metadata="mediatype:movies" \
--metadata="date:2013-09-10" \
--metadata="licenseurl:http://creativecommons.org/licenses/by-nc-sa/3.0/"

(see? per #19 it works so much better if you actually don't fubar your metadata args)

AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are both set and confirmed against the keys on my IA account.

The error message received:

Traceback (most recent call last):
  File "/usr/local/bin/ia", line 9, in <module>
    load_entry_point('internetarchive==0.3.4', 'console_scripts', 'ia')()
  File "/Library/Python/2.7/site-packages/iacli/ia.py", line 61, in main
    ia_module.main(argv)
  File "/Library/Python/2.7/site-packages/iacli/ia_upload.py", line 47, in main
    ignore_bucket=args['--ignore-bucket'])
  File "/Library/Python/2.7/site-packages/internetarchive/internetarchive.py", line 371, in upload
    response = self.upload_file(local_file, **kwargs)
  File "/Library/Python/2.7/site-packages/internetarchive/internetarchive.py", line 307, in upload_file
    self.s3_connection = ias3.connect()
  File "/Library/Python/2.7/site-packages/internetarchive/ias3.py", line 16, in connect
    calling_format=OrdinaryCallingFormat())
  File "/Library/Python/2.7/site-packages/boto/s3/connection.py", line 174, in __init__
    validate_certs=validate_certs)
  File "/Library/Python/2.7/site-packages/boto/connection.py", line 554, in __init__
    host, config, self.provider, self._required_auth_capability())
  File "/Library/Python/2.7/site-packages/boto/auth.py", line 687, in get_auth_handler
    'Check your credentials' % (len(names), str(names)))
boto.exception.NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV1Handler'] Check your credentials

So is the problem that there's nothing listening at the other end ("No handler was ready to authenticate") or that my credentials are incorrect ("Check your credentials")? As a user, it's difficult to tell.

Ideally this sort of exception would be trapped and handled more gracefully, giving the user guidance on how to proceed from there ("No servers available. Please try again later." or "Authorization denied: incorrect credentials").

pip install fails on OS X 10.9.1

Running setup.py install for PyYAML
    checking if libyaml is compilable
    cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c build/temp.macosx-10.9-intel-2.7/check_libyaml.c -o build/temp.macosx-10.9-intel-2.7/check_libyaml.o
    clang: warning: argument unused during compilation: '-mno-fused-madd'
    build/temp.macosx-10.9-intel-2.7/check_libyaml.c:2:10: fatal error: 'yaml.h' file not found
    #include <yaml.h>
             ^
    1 error generated.

    libyaml is not found or a compiler error: forcing --without-libyaml
    (if libyaml is installed correctly, you may need to
     specify the option --include-dirs or uncomment and
     modify the parameter include_dirs in setup.cfg)

metadata variables not being defaulted null.

a = "This is a test file" 
filename = "test_file_abc"     

def upload_to_IA():
    """Upload book to IA with appropriate meta-data."""
    global filename, a
    item = ia.get_item(filename)
    metadata = dict(
        description = a)
    filename = "./John Lyde Wilson - Code of Honor.page-001.jpeg"
    json_data = open('../settings.json')
    settings = json.load(json_data)
    S3_access_key = settings['ia']['S3_access_key']
    S3_secret_key = settings['ia']['S3_secret_key']
    status = item.upload(filename, access_key = S3_access_key, secret_key = S3_secret_key, metadata=metadata)
    return status

#1st upload
upload_to_IA()

#2nd upload
a=""
filename="test_file_bca"
upload_to_IA()

Suppose we upload a file with say description non-null ("this is a description"), and then upload another file with description null ( "" or None ), then somehow the internetarchive module does not overwrite the previous description value ("this is a description.") with None.

OverflowError when uploading large file

$ ~/.local/bin/ia upload "2014-03-06 grab of investigator.org.ua" investigator.org.ua.warc.gz investigator.org.ua.cdx    
2014-03-06 grab of investigator.org.ua:
Traceback (most recent call last):
  File "/home/ateam/.local/bin/ia", line 9, in <module>
    load_entry_point('internetarchive==0.5.7', 'console_scripts', 'ia')()
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/iacli/ia.py", line 81, in main
    ia_module.main(argv)
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/iacli/ia_upload.py", line 125, in main
    _upload_files(args, args['<identifier>'], local_file, upload_kwargs)
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/iacli/ia_upload.py", line 54, in _upload_files
    response = item.upload(local_file, **upload_kwargs)
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/item.py", line 510, in upload
    resp = self.upload_file(f, **kwargs)
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/item.py", line 437, in upload_file
    prepared_request = request.prepare()
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/iarequest.py", line 42, in prepare
    queue_derive=self.queue_derive,
  File "/home/ateam/.local/lib/python2.7/site-packages/internetarchive/iarequest.py", line 62, in prepare
    self.prepare_body(data, files)
  File "/home/ateam/.local/lib/python2.7/site-packages/requests/models.py", line 410, in prepare_body
    length = super_len(data)
  File "/home/ateam/.local/lib/python2.7/site-packages/requests/utils.py", line 50, in super_len
    return len(o)
OverflowError: long int too large to convert to int

$ ls -l ~/.local/lib/python2.7/site-packages/ | grep -i internetar
drwxrwxr-x 3 ateam ateam  4096 Apr 24 01:41 internetarchive
drwxrwxr-x 2 ateam ateam  4096 Apr 24 01:41 internetarchive-0.5.7.egg-info

$ python --version
Python 2.7.3

$ ls -l investigator.org.ua.warc.gz investigator.org.ua.cdx
-rw-rw-r-- 1 ateam ateam   84165932 Mar  6 02:45 investigator.org.ua.cdx
-rw-rw-r-- 1 ateam ateam 8974369559 Mar  6 02:45 investigator.org.ua.warc.gz

Handle requests.exceptions.ConnectionError: HTTPConnectionPool

Maybe it's my fault or my machine/whatever, but this exception seems weird:

Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File "/usr/lib/python2.7/dist-packages/apport/__init__.py", line 5, in <module>
    from apport.report import Report
  File "/usr/lib/python2.7/dist-packages/apport/report.py", line 30, in <module>
    import apport.fileutils
  File "/usr/lib/python2.7/dist-packages/apport/fileutils.py", line 23, in <module>
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python2.7/dist-packages/apport/packaging_impl.py", line 20, in <module>
    import apt
  File "/usr/lib/python2.7/dist-packages/apt/__init__.py", line 34, in <module>
    apt_pkg.init_config()
SystemError: E:Opening configuration file /etc/apt/apt.conf.d/15pax-mark - ifstream::ifstream (13: Permission denied)

Original exception was:
Traceback (most recent call last):
  File "uploader.py", line 282, in <module>
    main()
  File "uploader.py", line 279, in main
    upload(wikis, config)
  File "uploader.py", line 114, in upload
    item = get_item('wiki-' + wikiname)
  File "/home/users/federico/.local/lib/python2.7/site-packages/internetarchive/api.py", line 26, in get_item
    return item.Item(identifier, metadata_timeout, config, max_retries, archive_session)
  File "/home/users/federico/.local/lib/python2.7/site-packages/internetarchive/item.py", line 99, in __init__
    self._json = self.get_metadata(metadata_timeout)
  File "/home/users/federico/.local/lib/python2.7/site-packages/internetarchive/item.py", line 123, in get_metadata
    resp = self.http_session.get(url, timeout=metadata_timeout)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 468, in get
    return self.request('GET', url, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/adapters.py", line 375, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='archive.org', port=80): Max retries exceeded with url: /metadata/wiki-wikizaaksysteemnl (Caused by <class 'socket.error'>: [Errno 111] Connection refused)

CLI tool not found, but module installed.

I installed the wrapper with:
$ sudo pip install internetarchive

It says the module installed and I can access it from within a python program, but upon running
$ ia --help
I get "-bash: ia: command not found"

Python version - Python 2.6.6
pip version - pip 1.5.5
OS - Debian 6.0.9 (64bit) - kernel 2.6.38.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.