iacoll's Introduction

iacoll

iacoll allows you to regularly harvest the item metadata in an Internet Archive collection and store it in a LevelDB database. The database is a key/value store where the key is the unique Internet Archive item identifier, and the value is the JSON for the item metadata. Having the data stored this way allows it to be easily kept up to date.

Maybe it should use a JSON aware database instead, so the metadata itself can be queried. If you think so and have opinions about what database to use please let me know.

Install

To install iacoll you'll first need to install Python and LevelDB.

% brew intall python3 leveldb
% pip install iacoll

Usage

Here's an example of using iacoll to harvest the metadata for items in the University of Maryland's collection: univerity_maryland_cp:

% iacoll university_maryland_cp

By default iacoll will create the LevelDB database in a directory named with the item identifier. If you would like to control this you can explicitly pass it:

% iacoll university_maryland_cp --db /path/to/my/leveldb/database

When you run iacoll repeatedly it will look at the database and only fetch newer records. If an update ever fails you may want to force a full scan:

% iacoll university_maryland_cp --fullscan

If you would like to dump the metadata as line oriented JSON you can use --dump:

% iacoll university_maryland_cp --dump > university_maryland_cp.jsonl

iacoll's People

Contributors

Watchers

iacoll's Issues

iacoll README doesn't describe the requirements for API credentials

It could be beneficial for readers to see the requirement for IA API credentials before running the app. Describing the requirement and how they are managed/not-managed by iacoll on the README.md seems like a good potential solution.

suggestion

iacoll is neat, thanks, i think i will use it. i maintain myself a small collection on IA and i have the need to keep a copy of metadata (and derived data too).

at the beginning i used https://github.com/atomotic/iafc to ingest everything into a fedora repository running at my home server. after a while i was struggling at keeping the server running, so i gave up.

now i'm using https://caltechlibrary.github.io/dataset/, not really a json database but just as powerful: raw json are saved into a pairtree. notable features: filtering on json path, csv export, full-text search with blevesearch (not yet tried).

this is how i'm using it:

# install and init the repo
go get -u -v github.com/caltechlibrary/dataset/...
dataset init grafton9.ds

# ingest all items of a collection
parallel 'dataset create grafton9.ds {} "$(ia metadata {})"' ::: $(ia search -i collection:grafton9)

# get the items, filtering
dataset keys grafton9.ds
dataset keys grafton9.ds '(eq .metadata.date "1995")'

the update logic is very inefficient, i rely on the fact that my metadata never changes :)

Recommend Projects

edsu / iacoll Goto Github PK

iacoll's Introduction

iacoll

Install

Usage

iacoll's People

Contributors

Watchers

iacoll's Issues

iacoll README doesn't describe the requirements for API credentials

suggestion

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent