Coder Social home page Coder Social logo

iacoll's Introduction

iacoll

iacoll allows you to regularly harvest the item metadata in an Internet Archive collection and store it in a LevelDB database. The database is a key/value store where the key is the unique Internet Archive item identifier, and the value is the JSON for the item metadata. Having the data stored this way allows it to be easily kept up to date.

Maybe it should use a JSON aware database instead, so the metadata itself can be queried. If you think so and have opinions about what database to use please let me know.

Install

To install iacoll you'll first need to install Python and LevelDB.

% brew intall python3 leveldb
% pip install iacoll

Usage

Here's an example of using iacoll to harvest the metadata for items in the University of Maryland's collection: univerity_maryland_cp:

% iacoll university_maryland_cp 

By default iacoll will create the LevelDB database in a directory named with the item identifier. If you would like to control this you can explicitly pass it:

% iacoll university_maryland_cp --db /path/to/my/leveldb/database

When you run iacoll repeatedly it will look at the database and only fetch newer records. If an update ever fails you may want to force a full scan:

% iacoll university_maryland_cp --fullscan

If you would like to dump the metadata as line oriented JSON you can use --dump:

% iacoll university_maryland_cp --dump > university_maryland_cp.jsonl

iacoll's People

Contributors

edsu avatar

Watchers

 avatar  avatar

iacoll's Issues

suggestion

iacoll is neat, thanks, i think i will use it. i maintain myself a small collection on IA and i have the need to keep a copy of metadata (and derived data too).

at the beginning i used https://github.com/atomotic/iafc to ingest everything into a fedora repository running at my home server. after a while i was struggling at keeping the server running, so i gave up.

now i'm using https://caltechlibrary.github.io/dataset/, not really a json database but just as powerful: raw json are saved into a pairtree. notable features: filtering on json path, csv export, full-text search with blevesearch (not yet tried).

this is how i'm using it:

# install and init the repo
go get -u -v github.com/caltechlibrary/dataset/...
dataset init grafton9.ds

# ingest all items of a collection
parallel 'dataset create grafton9.ds {} "$(ia metadata {})"' ::: $(ia search -i collection:grafton9)

# get the items, filtering
dataset keys grafton9.ds
dataset keys grafton9.ds '(eq .metadata.date "1995")'

the update logic is very inefficient, i rely on the fact that my metadata never changes :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.