Coder Social home page Coder Social logo

web-download's Introduction

PDF Web Download

Overview

Downloads individual PDF files from saved Archive-It crawls using the Archive-It File Type report. Files are saved in folders named with the website (seed) url. It can download PDFs from many websites, but only one Archive-It collection, at a time.

At UGA, PDFs are downloaded to provide access to Georgia government publications via the Digital Library of Georgia.

Getting Started

Dependencies

  • wget - for downloading content using a URL.

Installation

To install wget in Windows:

  1. Download a Windows Binary (http://wget.addictivecode.org/FrequentlyAskedQuestions.html#download).
  2. Save the wget.exe file to a folder on your machine, such as Documents/wget.
  3. Add the folder with wget.exe to your Path variable (under Settings, Environment Variables).
  4. Test by opening a terminal window and typing wget -h. The wget options should appear.

You need administrative privileges on your machine for the script to be able to use wget. Additionally, Windows install instructions recommend saving to the System32 folder, which is already in the Path, but the Python script cannot access wget in this location.

To install wget in Linux or Mac: https://www.gnu.org/software/wget/

Verify the information in ait_collections.py is correct. It contains the default Archive-It collection for the download, as well as a list of all Archive-It collections for UGA.

To use this script for other formats, update how missing file extensions are assigned in get_file_name().

Script Arguments

  • input_folder (required): Folder with CSVs from Archive-It with the files to be downloaded.
  • ait_collection (optional): Title of the Archive-It collection that the files are part of, if not Georgia Government Publications (the default value for ait_collection)

Put quotes around either script argument if it has spaces

Example for GGP collection:
python C:/user/scripts/download_files.py "C:/user/GGP CSVs"

Example for Activists and Advocates collection:
python C:/user/scripts/download_files.py C:/user/csv "Activist and Advocates"

Terminal Tips:

  • Drag a file or folder onto the terminal window to make its path appear.
  • Put quotes around any paths that have spaces.
  • Make sure there is a space between each component of the command.
  • Use the up arrow key to see earlier commands you typed, e.g., if you need to fix a typo or run something again.

Testing

Use the Testing Instructions as a guide for designing tests or as a basis for creating unit tests for the functions.

Workflow

Instructions for the entire workflow involving this script, including preparing the list of PDFs from Archive-It and working with the script interface: PDF Download Workflow Instructions

Author

Adriane Hanson, Head of Digital Stewardship, UGA Libraries

History

This script was developed for Sarah Causey in MAGIL in 2022, after they transitioned from HTTrack to Archive-It for web crawling. HTTrack automatically downloaded individual files, including PDFs, so that they could be added to DLG when warranted. Archive-It downloads the entire website as a WARC, so we needed a different way to extract the PDF files.

web-download's People

Contributors

amhanson9 avatar

Watchers

James Cloos avatar  avatar

web-download's Issues

Use python library instead of wget

Installing wget on MAGIL workstations was complicated due to them not having administrator access to their machines. Using the python requests or urllib libraries might be simpler.

Make function for error testing

Script tests if each of the arguments is present and an expected value. If it is possible with threading, doing this in a function would make for simpler code and make it easier to test.

Don't change current directory for saving files

Currently, make_seed_folder() changes the current directory to the seed folder, so PDFs can be saved to the current directory. This has caused problems with other scripts during testing where it is unclear what the directory should be for a valid test.

Include the seed folder as part of the path for saving the PDF so that the current directory is not important.

Delete file if wget error 8

When wget returns an error code of 8, it means that a PDF file has been created with the indicated name but that the file was not in fact downloaded. This makes it look like the download was a success, but if the file is clicked on, it cannot be opened. Therefore, the PDF should be deleted if wget has an error code of 8.

Remove the GUI

PySimpleGUI has added an annual cost. MAGIL confirmed that they can operate the script from the command line instead.

Make unit tests for functions

For now, while we aren't changing the script, having manual testing instructions is sufficient. If we begin more active development, start by making unit tests for all of the functions.

Challenge: the script is accessing the most recent crawl of the websites, so we need to update the test input each time for ones that are using the Archive-It API. The websites would be the same, but the files would not be.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.