Coder Social home page Coder Social logo

nondairyneutrino / springer_book_scraping Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 42 KB

During COVID-19, Springer released a ton of books to the public (for a limited time). This code programmatically downloads and organizes the books you want.

Mathematica 100.00%

springer_book_scraping's Introduction

Introduction

Many of you may know that the scientific book publisher Springer has released a large collection of their books to help teachers and learners. In the announcement they even included a complete Excel spreadsheet containing a bunch of the books' metadata, including the links to the respective pages for each freely downloadable books. And it turns out, while the provided urls themselves don't link to the download pages/files, there's an easy rule to transform the DOI addresses into the file pages. So after identifying this pattern/rule, I wrote some Mathematica to scrap the list and download each book.

I've divided the program into two different types, scraping all the books or choosing only the subjects you want (e.g. Physics and Astronomy, and Mathematics). Though to start things, we first pick out the relevant data from the book data.

Book Data

Once this code is run, you'll be prompted with two sequential file explorer dialogs. The first is to identify the book data CSV, and the second is to specify which directory to place the folder that will contain all the books (e.g. [chosen directory]\Springer_Books). The code automatically creates subdirectories for the different subjects and puts the books in the right one.

books = Import[SystemDialogInput["FileOpen", WindowTitle -> "FIND THE BOOK DATA SHEET"]];;(*Opens a file explorer so you can more easily pick the book data CSV*)

subjectsColumn = Position[books[[1]], "English Package Name"][[1, 1]];
subjects = books[[2 ;;, subjectsColumn]]; (*Picks out the subjects column of the book data*)

urls = StringReplace[#, "http://doi.org" -> "https://link.springer.com/content/pdf"] & /@ books[[2 ;;, Position[books[[1]], "DOI URL"][[1, 1]]]]; (*Picks out and transforms the DOI urls of the book webpages into the urls of the downloadable files*)

titles = StringReplace[#, {":" -> ",", "/" -> " and "}] & /@ books[[2 ;;, 1]]; (*Picks out the titles column of the book data. NOTE: Some titles have a : in them, and Windows doesn't like those, so they have to first be edited into a friendly file name (e.g. ":" -> "-").*)

authors = With[
  {
    authorColumn = Position[books[[1]], "Author"][[1, 1]],
    getLastNamesList = (StringSplit[StringSplit[#, ", "]][[;; , -1]] &),
    lastNamesListToString = (StringJoin @@ If[Length@# > 1, Riffle[#, "; "], #] &),
    fixNames = (FromCharacterCode[ToCharacterCode@#, "UTF8"] &) (*You could also use ImportString[#,"Text"]&, but the implemented method is much faster.*)
  },
  fixNames@*lastNamesListToString@*getLastNamesList /@ books[[2 ;;, authorColumn]]
];

years = ToString /@ books[[2 ;;, 5]];

(*Creating the subject directories*)
bookDir = FileNameJoin@{SystemDialogInput["Directory", WindowTitle -> "WHERE TO SAVE ALL THE BOOKS"], "Springer_Books"}; (*This opens a file explorer to choose a directory in which to put the folder that will contain all the books*)

CreateDirectory@FileNameJoin@{bookDir, #} & /@ DeleteDuplicates@subjects;

uniqueName[title_String, year_String, author_String] := title <> " (" <> year <> ")" <> " - " <> author

fetchBook[{url_, subject_, name_}] := URLSave[url, FileNameJoin[{bookDir, subject, name <> ".pdf"}]]

fetchBookList[data : {urls_, subjects_, names_}] := Monitor[
  Block[
    {n = 0},
    (n++; fetchBook@#) & /@ Transpose@data
  ],
  ProgressIndicator[n/Length@data[[1]]]
]

Downloading

All the books

With all our definitions above, scraping all the books from the list is relatively simple, or at least concise.

fetchBookList@{urls, subjects, MapThread[uniqueName, {titles, years, authors}]}

Specific Subjects

Instead of downloading each subject as a black box, I've made it so you can see what books comprise each subject you may want to scrape. Basically, you click the subjects you're interested in and the program displays the books in that subject, and then you can simply click the "Download" button at the bottom of the prompt and Mathematica fetches the listed books.

Manipulate[
  If[
    subject === 0,
    "",
    Column@{
      TableForm[
        {#[[1]], #[[2, ;; , 2]]} & /@ data[[Sort@subject]],
        TableAlignments -> {Left, Top},
        TableSpacing -> {5, 5, 1.5}
      ],
      Button[
        "Download",
        Thread /@ data[[Sort@subject]] /. {subject_, {url_, name_}} :> {url, subject, name} // Catenate // Transpose // fetchBookList
      ]
    }
  ],
  {
    {subject, 0, "Subject"},
    Thread[Range@Length@# -> #] &@Sort@DeleteDuplicates@subjects,
    TogglerBar,
    Appearance -> "Horizontal" -> {3, 8}
  },
  Initialization :> (data = {#1[[1, 1]], #[[;; , 2 ;;]]} & /@ GatherBy[Transpose@{subjects, urls, MapThread[uniqueName, {titles, years, authors}]}, First] // SortBy[#, First] &),
  Deinitialization :> Clear@data
]

Conclusion

I elected to download all the books because I'm becoming a data hoarder. It took a little over an hour to download everything over WIFI (This wasn't parallelized because I had some concerns when it came to writing the data to my hard disk.) and the finished folder of everything is about 8GB.

springer_book_scraping's People

Contributors

nondairyneutrino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.