Coder Social home page Coder Social logo

pubcrawl's Introduction

Travis-CI Build Status AppVeyor Build Status Coverage Status

pubcrawl

Convert ā€˜epubā€™ Files to Text

Description

Convert ā€˜epubā€™ Files to Text

The ā€˜epubā€™ file format is really just a structured ā€˜ZIPā€™ archive with metadata, graphics and (usually) ā€˜HTMLā€™ text. Tools are provided to turn an ā€˜epubā€™ file into a tidy data frame.

Whatā€™s Inside The Tin

The following functions are implemented:

  • epub_to_text: Convert an epub file into a data frame of plaintext chapters

NOTE

There are edge cases Iā€™ve totally not covered yet. Feel free to jump in and make this a real, useful package!

TODO

  • Refactor so there arenā€™t so many heavy dependencies
  • [ ] Try to get hgr on CRAN so itā€™s not a GH dep Moved the cleaner code into here
  • Better docs
  • Embed some epubs for examples and tests
  • Setup Travis, Appveyor, code coverage

Installation

devtools::install_github("hrbrmstr/pubcrawl")

Usage

library(pubcrawl)
library(tidyverse)

# current verison
packageVersion("pubcrawl")
## [1] '0.1.0'

An Oā€™Reilly epub

epub_to_text("~/Data/R Packages.epub")
## # A tibble: 26 x 4
##    path                         size date                content                                                       
##    <chr>                       <dbl> <dttm>              <chr>                                                         
##  1 OEBPS/cover.html              315 2015-03-24 21:49:16 Cover                                                         
##  2 OEBPS/titlepage01.html        466 2015-03-24 21:49:16 "R Packages\n\nHadley Wickham"                                
##  3 OEBPS/copyright-page01.html  3286 2015-03-24 21:49:16 "R Packages\n\nby Hadley  Wickham\n\n\n\nPrinted in the Uniteā€¦
##  4 OEBPS/toc01.html            17557 2015-03-24 21:49:16 "navPrefaceIn This Book\n\nConventions Used in This Book\n\nUā€¦
##  5 OEBPS/preface01.html        17784 2015-03-24 21:49:16 "Preface\n\n\nIn This Book\n\nThis book will guide you from bā€¦
##  6 OEBPS/part01.html             444 2015-03-24 21:49:16 Getting Started                                               
##  7 OEBPS/ch01.html             12007 2015-03-24 21:49:16 "Introduction\n\nIn R, the fundamental unit of shareable codeā€¦
##  8 OEBPS/ch02.html             28633 2015-03-24 21:49:18 "Package Structure\n\nThis chapter will start you on the roadā€¦
##  9 OEBPS/part02.html             454 2015-03-24 21:49:18 Package Components                                            
## 10 OEBPS/ch03.html             28629 2015-03-24 21:49:18 "R Code\n\nThe first principle of using a package is that allā€¦
## # ... with 16 more rows

A Project Gutenberg epub that comes with the package

epub_to_text(system.file("extdat", "augustine.epub", package="pubcrawl")) %>% 
  mutate(path = abbreviate(path))
## # A tibble: 10 x 4
##    path                             size date                content                                                   
##    <chr>                           <dbl> <dttm>              <chr>                                                     
##  1 OEBPS/@@@@@@@3296@3296-@3296--0 63804 2017-10-02 07:00:00 "THE CONFESSIONS\nOF\nSAINT AUGUSTINE\n\nBy Saint Augustiā€¦
##  2 OEBPS/@@@@@@@3296@3296-@3296--1 68504 2017-10-02 07:00:00 "BOOK III\nTo Carthage I came, where there sang all arounā€¦
##  3 OEBPS/@@@@@@@3296@3296-@3296--2 80192 2017-10-02 07:00:00 "BOOK V\nAccept the sacrifice of my confessions from the ā€¦
##  4 OEBPS/@@@@@@@3296@3296-@3296--3 51898 2017-10-02 07:00:00 "O crooked paths! Woe to the audacious soul, which hoped,ā€¦
##  5 OEBPS/@@@@@@@3296@3296-@3296--4 80194 2017-10-02 07:00:00 "Anubis, barking Deity, and allĀ Ā Ā Ā Ā Ā Ā Ā  The monster Gods ā€¦
##  6 OEBPS/@@@@@@@3296@3296-@3296--5 80718 2017-10-02 07:00:00 "The boy then being stilled from weeping, Euodius took upā€¦
##  7 OEBPS/@@@@@@@3296@3296-@3296--6 65956 2017-10-02 07:00:00 "And Thou knowest how far Thou hast already changed me, wā€¦
##  8 OEBPS/@@@@@@@3296@3296-@3296--7 57022 2017-10-02 07:00:00 "BOOK XII\nMy heart, O Lord, touched with the words of Thā€¦
##  9 OEBPS/@@@@@@@3296@3296-@3296--8 69513 2017-10-02 07:00:00 "BOOK XIII\nI call upon Thee, O my God, my mercy, Who creā€¦
## 10 OEBPS/@@@@@@@3296@3296-@3296--9 21223 2017-10-02 07:00:00 "The Confessions of Saint Augustine, by Saint Augustine\nā€¦

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

pubcrawl's People

Contributors

hrbrmstr avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.