Coder Social home page Coder Social logo

issuu-pdf-dl's Introduction

Hi there ๐Ÿ‘‹

  • ๐Ÿ™‚ Iโ€™m a human being based in Ireland.
  • ๐Ÿ‘ท I work as a software engineer and I hold a PhD in computer science.
  • ๐Ÿ“ซ How to reach me: $name.$surname[at]gmail[dot]com or LinkedIn.

issuu-pdf-dl's People

Contributors

pviotti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

issuu-pdf-dl's Issues

OCR on PDF

Scan downloaded PDF with an OCR (Tesseract?) to make its text searchable

Update for current ruby and new issuu API (code provided)

Following codes fixes:

warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
undefined method `mktmpdir' for Dir:Class (NoMethodError)

Also I updated it to support the new API, the old API isn't working anymore.

#!/usr/bin/env ruby
# https://github.com/pviotti/issuu-pdf-dl
# sudo apt-get install ruby ruby-rmagick pdftk 

# Download documents from issuu.com as JPGs and convert them to PDF

require 'open-uri'
require 'rmagick'
require 'tmpdir'
require 'json'

def fetch_pdf(url)
    username = url.split("/")[3]
    docname = url.split("/")[5]
    pub_hash = URI.open(url).grep(/image.isu.pub/)[0].split('<meta name="twitter:image" content="https://image.isu.pub/')[1].split('/jpg/page_1.jpg">')[0]
    query_url = "https://search.issuu.com/api/2_0/documents?documentId=#{pub_hash}&responseParams=pagecount"

    json_data = JSON.parse(URI.open(query_url).read)
    num_pages = json_data['response']['docs'][0]['pagecount'].to_i

   begin
        dir = Dir.mktmpdir
        
        for x in 1..num_pages do
          open("#{dir}/page_#{"%03d" % x}.jpg","wb")
            .write(URI.open("http://image.issuu.com/#{pub_hash}/jpg/page_#{x}.jpg").read)
          puts(Time.now.strftime('%Y-%m-%d %X') +" - Downloaded: page_#{x}.jpg")
        end
        puts("#{Time.now.strftime('%Y-%m-%d %X')} - All pages have been downloaded.")

        Dir["#{dir}/*.jpg"].each { |filename| 
            begin
                im = Magick::Image.read(filename)
                im[0].write(filename + ".pdf")
            rescue
                puts("Error converting #{filename} to PDF.")
            end
            }

        `pdftk #{dir}/*.pdf cat output #{docname}.pdf`
        puts("#{Time.now.strftime('%Y-%m-%d %X')} - #{docname}.pdf has been created successfully.")
    ensure
        # remove the tmp directory
        FileUtils.remove_entry_secure dir
    end
end

if __FILE__ == $0
    if ARGV.length == 0 then
        puts "Usage: #{$0} <issue.com URL>"
        exit 1
    end
    fetch_pdf(ARGV[0])
end

Pdf conversion process breaks for some reason

The script successfully creates a temporary folder and manages to get .jpgs from host, but breaks halfway the process of converting each jpg to pdf.

An first extract of the error:

`...

2022-08-30 15:49:57 - Downloaded: page_223.jpg
2022-08-30 15:49:58 - Downloaded: page_224.jpg
2022-08-30 15:49:58 - All pages have been downloaded.
Error converting /tmp/d20220830-6822-11bnqm7/page_105.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_106.jpg to PDF.

...`

The book is 224 pages, all files gets created successfully in temporary folder, checked real-time from nautilus.
The page_105.jpg file is there and is working fine, doesn't even seem corrupted. Output at the end is as follows:

`...

Error converting /tmp/d20220830-6822-11bnqm7/page_221.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_222.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_223.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_224.jpg to PDF.
Error: Unable to find file.
Error: Failed to open PDF file:
/tmp/d20220830-6822-11bnqm7/page_105.jpg.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.

...`

Seems to me that the temporary folder gets deleted while task is running, so it breaks halfway. I know it sounds crazy.

Link I have this error is here

Thanks in advance for support.

Not working!

I am getting the following error:

issuu-pdf-dl.rb:16:in `fetch_pdf': undefined method `split' for nil:NilClass (NoMethodError)

Hope you can help! Thx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.