pviotti / issuu-pdf-dl Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 6.0 16 KB

Download documents from issuu.com as PDF

Ruby 100.00%

issuu-pdf-dl's Introduction

Hi there 👋

🙂 I’m a human being based in Ireland.
👷 I work as a software engineer and I hold a PhD in computer science.
📫 How to reach me: $name.$surname[at]gmail[dot]com or LinkedIn.

issuu-pdf-dl's People

Contributors

Stargazers

Watchers

Forkers

meugarfo braggjc osvaldoj librenauta asdbaihu

issuu-pdf-dl's Issues

Use pure Ruby solution to combine PDF

Instead of the external PDFToolkit, use a Ruby library to merge the PDFs (e.g. https://github.com/boazsegev/combine_pdf )

OCR on PDF

Scan downloaded PDF with an OCR (Tesseract?) to make its text searchable

-bash: syntax error near unexpected token `newline'

I issued the comand
ruby ./issuu-pdf-dl.rb https://issuu.com/doushuroll/book2_2022
in the terminal and I got the following error:
-bash: syntax error near unexpected token `newline'
please help.

Update for current ruby and new issuu API (code provided)

Following codes fixes:

warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
undefined method `mktmpdir' for Dir:Class (NoMethodError)

Also I updated it to support the new API, the old API isn't working anymore.

#!/usr/bin/env ruby
# https://github.com/pviotti/issuu-pdf-dl
# sudo apt-get install ruby ruby-rmagick pdftk 

# Download documents from issuu.com as JPGs and convert them to PDF

require 'open-uri'
require 'rmagick'
require 'tmpdir'
require 'json'

def fetch_pdf(url)
    username = url.split("/")[3]
    docname = url.split("/")[5]
    pub_hash = URI.open(url).grep(/image.isu.pub/)[0].split('<meta name="twitter:image" content="https://image.isu.pub/')[1].split('/jpg/page_1.jpg">')[0]
    query_url = "https://search.issuu.com/api/2_0/documents?documentId=#{pub_hash}&responseParams=pagecount"

    json_data = JSON.parse(URI.open(query_url).read)
    num_pages = json_data['response']['docs'][0]['pagecount'].to_i

   begin
        dir = Dir.mktmpdir
        
        for x in 1..num_pages do
          open("#{dir}/page_#{"%03d" % x}.jpg","wb")
            .write(URI.open("http://image.issuu.com/#{pub_hash}/jpg/page_#{x}.jpg").read)
          puts(Time.now.strftime('%Y-%m-%d %X') +" - Downloaded: page_#{x}.jpg")
        end
        puts("#{Time.now.strftime('%Y-%m-%d %X')} - All pages have been downloaded.")

        Dir["#{dir}/*.jpg"].each { |filename| 
            begin
                im = Magick::Image.read(filename)
                im[0].write(filename + ".pdf")
            rescue
                puts("Error converting #{filename} to PDF.")
            end
            }

        `pdftk #{dir}/*.pdf cat output #{docname}.pdf`
        puts("#{Time.now.strftime('%Y-%m-%d %X')} - #{docname}.pdf has been created successfully.")
    ensure
        # remove the tmp directory
        FileUtils.remove_entry_secure dir
    end
end

if __FILE__ == $0
    if ARGV.length == 0 then
        puts "Usage: #{$0} <issue.com URL>"
        exit 1
    end
    fetch_pdf(ARGV[0])
end

Pdf conversion process breaks for some reason

The script successfully creates a temporary folder and manages to get .jpgs from host, but breaks halfway the process of converting each jpg to pdf.

An first extract of the error:

`...

2022-08-30 15:49:57 - Downloaded: page_223.jpg
2022-08-30 15:49:58 - Downloaded: page_224.jpg
2022-08-30 15:49:58 - All pages have been downloaded.
Error converting /tmp/d20220830-6822-11bnqm7/page_105.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_106.jpg to PDF.

...`

The book is 224 pages, all files gets created successfully in temporary folder, checked real-time from nautilus.
The page_105.jpg file is there and is working fine, doesn't even seem corrupted. Output at the end is as follows:

`...

Error converting /tmp/d20220830-6822-11bnqm7/page_221.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_222.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_223.jpg to PDF.
Error converting /tmp/d20220830-6822-11bnqm7/page_224.jpg to PDF.
Error: Unable to find file.
Error: Failed to open PDF file:
/tmp/d20220830-6822-11bnqm7/page_105.jpg.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.

...`

Seems to me that the temporary folder gets deleted while task is running, so it breaks halfway. I know it sounds crazy.

Link I have this error is here

Thanks in advance for support.

Not working!

I am getting the following error:

issuu-pdf-dl.rb:16:in `fetch_pdf': undefined method `split' for nil:NilClass (NoMethodError)

Hope you can help! Thx

pviotti / issuu-pdf-dl Goto Github PK

issuu-pdf-dl's Introduction

Hi there 👋

issuu-pdf-dl's People

Contributors

Stargazers

Watchers

Forkers

issuu-pdf-dl's Issues

Use pure Ruby solution to combine PDF

OCR on PDF

-bash: syntax error near unexpected token `newline'

Update for current ruby and new issuu API (code provided)

Pdf conversion process breaks for some reason

Not working!

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent