Coder Social home page Coder Social logo

infoqscraper's Introduction

Build Status

A Web scraper for InfoQ.

InfoQ hosts a lot of great presentations, unfortunately it is not possible to watch them outside of the browser or if you do not have Flash installed. The video cannot simply be downloaded because the audio stream and the slide stream are not in the same media. By downloading the video you only get the audio track and a video of the presenter but you don't get the slides which are the meat of a presentation.

infoqscraper allows you to:

  • list and search for presentations
  • download and create a movie including the slides, the audio track and optionally a thumbnail of the presenter

Infoqscraper is compatible with Python 2 (>= 2.6) and Python 3. It has a few third party dependencies, ffmpeg & swftool, and has been reported to work fine on various Linux distro and Mac OS X.

See the Wiki to learn how to install and use Infoqscraper.

Help

You can contact me if you have any question or feature request.

If you find this project useful, any feedback, technical or non technical contribution is welcome !

infoqscraper's People

Contributors

bytehead avatar cykl avatar kevingreene avatar palfrey avatar zerkms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infoqscraper's Issues

Resume download

If a downloaded file already exist, ffmpeg will prompt

'infoq-video.avi' already exists. Overwrite ? [y/N]

However, this is not seen by the user, and infoqscraper simply hangs.

Enhancement

Have ffmpeg resume download of a video file.

Failed to download video

Running infoqscraper 0.1.1.dev0 on OS X

$ infoqscraper presentation download effective-api-design

results in

Failed to create presentation effective-api-design.avi: Failed to download video at rtmpe://video.infoq.com/cfx/st/: rtmpdump exited with 2.

Proportional scale for overlay

How about switching to scale=w=320:h=-1 instead of scale=320x240 so that it was not distorted?

Then the filter config could be changed from

[tmp1][speaker] overlay=shortest=1:x=main_w-320:y=main_h-240

to

[tmp1][speaker] overlay=shortest=1:x=main_w-320:y=main_h-h

I patched the local installation and it works great.

Thanks for this tool

Just wanted to report that I have been using it on mac, and it works fine for me. :)

OSError: [Errno 31] Too many links

Installed with pip:

$ ~/.local/bin/infoqscraper -c presentation download latency-pitfalls

Traceback (most recent call last):
File "/home/user/.local/bin/infoqscraper", line 46, in
sys.exit(main.main())
File "/home/user/.local/lib/python2.7/site-packages/infoqscraper/main.py", line 357, in main
module.main(infoq_client, args.module_args)
File "/home/user/.local/lib/python2.7/site-packages/infoqscraper/main.py", line 193, in main
return command.main(infoq_client, args.command_args)
File "/home/user/.local/lib/python2.7/site-packages/infoqscraper/main.py", line 297, in main
builder.create_presentation(output_path=output)
File "/home/user/.local/lib/python2.7/site-packages/infoqscraper/presentation.py", line 270, in create_presentation
frame_pattern = self._prepare_frames(jpg_slides)
File "/home/user/.local/lib/python2.7/site-packages/infoqscraper/presentation.py", line 409, in _prepare_frames
os.link(slides[slide_index], os.path.join(self.tmp_dir, "frame-{0:04d}." + ext).format(frame))
OSError: [Errno 31] Too many links

Download fails with "Failed to create final movie as"

The installation went with no problems. I've installed v0.0.7 and all the dependencies today.

I am able to list presentation by topic with no issues.

However, download fails. I've tried security-grails-apps and emberjs-use-case.

I am on Ubuntu 12.04.

Please advise.

Scraping fails due to metadata changes

Found in version 0.1.5

As of March 2019, scraping presentations no longer works due to format changes in the presentation HTML page.

Traceback (most recent call last):
  File "/usr/local/bin/infoqscraper", line 33, in <module>
    sys.exit(main.main())
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 374, in main
    return module.main(infoq_client, args.module_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 194, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 314, in main
    builder.create_presentation()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 82, in create_presentation
    video = self.download_video()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 103, in download_video
    rvideo_path = self.presentation.metadata['video_path']
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 171, in metadata
    'title': get_title(pres_div),
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 91, in get_title
    return pres_div.find('h1', class_="general").div.get_text().strip()
AttributeError: 'NoneType' object has no attribute 'find'

In fact, the fields that scrap.py is looking for are metadata and are not used by the main application. Removing them allows presentation to be grabbed correctly.

Only first slide visible.

I am on Arch Linux(updated), followed the wiki when installing infoqscraper.

Used the command infoqscraper presentation download The-Design-of-Datomic

Only the first slide is visible, nothing else. The audio is fine.

How to compile infoscraper on CentOS 6.2

On my CentOS 6.2 system, python version is 2.6. I downloaded and installed ffmpeg, swftools, and rtmpdump. BeautifulSoup4-4.1. 3, and html5lib-0.9.5 are installed on /usr/lib/python2.6/site-packages where as PIL is installed /usr/lib64/python2.6

when i run the sample command "infoqscraper presentation download Distributed-Systems-with-ZeroMQ-and-gevent",
it breaks in presentation.py:

File "/usr/lib/python2.6/site-packages/infoqscraper/presentation.py", line 326, in download_video_no_cache
subprocess.check_output(cmd, stderr=subprocess.STDOUT)
AttributeError: 'module' object has no attribute 'check_output'

Please, help.

URL not parsed correctly

Parsing this URL fails:

$ infoqscraper presentation download http://www.infoq.com/presentations/data-types-issues
Presentation http://www.infoq.com/presentations/data-types-issues not found. Please check your id or url

However, using the id works

$ infoqscraper presentation download data-types-issues

Integrate InfoQ's authentication

One of the big reasons I'd like to use this tool is to download videos that are not yet publicly available (e.g. Strangeloop 2013). It would be awesome if you could include a way to use your credentials to access these videos.

I apologize if there is already a way to do this that I am missing.

Feature suggestion: demos

This suggestion is for h264_overlay mode specifically

How about adding demos fullscreen, when the presentation switches into the full screen player mode?

They store the begin,end positions as the pairs of numbers in the demoTimings variable.

The solution would be to create multiple slices of the presentation: slides&video, demo, slides&video, demo, etc followed by a concatenation operation.

What it would change:

  • The runtime would require ~1x compressed video size extra to keep the intermediate videos and I believe the overall encoding time should not change.
  • It's not obvious to me if videos would concatenate seamlessly or some sound distortion would appear, but I as a software user am ready to pay the price in favour of having the full screen demo, not a tiny thing in the corner.

It might be implemented as an additional switch to keep BC.

If you like the idea and would accept it - I might spent few evenings to implement it.

Unable to download old presentations

Got this error when I call
$ python infoqmedia.py ejb-3

Traceback (most recent call last):
File "infoqmedia.py", line 290, in
sys.exit(main())
File "infoqmedia.py", line 277, in main
jpeg=args.jpeg
File "infoqmedia.py", line 89, in init
self.name = self._getName()
File "infoqmedia.py", line 107, in _getName
assert groups
AssertionError

ffmpeg 2.0 and up support?

On archlinux only ffmpeg 2.0.2 is available without compilation.
ffmpeg 1.2 must be installed from sources.

Do you plan to update infoqscraper?

ffmpeg version **1.2** Copyright (c) 2000-2013 the FFmpeg developers
  built on Oct 17 2013 18:20:07 with gcc 4.8.1 (GCC) 20130725 (prerelease)
  configuration: --prefix=/usr --disable-debug --disable-static --enable-avresample --enable-dxva2 --enable-fontconfig --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libfreetype --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-libv4l2 --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libxvid --enable-pic --enable-postproc --enable-runtime-cpudetect --enable-shared --enable-swresample --enable-vdpau --enable-version3 --enable-x11grab

ffmpeg version **2.0.2** Copyright (c) 2000-2013 the FFmpeg developers
  built on Oct  9 2013 20:28:06 with gcc 4.8.1 (GCC) 20130725 (prerelease)
  configuration: --prefix=/usr --disable-debug --disable-static --enable-avresample --enable-dxva2 --enable-fontconfig --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libfreetype --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-libv4l2 --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libxvid --enable-pic --enable-postproc --enable-runtime-cpudetect --enable-shared --enable-swresample --enable-vdpau --enable-version3 --enable-x11grab

  • infoqscraper -c presentation download rxjava-clojure
Traceback (most recent call last):
  File "/media/DW2/repos/infoqscraper/bin/infoqscraper", line 46, in <module>
    sys.exit(main.main())
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 355, in main
    module.main(infoq_client, args.module_args)
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 193, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 296, in main
    builder.create_presentation(output_path=output)
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 226, in create_presentation
    frame_pattern = self._prepare_frames(jpg_slides)
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 368, in _prepare_frames
    for remaining  in xrange(timecodes[slide_index], timecodes[slide_index+1]):
TypeError: 'NoneType' object has no attribute '__getitem__'

Maybe after all this is not related to ffmpeg, because I have the same error with ffmpeg 1.2.
/repos/infoqscraper/bin/infoqscraper -c presentation download -f /repos/FFmpeg/ffmpeg rxjava-clojure.

  • infoqscraper presentation list -p agile
Traceback (most recent call last):
  File "/usr/bin/infoqscraper", line 46, in <module>
    sys.exit(main.main())
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 355, in main
    module.main(infoq_client, args.module_args)
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 193, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 243, in main
    self.__standard_output(summaries)
  File "/usr/lib/python2.7/site-packages/infoqscraper/main.py", line 251, in __standard_output
    for result in results:
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 48, in get_summaries
    for summary in rb.summaries():
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 447, in summaries
    return [create_summary(div) for div in videos]
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 442, in create_summary
    'date':  get_date(div),
  File "/usr/lib/python2.7/site-packages/infoqscraper/presentation.py", line 432, in get_date
    return datetime.datetime.strptime(str, "%b %d, %Y")
  File "/usr/lib/python2.7/_strptime.py", line 328, in _strptime
    data_string[found.end():])
ValueError: unconverted data remains:  

virtualenv:
(infoqtest)> pip list
beautifulsoup4 (4.3.2)
html5lib (0.99)
infoqscraper (0.0.5)
PIL (1.1.7)
pip (1.4.1)
setuptools (0.9.8)
six (1.4.1)
wsgiref (0.1.2)

Installation log: https://www.refheap.com/16c0084f77ef33bc41d42dd42

Same errors on Ubuntu and Archlinux. Still testing.

ac3 invalid bit rate / Error while opening encoder for output stream #0:1

I see this (running develop branch, 391a6cb), regardless of the presentation I try to download:

$ PYTHONPATH=. bin/infoqscraper -c presentation download latency-pitfalls
[ac3 @ 0x1e60380] invalid bit rate
Stream mapping:
  Stream #0.0 -> #0.0
  Stream #1.0 -> #0.1
Error while opening encoder for output stream #0.1 - maybe incorrect parameters such as bit_rate, rate, width or height

$ ffmpeg --version
ffmpeg version 0.8.6-6:0.8.6-0ubuntu0.12.10.1, Copyright (c) 2000-2013 the Libav developers
  built on Apr  2 2013 17:02:16 with gcc 4.7.2
*** THIS PROGRAM IS DEPRECATED ***
This program is only provided for compatibility and will be removed in a future release. Please use avconv instead.
Missing argument for option '-version'

If I patch infoqscraper to use avconv and print the commands it's running, I see the same:

$ PYTHONPATH=. bin/infoqscraper -c presentation download latency-pitfalls
['avconv', '-v', 'error', '-i', '/home/at/.cache/infoqscraper/resources/mp4:presentations/13-mar-hownottomeasure.mp4', '-vn', '-acodec', 'libvorbis', '/tmp/infoq_H2K7V/audio.ogg']
['avconv', '-v', 'error', '-f', 'image2', '-r', '1', '-i', '/tmp/infoq_H2K7V/frame-%04d.jpg', '-i', '/tmp/infoq_H2K7V/audio.ogg', u'latency-pitfalls.avi']
File 'latency-pitfalls.avi' already exists. Overwrite ? [y/N] y
[ac3 @ 0x7febc0] invalid bit rate
Error while opening encoder for output stream #0:1 - maybe incorrect parameters such as bit_rate, rate, width or height

$ avconv --version
avconv version 0.8.6-6:0.8.6-0ubuntu0.12.10.1, Copyright (c) 2000-2013 the Libav developers
  built on Apr  2 2013 17:02:16 with gcc 4.7.2
Missing argument for option '-version'

These files are in my cache on exit:

~/.cache/infoqscraper]$ ls -lR
.:
total 4
drwxrwxr-x 4 at at 4096 Apr 15 01:38 resources/

./resources:
total 8
drwxrwxr-x 3 at at 4096 Apr 15 01:35 http:/
drwxrwxr-x 2 at at 4096 Apr 15 01:38 mp4:presentations/

./resources/http::
total 4
drwxrwxr-x 4 at at 4096 Apr 15 01:39 www.infoq.com/

./resources/http:/www.infoq.com:
total 8
drwxrwxr-x 2 at at 4096 Apr 15 01:35 presentations/
drwxrwxr-x 3 at at 4096 Apr 15 01:39 resource/

./resources/http:/www.infoq.com/presentations:
total 120
-rw-rw-r-- 1 at at 119659 Apr 15 01:35 latency-pitfalls

./resources/http:/www.infoq.com/resource:
total 4
drwxrwxr-x 3 at at 4096 Apr 15 01:39 presentations/

./resources/http:/www.infoq.com/resource/presentations:
total 4
drwxrwxr-x 3 at at 4096 Apr 15 01:39 latency-pitfalls/

./resources/http:/www.infoq.com/resource/presentations/latency-pitfalls:
total 4
drwxrwxr-x 3 at at 4096 Apr 15 01:39 en/

./resources/http:/www.infoq.com/resource/presentations/latency-pitfalls/en:
total 4
drwxrwxr-x 2 at at 4096 Apr 15 01:40 slides/

./resources/http:/www.infoq.com/resource/presentations/latency-pitfalls/en/slides:
total 5376
-rw-rw-r-- 1 at at 114799 Apr 15 01:39 sl107.jpg
-rw-rw-r-- 1 at at 143761 Apr 15 01:39 sl113.jpg
-rw-rw-r-- 1 at at  69349 Apr 15 01:39 sl114.jpg
-rw-rw-r-- 1 at at 100379 Apr 15 01:39 sl118.jpg
-rw-rw-r-- 1 at at 152491 Apr 15 01:39 sl122.jpg
-rw-rw-r-- 1 at at  84513 Apr 15 01:39 sl124.jpg
-rw-rw-r-- 1 at at  83714 Apr 15 01:39 sl127.jpg
-rw-rw-r-- 1 at at  74841 Apr 15 01:39 sl128.jpg
-rw-rw-r-- 1 at at 124762 Apr 15 01:39 sl132.jpg
-rw-rw-r-- 1 at at 104150 Apr 15 01:39 sl139.jpg
-rw-rw-r-- 1 at at  98363 Apr 15 01:39 sl144.jpg
-rw-rw-r-- 1 at at  90747 Apr 15 01:39 sl147.jpg
-rw-rw-r-- 1 at at 132999 Apr 15 01:39 sl14.jpg
-rw-rw-r-- 1 at at 130418 Apr 15 01:39 sl152.jpg
-rw-rw-r-- 1 at at 113504 Apr 15 01:40 sl156.jpg
-rw-rw-r-- 1 at at 105197 Apr 15 01:40 sl157.jpg
-rw-rw-r-- 1 at at 131151 Apr 15 01:40 sl164.jpg
-rw-rw-r-- 1 at at  65131 Apr 15 01:40 sl165.jpg
-rw-rw-r-- 1 at at  66399 Apr 15 01:40 sl166.jpg
-rw-rw-r-- 1 at at 101768 Apr 15 01:40 sl167.jpg
-rw-rw-r-- 1 at at 113009 Apr 15 01:40 sl172.jpg
-rw-rw-r-- 1 at at 132766 Apr 15 01:40 sl175.jpg
-rw-rw-r-- 1 at at 153855 Apr 15 01:40 sl180.jpg
-rw-rw-r-- 1 at at  82394 Apr 15 01:40 sl181.jpg
-rw-rw-r-- 1 at at  64969 Apr 15 01:40 sl182.jpg
-rw-rw-r-- 1 at at 109958 Apr 15 01:40 sl183.jpg
-rw-rw-r-- 1 at at 115389 Apr 15 01:40 sl187.jpg
-rw-rw-r-- 1 at at  81848 Apr 15 01:40 sl198.jpg
-rw-rw-r-- 1 at at  81263 Apr 15 01:39 sl1.jpg
-rw-rw-r-- 1 at at 129598 Apr 15 01:40 sl200.jpg
-rw-rw-r-- 1 at at 130699 Apr 15 01:40 sl217.jpg
-rw-rw-r-- 1 at at  92413 Apr 15 01:40 sl218.jpg
-rw-rw-r-- 1 at at 120505 Apr 15 01:39 sl23.jpg
-rw-rw-r-- 1 at at  74123 Apr 15 01:39 sl2.jpg
-rw-rw-r-- 1 at at 101692 Apr 15 01:39 sl30.jpg
-rw-rw-r-- 1 at at 100445 Apr 15 01:39 sl34.jpg
-rw-rw-r-- 1 at at 172973 Apr 15 01:39 sl42.jpg
-rw-rw-r-- 1 at at 129803 Apr 15 01:39 sl45.jpg
-rw-rw-r-- 1 at at 103083 Apr 15 01:39 sl50.jpg
-rw-rw-r-- 1 at at 107785 Apr 15 01:39 sl58.jpg
-rw-rw-r-- 1 at at  87702 Apr 15 01:39 sl59.jpg
-rw-rw-r-- 1 at at 101025 Apr 15 01:39 sl60.jpg
-rw-rw-r-- 1 at at 103298 Apr 15 01:39 sl63.jpg
-rw-rw-r-- 1 at at  76239 Apr 15 01:39 sl64.jpg
-rw-rw-r-- 1 at at 125948 Apr 15 01:39 sl71.jpg
-rw-rw-r-- 1 at at 127684 Apr 15 01:39 sl77.jpg
-rw-rw-r-- 1 at at 105978 Apr 15 01:39 sl83.jpg
-rw-rw-r-- 1 at at 127793 Apr 15 01:39 sl8.jpg
-rw-rw-r-- 1 at at 124370 Apr 15 01:39 sl90.jpg
-rw-rw-r-- 1 at at 134006 Apr 15 01:39 sl96.jpg

./resources/mp4:presentations:
total 153328
-rw-rw-r-- 1 at at 157006234 Apr 15 01:38 13-mar-hownottomeasure.mp4

I'm running 64-bit Ubuntu 12.10.

Download fails with TypeError

This is cut-and-pasted from my shell:

 ~/Downloads $ infoqscraper presentation download learning-developer
Traceback (most recent call last):
  File "/usr/local/bin/infoqscraper", line 46, in <module>
    sys.exit(main.main())
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 355, in main
    module.main(infoq_client, args.module_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 193, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 296, in main
    builder.create_presentation(output_path=output)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/presentation.py", line 226, in create_presentation
    frame_pattern = self._prepare_frames(jpg_slides)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/presentation.py", line 368, in _prepare_frames
    for remaining  in xrange(timecodes[slide_index], timecodes[slide_index+1]):
TypeError: 'NoneType' object has no attribute '__getitem__'

I have tried several times and the result is always the same. I'm running Linux Mint 14. Any other information I should include?

installed successfully error on running

installed successfully error on running

mac os x 10.8 - python 2.7

install log

successfully installed infoqscraper BeautifulSoup4 html5lib PIL
Cleaning up...

[17:04:55] > infoqscraper presentation list -n 20
-bash: infoqscraper: command not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.