isaacbernat / netflix-to-srt Goto Github PK

View Code? Open in Web Editor NEW

746.0 24.0 72.0 267 KB

Rip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)

License: MIT License

Python 100.00%

netflix srt dfxp subtitle subtitles python python3 regex closed-captions closed-captioning

netflix-to-srt's People

Contributors

Stargazers

Watchers

Forkers

thatotherstrangeguy press88 brazilianguy killerpot deflated-criossant ericje627 oowowaee rakochi uukrull shylon pgbpeich bravokid47 mlmlte nicoso77 3pliii oscar-deng ma5onic eokoyomon hubbe2 yuiiuy lbrayner johnnyma314 lynkas musus wazerstar simonkuu tidusboo zer0nka konata09 shinichihimura chreddy epoulsen ejero maiux maresjj im-asf luisriverag nanpuhaha mabellinn77 coding-hobby vishesh200130 newrain7803 johan456789 moecome swipswaps someonelike-u combodevs giaiaothoisu solarmist murtazaaly01 weebzoneindia rinrinx2 banelingrush cygnathreadbare mingyugwon bk3a12 zackmark29 krzysztof-adamski yusuf963 kimwoonggon leandrodaher yqingli123 devenu85 dlxj gibborimmm amirulandalib hajukt snoymy diemort gknakay rdhyee

netflix-to-srt's Issues

converted vtt file contains text format in srt output

source file
https://gist.github.com/darodi/c95ba5592933d11f8963626944ea4735#file-je-suis-la-2853734-fra-vtt

target file
https://gist.github.com/darodi/c95ba5592933d11f8963626944ea4735#file-je-suis-la-2853734-fra-vtt-srt

as you can see,

1
00:00:07,120 --> 00:00:09,480 
<c.magenta.bg_black>Musique douce</c>

insead of

1
00:00:07,120 --> 00:00:09,480 
<font color=#ff00ff">Musique douce</font>

still working?

i try to find ?0= no one that include subtitle

there is no line starting with ?o= in developer console

Steps:

Open a Netflix with subtitle
Follow these instructions: https://github.com/isaacbernat/netflix-to-srt#from-netflix-method-1
There is no component starting with ?o=

No ?o= file & instructions clarification needed

Hiya,

When following the instructions in the readme, I'm not finding any files starting with ?o= in the Network tab or chrome://net-internals. There are many with that text in the names, but none with it at the start. Do you know if Netflix has changed their naming method or why this might be the case? I know some shows use image-based subtitles so perhaps if that's the case, there's no relevant xml file that even exists (and thus nothing's coming up when looking for it…)

Also, regarding Method 2, I'm not sure what "Start AdblockPlus, open blockable items" means. I click on the ABP menu bar icon… and then what? I can't figure out what to do – there's nothing called "open blockable items" or anything even similar in that menu, or in the add-on options, etc.

Thanks!

Foreign languages

Trying to extract the Korean subtitles from Netflix Korea (English subs work fine) and there are multiple ?0= lines on the network tab in chrome. No search results for dfxp in Firefox either. Any workaround?

s not converted to

When trying download.xml.txt the italics spans (here style2) aren't converted properly. The current input looks like

<p begin="7304400001t" end="7356800000t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle195"><span style="style1">Mijn zus vertelde me dat ze een</span><br/><span style="style1">andere </span><span style="style2">aspie</span><span style="style1"> kende van de kerk...</span></p>

which is converted to

196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <span style="style2">aspie kende van de kerk...

where I would expect the output to be

196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <i>aspie</i> kende van de kerk...

Korean Subtitles are not dfxp

I'm not sure if its Korean only but there is no dfxp stream. I've read on reddit that its either encrypted or the subs are in some png file. Do you know anything about that?

ValueError: invalid literal for int() with base 10: ''

I receive errors when trying to convert a dfxp file. I'm not sure exactly how the XML should look, but it is a well formed XML document.

only .vtt files

All the script downloads now are vtt files, no .srt or .xml files.

Has something changed on netflix's end that has not been incorporated into the script`?
Do I have an old version?

One more bug with 's

This
So as of last week, Terrace House began its new season.
produced
8
00:00:27,569 --> 00:00:32,532
Terrace HouseSo as of last week,
began its new season.
but should be
8
00:00:27,569 --> 00:00:32,532
So as of last week,
Terrace House began its new season.

As a temporary solution I strip all 's using Notepad++ regex search:
([^<]+)
replacing it with $1.

Example: sample.xml.txt

Python error when the time value has less than 8 numbers

This is the error:

Traceback (most recent call last):
  File "to_srt.py", line 99, in <module>
    main()
  File "to_srt.py", line 96, in main
    f.write(to_srt(text))
  File "to_srt.py", line 74, in to_srt
    append_subs(start, end, prev_content, fmt_t)
  File "to_srt.py", line 27, in append_subs
    "start_time": convert_time(start) if format_time else start,
  File "to_srt.py", line 13, in convert_time
    if int(raw_time) == 0:
ValueError: invalid literal for int() with base 10: ''

dxfp example:

En capítulos anteriores de Doce Monos:

edit:

This code takes care of whatever the lenght of the raw_time string is:

def convert_time(raw_time):
    raw_time_len = len(raw_time)
    # only interested in milliseconds, let's drop the additional digits
    if raw_time_len > 4:
        ms = leading_zeros(int(raw_time[:-4]) % 1000, 3)
    else:
        ms = '000'
    if raw_time_len > 7:
        time_in_seconds = int(raw_time[:-7])
    else:
        time_in_seconds = 0

    second = leading_zeros(time_in_seconds % 60)
    minute = leading_zeros(int(math.floor(time_in_seconds / 60)) % 60)
    hour = leading_zeros(int(math.floor(time_in_seconds / 3600)))
    return "{}:{}:{},{}".format(hour, minute, second, ms)

The script cuts everything after

Hello, thanks for a great script!

I've noticed that if the script sees lines likes this:
Terrace House is a show about six strangers, men and women,
or
Terrace House has now been revived by Netflix.

It converts them to:
3
00:00:16,183 --> 00:00:19,394
Terrace House
or
8
00:00:30,780 --> 00:00:36,328
Terrace House

So the part after  is stripped.

Example file: sample.xml.txt

Extra digits on ms

Could you add a option so that all the ms digits are included? If you have a subtitles file with say 350 "points" it can be skewed a total of 0.0009 x 350 = 0.315 seconds at the end if you don't include all the digits.

webvtt format support

Netflix supplied new "webvtt-lssdh-ios" format，such as Chinese.

Convert html entities

Hello, maybe you can take in account html entities in vtt files. A quick and dirty solution is to change writing in:
f.write(to_srt(html.unescape(text), fn[-4:]))

sorry, I'm new to github and I don't know how to propose a change in code, yet :)

Subtitles images format

There is any solution to convert images to srt ?
Detect text from images and convert it to srt using Tesseract OCR

for example with Arabic

span-tags are not removed properly

Thanks for the nice script, i like it.

I have here a subtitle like the sample attached, the span tags are not removed properly
For the first line i can change the search pattern easily in the code, but for the second line only the outer tags gets removed.

sample.txt

Support for grabbing subs with positioning (SSA format)

AttributeError: 'Namespace' object has no attribute 'ouput'

I'm getting the following error message. The file I'm using is attached, I had to save it as a text edit file since github cannot handle XML files. I put the file in the same directory as to_srt.py.
download.txt

Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 167, in <module>
    main()
  File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 162, in main
    with codecs.open("{}/{}.srt".format(a.ouput, fn), 'wb', "utf-8") as f:
AttributeError: 'Namespace' object has no attribute 'ouput'

Raises exception for vtt file: ".vtt format must start with WEBVTT, wrong file?"

Tested on subtitle file attached, even though the file starts with "WEBVTT" it still throws the exception on line 70.

Vikings S04E01.zip

New Netflix use of span

Hi!

From few days ago, Netflix are using span for all lines, including non-italic.

This is an example:

<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>
</styling>

Or another example:

<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>
</styling>

The first example produces that all lines are marked as italic, and the second example, produces this lines once converted to srt:

00:06:02,958 --> 00:06:04,333
<i>-No podemos por ahí.</i>
<span style="style5">¿Por?</i>

62
00:06:06,292 --> 00:06:08,667
<i>¡Es por aquí!</i>
<span style="style1">Tenemos que rodearla.</i>

63
00:06:09,708 --> 00:06:11,292
<i>¿Y cuánto se tarda?</i>
<span style="style1">Media hora.</i>

Thanks