Coder Social home page Coder Social logo

isaacbernat / netflix-to-srt Goto Github PK

View Code? Open in Web Editor NEW
746.0 24.0 72.0 267 KB

Rip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)

License: MIT License

Python 100.00%
netflix srt dfxp subtitle subtitles python python3 regex closed-captions closed-captioning

netflix-to-srt's People

Contributors

isaacbernat avatar lbrayner avatar llbaker302 avatar mlmlte avatar rakochi avatar snoymy avatar uukrull avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

netflix-to-srt's Issues

No ?o= file & instructions clarification needed

Hiya,

When following the instructions in the readme, I'm not finding any files starting with ?o= in the Network tab or chrome://net-internals. There are many with that text in the names, but none with it at the start. Do you know if Netflix has changed their naming method or why this might be the case? I know some shows use image-based subtitles so perhaps if that's the case, there's no relevant xml file that even exists (and thus nothing's coming up when looking for it…)

Also, regarding Method 2, I'm not sure what "Start AdblockPlus, open blockable items" means. I click on the ABP menu bar icon… and then what? I can't figure out what to do – there's nothing called "open blockable items" or anything even similar in that menu, or in the add-on options, etc.

Thanks!

Foreign languages

Trying to extract the Korean subtitles from Netflix Korea (English subs work fine) and there are multiple ?0= lines on the network tab in chrome. No search results for dfxp in Firefox either. Any workaround?

<span>s not converted to <i></i>

When trying download.xml.txt the italics spans (here style2) aren't converted properly. The current input looks like

<p begin="7304400001t" end="7356800000t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle195"><span style="style1">Mijn zus vertelde me dat ze een</span><br/><span style="style1">andere </span><span style="style2">aspie</span><span style="style1"> kende van de kerk...</span></p>

which is converted to

196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <span style="style2">aspie kende van de kerk...

where I would expect the output to be

196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <i>aspie</i> kende van de kerk...

Korean Subtitles are not dfxp

I'm not sure if its Korean only but there is no dfxp stream. I've read on reddit that its either encrypted or the subs are in some png file. Do you know anything about that?

only .vtt files

All the script downloads now are vtt files, no .srt or .xml files.

Has something changed on netflix's end that has not been incorporated into the script`?
Do I have an old version?

One more bug with <span>'s

This
<p begin="275695420t" end="325325000t" region="bottomCenter" style="s1" xml:id="subtitle7">So as of last week,<br/><span style="s1_1">Terrace House</span> began its new season.</p>
produced
8
00:00:27,569 --> 00:00:32,532
Terrace HouseSo as of last week,
began its new season.
but should be
8
00:00:27,569 --> 00:00:32,532
So as of last week,
Terrace House began its new season.

As a temporary solution I strip all <span>'s using Notepad++ regex search:
<span style="[^"]+">([^<]+)</span>
replacing it with $1.

Example: sample.xml.txt

Python error when the time value has less than 8 numbers

This is the error:

Traceback (most recent call last):
  File "to_srt.py", line 99, in <module>
    main()
  File "to_srt.py", line 96, in main
    f.write(to_srt(text))
  File "to_srt.py", line 74, in to_srt
    append_subs(start, end, prev_content, fmt_t)
  File "to_srt.py", line 27, in append_subs
    "start_time": convert_time(start) if format_time else start,
  File "to_srt.py", line 13, in convert_time
    if int(raw_time) == 0:
ValueError: invalid literal for int() with base 10: ''

dxfp example:

<p begin="6673332t" end="27026999t" xml:id="subtitle0"><span style="normal_1">En capítulos anteriores</span><br/><span style="normal_1">de </span>Doce Monos<span style="normal_1">:</span></p>

edit:

This code takes care of whatever the lenght of the raw_time string is:

def convert_time(raw_time):
    raw_time_len = len(raw_time)
    # only interested in milliseconds, let's drop the additional digits
    if raw_time_len > 4:
        ms = leading_zeros(int(raw_time[:-4]) % 1000, 3)
    else:
        ms = '000'
    if raw_time_len > 7:
        time_in_seconds = int(raw_time[:-7])
    else:
        time_in_seconds = 0

    second = leading_zeros(time_in_seconds % 60)
    minute = leading_zeros(int(math.floor(time_in_seconds / 60)) % 60)
    hour = leading_zeros(int(math.floor(time_in_seconds / 3600)))
    return "{}:{}:{},{}".format(hour, minute, second, ms)

The script cuts everything after </span>

Hello, thanks for a great script!

I've noticed that if the script sees lines likes this:
<p begin="161831670t" end="193943750t" region="topCenter" style="s1" xml:id="subtitle2"><span style="s1_1">Terrace House</span> is a show about<br/>six strangers, men and women,</p>
or
<p begin="307807500t" end="363282920t" region="bottomCenter" style="s1" xml:id="subtitle7"><span style="s1_1">Terrace House</span> has now<br/>been revived by Netflix.</p>

It converts them to:
3
00:00:16,183 --> 00:00:19,394
Terrace House
or
8
00:00:30,780 --> 00:00:36,328
Terrace House

So the part after </span> is stripped.

Example file: sample.xml.txt

Extra digits on ms

Could you add a option so that all the ms digits are included? If you have a subtitles file with say 350 "points" it can be skewed a total of 0.0009 x 350 = 0.315 seconds at the end if you don't include all the digits.

Convert html entities

Hello, maybe you can take in account html entities in vtt files. A quick and dirty solution is to change writing in:
f.write(to_srt(html.unescape(text), fn[-4:]))

sorry, I'm new to github and I don't know how to propose a change in code, yet :)

Subtitles images format

There is any solution to convert images to srt ?
Detect text from images and convert it to srt using Tesseract OCR

for example with Arabic

span-tags are not removed properly

Thanks for the nice script, i like it.

I have here a subtitle like the sample attached, the span tags are not removed properly
For the first line i can change the search pattern easily in the code, but for the second line only the outer tags gets removed.

sample.txt

AttributeError: 'Namespace' object has no attribute 'ouput'

I'm getting the following error message. The file I'm using is attached, I had to save it as a text edit file since github cannot handle XML files. I put the file in the same directory as to_srt.py.
download.txt

Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 167, in <module>
    main()
  File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 162, in main
    with codecs.open("{}/{}.srt".format(a.ouput, fn), 'wb', "utf-8") as f:
AttributeError: 'Namespace' object has no attribute 'ouput'

New Netflix use of span

Hi!

From few days ago, Netflix are using span for all lines, including non-italic.

This is an example:

<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>
</styling>

Or another example:

<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>
</styling>

The first example produces that all lines are marked as italic, and the second example, produces this lines once converted to srt:

00:06:02,958 --> 00:06:04,333
<i>-No podemos por ahí.</i>
<span style="style5">¿Por?</i>

62
00:06:06,292 --> 00:06:08,667
<i>¡Es por aquí!</i>
<span style="style1">Tenemos que rodearla.</i>

63
00:06:09,708 --> 00:06:11,292
<i>¿Y cuánto se tarda?</i>
<span style="style1">Media hora.</i>

Thanks

Accent marks problem

Hi,
today I tried to convert a Netflix .xml sub to a regular .srt sub, howver I noticed that the script doesn't convert correctly accent marks (à,è,ì,ò,ù..)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.