isaacbernat / netflix-to-srt Goto Github PK
View Code? Open in Web Editor NEWRip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)
License: MIT License
Rip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)
License: MIT License
converted vtt file contains text format in srt output
source file
https://gist.github.com/darodi/c95ba5592933d11f8963626944ea4735#file-je-suis-la-2853734-fra-vtt
target file
https://gist.github.com/darodi/c95ba5592933d11f8963626944ea4735#file-je-suis-la-2853734-fra-vtt-srt
as you can see,
1
00:00:07,120 --> 00:00:09,480
<c.magenta.bg_black>Musique douce</c>
insead of
1
00:00:07,120 --> 00:00:09,480
<font color=#ff00ff">Musique douce</font>
i try to find ?0= no one that include subtitle
Steps:
?o=
Hiya,
When following the instructions in the readme, I'm not finding any files starting with ?o=
in the Network tab or chrome://net-internals. There are many with that text in the names, but none with it at the start. Do you know if Netflix has changed their naming method or why this might be the case? I know some shows use image-based subtitles so perhaps if that's the case, there's no relevant xml file that even exists (and thus nothing's coming up when looking for it…)
Also, regarding Method 2, I'm not sure what "Start AdblockPlus, open blockable items" means. I click on the ABP menu bar icon… and then what? I can't figure out what to do – there's nothing called "open blockable items" or anything even similar in that menu, or in the add-on options, etc.
Thanks!
Trying to extract the Korean subtitles from Netflix Korea (English subs work fine) and there are multiple ?0= lines on the network tab in chrome. No search results for dfxp in Firefox either. Any workaround?
When trying download.xml.txt the italics spans (here style2
) aren't converted properly. The current input looks like
<p begin="7304400001t" end="7356800000t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle195"><span style="style1">Mijn zus vertelde me dat ze een</span><br/><span style="style1">andere </span><span style="style2">aspie</span><span style="style1"> kende van de kerk...</span></p>
which is converted to
196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <span style="style2">aspie kende van de kerk...
where I would expect the output to be
196
00:12:10,440 --> 00:12:15,680
Mijn zus vertelde me dat ze een
andere <i>aspie</i> kende van de kerk...
I'm not sure if its Korean only but there is no dfxp stream. I've read on reddit that its either encrypted or the subs are in some png file. Do you know anything about that?
All the script downloads now are vtt files, no .srt or .xml files.
Has something changed on netflix's end that has not been incorporated into the script`?
Do I have an old version?
This
<p begin="275695420t" end="325325000t" region="bottomCenter" style="s1" xml:id="subtitle7">So as of last week,<br/><span style="s1_1">Terrace House</span> began its new season.</p>
produced
8
00:00:27,569 --> 00:00:32,532
Terrace HouseSo as of last week,
began its new season.
but should be
8
00:00:27,569 --> 00:00:32,532
So as of last week,
Terrace House began its new season.
As a temporary solution I strip all <span>
's using Notepad++ regex search:
<span style="[^"]+">([^<]+)</span>
replacing it with $1
.
Example: sample.xml.txt
This is the error:
Traceback (most recent call last):
File "to_srt.py", line 99, in <module>
main()
File "to_srt.py", line 96, in main
f.write(to_srt(text))
File "to_srt.py", line 74, in to_srt
append_subs(start, end, prev_content, fmt_t)
File "to_srt.py", line 27, in append_subs
"start_time": convert_time(start) if format_time else start,
File "to_srt.py", line 13, in convert_time
if int(raw_time) == 0:
ValueError: invalid literal for int() with base 10: ''
dxfp example:
<p begin="6673332t" end="27026999t" xml:id="subtitle0"><span style="normal_1">En capítulos anteriores</span><br/><span style="normal_1">de </span>Doce Monos<span style="normal_1">:</span></p>
edit:
This code takes care of whatever the lenght of the raw_time string is:
def convert_time(raw_time):
raw_time_len = len(raw_time)
# only interested in milliseconds, let's drop the additional digits
if raw_time_len > 4:
ms = leading_zeros(int(raw_time[:-4]) % 1000, 3)
else:
ms = '000'
if raw_time_len > 7:
time_in_seconds = int(raw_time[:-7])
else:
time_in_seconds = 0
second = leading_zeros(time_in_seconds % 60)
minute = leading_zeros(int(math.floor(time_in_seconds / 60)) % 60)
hour = leading_zeros(int(math.floor(time_in_seconds / 3600)))
return "{}:{}:{},{}".format(hour, minute, second, ms)
Hello, thanks for a great script!
I've noticed that if the script sees lines likes this:
<p begin="161831670t" end="193943750t" region="topCenter" style="s1" xml:id="subtitle2"><span style="s1_1">Terrace House</span> is a show about<br/>six strangers, men and women,</p>
or
<p begin="307807500t" end="363282920t" region="bottomCenter" style="s1" xml:id="subtitle7"><span style="s1_1">Terrace House</span> has now<br/>been revived by Netflix.</p>
It converts them to:
3
00:00:16,183 --> 00:00:19,394
Terrace House
or
8
00:00:30,780 --> 00:00:36,328
Terrace House
So the part after </span>
is stripped.
Example file: sample.xml.txt
Could you add a option so that all the ms digits are included? If you have a subtitles file with say 350 "points" it can be skewed a total of 0.0009 x 350 = 0.315 seconds at the end if you don't include all the digits.
Netflix supplied new "webvtt-lssdh-ios" format,such as Chinese.
Hello, maybe you can take in account html entities in vtt files. A quick and dirty solution is to change writing in:
f.write(to_srt(html.unescape(text), fn[-4:]))
sorry, I'm new to github and I don't know how to propose a change in code, yet :)
There is any solution to convert images to srt ?
Detect text from images and convert it to srt using Tesseract OCR
for example with Arabic
Thanks for the nice script, i like it.
I have here a subtitle like the sample attached, the span tags are not removed properly
For the first line i can change the search pattern easily in the code, but for the second line only the outer tags gets removed.
I'm getting the following error message. The file I'm using is attached, I had to save it as a text edit file since github cannot handle XML files. I put the file in the same directory as to_srt.py.
download.txt
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 167, in <module>
main()
File "/Users/kylefoley/codes/pcode/other/to_srt.py", line 162, in main
with codecs.open("{}/{}.srt".format(a.ouput, fn), 'wb', "utf-8") as f:
AttributeError: 'Namespace' object has no attribute 'ouput'
Tested on subtitle file attached, even though the file starts with "WEBVTT" it still throws the exception on line 70.
Hi!
From few days ago, Netflix are using span for all lines, including non-italic.
This is an example:
<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>
</styling>
Or another example:
<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>
</styling>
The first example produces that all lines are marked as italic, and the second example, produces this lines once converted to srt:
00:06:02,958 --> 00:06:04,333
<i>-No podemos por ahí.</i>
<span style="style5">¿Por?</i>
62
00:06:06,292 --> 00:06:08,667
<i>¡Es por aquí!</i>
<span style="style1">Tenemos que rodearla.</i>
63
00:06:09,708 --> 00:06:11,292
<i>¿Y cuánto se tarda?</i>
<span style="style1">Media hora.</i>
Thanks
Hi,
today I tried to convert a Netflix .xml sub to a regular .srt sub, howver I noticed that the script doesn't convert correctly accent marks (à,è,ì,ò,ù..)
Maybe you can add something that auto convert a region="region_2" to "{\an8}"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.