Coder Social home page Coder Social logo

nebularnerd / subtotxt Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 2.0 29 KB

Quickly convert an .srt or .vtt to plain text, removes timestamps and .srt subtitle line numbers.

License: Creative Commons Zero v1.0 Universal

Python 100.00%
convertion python python3 srt srt-subtitles subtitle subtitles subtitles-parsing text webvtt

subtotxt's Introduction

subtotxt

Quickly convert a SubRip .srt or WEBVTT .vtt subtitle file to plain text. Removes timestamps and .srt subtitle line numbers. This was a quick project thrown together for my girlfriend, she's still learning English and wanted to be able to read subtitles more like a transcript for some trickier language issues (and to understand the jokes in Friends by discussing them with me).

With a spot of feature creep and some encoding detection needs, it evolved into being able to detect character encoding, along with being able to understand both .srt and .vtt formats to save some pre-processing work.

Usage:

Pop the python file somewhere you can reach it and from a command line use:
python C:\Python\subtotxt.py -f subtitle.srt
or
python C:\Python\subtotxt.py -f subtitle.vtt
The script will check which format the subtitle file is (incase of incorrect file extensions), detect the character encoding used then write out a .txt file with the same name as your input. If the output file already exists it will ask for permission to delete and create a new one.

Advanced Usage:

The script has six more arguments you can parse:

  • --utf8 or -8
    Forces the output file to use UTF-8 encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.
  • --pause or -p
    Pause the script at the sanity check stage to let you check some stats before continuing, handy if the output is not working.
  • --screen or -s
    Prints the output to the console while writing to the file, may help with debugging failed outputs.
  • --copy or -c
    Copies input to output without change, appends -copy to filename e.g.: subtitle-copy.srt, handy to use with --utf8 to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
  • --overwrite or -o
    Skips asking Output file already exists, delete and make a new one? [y/n] and simply deletes the existing output file to create a new one. Ideal for batch processing.
  • --help or -h
    Shows above information.

Required External Modules:

  • Send2Trash Python module to safely delete the old output file on both Win and *nix based systems.
  • cchardet Python module to detect your subtitle file encoding (Removed for v2.0 release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).
  • charset_normalizer Python module to detect your subtitle file encoding (v2.0+ supports Python 3.9.x and 3.10.x).

If your system does not these installed, it will auto install them on first use.

Features:

  • Fast (aside from initial missing modules install on slow net connections)
  • Input files character encoding formats are autodetected (if supported by cchardet [v1.0] or charset_normalizer [v2.0+])
  • Output files are wrote in the same encoding as the input or can be forced to UTF8
  • Should be cross platform friendly thanks to PathLib and Send2Trash
  • Handles UNC style \\myserver\myshare\mysub.srt paths thanks to PathLib
  • Handles SRT to TXT or WEBVTT to TXT
  • Handles multi line subtitles and subtitle lines with just numbers (does not confuse them with SRT line numbers)
  • WEBVTT: Removes 'WEBVTT', 'Kind: xxxx', 'Language: xxx' headers and Timestamps from output
  • SRT: Removes subtitle line #'s and Timestamps, will not work if first subtitle is not 1 or if duplicated line numbers are present (rare cases but possible), use SubtitleEdit to renumber lines for now if this happens.

Examples:

WEBVTT Input:

WEBVTT
Kind: captions
Language: en

00:00:18.590 --> 00:00:21.389
you'll hear a telephone conversation

00:00:21.389 --> 00:00:23.310
now you have some time to look at

00:00:23.310 --> 00:00:27.589
questions one to six

or SRT Input:

1
00:00:18,590 --> 00:00:21,389
you'll hear a telephone conversation

2
00:00:21,389 --> 00:00:23,310
now you have some time to look at

3
00:00:23,310 --> 00:00:27,589
questions one to six

Output:

you'll hear a telephone conversation
now you have some time to look at
questions one to six

Examples with non latin characters:

These are random examples take from an SRT website. cchardet detects the encoding as UTF-8-SIG, Notepad++ detects as UTF-8-BOM, these are technically the same thing.

Arabic SRT in UTF-8-BOM / UTF-8-SIG encoding:

1
00:00:02,425 --> 00:00:20,776
تـرجـمـة وتـعـديـل
الدكتور علي طلال 

2
00:00:58,425 --> 00:00:59,776
مادلين)؟)

3
00:01:01,705 --> 00:01:03,462
هل تريدين أنّ تأكلين مجددًا؟

Output with forced UTF-8 encoding:

تـرجـمـة وتـعـديـل
الدكتور علي طلال 
مادلين)؟)
هل تريدين أنّ تأكلين مجددًا؟

Chinese (Simplified) SRT in UTF-8-BOM / UTF-8-SIG encoding:

1
00:00:58,016 --> 00:00:59,476
瑪德琳?

2
00:01:01,270 --> 00:01:03,272
(妳又想吃飯了?)

3
00:01:04,313 --> 00:01:07,276
(妳心情不好才會吃這麼多)

4
00:01:09,528 --> 00:01:10,612
瑪德琳!

Output file in original UTF-8-BOM / UTF-8-SIG encoding:

瑪德琳?
(妳又想吃飯了?)
(妳心情不好才會吃這麼多)
瑪德琳!

Future plans:

  • Possibly handle more formats (.ssa Sub Station Alpha would be the other major one I could think of), for now you can use something like SubtitleEdit to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
  • GUI option for simple drag and drop usage.
  • Figure out a checking method for misnumbered or duplicate numbered SRT line numbers.
  • Handle stripping out SRT formatting tags for bold, italic etc...

License:

Released as CC0, use it how you wish. If you do use it elsewhere, please be awesome and tag me as the original author. 🙂

subtotxt's People

Contributors

nebularnerd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

subtotxt's Issues

Python 3.10.x Issues

I've just come across a bug with Python 3.10.x installs. One of the required packages cChardet does not currently behave under Python versions above 3.9.x

If you try to run it you'll likely get an error mentioning 'Visual C++ 14 Required' (and most likely whatever the equivalent error would be in *nix) There are forks of cChardet that are meant to work on 3.10.x but none have been merged into the main Git as it looks like the maintainer has stopped working on it.

Once I have an update I'll add it here, and hopefully close the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.