paullerner / pyannote-db-plumcot Goto Github PK

This project forked from benjisympa/pyannote.db.plumcot

Not yet ready for prime time.

License: Other

Python 99.87% Shell 0.13%

pyannote-db-plumcot's Introduction

PLUMCOT 0

The PLUMCOT corpus provides annotation for face recognition, transcription (available for all episodes), person named-entities, speech activity (duration_a, available for all episodes), and speaker identities (duration_i) of 16 TV (or movie) series :

Serie				Transcription	Entities		Speech
title	short	episodes	duration	tokens	episodes	tokens	episodes	duration_a	duration_i
24	-	195	136:24	868,782	-	-	-	36:17	-
Battlestar Galactica	BG	71	52:16	264,066	-	-	61	10:53	08:49
Breaking Bad	BB	61	46:29	205,952	-	-	61	17:06	17:06
Buffy the Vampire Slayer	Buffy	143	101:18	587,165	12	73,301	143	25:55	25:55
ER	-	283	201:02	1,747,563	-	-	-	63:06	-
Friends	-	233	84:56	618,237	-	-	233	28:04	28:04
Game of Thrones	GoT	60	53:09	278,917	10	53,035	60	19:13	19:13
Harry Potter	HP	8	18:51	63,677	1	12,250	4	02:44	01:28
Homeland	-	70	57:49	333,405	-	-	-	12:24	-
Lost	-	66	46:48	367,546	7	66	07:12	07:12
Six Feet Under	SFU	63	56:43	326,542	-	-	-	15:11	-
Star Wars	SW	7	15:05	75,903	1	18,123	7	02:13	02:13
The Big Bang Theory	TBBT	207	68:41	547,193	17	61,285	207	25:23	25:23
The Lord of the Rings	TLOR	3	08:56	29,162	-	-	3	00:47	00:47
The Office	TO	188	71:45	575,500	6	24,692	188	30:15	30:15
The Walking Dead	TWD	89	65:00	321,334	19	93,599	25	08:32	02:46
TOTAL	-	1,747	1085:20	7,210,944	73	336,285	1,058	305:25	169:19

Transcriptions were scraped from various fan-websites, see LICENSE.

Person named-entities annotation were annotated semi-automatically from the transcripts, see CONTRIBUTING.

Speaker annotations come from forced-alignment on series transcripts except for Breaking Bad and Game Of Thrones which were manually annotated by Bost et al.

Face recognition annotations consists of a dataset of images labeled with the featured characters, scrapped from IMDb. No bounding box nor video identification annotations are provided (for now).

In addition, this repository provides a Python API to access the corpus programmatically.

Installation

Until the package has been published on PyPI, one has to run the following commands:

$ git clone https://github.com/PaulLerner/pyannote-db-plumcot.git
$ pip install pyannote-db-plumcot

Usage

Please refer to pyannote.database for a complete documentation.

export PYANNOTE_DATABASE_CONFIG=$PWD/pyannote-db-plumcot/Plumcot/data/database.yml
python

Speaker Diarization / Identification and Entity Linking

>>> from pyannote.database import get_protocol

# you can access the whole dataset using the meta-protocol 'X'
>>> plumcot = get_protocol('X.SpeakerDiarization.Plumcot')
# Note : this might take a while...
>>> plumcot.stats('train')
{'annotated': 710303.0550000002, 'annotation': 383730.8849999984, 'n_files': 681, 'labels': {...}}

# or access each serie individually, e.g. 'HarryPotter'
>>> from pyannote.database import get_protocol
>>> harry = get_protocol('HarryPotter.SpeakerDiarization.0')
>>> harry.stats('train')
{'annotated': 5281.429999999969, 'annotation': 2836.8099999998867, 'n_files': 2, 'labels': {...}}
# get the first file of HarryPotter.SpeakerDiarization.0's test set
>>> first_file = next(harry.test()) 
>>> first_file['uri']                                                                                                                                                                                   
'HarryPotter.Episode01'
# top 5 speakers of HarryPotter.Episode01
>>> first_file['annotation'].chart()[:5]                                                                                                                                                                
[('harry_potter', 417.1699999999951),
 ('rubeus_hagrid', 321.49000000000785),
 ('ron_weasley', 259.1599999999926),
 ('hermione_granger', 217.5499999999979),
 ('albus_dumbledore', 186.04999999999941)]
# On some files we provide entity linking annotation in a SpaCy Doc
# Beware, this might lead to a KeyError
>>> entity = first_file['entity'] 
>>> from pyannote.core import Segment 
>>> for token in entity[:11]: 
>>>     segment = Segment(token._.time_start, token._.time_end) 
>>>     print(f'{segment} {token._.speaker}: {token.text} -> {token.ent_kb_id_}')                                                                                                                        
[ 00:01:18.740 -->  00:01:18.790] albus_dumbledore: I -> albus_dumbledore
[ 00:01:18.830 -->  00:01:18.980] albus_dumbledore: should -> 
[ 00:01:18.980 -->  00:01:19.100] albus_dumbledore: have -> 
[ 00:01:19.160 -->  00:01:19.430] albus_dumbledore: known -> 
[ 00:01:19.460 -->  00:01:19.580] albus_dumbledore: that -> 
[ 00:01:19.600 -->  00:01:19.700] albus_dumbledore: you -> professor_mcgonagall
[ 00:01:19.700 -->  00:01:19.820] albus_dumbledore: would -> 
[ 00:01:19.820 -->  00:01:19.940] albus_dumbledore: be -> 
[ 00:01:19.940 -->  00:01:20.380] albus_dumbledore: here -> 
[ 00:01:21.660 -->  00:01:22.130] albus_dumbledore: ...Professor -> 
[ 00:01:22.380 -->  00:01:22.600] albus_dumbledore: mcgonagall -> professor_mcgonagall

Speech Activity Detection and transcription

Note that the previous dataset is also suitable for Speech Activity Detection but is smaller.

>>> from pyannote.database import get_protocol

# you can access the whole dataset using the meta-protocol 'X'
>>> plumcot = get_protocol('X.SpeakerDiarization.SAD')
# Note : this might take a while...
>>> plumcot.stats('train')
{'annotated': 1286065.3450000014, 'annotation': 716507.5149999945, 'n_files': 1144, 'labels': {...}}

# or access each serie individually, e.g. 'HarryPotter'
>>> harry = get_protocol('HarryPotter.SpeakerDiarization.SAD')
>>> harry.stats('train')
{'annotated': 12864.489999999932, 'annotation': 5853.799999999804, 'n_files': 5, 'labels': {...}}
# get the first file of HarryPotter.SpeakerDiarization.0's test set
>>> first_file = next(harry.test()) 
>>> first_file['uri']                                                                                                                                                                                   
'HarryPotter.Episode01'
# The 'transcription' key should *always* be available, even when speaker identity is not
>>> transcription = first_file['transcription']
>>> from pyannote.core import Segment 
>>> for token in transcription[:11]: 
>>>     s = Segment(token._.time_start, token._.time_end) 
>>>     print(f'{s}: {token.text}') 
[ 00:01:18.740 -->  00:01:18.790]: I
[ 00:01:18.830 -->  00:01:18.980]: should
[ 00:01:18.980 -->  00:01:19.100]: have
[ 00:01:19.160 -->  00:01:19.430]: known
[ 00:01:19.460 -->  00:01:19.580]: that
[ 00:01:19.600 -->  00:01:19.700]: you
[ 00:01:19.700 -->  00:01:19.820]: would
[ 00:01:19.820 -->  00:01:19.940]: be
[ 00:01:19.940 -->  00:01:20.380]: here
[ 00:01:21.660 -->  00:01:22.130]: ...Professor
[ 00:01:22.380 -->  00:01:22.600]: mcgonagall

Note: we don't provide for the series audio or video files! You'll need to acquire them yourself then place them in the relevant serie directory (e.g. HarryPotter/wavs) with file name formatted as <file_uri>.en16kHz.wav. See also DVD section.

Raw data

Transcripts, diarization and entities annotation can be found as text file in Plumcot/data sub-directory. Formats etc. are described in CONTRIBUTING.md.

The face recognition dataset is provided from an external link : TODO. Alternatively, you can scrap the images yourself using scripts/images_scraping.py (see CONTRIBUTING.md).

Entities

Some issues were not anticipated before annotation of entities, resulting in those hacks (implemented in Plumcot.loader so beware if you use the raw data), see also #13.

Since we only annotated person-named entities, the entity type (which was originally set automatically) is always set to "PERSON" when entity label is not empty and to "" otherwise.
Entities annotation is tokenized, unlike forced-alignment. It's not straightforward to align both annotations. If both are provided (i.e. in 0 protocols), we follow the tokenization of forced-alignment (i.e. split on whitespaces), therefore some linguistic attributes (e.g. POS) may be lost in the process. If you're not interested in the audio/timing content of the annotation, you can use the EL protocols (e.g. 'HarryPotter.SpeakerDiarization.EL') which follow the tokenization of the entity. These were used to obtain the NER results reported in the paper (see also the script ner.py).

DVDs

Episode numbering relies on IMDb.

We acquired zone 2 (i.e. Europe) DVDs. DVDs were converted to mkv and wav using dvd_extraction.

durations.csv provides the audio duration of the resulting wav files.

Some (double) episodes are numbered as two different episodes in the DVDs although they're numbered as one in IMDb. These are listed in the double_episodes/ folder of the relevant serie, if needed.

TODO: automate the creation of double_episodes/ files so that the user doesn't have to replace /vol/work3/lefevre/dvd_extracted/ manually.

The episodes are then concatenated using ffmpeg:

cd pyannote-db-plumcot/
bash scripts/concat_double_episodes.sh <serie_uri> </path/to/wavs>

Note that this will only create a new wav file resulting of the concatenation of <episode.i> and <episode.j> named like <episode.i.j> but it will not fix the numbering of the others episodes (TODO: add code to do it ?)

Ambiguous labels

Some labels are ambiguous depending on whether we focus on the speaker or on the entity.

We decided to focus on the entity as much as possible, e.g. 'Obiwan Kenobi' has the same label in the old and the new Star Wars movies, although it is not the same actor (i.e. speaker).

However, we annotated following the IMDb credits which are not always consistent, e.g. the emperor in Star Wars doesn't have the same label in the old and the new episodes.

Disclaimer : we do not intend to use the whole X.SpeakerDiarization.Plumcot corpus to train or evaluate speaker diarization systems! Indeed, the classes are largely imbalanced, a lot of actors (i.e. speakers) play in multiple series and a lot of characters share labels across series (see actor_counter and counter , respectively).

Moreover, some secondary characters (most don't have proper names) are played by several different actors through the same serie. These are listed in not_unique.json and should be removed from the evaluation (TODO).

IMDb credits were updated on 17/03/2020 in 123e37cb, therefore some annotated labels are inconsistent with these new IMDb credits and should be updated (TODO).

LICENSE

Source code

The source code (or "Software") if freely available under the MIT License

Speech annotations

All speech annotations, regarding speaker identity or speech regions are licensed under CC BY 4.0.

Textual content

All textual content, dialogues and derived annotations are held by their respective owners and their use is allowed under the Fair Use Clause of the Copyright Law. We only share them for research purposes.

They were scraped from various fan websites:

References

Bost, X., Labatut, V., Linares, G., 2020. Serial speakers: a dataset of tv series. arXiv preprint arXiv:2002.06923.

pyannote-db-plumcot's People

Contributors

Stargazers

Watchers

Forkers

sharleynelefevre hbredin juliettebergoend

pyannote-db-plumcot's Issues

fix: BreakingBad and GameOfThrones annotations (Serial Speakers)

There might be a need to make BreakingBad and GameOfThrones annotations uniform:

"who speaks when" (.rttm) annotations come from the Serial Speakers dataset (Bost. et al) who did not use the same IMDb-speaker-identifier as us.
transcripts (.txt), forced-alignment (.rttm, .aligned) and entities (.csv) annotations come from the same pipeline as the rest of the corpus (fan transcripts + IMDb identifier for every speaker), although we don't have BreakingBad transcripts with "who said what"

Note: it might be easier to do that using the original serial speaker dataset and not the rttm version, see https://github.com/PaulLerner/Prune#convert

fix: transcripts

remove didascalies
remove "previously in ......."
in some rare case, fans transcribed l instead of I (e.g. l am a TV character.). This led to poor POS-tagging and is fixed in Plumcot.loader.CsvLoader.

API for NLP stuff

Before releasing the corpus, we should provide example on how to load transcripts (and their forced-aligned version when available).

This probably means that we need to implement a dedicated API for that in the pyannote.database plugin.

>>> from pyannote.database import get_protocol
>>> protocol = get_protocol('TheBigBangTheory.????.Transcription')
>>> for file in protocol.train():
...    transcription = file['transcription']
...    for line in transcription:
...       line['speaker']
...       line['text'] 
...       # other fields of interest? time?

chore: install scripts in setup.py so we can launch them from anywhere

TODO rely on __init__ path everywhere

feat: add loader for episode summaries, synopsis and keywords

add loader in Plumcot/loader/loader.py so that one can access to episodes summaries, synopsis and keywords through the API.

See #52

Entities annotation: improvements for the future

These would prevent a lot of hacks currently present in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/Plumcot/loader/loader.py and https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/scripts/ner.py :

don't mess with spaCy tokenization
save in a format that allows to reconstruct the original text after tokenization (e.g. whitespace for each token which represents the whitespace following the token, might be '')
directly add forced-alignment attributes (word timestamp + alignment confidence) when tokenizing/annotating (although it's not straightforward what the attributes of a word split into several tokens should be)
correct entity type: can be automatic if we annotate only person named-entities
differentiate entities which are called by their names (and thus should be detected by a typical NER system) and others (e.g. pronouns etc.): maybe correcting the POS tag might be enough
better yet: annotate co-reference, this would allow to evaluate co-reference systems easily but I guess it's a lot of additional work

fix: TWD entities annotation

Example with TheWalkingDead.Season01.Episode02

Final format

Mix-up

Here "mom" is annotated as "theodore_douglas"

referenceTo vs labelDoccano

Here "Lori" is not annotated in "labelDoccano" but is (automatically) in "referenceTo"

Also there "he" is annotated as "UNKNOWN" in "referenceTo" (and absent in "labelDoccano") whereas in this case it's easy to see that "he" refers to "shane".

Doccano annotation (csv and json)

Mix-up

The first two lines come from nowhere, they're not in the original transcript

That explains the mix-up described above.

Miss

I'm aware that the json format is not user-friendly, but there you can see that "mom" is annotated as "lori_grimes", but "lori" and "amy" are not annotated

cc @sharleynelefevre

deepspeech-gpu installation error

Looks like deepspeech-gpu 0.7.3 can no longer be installed (at least on my macOS).
Where is this dependency used? Can we make it optional?

pip install ./pyannote-db-plumcot
ERROR: Could not find a version that satisfies the requirement deepspeech-gpu==0.7.3 (from pyannote.db.plumcot==0+untagged.455.g0b364bc) (from versions: none)
ERROR: No matching distribution found for deepspeech-gpu==0.7.3 (from pyannote.db.plumcot==0+untagged.455.g0b364bc)

fix: IMDb credits

Since 123e37c, a 1-1 mapping is done between the actor and the character so that an actor can only play one character. This was done to circumvent the fact that actors could be credited several times in movie series (e.g. TheLordOfTheRings) and then it would broke the episodes.py script that generates the credits.txt file from characters.txt and episodes.txt. Only now it's tricky to update the annotations prior to 123e37c (as some names have changed), see #19
Another solution would be to refactor the episodes.py script and the credits.txt file (I'm not sure why we have this huge unreadable matrix in the first place) so that one actor can play several characters and a character can be played by several actors. However this implies fixing all usage of the credits.txt file...

We could then:

close #19
revert:
fix TheLordOfTheRings, either:
- revert 2e94797 and update the names in the annotations (.aligned and .rttm)
- or keep the names from 123e37c
same for BuffyTheVampireSlayer but it's a lot easier, the only name that has changed is janice → janice_penshaw (which is inexistent for all I know)
update README and CONTRIBUTING

Calling file['entity'] raises "AlignmentError"

How to reproduce?

from pyannote.database import get_protocol
protocol = get_protocol('TheBigBangTheory.SpeakerDiarization.0')
test_file = next(protocol.test())
test_file['entity']
# AlignmentError: [...] are different texts.

Is this a known issue?
Is this related to your suggested changes in #13?

refactor: entities.py script (semi-automatic annotation)

This is also intended as a documentation for the current script (more or less translated from @sharleynelefevre).
See also #13 for thoughts about a better annotation.

General stuff

script usage should follow the others: entities.py [--uri=<uri>] -> defaults to process all series. You currently have to hard-write the serie uri and season.
all paths should be relative to Plumcot/data like in the other scripts, this can be done using the path of the installed Plumcot package in a few lines of code:

import pyannote.database
import Plumcot as PC
from pathlib import Path
DATA_PATH=Path(PC.__file__).parent / "data"

You currently have to hard-write input path and output path. Note that the output path is not consistent with the actual structure and should be Plumcot/data/<serie>/csv_semi-auto-annotation

formats should have clearer names than .csv, or, better, follow standard formats (e.g. .conll)

semi_auto_loc_annotation

Input

transcript with speaker names (e.g. Plumcot/data/TheBigBangTheory/transcripts/TheBigBangTheory.Season01.Episode01.txt)

Output

semi-automatically annotated transcript in a .conll and a .csv formats.

Process

in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/scripts/entities.py#L101, nicknames are appended to a list. You then have to add special rules for each nickname (e.g. shelly -> sheldon_cooper). All of these rules (which currently have to be commented or un-commented) could be infered from the serie uri
idem for the parents of the characters (e.g. « sheldon_cooper My mother is back. » -> « My » = sheldon_cooper ; « mother » = mary_cooper because they have the same last name)
There's this preprocessing step for compound names which I think is supposed to remove determiners (e.g. "the") which were previously split from names here
.conll format is intended for Doccano, I'm not sure what good is .csv

removeTabLines

Input

.conll file (previous output)

Output

.conll file without tabs at the end of the line so it's suitable for Doccano

Annotation in Doccano

Note this is not in the entities script. You need to:

select "sequence labeling" task
Check the second box
Import the .conll
Annotate
Export the data in JSONL format (the one on the right), rename the file with the same format as the title and move the file to your folder where the .csv / .conll outputs are located.

See CONTRIBUTING.md for annotation instructions.

Input

.conll file (previous output)

Output

.json1 files, manually annotated (i.e. corrected)

jsonToCSV

Input

.json1 file (previous output)

Output

.csv file. Note that this has nothing to do with the previous .csv format: this one only has three fields : "idSent", "idChar" and "label".

mergeData

Input

.csv file (previous output)
another .csv file (output of semi_auto_loc_annotation).

Output

Yet another .csv file which merges the two inputs. This is the format described in CONTRIBUTING.md and which is used in Plumcot.loader.

Process

Note this is insanely slow, roughly 15 minutes to merge 2 files in one.

stats

Input

.csv file (previous output)

Output

Evaluation of the automatic annotation

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.