persephone-tools / persephone Goto Github PK
View Code? Open in Web Editor NEWA tool for automatic phoneme transcription
License: Apache License 2.0
A tool for automatic phoneme transcription
License: Apache License 2.0
I finished 3 word lists from the 2015 back log and another 3 word lists from the 2016 back log. They are available from these 2 links
https://www.dropbox.com/s/o6yxamgbftu2i2m/2015_ThreeWordLists.zip?dl=0
https://www.dropbox.com/s/8owp3a2vu3zrl4m/2016_ThreeWordLists.zip?dl=0
(These sets include the EGG files, just in case.)
It's important to to decide how to deal with non-python dependencies: git and ffmpeg. Both are straightforward to resolve:
pip install git+git://github.com/oadams/persephone.git@somebranch
The audio recordings are WAV files, 24-bit, 44,100 Hz. About one third only have one audio channel, recorded with an AKG C535EB microphone (some with an AKG C900).
The others have two audio channels: one with the microphone described above, and one from a Sennheiser HSP 2 omnidirectional headworn microphone.
The 'big' microphone (placed on a table in front of the speaker) is always in the left channel.
For persephone, the stereo files could be used in various ways:
training 2 separate models, one for 'table microphone' and one for headworn microphone, and studying differences in their performances for phonemic transcription:
feeding the 2 channels into the training set as if they constituted separate resources (associated with the same transcription, of course) and examining how this affects performance (PER / TER). (That is what Diep tried when training an acoustic model (with CMU Sphinx) on Na data: stereo files were demultiplexed and fed into the training set as if they constituted separate resources. But no comparison of methods was attempted: it was a short-lived pilot study!)
Since many of the audio documents that currently lack a transcription have 2 audio channels, there could be a practical application: devising a way to do better transcription of a file when 2 channels are available. (Maybe even see big and team up with a signal processing colleague interested in this topic.)
Things like breathing (taking a breath) can be heard on the headworn microphone signal because it's close to the speaker's mouth. These could be automatically detected, too :-) As a student, I transcribed French conversations at the Kiel phonetics lab, and such information was part of the annotation.
If that is useful I can send you a list of which files (transcribed and untranscribed) have stereo signals.
Nasal filled pauses are transcribed as mmm
Oral filled pauses are transcribed as əəə
The symbol for '...' (as one symbol: …) is usually added after these pauses, but it can safely be deleted at preprocessing, like the rest of the punctuation.
I think transcription of the filled pauses is now OK for the 31 texts (narratives). Here is the whole set:
https://www.dropbox.com/s/8c733ugf68bj0rl/31texts_27Dec2017.zip?dl=0
If there are any worries at preprocessing let me know. (I'd love to look under the hood & take a look at the preprocessing code.)
In its first version, two widgets:
feat/
and label/
subdirectories) OR a file explorer where you choose a data directory.Assumes the data is preprocessed and creates a ReadyCorpus object and fires up a default model.
tkinter.ttk could be used. Make a simple web interface, which may be good if someone is using a server with GPUs.
EDIT: Requirements
Satisfying this issue involves making a web extension to section 2. Training a toy Na model in the tutorial. That is, we assume for now that (a) the data is already on the server and (b) the data is the preprocessed Na data from the tutorial, so the ReadyCorpus
constructor can be used. Both of these assumptions will be removed in future issues.
We need a simple web page with a text field and a run button. The text field specifies the path to the data directory. In the tutorial, this is data/na_example
. Ultimately, in work beyond this issue the UI shouldn't use a text field, but rather a file explorer or functionality to upload files to the server. The text field just seems like the most trivial thing to do right now. If a file explorer is comparably trivial, that can be done first.
The training and validation label error rates should be output per epoch, along with notification when training is complete, reporting the test error rate.
And adjust the experiment logging to note which version of the corpus is being used.
Currently there are situations where Exception
is being raised. This makes it hard for the caller to handle the exceptions.
For example in model.py https://github.com/oadams/persephone/blob/b0cf86c3a0f7cd4fc656d213eb272ce2bf8073a0/persephone/model.py#L53
This is an example of a situation where a different exception type would be easier for the callers to handle and would provide more data in the form of the exception type.
Looking at the set of texts as a training set for ASR tools, some improvements are required. (Assignee: Alexis!)
The syllables enclosed between a lozenge ◊ and a following tone-group boundary | don't make up a regular tone group and don't obey phonological rules on tone groups (in this instance: if it were a complete tone group, it would get an added final H tone by Rule 7). Examples such as (1) make it problematic to get the phonological rules right.
Updating phonemic transcription to latest phonemic analysis.
The contrast between i and ɯ following alveolar fricatives & affricates came late, and it came at a time when the main research objective was tone. So the verification of all the texts to implement the distinction between dʑɯ and dʑi, tɕɯ and tɕi and so on has not been carried out yet. A word such as 'water', /dʑɯ/, is still transcribed as /dʑi/ in many places, causing trouble for data-based learning of the mapping between phonemes & audio signal.
Emphatic stress.
A look at tone 'errors' in the output (Jan. 18th, 2018) suggests that emphatic stress can make tone recognition difficult (Wedding3.49). Phonetically, emphatic stress affects the syllable's pronunciation a lot. The logical thing for the linguist to do would be to go through the texts again and try to provide consistent indication of the presence of emphatic stress (by ↑ before the syllable).
Proposed time line: do the verifications little by little in the course of this year (2018), using automatic transcriptions as a tool to point to aspects that need to be encoded with increasing accuracy (& added detail).
A key question raised by Trevor Cohn is to what extent the 'success story' can be replicated on data sets that don't have high-quality audio. Maybe the low amount of reverberation and background noise does not really matter that much to ASR, provided the training corpus (transcribed audio) and the untranscribed audio have similar properties (similar signal-to-noise ratio, similar amount of reverberation, similar background noises...). Applying persephone to other languages, with less ‘clean’ audio, will indirectly provide an answer: if error rates are on the same order of magnitude, it means that clean audio does not matter so much. But it won’t shed direct light on the issue, because so many parameters differ from one language to another.
To get a more direct answer: what about teaming up with a signal processing expert who would do tests by degrading the audio quality of Na WAV files by superposing noises of different colours or adding reverberation?
Related tasks could include ‘scanning’ entire language archives and ‘tagging’ each corpus (or even each file) for specific acoustic properties: overall amount of background noise, profile of background noise... (e.g. buzz at 50 Hz is common due to grid power issues)
A related Issue is #4 (doing tests with stereo audio & other signals, not just 16-bit, 16kHz mono audio).
If these files have been written as empty, subsequent ReadyCorpus objects pointing to that directory will have empty datasets. These files shouldn't be written if empty. If they already exist and are empty, they shouldn't be read.
There is a neat divide between 2 types of input (binary typology in Pangloss: texts and word lists). Linguistically, word lists are hyperarticulated (clear pronunciation). Texts vary constantly: often hypo-articulated (rapid speech, undershooting of targets, strong coarticulation) but with some words (or longer stretches) that are hyper-articulated (clear pronunciation).
Your tests (Jan. 2018) reveal very clearly that a model trained on word lists performs very badly on texts. The transcripts in wordlist_only_hyps.txt (your e-mail of Jan. 9th, 2018) are wholly unintelligible (unusable, for humans at least ☺)
Your 'blending' tests suggest that simply mixing all texts and word lists as training data is not the best way to train the acoustic model. The common-sense ASR intuition is that the training data should be homogeneous, and resemble the 'target data' as closely as possible, right? (Same point made by Martine Adda-Decker in comments on 2017 ALTA paper)
A possibility would be to train two separate models on the 2 data sets: 1 model for texts and one for word lists. From a linguistic point of view, the hypothesis is that the former would be (overall) better for hypo-articulated speech, and the latter for hyper-articulated speech.
For recognition of new narratives (the priority task: transcribing reams of spontaneous speech), the text model would be applied. But the 2 models could also be combined in smart ways, identifying passages in texts for which the word list model performs better. Word lists are much more homogeneous than texts; word lists usually don't contain hypo-articulated speech, whereas texts sometimes have passages that sound like in word list (clear articulation).
As a rule of thumb, speech rate is a good predictor of the degree of hypo-articulation: the faster the speech rate, the stronger the coarticulation; the slower the speech rate, the closer the articulation gets to the 'canonical' targets. Since persephone is good at identifying the syllables, a syllable rate could be calculated for a given stretch of audio, and if it is below a certain threshold (empirically set), then it would be a safe guess that that stretch is articulated very clearly, more like tokens in a word list than the typical data on which the text model was trained, and the word list model would be applied to that stretch.
This is just an idea. A first test to do would be: training a text-only model, applying it to a new text, and sending me the result. I'd examine it to see to what extent errors correlate with speech rate: if passages that are slowly, clearly articulated have 'silly' mistakes (phoneme errors and tonal errors in 'easy' passages), then using a word list model on those could be a good way to lower PER and TER.
Currently modules such as dataset/chatino.py will not be able to be imported if the various files that are specified by config cannot be found.
This might be an issue if someone wants to run some files with one language but not chatino for example. Essentially we don't want people to get an error on import for code that they will not actually run. This is caused by code that is in the top level being executed when the module is imported. We could fix this by making sure that execution requiring paths to be specified only occurs when the functions in the module are called.
What approach do good project use?
Your e-mail of Jan. 9th, 2018:
"This includes the trivial changes of (a) making /əəə/ and /mmm/ single tokens (#3 ) and (b) making certain initial+rhymes a single token."
More accurately: some rhymes have a glide. The things noted as /j/ and /w/ are never an initial (unlike in English), only a part of a rhyme: rhymes that have on-glides.
So here are the single tokens:
jɤ jæ jo
wɤ wæ wɑ w̃æ
Label set for tones may also need to be adjusted. In the refs.txt document that you sent, I see that the two rising tones, low-rising ˩˥ and mid-rising ˧˥, have intervening spaces added, so they are treated as sequences of low-plus-high and mid-plus-high, respectively. Phonologically, they are analyzed as such (low-rising ˩˥ is phonologically L+H, and mid-rising ˧˥ M+H, that's why they are typeset like that) but for ASR they should be treated as single tokens (units). (Unless I'm mistaken and computationally it makes no difference... but I'd be surprised!)
This needs to either be integrated into the library or have the import specified in the install.
Currently we use nltk_contrib for use with TextGrids
from nltk_contrib.textgrid import TextGrid
Unfortunately nltk_contrib is not on PyPi and this makes installation a bit annoying. It appears that we already depend on https://pypi.python.org/pypi/pympi-ling/ which supports TextGrid. I propose using this library if it is easier to install.
The tone group is a really essential unit in Na, because the tone rules only apply within the tone group.
Let's first dream of what would be possible if the tone group boundaries could be identified automatically (but I'll return to the low-hanging fruit at the end of this Issue).
Since the tone group boundaries are indicated in the transcription, would there be a way to add them to the model, so that the acoustic model would try to identify them from the audio? Even if accuracy is low at first, this would be a qualitative leap.
Pauses almost always correspond to a tone-group boundary: the oral filled pauses and nasal filled pauses, now encoded in a cleaner way than before, as mmm and əəə. Silent pauses are also good evidence for tone-group boundaries.
This would allow for automatic correction and decrease TER. For instance for this example (Benevolence.1):
my transcription:
ə˧ʝi˧-ʂɯ˥ʝi˩, | zo˩no˥, | nɑ˩ ʈʂʰɯ˥-dʑo˩, | zo˩no˥, | le˧-ʐwɤ˩
output of mam/persephone in May 2017 (lightly edited):
ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ z o ˩ n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ z o ˩ n o ˧ l e ˧ ʐ w ɤ ˩
Let's imagine we get the following output from persephone, with tone-group boundaries added:
ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ | z o ˩ n o ˧ | n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ | z o ˩ n o ˧ | l e ˧ ʐ w ɤ ˩
Transcription as /z o ˩ n o ˧/ is phonetically good: the tone of /n o ˧/ is non-low, and from a phonetic point of view, that's that. But knowing that there is a tone-group boundary coming after, it can be rewritten as High (˥), on the basis of Rule 6.
And that's a gain on the TER scale. Out of the 13 tones, 11 were identified correctly; with this correction, tonal identification would reach 100% (TER: 0%). Victory!
A back-and-forth process between tone-group boundary identification and tonal string identification could be imagined:
Reading the "narrative" paper, Martine Adda-Decker seemed optimistic about the possibility of identifying tone-group boundaries from the audio (somewhat similar to intonational phrasing in English/French...) with reasonably good accuracy. (Of course I have no idea how that could be added to persephone.)
Another possibility would be to conduct top-down detection of tone-group boundaries, followed by 'sanity check'. Top-down detection of tone-group boundaries could be done on the basis of the tonal string. Thus (theoretical example), suppose we get this string of syllables from persephone:
æ˧ æ˩ æ˩˥ æ˧ æ˥ æ˧ æ˧ æ˧˥
Since contours (all of which, in Na, are rising) only occur at the right edge of a tone group, boundaries can be added after contours:
æ˧ æ˩ æ˩˥ | æ˧ æ˥ æ˧ æ˧ æ˧˥ |
Next, /æ˧ æ˩ æ˩˥/ can be parsed into either /æ˧ | æ˩ æ˩˥/ or /æ˧ æ˩ | æ˩˥/, because M.L.LH is not well-formed ('trough-shaped' sequence). It would be beautiful if the choice between the 2 possibilities ( σ | σ σ or σ σ | σ ) could be done on the basis of the acoustics, "asking" the "detector" which of the 2 is more plausible statistically.
And finally, æ˧ æ˥ æ˧ æ˧ æ˧˥ | needs to be parsed into /æ˧ æ˥ | æ˧ æ˧ æ˧˥ |/ because M.H.M is not well-formed (inside a tone group, H can only be followed by L: Rule 4).
If this theoretical example makes sense to you, I can try to come up with real examples where "top-down" tone-group boundary detection would lead to hypotheses about corrections that need to be made to the tonal string (with the prospect of lowering TER a great deal).
Now back to the low-hanging fruit, supposing we don't have information on tone-group boundaries.
An automated search would probably confirm that H.H sequences never occur (an occasional loanword or such could be the exception). That is because H can only be followed by L inside a tone group, so H.H is not valid inside a tone group, and must be parsed as H | ...
and the next tone group can only begin with L or M. So a detected H.H is probably to be analyzed as H | M.
But I can't think of many other such generalizations. Provisional conclusion: there's not much low-hanging fruit if tone-group boundaries are not included in the model. (Relevant reading from the book: Chapters 7, pp. 321-328.)
based on the initial pdf Severine and Alexis sent me (look in emails).
Currently any time the docker image changes the user has to pull the na data again via wget. It may be better to simply include a script int the container to pull the na data. Users may not want to evaluate on na, and that zip file isn't going to change.
README.md
, setup.py
and persephone/__init__.py
tox
(runs pylint, mypy and does quick tests)pytest -s --no-print-logs --experiment persephone/tests/experiments/test_na.py::test_tutorial
(ensures the tutorial isn't broken).python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*
A comment from the linguist's point of view, assuming that some interested linguists visit the persephone project on GitHub (which I think is not at all unlikely):
Currently the intro (README.md) says "I've made the settings less computationally demanding than it would be for optimal transcription quality". That could come across as "you will get suboptimal quality". But for a linguist, the aim is to see persephone's very best. Not 'state-of-the-art with a rebate' but 'full-SOTA': that's part of the magic! 🥇 The linguists' perspective (I think) is that, in order to see what the tool can do, we want to be able to choose the highest settings, even if it's 10 times as computationally intensive to gain a couple percent in accuracy.
On the other hand, the training will likely need to be fired dozens of times with different settings, and we don't want to wait 1 week (or more) for each test.
To solve this Issue, what about providing a simple scenario such as: testing & tuning with the computationally less demanding settings, then raising the settings & firing once more (being warned that this final training with high settings may take a week or more)? Would that make sense?
The suggestion amounts to: adding 1 sentence in README.md:
"I've made the settings less computationally demanding than it would be for optimal transcription quality. The settings can be raised by changing the values in <name of settings file>"
Just an idea!
Currently there is no license attached to the project, please decide on a license and include it in the repository.
Making this issue now before I forget:
Would be great to have docs on:
/exp/n/model
folder somewhere else?)model.untranscribe
with loaded model?)@oadams : at preprocessing, could a tone-group boundary | be added at the end of each <S> unit (for texts) or <W> unit (for word lists), in cases where one is lacking in the input transcription?
The end of a <S> is, a fortiori, the end of the tone group. A bar | is present in the XML in many cases, but not in all. The training data in its present state is inconsistent (sometimes there is a | at the end of the <S>, sometimes not). Accordingly, in the January 18th results ("Tone groups improve tone prediction"), the automatic transcription sometimes has a final boundary, sometimes not: it seems that the acoustic model is 'puzzled' and has no good criterion. This looks like a case of 'garbage in, garbage out', right?
Addition of the boundaries at preprocessing could lead to improvement in recognition rates for tone-group boundaries.
The texts used at training will also need to be tidied. I will do this gradually: I try to avoid automatic replacements throughout the corpus, because there is the occasional special case where I wouldn't want to add a boundary, for instance after a filled pause 'mmm!' that has no tone and does not constitute a tone group (so it would be misleading to type a | bar after it). But I think these cases are so few that automatic addition at preprocessing should not create trouble for the acoustic model.
If you try to create a Corpus object and there are no training files (self.{train,valid,test}_prefixes are empty) then an exception should be thrown.
Some work on this has been done in transcription_preprocessing.py
Detection of tone-group boundaries (#5 ) is a big priority because it is so important to tone recognition. From a programming point of view, work on detection of tone-group boundaries could pave the way for automatic identification of two other types of suprasegmental events indicated in the training transcriptions: 'F' for intonational focus (contrastive focus) and '↑' for emphatic stress. They are described in §8.3.1 and 8.3.2 of the book.
Like the tone group boundaries, contrastive focus and emphatic stress are not tones or segments: they're "very suprasegmental"!
The training set is probably insufficient to include those in the model at present (about 200 examples of 'F' and 45 examples of '↑') but if the idea is to aim high, well, that's definitely something worth aiming at!
I realise now this is super important. If people email in when the tool crashes, there might not be much to go on. Need to be able to request logs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.