persephone-tools / persephone Goto Github PK

View Code? Open in Web Editor NEW

154.0 154.0 26.0 1.27 MB

A tool for automatic phoneme transcription

License: Apache License 2.0

Python 99.77% Shell 0.03% Dockerfile 0.20%

acoustic-models artificial-intelligence machine-learning neural-networks speech-recognition

persephone's People

Contributors

Stargazers

Watchers

persephone's Issues

[Yongning Na] for tests on different types of training materials: six additional word lists

I finished 3 word lists from the 2015 back log and another 3 word lists from the 2016 back log. They are available from these 2 links
https://www.dropbox.com/s/o6yxamgbftu2i2m/2015_ThreeWordLists.zip?dl=0
https://www.dropbox.com/s/8owp3a2vu3zrl4m/2016_ThreeWordLists.zip?dl=0

(These sets include the EGG files, just in case.)

Put next version on pypi

It's important to to decide how to deal with non-python dependencies: git and ffmpeg. Both are straightforward to resolve:

git is only needed so that the git hash of the current commit can be stored in experimental logs for reproducability. However, if we're using a pip installable version of the software, we can just store that. So if git isn't present, just store say "v0.1.6" instead of the hash. This only breaks down if you pip install a non-master branch AND you don't have git, using something like:

pip install git+git://github.com/oadams/persephone.git@somebranch

ffmpeg is used to normalize the wav files to 16KHz mono files. However it appears python alternatives such as pydub can be used. This means we could make persephone entirely pip-able and would mean docker would not be necessary to make it run on windows (we could package it up in pyinstaller).

Look into supporting GPU usage with nvidia-docker

[Yongning Na] using both channels of stereo audio files, for training & for phonemic transcription?

The audio recordings are WAV files, 24-bit, 44,100 Hz. About one third only have one audio channel, recorded with an AKG C535EB microphone (some with an AKG C900).

The others have two audio channels: one with the microphone described above, and one from a Sennheiser HSP 2 omnidirectional headworn microphone.

The 'big' microphone (placed on a table in front of the speaker) is always in the left channel.

For persephone, the stereo files could be used in various ways:

training 2 separate models, one for 'table microphone' and one for headworn microphone, and studying differences in their performances for phonemic transcription:
- overall difference in PER and TER
- how well does the 'table mic' model perform with headworn mic data? How well does the headworn mic model perform with 'table mic' data?
feeding the 2 channels into the training set as if they constituted separate resources (associated with the same transcription, of course) and examining how this affects performance (PER / TER). (That is what Diep tried when training an acoustic model (with CMU Sphinx) on Na data: stereo files were demultiplexed and fed into the training set as if they constituted separate resources. But no comparison of methods was attempted: it was a short-lived pilot study!)

Since many of the audio documents that currently lack a transcription have 2 audio channels, there could be a practical application: devising a way to do better transcription of a file when 2 channels are available. (Maybe even see big and team up with a signal processing colleague interested in this topic.)

Things like breathing (taking a breath) can be heard on the headworn microphone signal because it's close to the speaker's mouth. These could be automatically detected, too :-) As a student, I transcribed French conversations at the Kiel phonetics lab, and such information was part of the annotation.

If that is useful I can send you a list of which files (transcribed and untranscribed) have stereo signals.

[Yongning Na] identification of filled pauses, for use in automatic phonemic transcription

Nasal filled pauses are transcribed as mmm
Oral filled pauses are transcribed as əəə

The symbol for '...' (as one symbol: …) is usually added after these pauses, but it can safely be deleted at preprocessing, like the rest of the punctuation.

I think transcription of the filled pauses is now OK for the 31 texts (narratives). Here is the whole set:
https://www.dropbox.com/s/8c733ugf68bj0rl/31texts_27Dec2017.zip?dl=0

If there are any worries at preprocessing let me know. (I'd love to look under the hood & take a look at the preprocessing code.)

Add a bare-bones GUI (UI task #1)

In its first version, two widgets:

A simple text field where the user inputs the path to the data directory (the one with feat/ and label/ subdirectories) OR a file explorer where you choose a data directory.
A run button.

Assumes the data is preprocessed and creates a ReadyCorpus object and fires up a default model.

~~tkinter.ttk could be used.~~ Make a simple web interface, which may be good if someone is using a server with GPUs.

EDIT: Requirements

Satisfying this issue involves making a web extension to section 2. Training a toy Na model in the tutorial. That is, we assume for now that (a) the data is already on the server and (b) the data is the preprocessed Na data from the tutorial, so the ReadyCorpus constructor can be used. Both of these assumptions will be removed in future issues.

We need a simple web page with a text field and a run button. The text field specifies the path to the data directory. In the tutorial, this is data/na_example. Ultimately, in work beyond this issue the UI shouldn't use a text field, but rather a file explorer or functionality to upload files to the server. The text field just seems like the most trivial thing to do right now. If a file explorer is comparably trivial, that can be done first.

The training and validation label error rates should be output per epoch, along with notification when training is complete, reporting the test error rate.

Pull from the HimalCo/corpus repository

And adjust the experiment logging to note which version of the corpus is being used.

Make installable by setuptools

Create custom exceptions

Currently there are situations where Exception is being raised. This makes it hard for the caller to handle the exceptions.

For example in model.py https://github.com/oadams/persephone/blob/b0cf86c3a0f7cd4fc656d213eb272ce2bf8073a0/persephone/model.py#L53

This is an example of a situation where a different exception type would be easier for the callers to handle and would provide more data in the form of the exception type.

[Yongning Na] improvements to the transcription of the training set

Looking at the set of texts as a training set for ASR tools, some improvements are required. (Assignee: Alexis!)

Special symbol for extrametrical spans.
So far the same bar | has been used as for a regular tone-group boundary. But that is confusing for a learner and probably also for the algorithm: one symbol with 2 meanings. A different symbol needs to be used for when a span of extrametrical syllables begins. Maybe one of these: ⟧ ◊ ⎬
Thus, instead of:
(1) | ʂv̩˧ɖv̩˧-ze˩-tsɯ˩ | -mv̩˩ |
it should be
(1') | ʂv̩˧ɖv̩˧-ze˩-tsɯ˩ ◊ -mv̩˩ |

The syllables enclosed between a lozenge ◊ and a following tone-group boundary | don't make up a regular tone group and don't obey phonological rules on tone groups (in this instance: if it were a complete tone group, it would get an added final H tone by Rule 7). Examples such as (1) make it problematic to get the phonological rules right.

Updating phonemic transcription to latest phonemic analysis.
The contrast between i and ɯ following alveolar fricatives & affricates came late, and it came at a time when the main research objective was tone. So the verification of all the texts to implement the distinction between dʑɯ and dʑi, tɕɯ and tɕi and so on has not been carried out yet. A word such as 'water', /dʑɯ/, is still transcribed as /dʑi/ in many places, causing trouble for data-based learning of the mapping between phonemes & audio signal.
Emphatic stress.
A look at tone 'errors' in the output (Jan. 18th, 2018) suggests that emphatic stress can make tone recognition difficult (Wedding3.49). Phonetically, emphatic stress affects the syllable's pronunciation a lot. The logical thing for the linguist to do would be to go through the texts again and try to provide consistent indication of the presence of emphatic stress (by ↑ before the syllable).

Proposed time line: do the verifications little by little in the course of this year (2018), using automatic transcriptions as a tool to point to aspects that need to be encoded with increasing accuracy (& added detail).

Team up with signal processing expert for tests on different audio qualities?

A key question raised by Trevor Cohn is to what extent the 'success story' can be replicated on data sets that don't have high-quality audio. Maybe the low amount of reverberation and background noise does not really matter that much to ASR, provided the training corpus (transcribed audio) and the untranscribed audio have similar properties (similar signal-to-noise ratio, similar amount of reverberation, similar background noises...). Applying persephone to other languages, with less ‘clean’ audio, will indirectly provide an answer: if error rates are on the same order of magnitude, it means that clean audio does not matter so much. But it won’t shed direct light on the issue, because so many parameters differ from one language to another.

To get a more direct answer: what about teaming up with a signal processing expert who would do tests by degrading the audio quality of Na WAV files by superposing noises of different colours or adding reverberation?
Related tasks could include ‘scanning’ entire language archives and ‘tagging’ each corpus (or even each file) for specific acoustic properties: overall amount of background noise, profile of background noise... (e.g. buzz at 50 Hz is common due to grid power issues)

A related Issue is #4 (doing tests with stereo audio & other signals, not just 16-bit, 16kHz mono audio).

Experiment with different batch sizes on CPU machines to arrive at a default for the docker image

{train,valid,test}_prefixes.txt shouldn't be empty

If these files have been written as empty, subsequent ReadyCorpus objects pointing to that directory will have empty datasets. These files shouldn't be written if empty. If they already exist and are empty, they shouldn't be read.

[Yongning Na] creating 2 distinct acoustic models for tests on different types of training materials?

There is a neat divide between 2 types of input (binary typology in Pangloss: texts and word lists). Linguistically, word lists are hyperarticulated (clear pronunciation). Texts vary constantly: often hypo-articulated (rapid speech, undershooting of targets, strong coarticulation) but with some words (or longer stretches) that are hyper-articulated (clear pronunciation).

Your tests (Jan. 2018) reveal very clearly that a model trained on word lists performs very badly on texts. The transcripts in wordlist_only_hyps.txt (your e-mail of Jan. 9th, 2018) are wholly unintelligible (unusable, for humans at least ☺)

Your 'blending' tests suggest that simply mixing all texts and word lists as training data is not the best way to train the acoustic model. The common-sense ASR intuition is that the training data should be homogeneous, and resemble the 'target data' as closely as possible, right? (Same point made by Martine Adda-Decker in comments on 2017 ALTA paper)

A possibility would be to train two separate models on the 2 data sets: 1 model for texts and one for word lists. From a linguistic point of view, the hypothesis is that the former would be (overall) better for hypo-articulated speech, and the latter for hyper-articulated speech.

For recognition of new narratives (the priority task: transcribing reams of spontaneous speech), the text model would be applied. But the 2 models could also be combined in smart ways, identifying passages in texts for which the word list model performs better. Word lists are much more homogeneous than texts; word lists usually don't contain hypo-articulated speech, whereas texts sometimes have passages that sound like in word list (clear articulation).

As a rule of thumb, speech rate is a good predictor of the degree of hypo-articulation: the faster the speech rate, the stronger the coarticulation; the slower the speech rate, the closer the articulation gets to the 'canonical' targets. Since persephone is good at identifying the syllables, a syllable rate could be calculated for a given stretch of audio, and if it is below a certain threshold (empirically set), then it would be a safe guess that that stretch is articulated very clearly, more like tokens in a word list than the typical data on which the text model was trained, and the word list model would be applied to that stretch.

This is just an idea. A first test to do would be: training a text-only model, applying it to a new text, and sending me the result. I'd examine it to see to what extent errors correlate with speech rate: if passages that are slowly, clearly articulated have 'silly' mistakes (phoneme errors and tonal errors in 'easy' passages), then using a word list model on those could be a good way to lower PER and TER.

Prepare a ReadyCorpus class for users who have done their preprocessing or want to do it in another language

Clean up run.py and dataset modules

Currently modules such as dataset/chatino.py will not be able to be imported if the various files that are specified by config cannot be found.

This might be an issue if someone wants to run some files with one language but not chatino for example. Essentially we don't want people to get an error on import for code that they will not actually run. This is caused by code that is in the top level being executed when the module is imported. We could fix this by making sure that execution requiring paths to be specified only occurs when the functions in the module are called.

Decide whether to do relative imports versus from persephone.blah import *

What approach do good project use?

Confirm that the docker image works on windows

[Yongning Na] Refining the label set

Your e-mail of Jan. 9th, 2018:
"This includes the trivial changes of (a) making /əəə/ and /mmm/ single tokens (#3 ) and (b) making certain initial+rhymes a single token."

More accurately: some rhymes have a glide. The things noted as /j/ and /w/ are never an initial (unlike in English), only a part of a rhyme: rhymes that have on-glides.
So here are the single tokens:
jɤ jæ jo
wɤ wæ wɑ w̃æ

Label set for tones may also need to be adjusted. In the refs.txt document that you sent, I see that the two rising tones, low-rising ˩˥ and mid-rising ˧˥, have intervening spaces added, so they are treated as sequences of low-plus-high and mid-plus-high, respectively. Phonologically, they are analyzed as such (low-rising ˩˥ is phonologically L+H, and mid-rising ˧˥ M+H, that's why they are typeset like that) but for ASR they should be treated as single tokens (units). (Unless I'm mistaken and computationally it makes no difference... but I'd be surprised!)

Distance module

This needs to either be integrated into the library or have the import specified in the install.

Figure out whether to have multiple docker containers (one with na example data, one without), or to require users to run a script to fetch the na data once they start their docker container

Add acknowledgements / mention of contributors

Remove nltk_contrib dependency

Currently we use nltk_contrib for use with TextGrids

from nltk_contrib.textgrid import TextGrid

Unfortunately nltk_contrib is not on PyPi and this makes installation a bit annoying. It appears that we already depend on https://pypi.python.org/pypi/pympi-ling/ which supports TextGrid. I propose using this library if it is easier to install.

Bundle the code up in pyinstaller

[Yongning Na] detection of tone-group boundaries: top-down from tone sequence, or by acoustic model

The tone group is a really essential unit in Na, because the tone rules only apply within the tone group.

Let's first dream of what would be possible if the tone group boundaries could be identified automatically (but I'll return to the low-hanging fruit at the end of this Issue).

Since the tone group boundaries are indicated in the transcription, would there be a way to add them to the model, so that the acoustic model would try to identify them from the audio? Even if accuracy is low at first, this would be a qualitative leap.
Pauses almost always correspond to a tone-group boundary: the oral filled pauses and nasal filled pauses, now encoded in a cleaner way than before, as mmm and əəə. Silent pauses are also good evidence for tone-group boundaries.

This would allow for automatic correction and decrease TER. For instance for this example (Benevolence.1):
my transcription:
ə˧ʝi˧-ʂɯ˥ʝi˩, | zo˩no˥, | nɑ˩ ʈʂʰɯ˥-dʑo˩, | zo˩no˥, | le˧-ʐwɤ˩

output of mam/persephone in May 2017 (lightly edited):
ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ z o ˩ n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ z o ˩ n o ˧ l e ˧ ʐ w ɤ ˩

Let's imagine we get the following output from persephone, with tone-group boundaries added:
ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ | z o ˩ n o ˧ | n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ | z o ˩ n o ˧ | l e ˧ ʐ w ɤ ˩

Transcription as /z o ˩ n o ˧/ is phonetically good: the tone of /n o ˧/ is non-low, and from a phonetic point of view, that's that. But knowing that there is a tone-group boundary coming after, it can be rewritten as High (˥), on the basis of Rule 6.
And that's a gain on the TER scale. Out of the 13 tones, 11 were identified correctly; with this correction, tonal identification would reach 100% (TER: 0%). Victory!

A back-and-forth process between tone-group boundary identification and tonal string identification could be imagined:

first-pass automatic detection of tone-group boundaries
validity check of the tone strings, on the basis of phonological rules. Suppose, for instance, that persephone yields this output:
ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ | z o ˩ n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ |
(missing the tone-group boundary after /n o ˧/)
The tone group z o ˩ n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ | is not well-formed, because it contains M.L.H (... n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ ...), a sequence which is not allowed inside a tone group. (dot = syllable boundary)
There are 2 hypotheses in cases of mismatch like this one: either the tonal string detected is wrong, or the tone-group boundary detection is wrong. So the audio excerpt could be 'sent back' to the tone-group boundary "detector". If that "detector" answers that the statistically most probable boundary is after /n o ˧/, then a tone-group boundary would be added there, and all would be for the best.
If the "detector" suggested a mistaken tone-group boundary, for instance after /z o ˩/ (z o ˩ | n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ |) then that would not work (output badly formed: tone sequence M.L.H is not allowed, and L on its own in the group made up of just the syllable /z o ˩/ is wrong too). A problem would be reported to the user.
(I can't imagine why the system would detect a tone-group boundary there; this is just to consider all hypotheses.)

Reading the "narrative" paper, Martine Adda-Decker seemed optimistic about the possibility of identifying tone-group boundaries from the audio (somewhat similar to intonational phrasing in English/French...) with reasonably good accuracy. (Of course I have no idea how that could be added to persephone.)

Another possibility would be to conduct top-down detection of tone-group boundaries, followed by 'sanity check'. Top-down detection of tone-group boundaries could be done on the basis of the tonal string. Thus (theoretical example), suppose we get this string of syllables from persephone:
æ˧ æ˩ æ˩˥ æ˧ æ˥ æ˧ æ˧ æ˧˥
Since contours (all of which, in Na, are rising) only occur at the right edge of a tone group, boundaries can be added after contours:
æ˧ æ˩ æ˩˥ | æ˧ æ˥ æ˧ æ˧ æ˧˥ |
Next, /æ˧ æ˩ æ˩˥/ can be parsed into either /æ˧ | æ˩ æ˩˥/ or /æ˧ æ˩ | æ˩˥/, because M.L.LH is not well-formed ('trough-shaped' sequence). It would be beautiful if the choice between the 2 possibilities ( σ | σ σ or σ σ | σ ) could be done on the basis of the acoustics, "asking" the "detector" which of the 2 is more plausible statistically.
And finally, æ˧ æ˥ æ˧ æ˧ æ˧˥ | needs to be parsed into /æ˧ æ˥ | æ˧ æ˧ æ˧˥ |/ because M.H.M is not well-formed (inside a tone group, H can only be followed by L: Rule 4).

If this theoretical example makes sense to you, I can try to come up with real examples where "top-down" tone-group boundary detection would lead to hypotheses about corrections that need to be made to the tonal string (with the prospect of lowering TER a great deal).

Now back to the low-hanging fruit, supposing we don't have information on tone-group boundaries.
An automated search would probably confirm that H.H sequences never occur (an occasional loanword or such could be the exception). That is because H can only be followed by L inside a tone group, so H.H is not valid inside a tone group, and must be parsed as H | ...
and the next tone group can only begin with L or M. So a detected H.H is probably to be analyzed as H | M.
But I can't think of many other such generalizations. Provisional conclusion: there's not much low-hanging fruit if tone-group boundaries are not included in the model. (Relevant reading from the book: Chapters 7, pp. 321-328.)

Refine Documentation

based on the initial pdf Severine and Alexis sent me (look in emails).

If WAVs are empty, present a warning on feature extraction, skipping the file instead of crashing

Remove tutorial branch and make na_example an optional download in docker image

Currently any time the docker image changes the user has to pull the na data again via wget. It may be better to simply include a script int the container to pull the na data. Users may not want to evaluate on na, and that zip file isn't going to change.

Add instructions for sharing data into the docker containers using docker volumes

Make compatible with Kaldi input files

Facilitate automatic preprocessing of ELAN

ELAN (currently standalone script)
TextGrid

Replace ffmpeg and sox dependencies by using pydub

Checklist for release tasks

Increment version number everywhere appropriate. Currently this is in README.md, setup.py and persephone/__init__.py
git tag and push to github.
tox (runs pylint, mypy and does quick tests)
pytest -s --no-print-logs --experiment persephone/tests/experiments/test_na.py::test_tutorial (ensures the tutorial isn't broken).
python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*
Make sure dockerhub has a build triggered.

Allowing changing the settings to computationally demanding mode for optimal transcription quality

A comment from the linguist's point of view, assuming that some interested linguists visit the persephone project on GitHub (which I think is not at all unlikely):

Currently the intro (README.md) says "I've made the settings less computationally demanding than it would be for optimal transcription quality". That could come across as "you will get suboptimal quality". But for a linguist, the aim is to see persephone's very best. Not 'state-of-the-art with a rebate' but 'full-SOTA': that's part of the magic! 🥇 The linguists' perspective (I think) is that, in order to see what the tool can do, we want to be able to choose the highest settings, even if it's 10 times as computationally intensive to gain a couple percent in accuracy.

On the other hand, the training will likely need to be fired dozens of times with different settings, and we don't want to wait 1 week (or more) for each test.

To solve this Issue, what about providing a simple scenario such as: testing & tuning with the computationally less demanding settings, then raising the settings & firing once more (being warned that this final training with high settings may take a week or more)? Would that make sense?

The suggestion amounts to: adding 1 sentence in README.md:

"I've made the settings less computationally demanding than it would be for optimal transcription quality. The settings can be raised by changing the values in <name of settings file>"

Just an idea!

Add user guide to Readme.md and rename as Persephone

Extend tutorial to explain how to set up one's own data.

Reduce the amount of data in the na_example docker image

Make a docker container

Project needs a license

Currently there is no license attached to the project, please decide on a license and include it in the repository.

Model saving, loading, transcribing unseen data

Making this issue now before I forget:

Would be great to have docs on:

Saving model (just save /exp/n/model folder somewhere else?)
Loading model (load saved folder?)
Running loaded model on unseen data (model.untranscribe with loaded model?)

[Yongning Na] adding a boundary (|) at the end of every <S> unit

@oadams : at preprocessing, could a tone-group boundary | be added at the end of each <S> unit (for texts) or <W> unit (for word lists), in cases where one is lacking in the input transcription?

The end of a <S> is, a fortiori, the end of the tone group. A bar | is present in the XML in many cases, but not in all. The training data in its present state is inconsistent (sometimes there is a | at the end of the <S>, sometimes not). Accordingly, in the January 18th results ("Tone groups improve tone prediction"), the automatic transcription sometimes has a final boundary, sometimes not: it seems that the acoustic model is 'puzzled' and has no good criterion. This looks like a case of 'garbage in, garbage out', right?

Addition of the boundaries at preprocessing could lead to improvement in recognition rates for tone-group boundaries.

The texts used at training will also need to be tidied. I will do this gradually: I try to avoid automatic replacements throughout the corpus, because there is the occasional special case where I wouldn't want to add a boundary, for instance after a filled pause 'mmm!' that has no tone and does not constitute a tone group (so it would be misleading to type a | bar after it). But I think these cases are so few that automatic addition at preprocessing should not create trouble for the acoustic model.

Implement semi-supervised learning from untranscribed data

Incorporate Bayesian hyperparameter optimization

Corpora shouldn't be empty

If you try to create a Corpus object and there are no training files (self.{train,valid,test}_prefixes are empty) then an exception should be thrown.

Implement an attentional model for automatic phoneme transcription

Pylinting

Offer functionality for automatic phoneme segmentation in the transcription, given a phoneme inventory

Some work on this has been done in transcription_preprocessing.py

[Yongning Na] Automatic detection of contrastive focus and emphatic stress

Detection of tone-group boundaries (#5 ) is a big priority because it is so important to tone recognition. From a programming point of view, work on detection of tone-group boundaries could pave the way for automatic identification of two other types of suprasegmental events indicated in the training transcriptions: 'F' for intonational focus (contrastive focus) and '↑' for emphatic stress. They are described in §8.3.1 and 8.3.2 of the book.

Like the tone group boundaries, contrastive focus and emphatic stress are not tones or segments: they're "very suprasegmental"!

The training set is probably insufficient to include those in the model at present (about 200 examples of 'F' and 45 examples of '↑') but if the idea is to aim high, well, that's definitely something worth aiming at!

Write unit tests for all the pure functions.

Improve logging

I realise now this is super important. If people email in when the tool crashes, there might not be much to go on. Need to be able to request logs.

persephone-tools / persephone Goto Github PK

persephone's People

Contributors

Stargazers

Watchers

Forkers

persephone's Issues

Recommend Projects

Recommend Topics

Recommend Org