as-ideas / deepphonemizer Goto Github PK

View Code? Open in Web Editor NEW

336.0 20.0 37.0 1.37 MB

Grapheme to phoneme conversion with deep learning.

License: MIT License

Python 91.61% Jupyter Notebook 8.06% Shell 0.33%

pytorch deep-learning transformer grapheme-to-phoneme g2p ipa phonemization phonemes

deepphonemizer's People

Contributors

Stargazers

Watchers

deepphonemizer's Issues

Modifying to generate multiple predictions

This is not the intended usecase for this package but I'm trying to find a way to generate multiple pronunciations for a given word, is there any way to modify the predict script to generate multiple sequences of results and also have the probabilities for each? Thanks!

latin_ipa_forward checkpoint dropping german letter 'ß'

Hi,

i am using the latin_ipa_forward.pt checkpoint to phonemize large amounts of german text.while it is working fine in almost every case, for some reason, it seems to drop the german letter 'ß'. eg:

>>>phonemizer('ich wohne in der sesamstraße.', lang='de')
ɪç voːnə ʔɪn deːr zezamstʁaː.

any idea how to fix this?
Thanks!

Heteronym problem

Hello, thanks for your great work! I found that dp.phonemizer cannot handle heteronym problems well.

For example:

"We create the new record in the recording room"

turns into

"[W][IY] [K][R][IY][EY][T] [DH][AH] [N][UW] [R][AH][K][AO][R][D] [IH][N] [DH][AH] [R][AH][K][AO][R][D][IH][NG] [R][UW][M]"

while record should be [R], [EH], [K], [ER], [D]

Is there any suggestion? Thanks

checkpoint: en_us_cmudict_forward

Hello, love your resource and would like to use it to convert phrases to Arpabet symbols. I noticed that the link to the checkpoint "en_us_cmudict_forward" is the same as "en_us_cmudict_ipa_forward". Could you please link the correct file? Thank you!

Optimizer is None when trying to finetune a pretrained model

Thanks for the repo!

When trying to finetune one of the provided pretrained models, I was getting an unintuitive error. This was because the models were saved without optimizer and when trying to load the checkpoint, in line 76 in training/trainer.py, the check wouldnt stop it from loading the optimizer as checkpoint['optimizer'] existed in the dict with None value

optimizer = Adam(model.parameters())
if 'optimizer' in checkpoint:
    optimizer.load_state_dict(checkpoint['optimizer'])
for g in optimizer.param_groups:
    g['lr'] = config['training']['learning_rate']

changing the line to if 'optimizer' in checkpoint and checkpoint['optimizer']: should fix it.

Auto-Download + Load Checkpoints?

Hi,
Is it possible to upload the checkpoints to Hugging Face and automatically download and load the checkpoints?
Thanks!

Fine-tune existing model

Hi,
Is it possible to fine-tune an existing model, for example to add a new language?
Thanks!

Problem exporting with JIT

Hello,
I had a problem when running the export code snippet in the Readme :

RuntimeError:
Unknown type name 'torch.tensor':

Any ideas ?
Thanks

Include stresses preidction

Hi, @cschaefer26
Cool lib!

I was just wondering: any particular reason you don't include stresses prediction into pipeline?
Both "cmudict-ipa" and "wikipron" has stresses labelling included.
Phoneme tokenizers from pretrained checkpoints lack ' and , symbols (this was probably done due to collision with puctuation, but it's pretty easy to avoid).

some character sets don't work

Hi.
I'm working on this shared task:

https://github.com/sigmorphon/2022G2PST

Some of the character sets work fine, but others do not, specifically: Persian, Bengali, and Thai.

Persian and Bengali fail when training begins. Thai fails at inference.

Any ideas why this might be so?

I'm appending the error below. The problem seems to be in training/trainer.py.

thank you,

mike h.

(mhenv) mhammond@SBS-7337:~/Dropbox/fromlapper/sigmorphon2022/deep$ python doit.py 
per
{'ن', 'و', 'ج', 'ل', 'ژ', 'س', 'ض', 'ذ', 'ت', 'ه', 'ر', '\u200c', 'ث', 'ظ', 'ش', 'ا', 'ع', 'ئ', 'م', 'غ', 'ە', 'ص', 'ح', 'آ', 'ء', 'پ', 'چ', 'گ', 'خ', 'ف', 'ی', 'ق', 'ز', 'د', 'ک', 'ب'}
2022-05-22 15:26:50,656.656 INFO preprocess:  Preprocessing, train data: with 100 files.
2022-05-22 15:26:50,656.656 INFO preprocess:  Processing train data...
100%|██████████████████████████████████████| 100/100 [00:00<00:00, 86178.43it/s]
2022-05-22 15:26:50,659.659 INFO preprocess:  
Saving datasets to: /home/mhammond/Desktop/datasets
2022-05-22 15:26:50,660.660 INFO preprocess:  Preprocessing. 
Train counts (deduplicated): [('per', 100)]
Val counts (including duplicates): [('per', 56)]
2022-05-22 15:26:50,662.662 INFO train:  Initializing new model from config...
2022-05-22 15:26:50,742.742 INFO train:  Checkpoints will be stored at /home/mhammond/Desktop/checkpoints
Traceback (most recent call last):
  File "doit.py", line 79, in <module>
    train(config_file=lang+'.yaml')
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/train.py", line 57, in train
    trainer.train(model=model,
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in train
    val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in <listcomp>
    val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 569, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in __iter__
    for idx in self.sampler:
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/dataset.py", line 54, in __iter__
    binned_idx = np.stack(bins).reshape(-1)
  File "<__array_function__ internals>", line 180, in stack
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/numpy/core/shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

New version does not save checkpoints properly

I'm been using this model for some time and recently had to install it in a new python venv. The train() syntax has changed since my original install, so I know some changes have been made since my previous install. Problem is, it no longer saves any checkpoint files during training. The directory is created, but nothing ever shows up there - latest_model.pt, best_model.pt, etc. are never written. This is making the model completely useless, as I need to test it after training, not during.

Issue while converting model using JIT script

I was trying to convert my trained model for Hindi to a JIT compatble model using the method provided on the README
Facing an error at the dropout layer.

In [6]: phonemizer.predictor.model = torch.jit.script(model)

Unknown type name 'torch.tensor':
  File "/home/gnani/.virtualenvs/ttsapi/lib/python3.8/site-packages/dp/model/utils.py", line 24
    def forward(self, x: torch.tensor) -> torch.tensor:         # shape: [T, N]
                         ~~~~~~~~~~~~ <--- HERE
        x = x + self.scale * self.pe[:x.size(0), :]
        return self.dropout(x)

In eval mode:

ln [8]: model.eval()
In [9]: phonemizer.predictor.model = torch.jit.script(model)

RuntimeError: Can't redefine method: forward on class: __torch__.dp.model.utils.PositionalEncoding

Any suggestions would be greatly appreciated!

Quick question: where do the pretrained model's phoneme dictionaries come from?

Hi!

Great work you're doing here.
I've been testing your tool, it's easy to use and gives fine results.
Since I'm looking for a tool to generate a phonemized imput for the VITS model (in onnx format), I need to use the same tokenizer (phonemizer) that model espects. I've found that your pretrained models already have the dictionary embedded in them. Can I ask where did those dictionaries come from? In your colab training example you use CUNY-CL/wikipron's ones, but I was wondering if those are the ones you used originally or just in the example.

Thanks.

ZeroDivisionError with run_training.py

To get familiar with the DeepPhoneme tool, I run the two example python files in the repository. run_prediction.py works as expected, but run_training-py generates an error.

Here is the log output:

mbarnig@mbarnig-MS-7B22:~/DeepPhonemizer$ python3 ./run_training.py
2021-05-25 22:40:49,343.343 INFO preprocess:  Preprocessing, train data: with 300 files.
2021-05-25 22:40:49,344.344 INFO preprocess:  Performing random split with num val: 100
2021-05-25 22:40:49,344.344 INFO preprocess:  Processing train data...
0it [00:00, ?it/s]
2021-05-25 22:40:49,347.347 INFO preprocess:  
Saving datasets to: /home/mbarnig/DeepPhonemizer/datasets
2021-05-25 22:40:49,348.348 INFO preprocess:  Preprocessing. 
Train counts (deduplicated): []
Val counts (including duplicates): [('de', 200), ('en_us', 100)]
2021-05-25 22:40:49,352.352 INFO train:  Initializing new model from config...
2021-05-25 22:40:49,380.380 INFO train:  Checkpoints will be stored at /home/mbarnig/DeepPhonemizer/checkpoints
2021-05-25 22:40:49.464084: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "./run_training.py", line 14, in <module>
    train(config_file=config_file)
  File "/home/mbarnig/DeepPhonemizer/dp/train.py", line 46, in train
    trainer.train(model=model,
  File "/home/mbarnig/DeepPhonemizer/dp/training/trainer.py", line 74, in train
    start_epoch = checkpoint['step'] // len(train_loader)
ZeroDivisionError: integer division or modulo by zero

Please advise me what is missing. Thank you.

Marco Barnig

training for Persian

Hello. Thanks for your great work. I want to train the model for Persian data. In Persian we link some words based on context using 'Ezafe' which is not written but pronounced. for example, here is two words and phonemes:
کیف: kif
من: man
But we read the sentence 'کیف من' as 'kife man' and not 'kif man' (Persian is written right to left). Also words pronunciations can differ based on their meanings.
My question is that how can I change the model to consider these issues?
Thanks

Is There a Sample Showing How to Convert to ONNX?

This looks like a really cool project. Thanks for your hard work.

Could someone please provide a sample of how to convert to ONNX? I'm new to this and I'm having a hard time figuring out how to provide the sample input. I see some others were having the same problem in this closed issue (#23). While some said they were able to export to ONNX, nobody provided a code sample of the export formula.

I see the forward method in ForwardTransformer class takes a dictionary with tensors for the following keys: text, start_index, and text_len.'

Since the model takes text on input at variable length (not fixed size) and returns output I assume I have to tell orch.onnx.export that the input for the "text" entry is a dynamic shape and that the output is a dynamic shape. I tried setting dynamic_axes but didn't have any success. If anyone could provide a sample it would be much appreciated.

[BUG] On Windows 10 "preprocessor.phoneme_tokenizer" not output token id which is greater than 28

I found a bug on Windows 10. "preprocessor.phoneme_tokenizer" function doesn't work properly as it will not output token id whch is greater than 28. The problem is caused by the encoding settings while reading the configuration file.

For fixing the bug, please and " encoding='utf-8' " in line 21 in "dp/utils/io.py".

    with open(path, 'r', encoding="utf-8") as stream:
        config = yaml.load(stream, Loader=yaml.FullLoader)
    return config

Train sentences

Hi. I was able to train an italian model almost perfectly with the exception of few words that are intrinsecally ambiguous without context. Since your model is similar to the bert transformer what do you think would be the best solution to let the model learn word with context? Passing the sentences would be enough? Or a MLM should be implemented?

some convert is not smart ,is there any configure to set?

use en_us_cmudict_forward.pt model ,
DON'T [D][IY]-[OW]-[EH][N]-[T][IY
YOU'LL [W][AY]-[OW]-[Y][UW]-[EH][L]-[EH][L]
BERGSON [B][IY]-[IY]-[AA][R]-[JH][IY]-[EH][S]-[OW]-[EH][N]

Numeric values in sentences not being Phonemized

If you run a string through the Phonemizer that has a numeric value it will produce an empty string as the result. For example:

 resultOne = phonemizer('It\'s 1 o\'clock', lang='en_us')
 print(resultOne)

 resultTwo = phonemizer('It\'s one o\'clock', lang='en_us')
 print(resultTwo)

Produces:

Result One: ɪts ɑklɑk

Result Two: ɪts wʌn ɑklɑk

Perhaps when the input text is split a raw numeric value could be converted to a spelled out string before being fed to the Phonemizer. I'm not sure if Python provides a built in way to do this (newbie @ Python) but if not a library like Inflect perhaps could be used.

Thanks a lot for this cool repository.

Need instructions for porting the trained model to tflite via onnx

I am trying to convert a custom trained model from pytorch checkpoint to tflite via onnx route. I am stuck in the onnx step as I can not seem to provide a dummy input during the conversion step of onnx model. Can you help me with this?

torch.onnx.export(
    model,                  # PyTorch Model
    dummy_input,                    # Input tensor
    "output.onnx",        # Output file (eg. 'output_model.onnx')
    opset_version=14,       # Operator support version
    input_names=['embedding'],   # Input tensor name (arbitary)
    output_names=['fc_out'] # Output tensor name (arbitary)
)

Transformer unable to predict double phonemes

Hello.
I found out a bug where the transformer model is unable to learn sequences of two or more consecutive identical phonemes. I first discovered it for italian which has double consonants and then applied it to english as well. Take the words holy and wholly as example. According to WordReference, their RP (probably outdated) pronunciation should be respectively: həʊli and həʊlli. I don't know how common is the latter with a geminated l sound but it doesn't really matter. What matters is that even with char repeats equal to 3 or 5 the transformer is unable to predict double phonemes.

It can be easily reproduced by running the run_training.py debug script with the default yaml file and this data:


train_data = [('en_us', 'holy', 'həʊli'),
              ('en_us', 'wholly', 'həʊlli')] * 50

val_data = [('en_us', 'holy', 'həʊli'),
            ('en_us', 'wholly', 'həʊlli')] * 60

config_file = 'forward_config.yaml'

preprocess(config_file=config_file,
            train_data=train_data,
            val_data=val_data,
            deduplicate_train_data=False)

train(config_file=config_file)

Even in a super overfitting environment you will see that predictions will be always həʊli. Reproduction rate 100%.

CMU Dictionary IPA Train/Test Dataset

Thank you for this great work!

I'm planning to train a phonemizer model of my own and wanted to compare the results to that of your en_us_cmudict_ipa_forward model. It would be great to know which dataset and its corresponding train/test split was used to train and evaluate the model respectively.

Cheers.

I am a newbi, I have a pretty simple question.

Hi,
I. just wanna know is to provide a tools for developer to train their language g2p model. and use deepphonemizer api to convert text to a series of phonemes. Am I right or wrong???

latin_ipa_forward set to en_us pronounces "is" as "aɪz" rather than "ɪz"

Tried swapping out the default phonemizer in ForwardTacotron for DeepPhonemizer and noticed the model was very hesitant to learn attention. Took a look at the actual output and it was using "aɪz" instead of "ɪz". I have since switched to the cmudict version which does not have this issue, but it would be nice to have this fixed, especially for such a common word. Looking at the actual wikipron dataset I see that the phonemes for "is" are correct, so I'm not sure what's causing this.

Question on overfitting

When I am training a forward transformer model, I observed that after the validation loss started to rise, PER and WER kept descending.

My training config is based on the example forward transformer config, with the phoneme_symbols modified (Phonemes are in ARPABET and vowels have stress marks) and dropout set to 0.3.

Should I keep training or should I use the model with the lowest validation loss? Or any other suggestion?

Checkpoint not working for multiple languages

I am training a single model supporting two languages.

For one of the languages, a row of training data looks like this:

&**ω⊃⊃&⟴∅ w i k k a t t i n u

The tensorboard during training phase gave correct entries but during inference phase using python pip package I am getting single character output for input. Using the same checkpoint the other language is working OK.

Convert to model

Thank you for such a beautiful project. I'm wondering if we can convert the model we created to the ONNX format, and what do I need to do for that? Thank you in advance.

Removing punctutation not working

Hey,

It seems that in line 88 in phonemizer.py punctuation is not removed, is this intentional?

Thanks

No delimiter in predictions from multi-letter phoneme codes

Thank you for putting this out there. I'm trying to train the model myself on English CMU pronunciations, which have multi-letter phoneme codes. I structure my phoneme transcriptions as lists, for example:

('en_us', 'timbre', ['T','IH1','M','B','ER0'])

The model trains fine, but when I ask for transcriptions (via, say phonemise_list()), the model output doesn't put delimiters between the phonemes; so it's version of 'timbre' is:

'TAY1MBER0'

This is not helpful, and also not what the pre-trained CMU model does - It produces output like:

'[T][AY1][M][B][ER0]'

How can I adjust the config file or the calls to train() so that I get back something with delimiters between the phonemes?

`dp.utils` logging config is overriding app logging

https://github.com/as-ideas/DeepPhonemizer/blob/main/dp/utils/logging.py#L13 is overriding app-level logging config at runtime when importing DeepPhonemizer. According to https://stackoverflow.com/a/27017068 and https://docs.python.org/3/howto/logging.html#library-config it might make sense to remove the logging config.

Question about grapheme set

Hello. Thank you for this amazing repository!
I have a question though. What’s the easiest way to get a unique grapheme set for a specific language? How did you get that list when training a multilingual model?

as-ideas / deepphonemizer Goto Github PK

deepphonemizer's People

Contributors

Stargazers

Watchers

Forkers

deepphonemizer's Issues

Recommend Projects

Recommend Topics

Recommend Org