Coder Social home page Coder Social logo

Comments (5)

Tomiinek avatar Tomiinek commented on May 28, 2024 1

Hello!

First, the original encoder is implemented here:

class Encoder(torch.nn.Module):
"""Vanilla Tacotron 2 encoder.
Details:
stack of 3 conv. layers 5 ร— 1 with BN and ReLU, dropout
output is passed into a Bi-LSTM layer
Arguments:
input_dim -- size of the input (supposed character embedding)
output_dim -- number of channels of the convolutional blocks and last Bi-LSTM
num_blocks -- number of the convolutional blocks (at least one)
kernel_size -- kernel size of the encoder's convolutional blocks
dropout -- dropout rate to be aplied after each convolutional block
Keyword arguments:
generated -- just for convenience
"""
def __init__(self, input_dim, output_dim, num_blocks, kernel_size, dropout, generated=False):
super(Encoder, self).__init__()
assert num_blocks > 0, ('There must be at least one convolutional block in the encoder.')
assert output_dim % 2 == 0, ('Bidirectional LSTM output dimension must be divisible by 2.')
convs = [ConvBlock(input_dim, output_dim, kernel_size, dropout, 'relu')] + \
[ConvBlock(output_dim, output_dim, kernel_size, dropout, 'relu') for _ in range(num_blocks - 1)]
self._convs = Sequential(*convs)
self._lstm = LSTM(output_dim, output_dim // 2, batch_first=True, bidirectional=True)
def forward(self, x, x_lenghts, x_langs=None):
# x_langs argument is there just for convenience
x = x.transpose(1, 2)
x = self._convs(x)
x = x.transpose(1, 2)
ml = x.size(1)
x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lenghts, batch_first=True)
self._lstm.flatten_parameters()
x, _ = self._lstm(x)
x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True, total_length=ml)
return x

and it does not contain highway convolutions.

The generated encoder is based on a fully convolutional encoder without LSTMs, implementation is here:

class ConvolutionalEncoder(torch.nn.Module):
"""Convolutional encoder (possibly multi-lingual).
Expects input of shape [B * N, L, F], where B is divisible by N (number of languages) and
samples of each language with the first sample at the i-th position occupy every i+L-th
position in the batch (so that it can be reshaped to [B, N * F, L] easily).
Arguments:
input_dim -- size of the input (supposed character embedding)
output_dim -- number of channels of the convolutional blocks and output
dropout -- dropout rate to be aplied after each convolutional block
Keyword arguments:
groups (default: 1) -- number of separate encoders (which are implemented using grouped convolutions)
"""
def __init__(self, input_dim, output_dim, dropout, groups=1):
super(ConvolutionalEncoder, self).__init__()
self._groups = groups
self._input_dim = input_dim
self._output_dim = output_dim
input_dim *= groups
output_dim *= groups
layers = [ConvBlock(input_dim, output_dim, 1, dropout, activation='relu', groups=groups),
ConvBlock(output_dim, output_dim, 1, dropout, groups=groups)] + \
[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \
[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \
[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=1, groups=groups) for _ in range(2)] + \
[HighwayConvBlock(output_dim, output_dim, 1, dropout, dilation=1, groups=groups) for _ in range(2)]

It follows the architecture from the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. The encoder is much deeper than the original one with LSTMs, so the highways support gradient propagation to the very beginning of the encoder and enable the model to converge.

from multilingual_text_to_speech.

Tomiinek avatar Tomiinek commented on May 28, 2024 1

Hm, I do not think so.

from multilingual_text_to_speech.

YihWenWang avatar YihWenWang commented on May 28, 2024

Thanks for your reply.
Could it cause a problem of noise or pronunciation about this model change?
Because my English-Mandarin TTS model sometimes has this problem.

from multilingual_text_to_speech.

stale avatar stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

stale avatar stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.