Hello, I have a problem with the encoder of the synthesizer. Do you have used the

About Generate Convolution Encoder about multilingual_text_to_speech HOT 5 CLOSED

tomiinek commented on May 28, 2024

About Generate Convolution Encoder

from multilingual_text_to_speech.

Comments (5)

Tomiinek commented on May 28, 2024 1

Hello!

First, the original encoder is implemented here:

Multilingual_Text_to_Speech/modules/encoder.py

Lines 9 to 45 in 83b1164

    
           class Encoder(torch.nn.Module): 
        
               """Vanilla Tacotron 2 encoder. 
        
               Details: 
        
                   stack of 3 conv. layers 5 × 1 with BN and ReLU, dropout 
        
                   output is passed into a Bi-LSTM layer 
        
               Arguments: 
        
                   input_dim -- size of the input (supposed character embedding) 
        
                   output_dim -- number of channels of the convolutional blocks and last Bi-LSTM 
        
                   num_blocks -- number of the convolutional blocks (at least one) 
        
                   kernel_size -- kernel size of the encoder's convolutional blocks 
        
                   dropout -- dropout rate to be aplied after each convolutional block 
        
               Keyword arguments: 
        
                   generated -- just for convenience 
        
               """ 
        
               def __init__(self, input_dim, output_dim, num_blocks, kernel_size, dropout, generated=False): 
        
                   super(Encoder, self).__init__() 
        
                   assert num_blocks > 0, ('There must be at least one convolutional block in the encoder.') 
        
                   assert output_dim % 2 == 0, ('Bidirectional LSTM output dimension must be divisible by 2.') 
        
                   convs = [ConvBlock(input_dim, output_dim, kernel_size, dropout, 'relu')] + \ 
        
                           [ConvBlock(output_dim, output_dim, kernel_size, dropout, 'relu') for _ in range(num_blocks - 1)] 
        
                   self._convs = Sequential(*convs) 
        
                   self._lstm = LSTM(output_dim, output_dim // 2, batch_first=True, bidirectional=True) 
        
               def forward(self, x, x_lenghts, x_langs=None):   
        
                   # x_langs argument is there just for convenience 
        
                   x = x.transpose(1, 2) 
        
                   x = self._convs(x) 
        
                   x = x.transpose(1, 2) 
        
                   ml = x.size(1) 
        
                   x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lenghts, batch_first=True) 
        
                   self._lstm.flatten_parameters() 
        
                   x, _ = self._lstm(x) 
        
                   x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True, total_length=ml)  
        
                   return x

and it does not contain highway convolutions.

The generated encoder is based on a fully convolutional encoder without LSTMs, implementation is here:

Multilingual_Text_to_Speech/modules/encoder.py

Lines 100 to 130 in 83b1164

    
           class ConvolutionalEncoder(torch.nn.Module): 
        
               """Convolutional encoder (possibly multi-lingual). 
        
               Expects input of shape [B * N, L, F], where B is divisible by N (number of languages) and 
        
               samples of each language with the first sample at the i-th position occupy every i+L-th  
        
               position in the batch (so that it can be reshaped to [B, N * F, L] easily). 
        
               Arguments: 
        
                   input_dim -- size of the input (supposed character embedding) 
        
                   output_dim -- number of channels of the convolutional blocks and output 
        
                   dropout -- dropout rate to be aplied after each convolutional block 
        
               Keyword arguments: 
        
                   groups (default: 1) -- number of separate encoders (which are implemented using grouped convolutions) 
        
               """ 
        
               def __init__(self, input_dim, output_dim, dropout, groups=1): 
        
                   super(ConvolutionalEncoder, self).__init__() 
        
                   self._groups = groups 
        
                   self._input_dim = input_dim 
        
                   self._output_dim = output_dim 
        
                   input_dim *= groups 
        
                   output_dim *= groups  
        
                   layers = [ConvBlock(input_dim, output_dim, 1, dropout, activation='relu', groups=groups), 
        
                             ConvBlock(output_dim, output_dim, 1, dropout, groups=groups)] + \ 
        
                            [HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \ 
        
                            [HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \ 
        
                            [HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=1, groups=groups) for _ in range(2)] + \ 
        
                            [HighwayConvBlock(output_dim, output_dim, 1, dropout, dilation=1, groups=groups) for _ in range(2)]

It follows the architecture from the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. The encoder is much deeper than the original one with LSTMs, so the highways support gradient propagation to the very beginning of the encoder and enable the model to converge.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024 1

Hm, I do not think so.

from multilingual_text_to_speech.

YihWenWang commented on May 28, 2024

Thanks for your reply.
Could it cause a problem of noise or pronunciation about this model change?
Because my English-Mandarin TTS model sometimes has this problem.

from multilingual_text_to_speech.

stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

About Generate Convolution Encoder about multilingual_text_to_speech HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	class Encoder(torch.nn.Module):
	"""Vanilla Tacotron 2 encoder.

	Details:
	stack of 3 conv. layers 5 × 1 with BN and ReLU, dropout
	output is passed into a Bi-LSTM layer

	Arguments:
	input_dim -- size of the input (supposed character embedding)
	output_dim -- number of channels of the convolutional blocks and last Bi-LSTM
	num_blocks -- number of the convolutional blocks (at least one)
	kernel_size -- kernel size of the encoder's convolutional blocks
	dropout -- dropout rate to be aplied after each convolutional block
	Keyword arguments:
	generated -- just for convenience
	"""

	def __init__(self, input_dim, output_dim, num_blocks, kernel_size, dropout, generated=False):
	super(Encoder, self).__init__()
	assert num_blocks > 0, ('There must be at least one convolutional block in the encoder.')
	assert output_dim % 2 == 0, ('Bidirectional LSTM output dimension must be divisible by 2.')
	convs = [ConvBlock(input_dim, output_dim, kernel_size, dropout, 'relu')] + \
	[ConvBlock(output_dim, output_dim, kernel_size, dropout, 'relu') for _ in range(num_blocks - 1)]
	self._convs = Sequential(*convs)
	self._lstm = LSTM(output_dim, output_dim // 2, batch_first=True, bidirectional=True)

	def forward(self, x, x_lenghts, x_langs=None):
	# x_langs argument is there just for convenience
	x = x.transpose(1, 2)
	x = self._convs(x)
	x = x.transpose(1, 2)
	ml = x.size(1)
	x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lenghts, batch_first=True)
	self._lstm.flatten_parameters()
	x, _ = self._lstm(x)
	x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True, total_length=ml)
	return x

	class ConvolutionalEncoder(torch.nn.Module):
	"""Convolutional encoder (possibly multi-lingual).

	Expects input of shape [B * N, L, F], where B is divisible by N (number of languages) and
	samples of each language with the first sample at the i-th position occupy every i+L-th
	position in the batch (so that it can be reshaped to [B, N * F, L] easily).

	Arguments:
	input_dim -- size of the input (supposed character embedding)
	output_dim -- number of channels of the convolutional blocks and output
	dropout -- dropout rate to be aplied after each convolutional block
	Keyword arguments:
	groups (default: 1) -- number of separate encoders (which are implemented using grouped convolutions)
	"""

	def __init__(self, input_dim, output_dim, dropout, groups=1):
	super(ConvolutionalEncoder, self).__init__()

	self._groups = groups
	self._input_dim = input_dim
	self._output_dim = output_dim

	input_dim *= groups
	output_dim *= groups

	layers = [ConvBlock(input_dim, output_dim, 1, dropout, activation='relu', groups=groups),
	ConvBlock(output_dim, output_dim, 1, dropout, groups=groups)] + \
	[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \
	[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=3**i, groups=groups) for i in range(4)] + \
	[HighwayConvBlock(output_dim, output_dim, 3, dropout, dilation=1, groups=groups) for _ in range(2)] + \
	[HighwayConvBlock(output_dim, output_dim, 1, dropout, dilation=1, groups=groups) for _ in range(2)]