Light

yangyangii / dl-papers Goto Github PK

View Code? Open in Web Editor NEW

6.0 4.0 1.0 0 B

dl-papers's Introduction

DL-Papers

dl-papers's People

Contributors

Stargazers

Watchers

Forkers

17011775

dl-papers's Issues

Style Transfer in Neural Speech Synthesis (1)

Style Transfer in Neural Speech Synthesis

"Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron." Skerry-Ryan, R. J., et al. ICML. 2018. PDF

Key Concept

E2E TTS에서 style transfer를 시도한 가장 기본적인 모델이다.
Prosody를 음성 신호의 variation으로 정의(speaker identity, phonetics, channel effect 제외)
Reference Encoder을 도입하여 input으로 spectrogram (reference audio)을 사용하고 직접적으로 임베딩하여 텍스트 정보와 함께 conditioning
Input Spectrogram으로 target spectrogram과 동일한 것을 사용

Result

reference audio와 유사한 음성 생성

Limitation

직접적인 임베딩을 사용하다보니 prosody의 utterance와 text 정보(혹은 시간 정보)가 일치하지 않을 경우 매우 취약함.

"Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis." Wang, Yuxuan, et al. ICML. 2018. PDF

Key Concept

위 Prosody transfer 논문과 같이 구글에서 나온 논문. 하지만 선후관계는 내용을 봤을 때, 내부적으로 prosody transfer가 먼저 진행된 것으로 보임.
reference audio로부터 직접적으로 임베딩하여 사용하지 않고 query로써 사용함.
multi-head attention (Q: reference audio, K, V: token embedding) 제안.
reference audio들로부터 학습한 특성들을 토큰을 통해 control (speed, noise, animated)
reference audio 없이도 inference 가능

Result

Prosody transfer 논문에서 취약했던 non-parallel transfer에 대해서도 robust한 결과를 보임.
style token의 weight를 조정했을 때 speed가 두드러지게 나타남 (10개 중에 1개 건진듯. 그 외의 스타일에 대해서는 정의 자체가 어려운 면이 있음)
그 외에도 실험 결과 및 분석이 매우 많고 잘 되어 있음.

Limitation

주요 contribution인 style token layer(including multi-head attention)의 구조에 대한 설명이 빈약함.
LJ나 VCTK가 아닌 single speaker 147시간 데이터 사용.
token 개수가 많아지면 controllable token을 찾기 어려움.

"Neural TTS Stylization with Adversarial and Collaborative Games.", Ma, Shuang, Daniel Mcduff, and Yale Song. ICLR. 2019. PDF

Key Concept

스타일을 강화하기 위해 GAN을 도입
unparallel 구조를 도입
style transfer in computer vision에서 사용되는 Gram matrix 도입

Result

speaker identity, emotion identity 에 대해 각각의 성능이 좋아짐.

Limitation

복잡한 구조와 다소 다른 목적을 가진 loss들로 인한 불안정한 학습. (hyperparameter에 크게 의존. e.g. ratio of loss)
너무 많은 내용을 넣느라 평가 방법과 세팅에 생략되고 불분명한 부분이 많음 (seen speaker accuracy와 unseen speaker accuracy가 비슷한 수치...)
categorical style (labeled style)에 가능한 구조.

"Transfer learning from speaker verification to multispeaker text-to-speech synthesis.", Jia, Ye, et al., 2018. PDF

Key Concept

Pretrained speaker verification model를 speaker encoder로 사용.
기존에 speaker label을 이용한 embedding table 방식에서 벗어남으로써, embedding space의 적극적인 활용이 가능해짐. (e.g. zeroshot speaker adaptation, fictitious speaker synthesis)

Result

Seen speaker에 대해 MOS에서 embedding table 방식과 유사한 제안 방법의 성능.
Unseen speaker에 대해서는 적당히 흉내내지만 부족한 수준
Naturalness에 대한 MOS에서는 둘 다 우수

Limitation

Speaker Encoder의 학습에 36M utterances, 18K speakers 를 가진 internal dataset을 사용하였음.
zero-shot의 speaker similarity

"Predicting expressive speaking style from text in end-to-end speech synthesis." Stanton, Daisy, Yuxuan Wang, and R. J. Skerry-Ryan., IEEE SLT, 2018.

Key Concept

Text Predicted GST.
Inference stage에서 reference signal 없이 text로부터 스타일을 만들어내는 음성합성 모델.
Text encoder의 output으로부터 style embedding 혹은 combination weight를 예측하도록 학습함.

Result

reference audio 없이도 text에 따라 style variation이 있는 음성 합성 가능
baseline에서 발생하던 "declining pitch" problem 해소
baseline 보다 선호 평가에서 확연히 좋은 성능
automatic denoising 효과 (별도로 clean token을 위해 학습하거나 조정하지 않음)

Limitation

GST와 동일한 대용량 데이터 학습
Text에만 의존하는 방법

"Semi-supervised training for improving data efficiency in end-to-end speech synthesis." Chung, Yu-An, et al. ICASSP, 2019. PDF

Key Concept

Style관련 논문은 아니나 transfer learning과 관련된 최신 트렌드
NNLM(or Word2Vec)을 character embedding에 conditioning
text condition 없이 audio decoder를 더 큰 데이터에 대해 pretraining하여 사용

Result

Data-efficiency를 보이기 위해 Tacotron이 어느 지점부터 intelligibility가 떨어지는지를 먼저 실험하였음 (구글의 남아도는 컴퓨팅 파워란...)
- 40~10 시간: 대충 비슷하게 좋음.
- 10~3 시간: 성능이 조금씩 떨어짐.
- 24분: 심각하게 안좋음.
- 12분: 학습 실패
W2V보다 pretrained decoder의 성능이 두드러지며, pretrained decoder를 사용할 시에는 W2V를 사용해도 이득이 거의 없음. (하지만 문장 셋에 따른 robustness를 실험해보면 유의미 할 수 있어 보임)
그래프를 보면 2시간 이상의 데이터가 있을 경우에는 pretrained decoder의 성능이 좋기는 하지만 큰 이득은 없음.

Limitation

한계점보다는 pretraining에 대한 기초적인 연구 실험을 했고, 앞으로 두고두고 좋은 레퍼런스가 될 듯 하다.

"Hierarchical Generative Modeling for Controllable Speech Synthesis." Hsu, Wei-Ning, et al. ICLR, 2018.

Key Concept

Purpose: clean speech with controllable speaking style
GM-VAE Tacotron
speech의 latent attributes들을 VAE를 통해 학습하고 control
latent attributes를 MoG로 가정하고 모델링
Youtube에서 daily life 데이터를 noise signal로 사용해서 음성에 mix

Result

대부분의 component가 남/여 성분을 나타내고, 그 성분이 뚜렷하지 않은 component는 국가 성분(억양, 발음 등)을 나타냄.
noisy component와 clean component를 conditioning 했을 때의 결과가 뚜렷하게 나타남.
noisy data를 사용했을 때의 성능이 특히 좋음.
unseen에 대해서도 어느 정도 성능을 내나, d-vector를 사용한 방법보다 유사도가 낮음.

Limitation

unseen speaker 모델링에는 어려움이 있음. (딱히 이 모델의 한계라고 보기는 어려움)

"Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization." Hsu, Wei-Ning, et al. ICASSP, 2019. PDF

Key Concept

conditional generative model (VAE + Tacotron2 + auxiliary classifier)
data augmentation (speaker identity와 상관없는)
Adversarial factorization (disentanglement를 위한)
- speaker embedding에 대한 2가지 adversarial factorization
- Speaker classifier
- Augmentation classifier >> gradient reversal
VCTK로 학습, CHiME-4의 noise mix

Result

speaker embedding, residual embedding으로 classifier를 학습했을 때의 성능이 각 역할에 해당되는 경우에만 높음으로써 disentangle이 잘되었다고 보여짐.
noisy한 speaker embedding을 사용해도 clean한 residual embedding을 사용함으로써 깨끗한 음성을 만들 수 있음.

Limitation

"Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning}}." Zhang, Yu, et al. Interspeech, 2019. PDF

Key Concept

Multi-speaker and multi-lingual TTS w/o parallel dataset
language를 넘나드는 phonemic input representation을 만드는 것
phonemic representation(from text)을 speaker identity와 disentangle 하기 위한 adversarial loss term (gradient reversal)
adversarial loss를 적용할 때, 몇몇 text들은 language에 상당히 dependent하기 때문에 불안정한 gradient를 위해서 gradient clipping

Result

데이터가 많은 영어를 기준으로 다른 국적 목소리를 이용했을 때, 특히 좋은 성능을 보였음.
중국어의 경우 speaker similarity가 좋지 않음.
전체적으로 naturalness에 대한 성능은 다 좋음.

Limitation

아무래도 데이터가 부족한 언어들에 대해서는 한계가 있음.
성능을 높이기 위해서는 아무래도 speaker가 더 많거나, 추가적인 장치가 필요해보임.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.