The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Neste artigo, propomos uma nova estrutura de treinamento denominada algoritmo INmfCA para sistemas de conversão de voz não paralela (VC). Para treinar modelos de conversão, as estruturas tradicionais de VC exigem corpora paralelos, nos quais os falantes de origem e de destino pronunciam os mesmos conteúdos linguísticos. Embora as estruturas tenham alcançado VC de alta qualidade, elas não são aplicáveis em situações onde corpora paralelos não estão disponíveis. Para adquirir modelos de conversão sem corpora paralelos, métodos não paralelos são amplamente estudados. Embora as estruturas atinjam VC em condições não paralelas, elas tendem a exigir um grande conhecimento prévio ou muitas declarações de treinamento. Isso se deve à dificuldade em desembaraçar as informações linguísticas e do falante sem uma grande quantidade de dados. Neste trabalho, abordamos esse problema explorando o NMF, que pode fatorar características acústicas em componentes variantes e invariantes no tempo de maneira não supervisionada. O método adquire alinhamento entre as características acústicas dos enunciados do locutor fonte e um dicionário alvo e usa o alinhamento obtido como ativação do NMF para treinar o dicionário do locutor fonte sem corpora paralelos. O método de aquisição é baseado no algoritmo INCA, que obtém o alinhamento de corpora não paralelos. Em contraste com o algoritmo INCA, o alinhamento não se restringe às amostras observadas e, portanto, o método proposto pode utilizar eficientemente pequenos corpora não paralelos. Os resultados de experimentos subjetivos mostram que a combinação do algoritmo proposto e do algoritmo INCA superou não apenas uma estrutura não paralela baseada em INCA, mas também o CycleGAN-VC, que executa VC não paralelo sem quaisquer dados de treinamento adicionais. Os resultados também indicam que uma estrutura VC one-shot, que não precisa treinar alto-falantes fonte, pode ser construída com base no método proposto.
Hitoshi SUDA
the University of Tokyo
Gaku KOTANI
the University of Tokyo
Daisuke SAITO
the University of Tokyo
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Hitoshi SUDA, Gaku KOTANI, Daisuke SAITO, "INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 6, pp. 1196-1210, June 2022, doi: 10.1587/transinf.2021EDP7234.
Abstract: In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7234/_p
Copiar
@ARTICLE{e105-d_6_1196,
author={Hitoshi SUDA, Gaku KOTANI, Daisuke SAITO, },
journal={IEICE TRANSACTIONS on Information},
title={INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization},
year={2022},
volume={E105-D},
number={6},
pages={1196-1210},
abstract={In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.},
keywords={},
doi={10.1587/transinf.2021EDP7234},
ISSN={1745-1361},
month={June},}
Copiar
TY - JOUR
TI - INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization
T2 - IEICE TRANSACTIONS on Information
SP - 1196
EP - 1210
AU - Hitoshi SUDA
AU - Gaku KOTANI
AU - Daisuke SAITO
PY - 2022
DO - 10.1587/transinf.2021EDP7234
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2022
AB - In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.
ER -