The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Propomos uma abordagem aprimorada de ponderação de alto-falantes de referência (RSW) e de cluster de alto-falantes (SCW) que usa um modelo de aspecto. O conceito da abordagem é que o modelo adaptado é uma combinação linear de alguns modelos de referência latentes obtidos de um conjunto de alto-falantes de referência. O modelo de aspecto possui características específicas do espaço latente que diferem dos vetores de base ortogonais da voz própria. O modelo de aspecto é um modelo de "mistura de mistura". Primeiro calculamos um pequeno número de modelos de referência latentes como misturas de distribuições dos modelos do falante de referência e, em seguida, os modelos de referência latentes são misturados para obter a distribuição adaptada. Os pesos da mistura são calculados com base no algoritmo de maximização de expectativa (EM). Usamos os pesos das misturas obtidos para interpolar os parâmetros médios das distribuições. Tanto o treinamento quanto a adaptação são realizados com base na maximização da probabilidade em relação aos dados de treinamento e adaptação, respectivamente. Conduzimos um experimento contínuo de reconhecimento de fala usando um banco de dados coreano (KAIST-TRADE). Os resultados são comparados aos de um MAP convencional, MLLR, RSW, eigenvoice e SCW. A melhoria absoluta da precisão das palavras de 2.06 pontos foi alcançada usando o método proposto, embora utilizemos apenas 0.3 s de dados de adaptação.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Seong-Jun HAHM, Yuichi OHKAWA, Masashi ITO, Motoyuki SUZUKI, Akinori ITO, Shozo MAKINO, "Improved Reference Speaker Weighting Using Aspect Model" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 7, pp. 1927-1935, July 2010, doi: 10.1587/transinf.E93.D.1927.
Abstract: We propose an improved reference speaker weighting (RSW) and speaker cluster weighting (SCW) approach that uses an aspect model. The concept of the approach is that the adapted model is a linear combination of a few latent reference models obtained from a set of reference speakers. The aspect model has specific latent-space characteristics that differ from orthogonal basis vectors of eigenvoice. The aspect model is a "mixture-of-mixture" model. We first calculate a small number of latent reference models as mixtures of distributions of the reference speaker's models, and then the latent reference models are mixed to obtain the adapted distribution. The mixture weights are calculated based on the expectation maximization (EM) algorithm. We use the obtained mixture weights for interpolating mean parameters of the distributions. Both training and adaptation are performed based on likelihood maximization with respect to the training and adaptation data, respectively. We conduct a continuous speech recognition experiment using a Korean database (KAIST-TRADE). The results are compared to those of a conventional MAP, MLLR, RSW, eigenvoice and SCW. Absolute word accuracy improvement of 2.06 point was achieved using the proposed method, even though we use only 0.3 s of adaptation data.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.1927/_p
Copiar
@ARTICLE{e93-d_7_1927,
author={Seong-Jun HAHM, Yuichi OHKAWA, Masashi ITO, Motoyuki SUZUKI, Akinori ITO, Shozo MAKINO, },
journal={IEICE TRANSACTIONS on Information},
title={Improved Reference Speaker Weighting Using Aspect Model},
year={2010},
volume={E93-D},
number={7},
pages={1927-1935},
abstract={We propose an improved reference speaker weighting (RSW) and speaker cluster weighting (SCW) approach that uses an aspect model. The concept of the approach is that the adapted model is a linear combination of a few latent reference models obtained from a set of reference speakers. The aspect model has specific latent-space characteristics that differ from orthogonal basis vectors of eigenvoice. The aspect model is a "mixture-of-mixture" model. We first calculate a small number of latent reference models as mixtures of distributions of the reference speaker's models, and then the latent reference models are mixed to obtain the adapted distribution. The mixture weights are calculated based on the expectation maximization (EM) algorithm. We use the obtained mixture weights for interpolating mean parameters of the distributions. Both training and adaptation are performed based on likelihood maximization with respect to the training and adaptation data, respectively. We conduct a continuous speech recognition experiment using a Korean database (KAIST-TRADE). The results are compared to those of a conventional MAP, MLLR, RSW, eigenvoice and SCW. Absolute word accuracy improvement of 2.06 point was achieved using the proposed method, even though we use only 0.3 s of adaptation data.},
keywords={},
doi={10.1587/transinf.E93.D.1927},
ISSN={1745-1361},
month={July},}
Copiar
TY - JOUR
TI - Improved Reference Speaker Weighting Using Aspect Model
T2 - IEICE TRANSACTIONS on Information
SP - 1927
EP - 1935
AU - Seong-Jun HAHM
AU - Yuichi OHKAWA
AU - Masashi ITO
AU - Motoyuki SUZUKI
AU - Akinori ITO
AU - Shozo MAKINO
PY - 2010
DO - 10.1587/transinf.E93.D.1927
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 7
JA - IEICE TRANSACTIONS on Information
Y1 - July 2010
AB - We propose an improved reference speaker weighting (RSW) and speaker cluster weighting (SCW) approach that uses an aspect model. The concept of the approach is that the adapted model is a linear combination of a few latent reference models obtained from a set of reference speakers. The aspect model has specific latent-space characteristics that differ from orthogonal basis vectors of eigenvoice. The aspect model is a "mixture-of-mixture" model. We first calculate a small number of latent reference models as mixtures of distributions of the reference speaker's models, and then the latent reference models are mixed to obtain the adapted distribution. The mixture weights are calculated based on the expectation maximization (EM) algorithm. We use the obtained mixture weights for interpolating mean parameters of the distributions. Both training and adaptation are performed based on likelihood maximization with respect to the training and adaptation data, respectively. We conduct a continuous speech recognition experiment using a Korean database (KAIST-TRADE). The results are compared to those of a conventional MAP, MLLR, RSW, eigenvoice and SCW. Absolute word accuracy improvement of 2.06 point was achieved using the proposed method, even though we use only 0.3 s of adaptation data.
ER -