The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Este artigo descreve resultados experimentais sobre reconhecimento de fala de dígitos conectados baseado em HMM de palavras inteiras em japonês, com foco especial no tamanho dos dados de treinamento e no problema de "ovelhas e cabras". Os dados de treinamento compreendem 757000 dígitos pronunciados por 2000 falantes, enquanto os dados de teste compreendem 399000 dígitos pronunciados por 1700 falantes. A melhor taxa de erro de palavras para strings de comprimento desconhecido foi de 1.64% obtida usando HMMs dependentes de contexto. A taxa de erro de palavras foi medida para vários subconjuntos dos dados de treinamento reduzidos tanto no número de falantes (s) e o número de enunciados por locutor (u). Como resultado, uma fórmula empírica de s[{minutos(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) foi desenvolvido, onde Ew e D(Ew) designam a taxa de erro de palavras e o tamanho efetivo dos dados, respectivamente. Foram realizadas análises de vários aspectos dos falantes de baixo desempenho, responsáveis pela maior parte dos erros de reconhecimento. Também foram feitas tentativas para melhorar seu desempenho de reconhecimento. Verificou-se que 33% dos alto-falantes de baixo desempenho são melhorados para o nível normal pelo agrupamento de alto-falantes centrado em cada alto-falante de baixo desempenho.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Hisashi KAWAI, Tohru SHIMIZU, Norio HIGUCHI, "Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network" in IEICE TRANSACTIONS on Information,
vol. E84-D, no. 3, pp. 374-383, March 2001, doi: .
Abstract: This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.
URL: https://global.ieice.org/en_transactions/information/10.1587/e84-d_3_374/_p
Copiar
@ARTICLE{e84-d_3_374,
author={Hisashi KAWAI, Tohru SHIMIZU, Norio HIGUCHI, },
journal={IEICE TRANSACTIONS on Information},
title={Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network},
year={2001},
volume={E84-D},
number={3},
pages={374-383},
abstract={This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.},
keywords={},
doi={},
ISSN={},
month={March},}
Copiar
TY - JOUR
TI - Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network
T2 - IEICE TRANSACTIONS on Information
SP - 374
EP - 383
AU - Hisashi KAWAI
AU - Tohru SHIMIZU
AU - Norio HIGUCHI
PY - 2001
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E84-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2001
AB - This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.
ER -