The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
As redes neurais profundas (DNNs) alcançaram um sucesso significativo no campo do reconhecimento automático de fala. Uma vantagem principal das DNNs é a extração automática de recursos sem intervenção humana. No entanto, a adaptação sob dados disponíveis limitados continua a ser um grande desafio para os sistemas baseados em DNN devido aos seus enormes parâmetros livres. Neste artigo, propomos um DNN incorporado em banco de filtros que incorpora uma camada de banco de filtros que apresenta a forma/frequência central do filtro e um modelo acústico baseado em DNN. A camada do banco de filtros e as seguintes redes do modelo proposto são treinadas em conjunto, explorando as vantagens da extração hierárquica de características, enquanto a maioria dos sistemas usa características predefinidas do banco de filtros em escala mel como características acústicas de entrada para DNNs. Os filtros na camada do banco de filtros são parametrizados para representar as características do alto-falante enquanto minimizam vários parâmetros. A otimização de um tipo de parâmetro corresponde à Normalização do Comprimento do Trato Vocal (VTLN), e outro tipo corresponde à Regressão Linear de Verossimilhança Máxima do espaço de características (fMLLR) e à Regressão Linear Discriminativa do espaço de características (fDLR). Como a camada do banco de filtros consiste em apenas alguns parâmetros, é vantajoso na adaptação sob dados disponíveis limitados. No experimento, DNNs incorporadas ao banco de filtros mostraram eficácia nas adaptações de locutor/gênero sob dados de adaptação limitados. Resultados experimentais na tarefa CSJ demonstram que a adaptação do modelo proposto apresentou taxa de redução de erros de palavras de 5.8% com 10 enunciados em relação ao modelo não adaptado.
Hiroshi SEKI
Toyohashi University of Technology
Kazumasa YAMAMOTO
Chubu University
Tomoyosi AKIBA
Toyohashi University of Technology
Seiichi NAKAGAWA
Toyohashi University of Technology,Chubu University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Hiroshi SEKI, Kazumasa YAMAMOTO, Tomoyosi AKIBA, Seiichi NAKAGAWA, "Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation" in IEICE TRANSACTIONS on Information,
vol. E102-D, no. 2, pp. 364-374, February 2019, doi: 10.1587/transinf.2018EDP7252.
Abstract: Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2018EDP7252/_p
Copiar
@ARTICLE{e102-d_2_364,
author={Hiroshi SEKI, Kazumasa YAMAMOTO, Tomoyosi AKIBA, Seiichi NAKAGAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation},
year={2019},
volume={E102-D},
number={2},
pages={364-374},
abstract={Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.},
keywords={},
doi={10.1587/transinf.2018EDP7252},
ISSN={1745-1361},
month={February},}
Copiar
TY - JOUR
TI - Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation
T2 - IEICE TRANSACTIONS on Information
SP - 364
EP - 374
AU - Hiroshi SEKI
AU - Kazumasa YAMAMOTO
AU - Tomoyosi AKIBA
AU - Seiichi NAKAGAWA
PY - 2019
DO - 10.1587/transinf.2018EDP7252
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E102-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2019
AB - Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
ER -