The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
A rede neural tem sido uma das técnicas mais úteis na área de reconhecimento de fala, tradução de idiomas e análise de imagens nos últimos anos. Long Short-Term Memory (LSTM), um tipo popular de redes neurais recorrentes (RNNs), tem sido amplamente implementada em CPUs e GPUs. No entanto, essas implementações de software oferecem um paralelismo pobre, enquanto as implementações de hardware existentes carecem de configurabilidade. Para compensar essa lacuna, uma implementação de hardware de 7.62 GOP/s altamente configurável para LSTM é proposta neste artigo. Para atingir o objetivo, o fluxo de trabalho é cuidadosamente organizado para tornar o projeto compacto e de alto rendimento; a estrutura é cuidadosamente organizada para tornar o projeto configurável; a estratégia de buffer e compactação de dados é cuidadosamente escolhida para diminuir a largura de banda sem aumentar a complexidade da estrutura; o tipo de dados, a função logística sigmóide (σ) e a função tangente hiperbólica (tanh) são cuidadosamente otimizados para equilibrar o custo e a precisão do hardware. Este trabalho atinge um desempenho de 7.62 GOP/s a 238 MHz no FPGA XCZU6EG, que utiliza apenas tabela de consulta (LUT) de 3K. Comparado com a implementação na CPU Intel Xeon E5-2620 a 2.10 GHz, este trabalho atinge cerca de 90× de aceleração para redes pequenas e 25× de aceleração para redes grandes. O consumo de recursos também é bem menor que o das obras de última geração.
Yibo FAN
Fudan University
Leilei HUANG
Fudan University
Kewei CHEN
Fudan University
Xiaoyang ZENG
Fudan University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, "A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM" in IEICE TRANSACTIONS on Electronics,
vol. E103-C, no. 5, pp. 263-273, May 2020, doi: 10.1587/transele.2019ECP5008.
Abstract: The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
URL: https://global.ieice.org/en_transactions/electronics/10.1587/transele.2019ECP5008/_p
Copiar
@ARTICLE{e103-c_5_263,
author={Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, },
journal={IEICE TRANSACTIONS on Electronics},
title={A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM},
year={2020},
volume={E103-C},
number={5},
pages={263-273},
abstract={The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.},
keywords={},
doi={10.1587/transele.2019ECP5008},
ISSN={1745-1353},
month={May},}
Copiar
TY - JOUR
TI - A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM
T2 - IEICE TRANSACTIONS on Electronics
SP - 263
EP - 273
AU - Yibo FAN
AU - Leilei HUANG
AU - Kewei CHEN
AU - Xiaoyang ZENG
PY - 2020
DO - 10.1587/transele.2019ECP5008
JO - IEICE TRANSACTIONS on Electronics
SN - 1745-1353
VL - E103-C
IS - 5
JA - IEICE TRANSACTIONS on Electronics
Y1 - May 2020
AB - The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
ER -