The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Propomos o “Temporal Ensemble SSDLite”, um novo método para detecção de objetos de vídeo que aumenta a precisão enquanto mantém a velocidade de detecção e o consumo de energia. A detecção de objetos para vídeo está se tornando cada vez mais importante como parte central de aplicações em robótica, direção autônoma e muitos outros campos promissores. Muitas dessas aplicações exigem alta precisão e velocidade para serem viáveis, mas são usadas em ambientes com restrição de computação e energia. Portanto, novos métodos que aumentem o desempenho geral da detecção de objetos de vídeo, ou seja, precisão e velocidade, devem ser desenvolvidos. Para aumentar a precisão, usamos o ensemble, o método de aprendizado de máquina que combina previsões de vários modelos diferentes. A desvantagem do conjunto é o aumento do custo computacional que é proporcional ao número de modelos utilizados. Superamos esse déficit implantando nosso conjunto temporalmente, o que significa que inferimos apenas um único modelo em cada quadro, percorrendo nosso conjunto de modelos em cada quadro. Então, combinamos as previsões para o último N quadros onde N é o número de modelos em nosso conjunto por meio de supressão não máxima. Isto é possível porque os quadros próximos em um vídeo são extremamente semelhantes devido à correlação temporal. Como resultado, aumentamos a precisão através do conjunto, inferindo apenas um único modelo em cada quadro e, portanto, mantendo a velocidade de detecção. Para avaliar a proposta, medimos a precisão, velocidade de detecção e consumo de energia no Google Edge TPU, um acelerador de inferência de aprendizado de máquina, com o conjunto de dados Imagenet VID. Nossos resultados demonstram um aumento de precisão de até 4.9%, mantendo a velocidade de detecção em tempo real e um consumo de energia de 181mJ por imagem.
Lukas NAKAMURA
Osaka University
Hiromitsu AWANO
Kyoto University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Lukas NAKAMURA, Hiromitsu AWANO, "Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection" in IEICE TRANSACTIONS on Fundamentals,
vol. E105-A, no. 7, pp. 1082-1090, July 2022, doi: 10.1587/transfun.2021EAP1068.
Abstract: We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.2021EAP1068/_p
Copiar
@ARTICLE{e105-a_7_1082,
author={Lukas NAKAMURA, Hiromitsu AWANO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection},
year={2022},
volume={E105-A},
number={7},
pages={1082-1090},
abstract={We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.},
keywords={},
doi={10.1587/transfun.2021EAP1068},
ISSN={1745-1337},
month={July},}
Copiar
TY - JOUR
TI - Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1082
EP - 1090
AU - Lukas NAKAMURA
AU - Hiromitsu AWANO
PY - 2022
DO - 10.1587/transfun.2021EAP1068
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E105-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 2022
AB - We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
ER -