The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
O avanço da tecnologia multicore tornou possível centenas ou até milhares de núcleos de processador em um único chip. No entanto, em um multicore de maior escala, um mecanismo de coerência de cache baseado em hardware torna-se extremamente complicado, quente e caro. Portanto, propomos um esquema de coerência de software gerenciado por um compilador paralelizado para sistemas multicore de memória compartilhada sem mecanismo de coerência de cache de hardware. Nosso método proposto é simples e eficiente. Ele está integrado ao compilador de paralelização automática OSCAR. O compilador OSCAR paraleliza a tarefa de granulação grossa, analisa dados obsoletos e compartilhamento de linhas no programa e, em seguida, resolve esses problemas por meio de uma simples reestruturação do programa e sincronização de dados. Usando nosso método proposto, compilamos 10 programas de benchmark do SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) e MediaBench II. Os binários compilados são então executados no Renesas RP2, um processador SH-8A de 4 núcleos e um sistema Altera Nios II personalizado de 8 núcleos no Altera Arria 10 FPGA. O hardware de coerência de cache no processador RP2 está disponível apenas para até 4 núcleos. O hardware de coerência de cache do RP2 também pode ser desligado para o modo de cache sem coerência. O sistema multicore Nios II não possui nenhum mecanismo de coerência de cache de hardware; portanto, executar um programa paralelo é difícil sem qualquer suporte do compilador. O método proposto teve um desempenho tão bom ou melhor que o esquema de coerência de cache de hardware, mas ainda forneceu o resultado correto como mecanismo de coerência de hardware. Este método permite que uma grande variedade de núcleos de CPU de memória compartilhada em uma configuração de HPC ou uma simples CPU incorporada multicore não coerente seja facilmente programada. Por exemplo, no processador RP2, o método proposto de cache não coerente (NCC) controlado por software nos deu 2.6 vezes de aceleração para o “equake” SPEC 2000 com 4 núcleos contra execução sequencial, enquanto obteve apenas 2.5 vezes de aceleração para hardware MESI de 4 núcleos controle coerente. Além disso, o controle de coerência de software nos proporcionou uma aceleração de 4.4 vezes para 8 núcleos sem nenhum mecanismo de coerência de hardware disponível.
Boma A. ADHI
Waseda University
Tomoya KASHIMATA
Waseda University
Ken TAKAHASHI
Waseda University
Keiji KIMURA
Waseda University
Hironori KASAHARA
Waseda University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA, "Compiler Software Coherent Control for Embedded High Performance Multicore" in IEICE TRANSACTIONS on Electronics,
vol. E103-C, no. 3, pp. 85-97, March 2020, doi: 10.1587/transele.2019LHP0008.
Abstract: The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
URL: https://global.ieice.org/en_transactions/electronics/10.1587/transele.2019LHP0008/_p
Copiar
@ARTICLE{e103-c_3_85,
author={Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA, },
journal={IEICE TRANSACTIONS on Electronics},
title={Compiler Software Coherent Control for Embedded High Performance Multicore},
year={2020},
volume={E103-C},
number={3},
pages={85-97},
abstract={The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.},
keywords={},
doi={10.1587/transele.2019LHP0008},
ISSN={1745-1353},
month={March},}
Copiar
TY - JOUR
TI - Compiler Software Coherent Control for Embedded High Performance Multicore
T2 - IEICE TRANSACTIONS on Electronics
SP - 85
EP - 97
AU - Boma A. ADHI
AU - Tomoya KASHIMATA
AU - Ken TAKAHASHI
AU - Keiji KIMURA
AU - Hironori KASAHARA
PY - 2020
DO - 10.1587/transele.2019LHP0008
JO - IEICE TRANSACTIONS on Electronics
SN - 1745-1353
VL - E103-C
IS - 3
JA - IEICE TRANSACTIONS on Electronics
Y1 - March 2020
AB - The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
ER -