High-performance SIMD implementation of the lattice-Boltzmann method on the Xeon Phi processor

Fredrik Robertsén; Keijo Mattila; Jan Westerholm

doi:10.1002/cpe.5072

High-performance SIMD implementation of the lattice-Boltzmann method on the Xeon Phi processor

Fredrik Robertsén, Keijo Mattila, Jan Westerholm

Informationsteknologi

Forskningsoutput: Tidskriftsbidrag › Artikel › Vetenskaplig › Peer review

6 Citeringar (Scopus)

86 Nedladdningar (Pure)

Sammanfattning

We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second.

Originalspråk	Odefinierat/okänt
Sidor (från-till)	–
Tidskrift	Concurrency and Computation: Practice and Experience
Volym	31
Nummer	13
DOI	https://doi.org/10.1002/cpe.5072
Status	Publicerad - 2019
MoE-publikationstyp	A1 Tidskriftsartikel-refererad

Nyckelord

Xeon Phi

Åtkomst av dokument

10.1002/cpe.5072

phiArticle.pdfAccepterat manuskript från författare, 556 KBLicens: Publisher rights policy

Citera det här

@article{07a1f86d039d4bfc9b4d0079ea736925,

title = "High-performance SIMD implementation of the lattice-Boltzmann method on the Xeon Phi processor",

abstract = "We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second.",

keywords = "Xeon Phi, Xeon Phi, Xeon Phi",

author = "Fredrik Roberts{\'e}n and Keijo Mattila and Jan Westerholm",

year = "2019",

doi = "10.1002/cpe.5072",

language = "Odefinierat/ok{\"a}nt",

volume = "31",

pages = "–",

journal = "Concurrency and Computation: Practice and Experience",

issn = "1532-0626",

publisher = "John Wiley and Sons",

number = "13",

}

TY - JOUR

T1 - High-performance SIMD implementation of the lattice-Boltzmann method on the Xeon Phi processor

AU - Robertsén, Fredrik

AU - Mattila, Keijo

AU - Westerholm, Jan

PY - 2019

Y1 - 2019

N2 - We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second.

AB - We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second.

KW - Xeon Phi

U2 - 10.1002/cpe.5072

DO - 10.1002/cpe.5072

M3 - Artikel

SN - 1532-0626

VL - 31

SP - –

JO - Concurrency and Computation: Practice and Experience

JF - Concurrency and Computation: Practice and Experience

IS - 13

ER -