Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping.
HPL - A portable implementation of the high-performance linpack benchmark for distributed memory computers. Petitet A, Whaley R C, Dongarra J J, Cleary A. 2009 China top100 list of high performance computer. the 2004 ACM/IEEE Conference on Supercomputing ( SC 2004), Pittsburgh, USA, Nov. GPU cluster for high performance computing. the 39th International Conference on Parallel Processing, San Diego, USA, Sept. Toward harnessing DOACROSS parallelism for multi-GPGPUs. the 23 rd International Conference on Supercomputing ( ICS 2009), Yorktown Heights, USA, Jun.
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. the 2009 IEEE International Symposium on Parallel& Distributed Processing ( IPDPS 2009), Rome, Italy, May 23–29, 2009, pp.1-8. Long time-scale simulations of in vivo diffusion using GPU hardware. Roberts E, Stone J E, Sepulveda L, Mei W, Hwu W, Luthey-Schulten Z. Journal of Computer Science and Technology, 2009, 24(5): 913–924. Parallel LDPC decoding on GPUs using a stream-based computing approach. 11–15, 2008.įalcao G, Yamagiwa S, Silva V, Sousa L. ACM SIGGRAPH 2008, Los Angeles, USA, Aug. Opencl parallel computing on the GPU and CPU. AMD stream computing user guide v 1.4.0, Feb. Fermi compute architecture whitepaper, 2009.ĪMD. International Conference on Field Programmable Logic and Applications ( FPL 2008), Heidelberg, Germany, Sept. Compiled hardware acceleration of molecular dynamics code. Scalability issues affecting the design of a dense linear algebra library.
On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0 :563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.ĭongarra J J, van de Geijn R A, Walker D W. This result is 70 :1% of the peak compute capability, 3 :3 times faster than the result by using the vendor's library. Combined with other traditional optimizations, the Linpack we developed achieved 196 :7 GFLOPS on a single compute element of TianHe-1.
To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead.
We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before.