Many core acceleration of the numerical scheme

4 Efficient implementation of BEM and shape optimization problems

4.2 Efficient implementation of BEM

4.2.4 Many core acceleration of the numerical scheme - offload mode

In this section we concentrate on further acceleration of the CPU code by the Intel MIC tech-nology, namely by the Intel Xeon Phi 7120P coprocessors of the Knights Corner (KNC) series.

An introduction to the MIC architecture and possible programming models is presented in [72]

with real-world applications discussed in [73, 74]. The results presented here are summarized in the paper [101].

The Xeon Phi cards follow a different trend than the current processors. Instead of having a handful of cores running at relatively high frequencies, the KNC coprocessor features 61 physical cores running at a much lower frequency. The cores are based on a dated x86 Pentium technology, however with 64-bit support and most importantly 512-bit SIMD registers. Every core is capable

...

oﬄoad to MIC 0

. ..

oﬄoad to MIC 1

CPU thread MIC thread

asynchronous oﬄoad

Figure 4.8: Asymmetric offload programming model.

of running four hardware threads which is important to overcome the bottleneck given by the in-order execution of instructions. As mentioned earlier, the processing speed of the KNC card is approximately the same as the pair of HSW processors available at the Salomon supercomputer.

However, such performance can only be achieved by exploiting several levels of parallelism including OpenMP threading and in-core vectorization. The x86 compatibility ensures that a C++ code capable of running on standard processors can be recompiled to run on the coprocessor and vice versa. This brings along an interesting programming paradigm – an existing CPU code is easily ported and run on the coprocessor, the code is optimized to benefit from the Xeon Phi architecture, afterwards the same optimizations are performed on the CPU portion of the code leading to a significant speedup on the CPU as well. This strategy can be seen as a major advantage over the GP-GPU (General Purpose Computing on Graphics Processing Units) acceleration, where the structure of the code has to be modified substantially using specialized APIs and thus cannot be ported back to the CPU.

Xeon Phi offers several models of programming. Due to the x86 compatibility, the whole program can run natively on the coprocessor. In this model, the CPU is idle and one has to take into account the limited memory available on Xeon Phi. In Section 4.2.5 we discuss the native mode on both Knights Corner and Knights Landing platforms. Another possibility is the symmetric model achieved by the Message Passing Interface (MPI), which considers both CPUs and coprocessors as separate nodes with the same code processed. In the rest of this section we adopt the offload (asymmetric) model, where the code is run on the host CPU and the computationally intensive kernels are processed on the accelerator.

The asynchronous offload model also allows concurrent computation on the host CPU and the Xeon Phi card, see Figure 4.8. The offload can be achieved either by OpenMP 4.0 pragmas, or by Intel Language Extension for Offload (LEO) available in the Intel Parallel Studio XE 2016.

In Listing 4.9 we demonstrate the offload model on the simple 1D quadrature from Listing 4.3.

The offload region is enclosed in a block with the #pragma offload target( mic : id ) clause, withiddenoting the identification number of the Xeon Phi card. To transfer the static array x we specify the in( x ) clause, the dynamically allocated array w is transferred using the in( w : length( S ) alloc_if( a_flag ) free_if( f_flag ) ) clause. The boolean expression a_flag determines whether the array should be allocated on MIC (true) or reused from a previous offload (false). Thef_flagflag directs the runtime to free the memory after the execution of the offload region ends (true) or retain the array for future offload (false). If not specified otherwise, the alignment of the arrays is preserved during the offload. With the clauseinout( result ), the result variable is transferred to the coprocessors with its value

1 d o u b l e * w = (d o u b l e *) _ m m _ m a l l o c ( S * s i z e o f( d o u b l e ) , 64 ) ;

2 w [ 0 ] = ...

3 d o u b l e x [ ] _ _ a t t r i b u t e _ _ ( ( a l i g n e d ( 64 ) ) ) = { ... };

5 # p r a g m a o f f l o a d t a r g e t ( mic : 0 ) \

6 i n o u t ( r e s u l t ) \

7 in ( x ) \

8 in ( w : l e n g t h ( S ) a l l o c _ i f ( t r u e ) f r e e _ i f ( t r u e ) )

9 {

10 _ _ a s s u m e _ a l i g n e d ( w , 64 ) ;

11 # p r a g m a omp s i m d r e d u c t i o n ( + : r e s u l t )

12 for( int l = 0; l < S ; ++ l ) {

13 r e s u l t += w [ l ] * f ( x [ l ] ) ;

14 } // end for l

15 } // end p r a g m a o f f l o a d

16 _ m m _ f r e e ( w ) ;

Listing 4.9: Vectorized quadrature on MIC.

MIC 0 MIC 1 CPU

Figure 4.9: Distribution of the matrix assembly.

initiated on the CPU and back after the offload region terminates. The current capabilities of the offload pragma are limited to scalar values, arrays, or pointers to these. However, inside of the offload region one can use the standard constructs including C++ objects or OpenMP parallel regions. This is demonstrated in the lines 10–14, which are copied from Listing 4.3. This fact allows us to use the code from Listing 4.8 to assemble local contributions on both CPU and the Xeon Phi cards. It thus remains to discuss the modified version of the global matrix assembly.

Since we aim to employ both CPU and the Xeon Phi cards (in the following we assume there are two available), we partition the global matrix into three slices as depicted in Figure 4.9. The slices assigned to the coprocessors may not fit into their memory, thus we further separate them into smaller chunks. This also allows us to use the technique of double buffering, which hides the data transfer between the card and the host CPU. The assembly of a subsequent chunk is started at the same moment as the transfer of the previous chunk to the CPU is initiated.

To prepare the assembly it is first necessary to send the relevant data to the coprocessors.

This is done in Listing 4.10, where we extract the raw data from the mesh object, including the coordinates of nodes, the element indices, or the precomputed reference quadrature data.

The arrays buffer1,buffer2 have the size of the matrix chunks and will serve for the double buffering technique.

1 d o u b l e * b u f f e r [ 2 ][ 2 ]; // a l l o c a t e d by _ m m _ m a l l o c

3 # p r a g m a omp p a r a l l e l n u m _ t h r e a d s ( 2 )

4 {

5 r a n k = o m p _ g e t _ t h r e a d _ n u m ( ) ;

6 d o u b l e * b u f f e r 1 = b u f f e r [ 0 ][ r a n k ];

7 d o u b l e * b u f f e r 2 = b u f f e r [ 1 ][ r a n k ];

8 // i n i t i a l t r a n s f e r of d a t a

9 # p r a g m a o f f l o a d _ t r a n s f e r t a r g e t( mic : r a n k ) \

10 in ( b u f f e r 1 : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) ) \

11 in ( b u f f e r 2 : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) ) \

12 in ( n o d e s : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) ) \

13 in ( e l e m e n t s : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) ) \

14 ...

15 in ( x 1 R e f I d e n t i c a l : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) ) \

16 ...

17 in ( y 2 R e f I d e n t i c a l : l e n g t h ( . . . ) a l l o c _ i f ( 1 ) f r e e _ i f ( 0 ) )

18 }

Listing 4.10: Transfer of raw data to Intel Xeon Phi.

# threads 24 24 + 60 24 + 120 24 + 240 24 + 2·60 24 + 2·120 24 + 2·240

double 1.00 1.39 1.59 1.77 1.72 2.27 2.66

single 1.00 1.35 1.48 1.66 1.60 2.07 2.43

Table 4.5: Speedup of CPU+MIC vs. CPU assembly forV_h.

The assembly of the global matrix is sketched in Listing 4.11. The OpenMP threads running on the host are divided into two groups - for every MIC we dedicate one thread to manage the double buffering and mapping to the global matrix, see Figure 4.8. In the line 41 the rest of the threads call the standard methodgetLocalMatrix discussed above. The only difference to the CPU code from Listing 4.1 is that we cannot use #pragma omp for, since such a loop has to be accessed by all threads at once. Instead, we create a list of OpenMP tasks, which are dynamically assigned to the free threads.

Depending on s, the correct buffer is chosen for the MIC assembly, the other one is asyn-chronously transferred back to the CPU. The offload in the line 9 of Listing 4.11 executed for every matrix chunk does not transfer any additional data. All the relevant pointers are reused from the initial upload by using the alloc_if( 0 ) clause. The signal clause ensures the asynchronous offload, i.e., the responsible CPU thread is freed after the initiation of the offload region and is ready to transfer the previous chunk in the line 28. The clausewait ensures that the previous computation has already finished. Inside of the offload region we use the standard assembly code, the outer loop is executed by a team of threads created on the coprocessor, only a subset of element indices defined bystart and endare processed.

Again, we provide the scalability tests performed on one Salomon node equipped with two Intel Xeon Phi 7120P coprocessors. The mesh Γ_h with 81,920 elements is given by a six times refined icosahedron with nodes mapped to the unit sphere. The reference times are obtained by

# threads 24 24 + 60 24 + 120 24 + 240 24 + 2·60 24 + 2·120 24 + 2·240

double 1.00 1.28 1.46 1.77 1.41 2.04 2.43

single 1.00 1.37 1.59 1.87 1.60 2.16 2.54

Table 4.6: Speedup of CPU+MIC vs. CPU assembly forK_h.

OMP threads Single-layer matrix (double)

50 100

0 150

t[s]

2xHSW 2xHSW+(2x)KNC

(a)Vh(double).

OMP threads Double-layer matrix (double)

50 100

0 150

t[s]

2xHSW 2xHSW+(2x)KNC

(b)Kh(double).

Figure 4.10: Accelerated assembly of the BEM matrices, double precision.

OMP threads Single-layer matrix (single)

50 100

0 150

t[s]

2xHSW 2xHSW+(2x)KNC

(a)Vh(single).

OMP threads Double-layer matrix (single)

50 100

0 150

t[s]

2xHSW 2xHSW+(2x)KNC

(b)Kh(single).

Figure 4.11: Accelerated assembly of the BEM matrices, single precision.

the assembly ofV_h,K_hin double (single) precision on 24 CPU threads with vectorization enabled and read 125.56 s (86.55 s) and 139.83 s (113.43 s), respectively. We test the accelerated code on one and two Xeon Phi cards using 60, 120, and 240 hardware threads. The computational times with one coprocessor and 240 threads employed reduce to 70.77 s (52.04 s) and 78.93 s (60.78 s) reaching the speedup of 1.77 (1.66) and 1.77 (1.87), respectively. Running 240 threads on both coprocessors further reduces the assembly times to 47.30 s (35.69 s) and 57.50 s (44.64 s) achieving the total speedup of 2.66 (2.43) and 2.43 (2.54), respectively. In Tables 4.5, 4.6 we summarize the results using various number of threads. One core of the coprocessor is always left idle to run the card’s operating system. It is clearly seen that the hardware threading plays a more important role than in the case of the hyperthreading available in modern processors.

The Figures 4.10, 4.11 graphically present the assembly times.

Although the obtained speedup is reasonable, the drawback of the described method is the necessity to guess the correct slicing of the global system matrix. The goal is that the coprocessor part is finished at the same time as the CPU portion. Thus, a more sophisticated load balancing should be designed, which can be based on a more granular decomposition of the matrix. After a portion of, say, matrix lines is processed in parallel on Xeon Phi and the CPU, the next portion of lines can be distributed based on the previous computational times, see also [74, Chapter 11]

for a possible solution.

1 # p r a g m a omp p a r a l l e l

2 {

3 r a n k = o m p _ g e t _ t h r e a d _ n u m ( ) ;

4 if( r a n k < 2 ) { // two t h r e a d s d e v o t e d to MIC m a n a g e m e n t

5 for( s = 0; s <= n C h u n k s ; ++ s ) {

7 if( s < n C h u n k s ) { // c o m p u t e new c h u n k

8 d o u b l e * i n B u f f e r = b u f f e r [ s % 2 ][ r a n k ];

9 # p r a g m a o f f l o a d t a r g e t ( mic : r a n k ) \

10 in ( i n B u f f e r : l e n g t h ( 0 ) a l l o c _ i f ( 0 ) f r e e _ i f ( 0 ) ) \

11 s i g n a l ( b u f f e r [ s % 2 ][ r a n k ] ) \ // a s y n c h r o n o u s o f f l o a d

12 in ( n o d e s : l e n g t h ( 0 ) a l l o c _ i f ( 0 ) f r e e _ i f ( 0 ) ) \

13 ... // r e u s e all d a t a f r o m i n i t i a l u p l o a d w i t h a l l o c _ i f ( 0 )

14 {

15 // l o o p o v e r a s u b s e t of t a u _ l

16 # p r a g m a omp p a r a l l e l for // omp p a r a l l e l r e g i o n on MIC

17 for( int t a u _ l = s t a r t ; t a u _ l < end ; ++ t a u _ l ) {

18 for( int t a u _ k = 0; t a u _ k < E ; ++ t a u _ k ) {

19 g e t L o c a l M a t r i x ( tau_l , tau_k , l o c a l M a t r i x ) ;

20 i n B u f f e r . add ( tau_l , tau_k , l o c a l M a t r i x ) ;

21 } // end for t a u _ k

22 } // end for t a u _ l

23 } // end p r a g m a o f f l o a d

24 } // end if s < n C h u n k s

26 if( s > 0 ) { // t r a n s f e r p r e v i o u s c h u n k

27 d o u b l e * o u t B u f f e r = b u f f e r [ ( s - 1 ) % 2 ][ r a n k ];

28 # p r a g m a o f f l o a d _ t r a n s f e r t a r g e t ( mic : r a n k ) \

29 w a i t ( b u f f e r [ ( s - 1 ) % 2 ][ r a n k ] ) \

30 out ( o u t B u f f e r : l e n g t h ( ... ) )

31 ... // c o p y the b u f f e r to g l o b a l m a t r i x

32 } // end if s > 0

33 } // end for s

34 } e l s e { // c o m p u t a t i o n on CPU

35 # p r a g m a omp s i n g l e n o w a i t

36 {

37 for( int t a u _ l = s t a r t ; t a u _ l < end ; ++ t a u _ l ) {

38 # p r a g m a omp t a s k

39 {

40 for( int t a u _ k = 0; t a u _ k < E ; ++ t a u _ k ) {

41 g e t L o c a l M a t r i x ( tau_l , tau_k , l o c a l M a t r i x ) ;

42 g l o b a l M a t r i x . add ( tau_l , tau_k , l o c a l M a t r i x ) ;

43 } // end for t a u _ k

44 } // end omp t a s k

45 } // end for t a u _ l

46 } // end omp s i n g l e

47 } // end if r a n k

48 } // end omp p a r a l l e l

Listing 4.11: Matrix assembly accelerated by the Intel Xeon Phi coprocessors.

In document TheBoundaryElementMethodforShapeOptimizationin3D Ph.D.Thesis (Stránka 95-102)