[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference nicctr::kap-users

Title:	Kuck Associates Preprocessor Users
Notice:	KAP V2.1 (f90,f77,C) SSB-kits - see note 2
Moderator:	HPCGRP::DEGREGORY

Created:	Fri Nov 22 1991
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	390
Total number of notes:	1440

365.0. "Help speedup a C program (in next note)" by TAV02::GLASS (Yossi Glass, 882-3254) Thu Feb 06 1997 10:14

We are trying to use KAP C in order to ||ize a program (it is us and SGI:
the winner sells an AlphaServer 4100 with 4 CPUs (or an Origin 2000)).

First we had problems getting any 'kmp' calls in the .cmp.c file. Therefore,
we took the most time-consuming routine, put it in a separate module, got
it ||ized (we see 'kmp' calls in the .cmp.c file). However, even now we don't
get any speedup.

The program is attached in the following note. You can also ftp a file
called 'dennis.tar' from tavosf.iso.dec.com (16.190.64.193) as following:
- ftp as anonymous to tavosf.iso.dec.com
- cd pub
- get dennis.tar (in binary mode)

After copying the file and un'taring it, you will see a README file
which has some info, and two directories: serial and parallel ('serial' is 
the original file which appears also in the following note; 'parallel'
is a modified version, where one of the routines has been extracted into a 
separate module).

Good luck,
Yossi.

T.R	Title	User	Personal Name	Date	Lines
365.1	Program source code	TAV02::GLASS	Yossi Glass, 882-3254	`Thu Feb 06 1997 10:21`	657
	Here is the source of the program: --------------------------------- /* md-smp.c / /***************************************************************** Basic 2-D & 3-D molecular dynamics code - adapted for SMP Copyright (c) 1997 D.C. Rapaport Compile: cc CFLAGS -o md-smp md-smp.c [-DSMP] [-DD3] -lm where CFLAGS = {-n32 -r5000 -O3}, or {-migrate -O4}, etc. Define SMP to get SMP code for 4 procs. Define D3 to get 3-D version (run both 2-D and 3-D versions). Edited output from "small" runs (1-2 minutes) on 6 SPECfp95 workstation: > cc ... > md-smp 200 2-D test: 46000 atoms, 200 steps 20 0.641 1.000 40 0.683 1.000 ... 180 0.669 1.000 200 0.666 1.000 time/atom-step 3.68 musec <<- this is the important number! > cc ... -DD3 ... > md-smp 24 3-D test: 55296 atoms, 200 steps 20 0.934 1.500 40 0.982 1.500 ... 180 0.967 1.500 200 0.966 1.500 time/atom-step 8.73 musec <<- this is the important number! (Near-linear speedup is obtained on message-passing architecture using another version of this program; hope for at least 75% efficiency on 4-node SMP since code is fully parallelizable in principle.) Note: the SMP version has not been tested on a multi-cpu machine, and may require further changes. *****************************************************************/ #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/time.h> #ifndef SMP #define SMP 0 #endif #ifdef D3 #define NDIM 3 #else #define NDIM 2 #endif / vector definitions and macro ops, etc / typedef struct {double x, y;} VecD2; typedef struct {double x, y, z;} VecD3; typedef struct {int x, y;} VecI2; typedef struct {int x, y, z;} VecI3; #if NDIM == 2 typedef VecD2 VecD; typedef VecI2 VecI; #define VSet(v, sx, sy) v.x = sx, v.y = sy #define VZero(v) VSet (v, 0, 0) #define VScale(v, s) v.x = s, v.y = s #define VSCopy(v2, s1, v1) \ v2.x = (s1) v1.x, v2.y = (s1) * v1.y #define VAdd(v1, v2, v3) \ v1.x = v2.x + v3.x, v1.y = v2.y + v3.y #define VSub(v1, v2, v3) \ v1.x = v2.x - v3.x, v1.y = v2.y - v3.y #define VMul(v1, v2, v3) \ v1.x = v2.x * v3.x, v1.y = v2.y * v3.y #define VDiv(v1, v2, v3) \ v1.x = v2.x / v3.x, v1.y = v2.y / v3.y #define VSAdd(v1, v2, s3, v3) \ v1.x = v2.x + (s3) * v3.x, v1.y = v2.y + (s3) * v3.y #define VProd(v) (v.x * v.y) #define VLinear(p, s) p.y * s.x + p.x #define VLenSq(v) (SQ (v.x) + SQ (v.y)) #define OFFSET_VALS {0,0, 1,0, 1,1, 0,1, -1,1} #define DO_OFFSET for (offset = 0; offset < 5; offset ++) #elif NDIM == 3 typedef VecD3 VecD; typedef VecI3 VecI; #define VSet(v, sx, sy, sz) v.x = sx, v.y = sy, v.z = sz #define VZero(v) VSet (v, 0, 0, 0) #define VScale(v, s) v.x = s, v.y = s, v.z = s #define VSCopy(v2, s1, v1) \ v2.x = (s1) v1.x, v2.y = (s1) * v1.y, v2.z = (s1) * v1.z #define VAdd(v1, v2, v3) \ v1.x = v2.x + v3.x, v1.y = v2.y + v3.y, v1.z = v2.z + v3.z #define VSub(v1, v2, v3) \ v1.x = v2.x - v3.x, v1.y = v2.y - v3.y, v1.z = v2.z - v3.z #define VMul(v1, v2, v3) \ v1.x = v2.x * v3.x, v1.y = v2.y * v3.y, v1.z = v2.z * v3.z #define VDiv(v1, v2, v3) \ v1.x = v2.x / v3.x, v1.y = v2.y / v3.y, v1.z = v2.z / v3.z #define VSAdd(v1, v2, s3, v3) \ v1.x = v2.x + (s3) * v3.x, v1.y = v2.y + (s3) * v3.y, \ v1.z = v2.z + (s3) * v3.z #define VProd(v) (v.x * v.y * v.z) #define VLinear(p, s) (p.z * s.y + p.y) * s.x + p.x #define VLenSq(v) (SQ (v.x) + SQ (v.y) + SQ (v.z)) #define OFFSET_VALS {0,0,0, 1,0,0, 1,1,0, 0,1,0, -1,1,0, 0,0,1, \ 1,0,1, 1,1,1, 0,1,1, -1,1,1, -1,0,1, -1,-1,1, 0,-1,1, 1,-1,1} #define DO_OFFSET for (offset = 0; offset < 14; offset ++) #endif #define SQ(x) ((x) * (x)) #define AllocT(x, t) (t ) malloc ((x) sizeof (t)) #define TdiffD(tv2, tv1) \ (1e6 * (tv2.tv_sec - tv1.tv_sec) + tv2.tv_usec - tv1.tv_usec) /* prototypes / void BuildNebrList (); void ComputeForces (); void LeapfrogIntegrate (); void CheckPeriodicWrap (); void CalcEnergy (); void InitialState (); void VRand (); / globals / static VecD r, rv, ra, region; static VecI cells, initUcell; static double deltaT, rCut, temperature, density, uSum, kinEnergy, potEnergy, rNebrShell, dispHi, vvMax; static int cellList, nAtom, stepCount, stepMax, stepPrint, nUcellEdge, nebrTabMax, nebrTabFac, nebrNow; #if SMP #define MAX_PROC 4 static VecD raP[MAX_PROC]; static int nebrTab[MAX_PROC], nebrTabLen[MAX_PROC], nProc; #else static int nebrTab, nebrTabLen; #endif /* do not parallelize main program / void main (int argc, char argv) { FILE fp; struct timeval tv1, tv2; #if SMP int ip; #endif #if NDIM == 2 nUcellEdge = 100; nebrTabFac = 5; #elif NDIM == 3 nUcellEdge = 16; nebrTabFac = 10; #endif if (argc > 1) nUcellEdge = atol (argv[1]); temperature = 1.; density = 0.8; rCut = pow (2., 1./6.); rNebrShell = 0.4; stepMax = 200; stepPrint = 20; deltaT = 0.005; #if NDIM == 2 VSet (initUcell, nUcellEdge, nUcellEdge / sqrt (3.)); region.x = initUcell.x / sqrt (density * sqrt (3.) / 2.); region.y = region.x; nAtom = 2 * VProd (initUcell); density = nAtom / VProd (region); #elif NDIM == 3 VSet (initUcell, nUcellEdge, nUcellEdge, nUcellEdge); nAtom = 4 * VProd (initUcell); VSCopy (region, 1. / pow (density / 4., 1./3.), initUcell); #endif nebrTabMax = nebrTabFac * nAtom; VSCopy (cells, 1. / (rCut + rNebrShell), region); r = AllocT (nAtom, VecD); rv = AllocT (nAtom, VecD); ra = AllocT (nAtom, VecD); cellList = AllocT (nAtom + VProd (cells), int); #if SMP nProc = MAX_PROC; for (ip = 0; ip < nProc; ip ++) { raP[ip] = AllocT (nAtom, VecD); nebrTab[ip] = AllocT (2 * nebrTabMax / nProc, int); } #else nebrTab = AllocT (2 * nebrTabMax, int); #endif InitialState (); nebrNow = 1; fp = stdout; #if SMP fprintf (fp, "SMP %d procs -- ", nProc); #endif fprintf (fp, "%d-D test: %d atoms, %d steps\n", NDIM, nAtom, stepMax); /* start timing / gettimeofday (&tv1, NULL); / main loop / for (stepCount = 1; stepCount <= stepMax; stepCount ++) { if (nebrNow) { nebrNow = 0; dispHi = 0.; BuildNebrList (); } ComputeForces (); LeapfrogIntegrate (); CheckPeriodicWrap (); CalcEnergy (); dispHi += sqrt (vvMax) deltaT; if (dispHi > 0.5 * rNebrShell) nebrNow = 1; if (stepCount % stepPrint == 0) { fprintf (fp, "%5d %.3f %.3f\n", stepCount, kinEnergy, kinEnergy + potEnergy); fflush (fp); } } /* end / gettimeofday (&tv2, NULL); fprintf (fp, " time/atom-step %.2f musec\n", TdiffD (tv2, tv1) / (stepMax nAtom)); exit (0); } #define WrapC(t) \ if (m2v.t >= cells.t) m2v.t = 0, shift.t = region.t; \ else if (m2v.t < 0) m2v.t = cells.t - 1, shift.t = - region.t /* function should be automatically parallelized - can be 2nd largest cpu user, need to work on local variables, perhaps better way to organize first part of function / void BuildNebrList () { VecD dr, invWid, rs, shift; VecI cc, m1v, m2v, iof[] = OFFSET_VALS; double rrNebr; int c, j1, j2, m1, m1x, m1y, m1z, m2, n, offset; #if SMP int ip; #endif VDiv (invWid, cells, region); rrNebr = SQ (rCut + rNebrShell); / start parallel section / for (n = nAtom; n < nAtom + VProd (cells); n ++) cellList[n] = -1; / end parallel section / / barrier here / / start parallel section - this is safe to parallelize (better ways?) / #if SMP for (ip = 0; ip < nProc; ip ++) { #endif for (n = 0; n < nAtom; n ++) { VSAdd (rs, r[n], 0.5, region); VMul (cc, rs, invWid); c = VLinear (cc, cells) + nAtom; #if SMP if (c % nProc == ip) { #endif cellList[n] = cellList[c]; cellList[c] = n; #if SMP } #endif } #if SMP } #endif / end parallel section / / barrier here / / start parallel section / #if SMP for (ip = 0; ip < nProc; ip ++) { nebrTabLen[ip] = 0; #else nebrTabLen = 0; #endif #if NDIM == 3 for (m1z = 0; m1z < cells.z; m1z ++) { #endif for (m1y = 0; m1y < cells.y; m1y ++) { #if SMP if (m1y % nProc == ip) { #endif for (m1x = 0; m1x < cells.x; m1x ++) { #if NDIM == 2 VSet (m1v, m1x, m1y); #elif NDIM == 3 VSet (m1v, m1x, m1y, m1z); #endif m1 = VLinear (m1v, cells) + nAtom; DO_OFFSET { VAdd (m2v, m1v, iof[offset]); VZero (shift); WrapC (x); WrapC (y); #if NDIM == 3 WrapC (z); #endif m2 = VLinear (m2v, cells) + nAtom; for (j1 = cellList[m1]; j1 >= 0; j1 = cellList[j1]) { for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) { if (m1 != m2 \|\| j2 < j1) { VSub (dr, r[j1], r[j2]); VSub (dr, dr, shift); if (VLenSq (dr) < rrNebr) { #if SMP nebrTab[ip][2 nebrTabLen[ip]] = j1; nebrTab[ip][2 * nebrTabLen[ip] + 1] = j2; ++ nebrTabLen[ip]; #else nebrTab[2 * nebrTabLen] = j1; nebrTab[2 * nebrTabLen + 1] = j2; ++ nebrTabLen; #endif } } } } } } #if SMP } #endif } #if NDIM == 3 } #endif #if SMP } #endif /* end parallel section / } #define WrapR(v, t) if (v.t >= 0.5 region.t) v.t -= region.t; \ else if (v.t < - 0.5 * region.t) v.t += region.t /* function should be automatically parallelized - largest (> 60%) cpu user, need to work on local variables / void ComputeForces () { VecD dr; double fcVal, rr, rrCut, rri, rri3, uVal; int j1, j2, n; #if SMP double uSumP[MAX_PROC]; int ip; #endif rrCut = SQ (rCut); / start parallel section / #if SMP for (ip = 0; ip < nProc; ip ++) { for (n = 0; n < nAtom; n ++) VZero (raP[ip][n]); uSumP[ip] = 0.; #else for (n = 0; n < nAtom; n ++) VZero (ra[n]); uSum = 0.; #endif #if SMP for (n = 0; n < nebrTabLen[ip]; n ++) { j1 = nebrTab[ip][2 n]; j2 = nebrTab[ip][2 * n + 1]; #else for (n = 0; n < nebrTabLen; n ++) { j1 = nebrTab[2 * n]; j2 = nebrTab[2 * n + 1]; #endif VSub (dr, r[j1], r[j2]); WrapR (dr, x); WrapR (dr, y); #if NDIM == 3 WrapR (dr, z); #endif rr = VLenSq (dr); if (rr < rrCut) { rri = 1. / rr; rri3 = rri * rri * rri; fcVal = 48. * rri3 * (rri3 - 0.5) * rri; uVal = 4. * rri3 * (rri3 - 1.) + 1.; #if SMP VSAdd (raP[ip][j1], raP[ip][j1], fcVal, dr); VSAdd (raP[ip][j2], raP[ip][j2], - fcVal, dr); uSumP[ip] += uVal; #else VSAdd (ra[j1], ra[j1], fcVal, dr); VSAdd (ra[j2], ra[j2], - fcVal, dr); uSum += uVal; #endif } } #if SMP } #endif /* end parallel section / #if SMP / two alternative ways of combining partial results (first is new) / #if 1 / barrier here / / start parallel section / for (n = 0; n < nAtom; n ++) { ra[n] = raP[0][n]; for (ip = 1; ip < nProc; ip ++) VAdd (ra[n], ra[n], raP[ip][n]); } uSum = 0.; for (ip = 0; ip < nProc; ip ++) uSum += uSumP[ip]; / end parallel section / #else / start serial section / for (n = 0; n < nAtom; n ++) VZero (ra[n]); uSum = 0.; for (ip = 0; ip < nProc; ip ++) { for (n = 0; n < nAtom; n ++) VAdd (ra[n], ra[n], raP[ip][n]); uSum += uSumP[ip]; } / end serial section / #endif #endif } / function should be automatically parallelized / void LeapfrogIntegrate () { int n; for (n = 0; n < nAtom; n ++) { VSAdd (rv[n], rv[n], deltaT, ra[n]); VSAdd (r[n], r[n], deltaT, rv[n]); } } / function should be automatically parallelized / void CheckPeriodicWrap () { int n; for (n = 0; n < nAtom; n ++) { WrapR (r[n], x); WrapR (r[n], y); #if NDIM == 3 WrapR (r[n], z); #endif } } / function should be automatically parallelized - but note reduction ops / void CalcEnergy () { VecD v; double vv, vvSum; int n; vvSum = 0.; vvMax = 0.; for (n = 0; n < nAtom; n ++) { VSAdd (v, rv[n], -0.5 deltaT, ra[n]); vv = VLenSq (v); vvSum += vv; if (vv > vvMax) vvMax = vv; } kinEnergy = 0.5 * vvSum / nAtom; potEnergy = uSum / nAtom; } /* function only called once (not part of timing) - do not parallelize / void InitialState () { VecD p, gap, vSum; double vMag; int j, n, nx, ny, nz; VDiv (gap, region, initUcell); n = 0; #if NDIM == 3 for (nz = 0; nz < initUcell.z; nz ++) { #endif for (ny = 0; ny < initUcell.y; ny ++) { for (nx = 0; nx < initUcell.x; nx ++) { #if NDIM == 2 VSet (p, nx + 0.25, ny + 0.25); #elif NDIM == 3 VSet (p, nx + 0.25, ny + 0.25, nz + 0.25); #endif VMul (p, p, gap); VSAdd (p, p, -0.5, region); #if NDIM == 2 for (j = 0; j < 2; j ++) { r[n] = p; if (j != 0) VSAdd (r[n], r[n], 0.5, gap); ++ n; } #elif NDIM == 3 for (j = 0; j < 4; j ++) { r[n] = p; if (j != 3) { if (j != 0) r[n].x += 0.5 gap.x; if (j != 1) r[n].y += 0.5 * gap.y; if (j != 2) r[n].z += 0.5 * gap.z; } ++ n; } #endif } } #if NDIM == 3 } #endif vMag = sqrt (NDIM * temperature); VZero (vSum); for (n = 0; n < nAtom; n ++) { VRand (&rv[n]); VScale (rv[n], vMag); VAdd (vSum, vSum, rv[n]); } for (n = 0; n < nAtom; n ++) VSAdd (rv[n], rv[n], -1. / nAtom, vSum); } #if NDIM == 2 void VRand (VecD2 p) { double s; s = 2. M_PI * drand48 (); p->x = cos (s); p->y = sin (s); } #elif NDIM == 3 void VRand (VecD3 p) { double s, x, y; s = 2.; while (s > 1.) { x = 2. drand48 () - 1.; y = 2. * drand48 () - 1.; s = SQ (x) + SQ (y); } p->z = 1. - 2. * s; s = 2. * sqrt (1. - s); p->x = s * x; p->y = s * y; } #endif /* not used / #if 0 void ComputeForces () { VecD dr, invWid, rs, shift; VecI cc, m1v, m2v, iof[] = OFFSET_VALS; double fcVal, rr, rrCut, rri, rri3, uVal; int c, j1, j2, m1, m1x, m1y, m1z, m2, n, offset; rrCut = SQ (rCut); for (n = nAtom; n < nAtom + VProd (cells); n ++) cellList[n] = -1; VDiv (invWid, cells, region); for (n = 0; n < nAtom; n ++) { VSAdd (rs, r[n], 0.5, region); VMul (cc, rs, invWid); c = VLinear (cc, cells) + nAtom; cellList[n] = cellList[c]; cellList[c] = n; } for (n = 0; n < nAtom; n ++) VZero (ra[n]); uSum = 0.; #if NDIM == 3 for (m1z = 0; m1z < cells.z; m1z ++) { #endif for (m1y = 0; m1y < cells.y; m1y ++) { for (m1x = 0; m1x < cells.x; m1x ++) { #if NDIM == 2 VSet (m1v, m1x, m1y); #elif NDIM == 3 VSet (m1v, m1x, m1y, m1z); #endif m1 = VLinear (m1v, cells) + nAtom; DO_OFFSET { VAdd (m2v, m1v, iof[offset]); VZero (shift); WrapC (x); WrapC (y); #if NDIM == 3 WrapC (z); #endif m2 = VLinear (m2v, cells) + nAtom; for (j1 = cellList[m1]; j1 >= 0; j1 = cellList[j1]) { for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) { if (m1 != m2 \|\| j2 < j1) { VSub (dr, r[j1], r[j2]); VSub (dr, dr, shift); rr = VLenSq (dr); if (rr < rrCut) { rri = 1. / rr; rri3 = rri rri * rri; fcVal = 48. * rri3 * (rri3 - 0.5) * rri; uVal = 4. * rri3 * (rri3 - 1.) + 1.; VSAdd (ra[j1], ra[j1], fcVal, dr); VSAdd (ra[j2], ra[j2], - fcVal, dr); uSum += uVal; } } } } } } } #if NDIM == 3 } #endif } #endif /* end of program */
365.2	More info: Your help is still needed	TAV02::GLASS	Yossi Glass, 882-3254	`Wed Feb 12 1997 09:50`	72
	Following is a summary of my efforts to \|\|ize this program. In few words: - The customer created a program which has #ifdef for running it on one CPU, and for SMP. - Using "kcc -ckapargs='-conc'" I didn't get any speedup when running on 4 CPUs (compared to the runtime on one CPU). - Using "kcc -ckapargs='-conc'" but without specifying -DSMP (which means that the "serial code" is being \|\|ized by KAP, I got 25% speedup for 2 CPUs and 50% speedup for 4 CPUs. You can help me with the following: - Find out why we didn't get speedup with -DSMP - How to get better speedup with and without -DSMP The modified code, on which I did the \|\|ization tests, is on the anynoymous ftp account of tavosf.iso.dec.com, under directory pub, file is dennis_parallel.tar. I Don't remember if I said it already: benchmark results will decide whether the customer purchases an Origin 2000 (4 CPUs) or AlphaServer 4100 (4 CPUs); This customer used to be an SGI customer, until he purchased Alpha systems 2.5 years ago. More details: ============ Profiling info -------------- When running the program without optimization on a single CPU, the following pieces of code consume most of the runtime: 1. In ComputeForces, the loop starting with: for (n = 0; n < nebrTabLen; n ++) takes ~55% of the total time 2. In BuildNebrList, the loop starting with: for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) { takes ~15% of the total time What I did ---------- 1. I tried to use KAP on the original code, but didn't get \|\|ization, mostly beacuse KAP assumes that there are dependenices. Therefore, I created two new modules: compute.c (ComputeForces) and build.c (BuildNebrList). Having several modules, I had to create a separate module for globals (globals.c) and for #defines and typedefs (mdsmp.h) and for externals (externals.h). Yes, and I called the main module 'mdsmp.c'. 2. I compiled the code and ran it several times for both 2D and 3D. I will talk about the 3D results (the 2D results behave more or less the same): - First, I compiled the code without -DSMP, using just 'cc', and then 'kcc' without the -conc option (to see the speedup from scalar optimizations). I got: 3.95 time/atom for the compile using 'cc' 3.72 time/atom for the compile using 'kcc' - Then, I compiled the code with -DSMP, using kcc -ckapargs='-conc -arl=4 -mc=0' for build.c and compute.c. -arl=4 tells KAP not to worry about dependencies -mc=0 tells it to \|\|ize regardless of operations in the loop I got: 3.97 time/atom. No speedup, although KAP generates the calls to its thread library. - I supsected that the reason was some kind of cache problems (there are tools to test this, but I didn't use them yet). Therefore, I tried to run kcc with '-conc', but without -DSMP. Running it I got: for 2 CPUs: 2.98 time/atom for 4 CPUs: 2.41 time/atom 3. This is what I know so far. It seems that KAP feels better when you let him \|\|ize the original code (it doesn't like the idea that somebody thinks he can \|\|ize the code better). ---------------------