[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference nicctr::kap-users

Title:Kuck Associates Preprocessor Users
Notice:KAP V2.1 (f90,f77,C) SSB-kits - see note 2
Moderator:HPCGRP::DEGREGORY
Created:Fri Nov 22 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:390
Total number of notes:1440

365.0. "Help speedup a C program (in next note)" by TAV02::GLASS (Yossi Glass, 882-3254) Thu Feb 06 1997 10:14

We are trying to use KAP C in order to ||ize a program (it is us and SGI:
the winner sells an AlphaServer 4100 with 4 CPUs (or an Origin 2000)).

First we had problems getting any 'kmp' calls in the .cmp.c file. Therefore,
we took the most time-consuming routine, put it in a separate module, got
it ||ized (we see 'kmp' calls in the .cmp.c file). However, even now we don't
get any speedup.

The program is attached in the following note. You can also ftp a file
called 'dennis.tar' from tavosf.iso.dec.com (16.190.64.193) as following:
- ftp as anonymous to tavosf.iso.dec.com
- cd pub
- get dennis.tar (in binary mode)

After copying the file and un'taring it, you will see a README file
which has some info, and two directories: serial and parallel ('serial' is 
the original file which appears also in the following note; 'parallel'
is a modified version, where one of the routines has been extracted into a 
separate module).

Good luck,
Yossi.
T.RTitleUserPersonal
Name
DateLines
365.1Program source codeTAV02::GLASSYossi Glass, 882-3254Thu Feb 06 1997 10:21657
Here is the source of the program:
---------------------------------

/* md-smp.c */


/********************************************************************

 Basic 2-D & 3-D molecular dynamics code - adapted for SMP

 Copyright (c) 1997 D.C. Rapaport 

 Compile:
  cc CFLAGS -o md-smp md-smp.c [-DSMP] [-DD3] -lm
    where CFLAGS = {-n32 -r5000 -O3}, or {-migrate -O4}, etc.

 Define SMP to get SMP code for 4 procs.
 Define D3 to get 3-D version (run both 2-D and 3-D versions).

 Edited output from "small" runs (1-2 minutes) on 6 SPECfp95 workstation:

  > cc ... 
  > md-smp 200
  2-D test: 46000 atoms, 200 steps
     20  0.641  1.000
     40  0.683  1.000
    ...
    180  0.669  1.000
    200  0.666  1.000
   time/atom-step 3.68 musec   <<- this is the important number!
  
  > cc ... -DD3 ...
  > md-smp 24
  3-D test: 55296 atoms, 200 steps
     20  0.934  1.500
     40  0.982  1.500
    ...
    180  0.967  1.500
    200  0.966  1.500
   time/atom-step 8.73 musec   <<- this is the important number!

 (Near-linear speedup is obtained on message-passing architecture using
 another version of this program; hope for at least 75% efficiency on 
 4-node SMP since code is fully parallelizable in principle.)

 Note: the SMP version has not been tested on a multi-cpu machine, and
 may require further changes.

********************************************************************/


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>


#ifndef SMP
#define SMP 0
#endif

#ifdef D3
#define NDIM 3
#else
#define NDIM 2
#endif


/* vector definitions and macro ops, etc */

typedef struct {double x, y;} VecD2;
typedef struct {double x, y, z;} VecD3;
typedef struct {int x, y;} VecI2;
typedef struct {int x, y, z;} VecI3;

#if NDIM == 2

typedef VecD2 VecD;
typedef VecI2 VecI;

#define VSet(v, sx, sy) v.x = sx, v.y = sy
#define VZero(v) VSet (v, 0, 0)
#define VScale(v, s) v.x *= s, v.y *= s
#define VSCopy(v2, s1, v1) \
   v2.x = (s1) * v1.x, v2.y = (s1) * v1.y
#define VAdd(v1, v2, v3) \
   v1.x = v2.x + v3.x, v1.y = v2.y + v3.y
#define VSub(v1, v2, v3) \
   v1.x = v2.x - v3.x, v1.y = v2.y - v3.y
#define VMul(v1, v2, v3) \
   v1.x = v2.x * v3.x, v1.y = v2.y * v3.y
#define VDiv(v1, v2, v3) \
   v1.x = v2.x / v3.x, v1.y = v2.y / v3.y
#define VSAdd(v1, v2, s3, v3) \
   v1.x = v2.x + (s3) * v3.x, v1.y = v2.y + (s3) * v3.y
#define VProd(v) (v.x * v.y)
#define VLinear(p, s) p.y * s.x + p.x
#define VLenSq(v) (SQ (v.x) + SQ (v.y))
#define OFFSET_VALS {0,0, 1,0, 1,1, 0,1, -1,1}
#define DO_OFFSET for (offset = 0; offset < 5; offset ++)

#elif NDIM == 3

typedef VecD3 VecD;
typedef VecI3 VecI;

#define VSet(v, sx, sy, sz) v.x = sx, v.y = sy, v.z = sz
#define VZero(v) VSet (v, 0, 0, 0)
#define VScale(v, s) v.x *= s, v.y *= s, v.z *= s
#define VSCopy(v2, s1, v1) \
   v2.x = (s1) * v1.x, v2.y = (s1) * v1.y, v2.z = (s1) * v1.z
#define VAdd(v1, v2, v3) \
   v1.x = v2.x + v3.x, v1.y = v2.y + v3.y, v1.z = v2.z + v3.z
#define VSub(v1, v2, v3) \
   v1.x = v2.x - v3.x, v1.y = v2.y - v3.y, v1.z = v2.z - v3.z
#define VMul(v1, v2, v3) \
   v1.x = v2.x * v3.x, v1.y = v2.y * v3.y, v1.z = v2.z * v3.z
#define VDiv(v1, v2, v3) \
   v1.x = v2.x / v3.x, v1.y = v2.y / v3.y, v1.z = v2.z / v3.z
#define VSAdd(v1, v2, s3, v3) \
   v1.x = v2.x + (s3) * v3.x, v1.y = v2.y + (s3) * v3.y, \
   v1.z = v2.z + (s3) * v3.z
#define VProd(v) (v.x * v.y * v.z)
#define VLinear(p, s) (p.z * s.y + p.y) * s.x + p.x
#define VLenSq(v) (SQ (v.x) + SQ (v.y) + SQ (v.z))
#define OFFSET_VALS {0,0,0, 1,0,0, 1,1,0, 0,1,0, -1,1,0, 0,0,1, \
   1,0,1, 1,1,1, 0,1,1, -1,1,1, -1,0,1, -1,-1,1, 0,-1,1, 1,-1,1}
#define DO_OFFSET for (offset = 0; offset < 14; offset ++)

#endif

#define SQ(x) ((x) * (x))
#define AllocT(x, t) (t *) malloc ((x) * sizeof (t))
#define TdiffD(tv2, tv1) \
   (1e6 * (tv2.tv_sec - tv1.tv_sec) + tv2.tv_usec - tv1.tv_usec)


/* prototypes */

void BuildNebrList ();
void ComputeForces ();
void LeapfrogIntegrate ();
void CheckPeriodicWrap ();
void CalcEnergy ();
void InitialState ();
void VRand ();


/* globals */

static VecD *r, *rv, *ra, region;
static VecI cells, initUcell;
static double deltaT, rCut, temperature, density, uSum, kinEnergy,
   potEnergy, rNebrShell, dispHi, vvMax;
static int *cellList, nAtom, stepCount, stepMax, stepPrint,
   nUcellEdge, nebrTabMax, nebrTabFac, nebrNow;
#if SMP
#define MAX_PROC 4
static VecD *raP[MAX_PROC];
static int *nebrTab[MAX_PROC], nebrTabLen[MAX_PROC], nProc;
#else
static int *nebrTab, nebrTabLen;
#endif



/* do not parallelize main program */
void main (int argc, char **argv)
{
  FILE *fp;
  struct timeval tv1, tv2;
#if SMP
  int ip;
#endif

#if NDIM == 2
  nUcellEdge = 100;
  nebrTabFac = 5;
#elif NDIM == 3
  nUcellEdge = 16;
  nebrTabFac = 10;
#endif
  if (argc > 1) nUcellEdge = atol (argv[1]);
  temperature = 1.;
  density = 0.8;
  rCut = pow (2., 1./6.);
  rNebrShell = 0.4;
  stepMax = 200;
  stepPrint = 20;
  deltaT = 0.005;
#if NDIM == 2
  VSet (initUcell, nUcellEdge, nUcellEdge / sqrt (3.));
  region.x = initUcell.x / sqrt (density * sqrt (3.) / 2.);
  region.y = region.x;
  nAtom = 2 * VProd (initUcell);
  density = nAtom / VProd (region);
#elif NDIM == 3
  VSet (initUcell, nUcellEdge, nUcellEdge, nUcellEdge);
  nAtom = 4 * VProd (initUcell);
  VSCopy (region, 1. / pow (density / 4., 1./3.), initUcell);
#endif
  nebrTabMax = nebrTabFac * nAtom;
  VSCopy (cells, 1. / (rCut + rNebrShell), region);
  r = AllocT (nAtom, VecD);
  rv = AllocT (nAtom, VecD);
  ra = AllocT (nAtom, VecD);
  cellList = AllocT (nAtom + VProd (cells), int);
#if SMP
  nProc = MAX_PROC;
  for (ip = 0; ip < nProc; ip ++) {
    raP[ip] = AllocT (nAtom, VecD);
    nebrTab[ip] = AllocT (2 * nebrTabMax / nProc, int);
  }
#else
  nebrTab = AllocT (2 * nebrTabMax, int);
#endif
  InitialState ();
  nebrNow = 1;
  fp = stdout;
#if SMP
  fprintf (fp, "SMP %d procs -- ", nProc);
#endif
  fprintf (fp, "%d-D test: %d atoms, %d steps\n", NDIM, nAtom, stepMax);
/* start timing */
  gettimeofday (&tv1, NULL);
/* main loop */
  for (stepCount = 1; stepCount <= stepMax; stepCount ++) {
    if (nebrNow) {
      nebrNow = 0;  
      dispHi = 0.;
      BuildNebrList ();
    }
    ComputeForces ();
    LeapfrogIntegrate ();
    CheckPeriodicWrap ();
    CalcEnergy ();
    dispHi += sqrt (vvMax) * deltaT;
    if (dispHi > 0.5 * rNebrShell) nebrNow = 1;
    if (stepCount % stepPrint == 0) {
      fprintf (fp, "%5d  %.3f  %.3f\n", stepCount, kinEnergy,
         kinEnergy + potEnergy);
      fflush (fp);
    }
  }
/* end */
  gettimeofday (&tv2, NULL);
  fprintf (fp, " time/atom-step %.2f musec\n",
     TdiffD (tv2, tv1) / (stepMax * nAtom));
  exit (0);
}



#define WrapC(t) \
   if (m2v.t >= cells.t) m2v.t = 0, shift.t = region.t; \
   else if (m2v.t < 0) m2v.t = cells.t - 1, shift.t = - region.t


/* function should be automatically parallelized -
   can be 2nd largest cpu user, need to work on local variables,
   perhaps better way to organize first part of function */
void BuildNebrList ()
{
  VecD dr, invWid, rs, shift;
  VecI cc, m1v, m2v, iof[] = OFFSET_VALS;
  double rrNebr;
  int c, j1, j2, m1, m1x, m1y, m1z, m2, n, offset;
#if SMP
  int ip;
#endif

  VDiv (invWid, cells, region);
  rrNebr = SQ (rCut + rNebrShell);
/* start parallel section */
  for (n = nAtom; n < nAtom + VProd (cells); n ++) cellList[n] = -1;
/* end parallel section */
/* barrier here */
/* start parallel section - this is safe to parallelize (better ways?) */
#if SMP
  for (ip = 0; ip < nProc; ip ++) {
#endif
    for (n = 0; n < nAtom; n ++) {
      VSAdd (rs, r[n], 0.5, region);
      VMul (cc, rs, invWid);
      c = VLinear (cc, cells) + nAtom;
#if SMP
      if (c % nProc == ip) {
#endif
        cellList[n] = cellList[c];
        cellList[c] = n;
#if SMP
      }
#endif
    }
#if SMP
  }
#endif
/* end parallel section */
/* barrier here */
/* start parallel section */
#if SMP
  for (ip = 0; ip < nProc; ip ++) {
    nebrTabLen[ip] = 0;
#else
  nebrTabLen = 0;
#endif
#if NDIM == 3
    for (m1z = 0; m1z < cells.z; m1z ++) {
#endif
      for (m1y = 0; m1y < cells.y; m1y ++) {
#if SMP
        if (m1y % nProc == ip) {
#endif
          for (m1x = 0; m1x < cells.x; m1x ++) {
#if NDIM == 2
            VSet (m1v, m1x, m1y);
#elif NDIM == 3
            VSet (m1v, m1x, m1y, m1z);
#endif
            m1 = VLinear (m1v, cells) + nAtom;
            DO_OFFSET {
              VAdd (m2v, m1v, iof[offset]);
              VZero (shift);
              WrapC (x);
              WrapC (y);
#if NDIM == 3
              WrapC (z);
#endif
              m2 = VLinear (m2v, cells) + nAtom;
              for (j1 = cellList[m1]; j1 >= 0; j1 = cellList[j1]) {
                for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) {
                  if (m1 != m2 || j2 < j1) {
                    VSub (dr, r[j1], r[j2]);
                    VSub (dr, dr, shift);
                    if (VLenSq (dr) < rrNebr) {
#if SMP
                      nebrTab[ip][2 * nebrTabLen[ip]] = j1;
                      nebrTab[ip][2 * nebrTabLen[ip] + 1] = j2;
                      ++ nebrTabLen[ip];
#else
                      nebrTab[2 * nebrTabLen] = j1;
                      nebrTab[2 * nebrTabLen + 1] = j2;
                      ++ nebrTabLen;
#endif
                    }
                  }
                }
              }
            }
          }
#if SMP
        }
#endif
      }
#if NDIM == 3
    }
#endif
#if SMP
  }
#endif
/* end parallel section */
}



#define WrapR(v, t) if (v.t >= 0.5 * region.t) v.t -= region.t; \
   else if (v.t < - 0.5 * region.t) v.t += region.t


/* function should be automatically parallelized - 
   largest (> 60%) cpu user, need to work on local variables */
void ComputeForces ()
{
  VecD dr;
  double fcVal, rr, rrCut, rri, rri3, uVal;
  int j1, j2, n;
#if SMP
  double uSumP[MAX_PROC];
  int ip;
#endif

  rrCut = SQ (rCut);
/* start parallel section */
#if SMP
  for (ip = 0; ip < nProc; ip ++) {
    for (n = 0; n < nAtom; n ++) VZero (raP[ip][n]);
    uSumP[ip] = 0.;
#else
  for (n = 0; n < nAtom; n ++) VZero (ra[n]);
  uSum = 0.;
#endif
#if SMP
    for (n = 0; n < nebrTabLen[ip]; n ++) {
      j1 = nebrTab[ip][2 * n];
      j2 = nebrTab[ip][2 * n + 1];
#else
    for (n = 0; n < nebrTabLen; n ++) {
      j1 = nebrTab[2 * n];
      j2 = nebrTab[2 * n + 1];
#endif
      VSub (dr, r[j1], r[j2]);
      WrapR (dr, x);
      WrapR (dr, y);
#if NDIM == 3
      WrapR (dr, z);
#endif
      rr = VLenSq (dr);
      if (rr < rrCut) {
        rri = 1. / rr;
        rri3 = rri * rri * rri;
        fcVal = 48. * rri3 * (rri3 - 0.5) * rri;
        uVal = 4. * rri3 * (rri3 - 1.) + 1.;
#if SMP
        VSAdd (raP[ip][j1], raP[ip][j1], fcVal, dr);
        VSAdd (raP[ip][j2], raP[ip][j2], - fcVal, dr);
        uSumP[ip] += uVal;
#else
        VSAdd (ra[j1], ra[j1], fcVal, dr);
        VSAdd (ra[j2], ra[j2], - fcVal, dr);
        uSum += uVal;
#endif
      }
    }
#if SMP
  }
#endif
/* end parallel section */

#if SMP
/* two alternative ways of combining partial results (first is new) */
#if 1
/* barrier here */
/* start parallel section */
  for (n = 0; n < nAtom; n ++) {
    ra[n] = raP[0][n];
    for (ip = 1; ip < nProc; ip ++) VAdd (ra[n], ra[n], raP[ip][n]);
  }
  uSum = 0.;
  for (ip = 0; ip < nProc; ip ++) uSum += uSumP[ip];
/* end parallel section */
#else
/* start serial section */
  for (n = 0; n < nAtom; n ++) VZero (ra[n]);
  uSum = 0.;
  for (ip = 0; ip < nProc; ip ++) {
    for (n = 0; n < nAtom; n ++) VAdd (ra[n], ra[n], raP[ip][n]);
    uSum += uSumP[ip];
  }
/* end serial section */
#endif
#endif
}


/* function should be automatically parallelized */
void LeapfrogIntegrate ()
{
  int n;

  for (n = 0; n < nAtom; n ++) {
    VSAdd (rv[n], rv[n], deltaT, ra[n]);
    VSAdd (r[n], r[n], deltaT, rv[n]);
  }
}



/* function should be automatically parallelized */
void CheckPeriodicWrap ()
{
  int n;

  for (n = 0; n < nAtom; n ++) {
    WrapR (r[n], x);
    WrapR (r[n], y);
#if NDIM == 3
    WrapR (r[n], z);
#endif
  }
}



/* function should be automatically parallelized -
   but note reduction ops */
void CalcEnergy ()
{
  VecD v;
  double vv, vvSum;
  int n;

  vvSum = 0.;
  vvMax = 0.;
  for (n = 0; n < nAtom; n ++) {
    VSAdd (v, rv[n], -0.5 * deltaT, ra[n]);
    vv = VLenSq (v);
    vvSum += vv;
    if (vv > vvMax) vvMax = vv;
  }
  kinEnergy = 0.5 * vvSum / nAtom;
  potEnergy = uSum / nAtom;
}



/* function only called once (not part of timing) - do not parallelize */
void InitialState ()
{
  VecD p, gap, vSum; 
  double vMag;
  int j, n, nx, ny, nz;

  VDiv (gap, region, initUcell);
  n = 0;
#if NDIM == 3
  for (nz = 0; nz < initUcell.z; nz ++) {
#endif
    for (ny = 0; ny < initUcell.y; ny ++) {
      for (nx = 0; nx < initUcell.x; nx ++) {
#if NDIM == 2
        VSet (p, nx + 0.25, ny + 0.25);
#elif NDIM == 3
        VSet (p, nx + 0.25, ny + 0.25, nz + 0.25);
#endif
        VMul (p, p, gap);
        VSAdd (p, p, -0.5, region);
#if NDIM == 2
        for (j = 0; j < 2; j ++) {
          r[n] = p;
          if (j != 0) VSAdd (r[n], r[n], 0.5, gap);
          ++ n;
        }
#elif NDIM == 3
        for (j = 0; j < 4; j ++) {
          r[n] = p;
          if (j != 3) {
            if (j != 0) r[n].x += 0.5 * gap.x;
            if (j != 1) r[n].y += 0.5 * gap.y;
            if (j != 2) r[n].z += 0.5 * gap.z;
          }
          ++ n;
        }
#endif
      }
    }
#if NDIM == 3
  }
#endif
  vMag = sqrt (NDIM * temperature);
  VZero (vSum);
  for (n = 0; n < nAtom; n ++) {
    VRand (&rv[n]);
    VScale (rv[n], vMag);
    VAdd (vSum, vSum, rv[n]);
  }
  for (n = 0; n < nAtom; n ++) VSAdd (rv[n], rv[n], -1. / nAtom, vSum);
}



#if NDIM == 2
void VRand (VecD2 *p)
{
  double s;

  s = 2. * M_PI * drand48 ();
  p->x = cos (s);  p->y = sin (s);
}
#elif NDIM == 3
void VRand (VecD3 *p)
{
  double s, x, y;

  s = 2.;
  while (s > 1.) {
    x = 2. * drand48 () - 1.;  y = 2. * drand48 () - 1.;
    s = SQ (x) + SQ (y);
  }
  p->z = 1. - 2. * s;
  s = 2. * sqrt (1. - s);
  p->x = s * x;  p->y = s * y;    
}
#endif



/* not used */
#if 0
void ComputeForces ()
{
  VecD dr, invWid, rs, shift;
  VecI cc, m1v, m2v, iof[] = OFFSET_VALS;
  double fcVal, rr, rrCut, rri, rri3, uVal;
  int c, j1, j2, m1, m1x, m1y, m1z, m2, n, offset;

  rrCut = SQ (rCut);
  for (n = nAtom; n < nAtom + VProd (cells); n ++) cellList[n] = -1;
  VDiv (invWid, cells, region);
  for (n = 0; n < nAtom; n ++) {
    VSAdd (rs, r[n], 0.5, region);
    VMul (cc, rs, invWid);
    c = VLinear (cc, cells) + nAtom;
    cellList[n] = cellList[c];
    cellList[c] = n;
  }
  for (n = 0; n < nAtom; n ++) VZero (ra[n]);
  uSum = 0.;
#if NDIM == 3
  for (m1z = 0; m1z < cells.z; m1z ++) {
#endif
    for (m1y = 0; m1y < cells.y; m1y ++) {
      for (m1x = 0; m1x < cells.x; m1x ++) {
#if NDIM == 2
        VSet (m1v, m1x, m1y);
#elif NDIM == 3
        VSet (m1v, m1x, m1y, m1z);
#endif
        m1 = VLinear (m1v, cells) + nAtom;
        DO_OFFSET {
          VAdd (m2v, m1v, iof[offset]);
          VZero (shift);
          WrapC (x);  
          WrapC (y);
#if NDIM == 3
          WrapC (z);
#endif
          m2 = VLinear (m2v, cells) + nAtom;
          for (j1 = cellList[m1]; j1 >= 0; j1 = cellList[j1]) {
            for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) {
              if (m1 != m2 || j2 < j1) {
                VSub (dr, r[j1], r[j2]);
                VSub (dr, dr, shift);
                rr = VLenSq (dr);
                if (rr < rrCut) {
                  rri = 1. / rr;
                  rri3 = rri * rri * rri;
                  fcVal = 48. * rri3 * (rri3 - 0.5) * rri;
                  uVal = 4. * rri3 * (rri3 - 1.) + 1.;
                  VSAdd (ra[j1], ra[j1], fcVal, dr);
                  VSAdd (ra[j2], ra[j2], - fcVal, dr);
                  uSum += uVal;
                }
              }
            }
          }
        }
      }
    }
#if NDIM == 3
  }
#endif
}
#endif

/* end of program */


365.2More info: Your help is still neededTAV02::GLASSYossi Glass, 882-3254Wed Feb 12 1997 09:5072
Following is a summary of my efforts to ||ize this program. In few words:
- The customer created a program which has #ifdef for running it on one
  CPU, and for SMP.
- Using "kcc -ckapargs='-conc'" I didn't get any speedup when running on 4
  CPUs (compared to the runtime on one CPU).
- Using "kcc -ckapargs='-conc'" but without specifying -DSMP (which means
  that the "serial code" is being ||ized by KAP, I got 25% speedup for
  2 CPUs and 50% speedup for 4 CPUs.

You can help me with the following:
- Find out why we didn't get speedup with -DSMP
- How to get better speedup with and without -DSMP

The modified code, on which I did the ||ization tests, is on the anynoymous 
ftp account of tavosf.iso.dec.com, under directory pub, file is 
dennis_parallel.tar.

I Don't remember if I said it already: benchmark results will decide whether 
the customer purchases an Origin 2000 (4 CPUs) or AlphaServer 4100 (4 CPUs);
This customer used to be an SGI customer, until he purchased Alpha systems
2.5 years ago.

More details:
============
Profiling info
--------------
When running the program without optimization on a single CPU, the
following pieces of code consume most of the runtime:
1. In ComputeForces, the loop starting with:
        for (n = 0; n < nebrTabLen; n ++)
   takes ~55% of the total time
2. In BuildNebrList, the loop starting with:
        for (j2 = cellList[m2]; j2 >= 0; j2 = cellList[j2]) {
   takes ~15% of the total time 

What I did
----------
1. I tried to use KAP on the original code, but didn't get ||ization,
   mostly beacuse KAP assumes that there are dependenices. Therefore,
   I created two new modules: compute.c (ComputeForces) and build.c
   (BuildNebrList). Having several modules, I had to create a
   separate module for globals (globals.c) and for #defines and
   typedefs (mdsmp.h) and for externals (externals.h). Yes, and I called
   the main module 'mdsmp.c'.

2. I compiled the code and ran it several times for both 2D and 3D. I 
   will talk about the 3D results (the 2D results behave more or
   less the same):
   - First, I compiled the code without -DSMP, using just 'cc', and
     then 'kcc' without the -conc option (to see the speedup from
     scalar optimizations). I got:
        3.95 time/atom for the compile using 'cc'
        3.72 time/atom for the compile using 'kcc'
   - Then, I compiled the code with -DSMP, using 
        kcc -ckapargs='-conc -arl=4 -mc=0' for build.c and compute.c.
        -arl=4 tells KAP not to worry about dependencies
        -mc=0  tells it to ||ize regardless of operations in the loop
        I got: 3.97 time/atom. No speedup, although KAP generates the 
        calls to its thread library.
   - I supsected that the reason was some kind of cache problems
        (there are tools to test this, but I didn't use them yet).
        Therefore, I tried to run kcc with '-conc', but without
        -DSMP. Running it I got:
        for 2 CPUs: 2.98 time/atom
        for 4 CPUs: 2.41 time/atom

3. This is what I know so far. It seems that KAP feels better when
   you let him ||ize the original code (it doesn't like the idea
   that somebody thinks he can ||ize the code better).

		---------------------