
  Mark,

   We have determined the reasons for your low performance numbers. There are two 
main causes.

1) You ran the code on the SP1 without the mpirun option -nopoll. The IBM
   switch has two modes, one where essentially an interupt is generated upon
   receiving a message and one where the OS polls the switch to check for 
   messages. The -nopoll option  always seems to generate smaller times and
   more consistency between runs on the SP1. (On the SP2, on the other hand,
   it is faster not to use the nopoll option :-()
2) There is an interesting effect on all these machines due to the paging-
   in of the executable. As you know in Unix, when you start a program running
   the entire executable is not loaded into physical memory, rather when a
   new routine is called, the pages that contain the routine are loaded into 
   memory the first time this routine is executed. When timing routines that 
   take seconds or fractions of seconds, this IO time is extremely significant.
   The problem gets dramatically worse when, say 32, compute nodes all request
   some pages of the executable from the same file server. One essentially gets
   meaningless timing results. We have a strategy to deal with this we call
   "preloading of the executable" essentially we first run the EXACT code that
   we will run on the large problem on a smaller set of data, then we run it on 
   the problem of interest. There is a section of the users manual where we
   discuss this. I also include below the program we used to generate the timings.

  A few notes:
* The PowerChallenge has a 2 megabyte external cache and the Origin2000
  has a 4 megabyte external cache. This explains the super-linear speed-up you see 
  below on those machines.
* Note the much better performance with the -nopoll option on the SP1
* Our SP2 machine is running, the only thing missing is a scheduler, which is 
  why external users (like you) don't have accounts yet. I've asked support to 
  give you an account ASAP, I hope you can get the account Tuesday to confirm these
  numbers.
* Once we get additional memory cards in our SP2 (in a few months) we'll see another
  jump of 20 to 30 percent (we hope :-)).
* The 16 processor performance with BAIJ on the SP2, T3E, and SGI machines are 
  remarkably similar!
* You can also find some efficiency numbers for a much larger problem, 1.1 million unknows,
  (matrix block size 5) in our paper at ftp://info.mcs.anl.gov/pub/petsc/scitools96.ps.gz
* Note also, due to PETSc's extensibility, if any particular numerical kernel is not
  efficient enough, you are free to drop in a new kernel routine, while retaining 
  other routines in the library that are not performance critical. Thus you do not
  have to write an entire library from scratch, only those routines that you can 
  do a better job on then we have.

  If you have any additional questions or concerns please let us know.

  Barry and Satish


       Flop rates with effiency given in () for MatMult() on your matrix
  using the "natural" partitioning.

-------------------------------------------------------------------------------------
Machine   Quad               Bonnie         Mcurie         Sif             Modi4
ARCH      (SP2)              (SP1)           (T3E)   (PowerChallenge)   (Origin2000)
-nopoll:                   YES     NO
-------------------------------------------------------------------------------------
***********Using BAIJ format with BlockSize 3 ********************
 1        87            35         35          81             26          56
 2       161(.93)     65(.93)    37(.52)   154(.95)      50(.96)        94(.84)
 4       300(.86)    120(.86)    68(.49)   295(.91)      96(.93)       205(.92)
 8       548(.79)    218(.78)   134(.48)   559(.86)     413(1.99)      665(1.49)
16       920(.66)    368(.66)   274(.49)  1014(.78)      NA           1028(1.15)
***********Using AIJ format with Inodes *************************
 1        78            30                     64            26           26
 2       146(.93)     55(.92)              121(.95)     48(.93)       57(1.10)
 4       278(.89)    102(.85)              244(.95)     93(.88)      197(1.90)
 8       518(.83)    186(.78)              419(.82)    248(1.20)     440(2.12)
16       835(.67)    317(.66)              724(.71)     NA          1075(2.58)





#ifndef lint
static char vcid[] = "$Id: ex10.c,v 1.15 1997/01/22 18:44:10 bsmith Exp $";
#endif

static char help[] = 
"Reads a PETSc matrix and vector from a file and solves a linear system.\n\
This version first preloads and solves a small system, then loads \n\
another (larger) system and solves it as well.  This example illustrates\n\
preloading of instructions with the smaller system so that more accurate\n\
performance monitoring can be done with the larger one (that actually\n\
is the system of interest).  See the 'Performance Hints' chapter of the\n\
users manual for a discussion of preloading.  Input parameters include\n\
  -f0 <input_file> : first file to load (small system)\n\
  -f1 <input_file> : second file to load (larger system)\n\n";


/* 
  Include "sles.h" so that we can use SLES solvers.  Note that this file
  automatically includes:
     petsc.h  - base PETSc routines   vec.h - vectors
     sys.h    - system routines       mat.h - matrices
     is.h     - index sets            ksp.h - Krylov subspace methods
     viewer.h - viewers               pc.h  - preconditioners
*/
#include "sles.h"
#include <stdio.h>

int main(int argc,char **args)
{
  MatType    mtype;            /* matrix format */
  Mat        A;                /* matrix */
  Vec        b, u;             /* approx solution, RHS, exact solution */
  Viewer     fd;               /* viewer */
  char       file[2][128];     /* input file name */
  char       stagename[6][16]; /* names of profiling stages */
  int        j,ierr, set, flg, i,loops  = 2;
  Scalar     none = -1.0;
  int        m,n;
  PetscInitialize(&argc,&args,(char *)0,help);


#if defined(PETSC_COMPLEX)
  SETERRA(1,0,"This example does not work with complex numbers");
#else

  /* 
     Determine files from which we read the two linear systems
     (matrix and right-hand-side vector).
  */
  ierr = OptionsGetString(NULL,"-f0",file[0],127,&flg); CHKERRA(ierr);
  if (!flg) SETERRA(1,0,"Must indicate binary file with the -f0 option");
  ierr = OptionsGetString(NULL,"-f1",file[1],127,&flg); CHKERRA(ierr);
  if (!flg) {loops = 1;} /* don't bother with second system */

  /* -----------------------------------------------------------
                  Beginning of linear solver loop
     ----------------------------------------------------------- */
  /* 
     Loop through the linear solve 2 times.  
      - The intention here is to preload and solve a small system;
        then load another (larger) system and solve it as well.
        This process preloads the instructions with the smaller
        system so that more accurate performance monitoring (via
        -log_summary) can be done with the larger one (that actually
        is the system of interest). 
  */
  for ( i=0; i<loops; i++ ) {

    /* - - - - - - - - - - - New Stage - - - - - - - - - - - - -
                           Load system i
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

    /*
       Begin profiling next stage
    */
    PLogStagePush(2*i);
    sprintf(stagename[2*i],"Load System %d",i);
    PLogStageRegister(2*i,stagename[2*i]);

    /* 
       Open binary file.  Note that we use BINARY_RDONLY to indicate
       reading from this file.
    */
    ierr = ViewerFileOpenBinary(MPI_COMM_WORLD,file[i],BINARY_RDONLY,&fd);
           CHKERRA(ierr);

    /* 
       Determine matrix format to be used (specified at runtime).
       See the manpage for MatLoad() for available formats.
    */
    ierr = MatGetTypeFromOptions(MPI_COMM_WORLD,0,&mtype,&set); CHKERRQ(ierr);

    /*
       Load the matrix and vector; then destroy the viewer.
    */
    ierr = MatLoad(fd,mtype,&A); CHKERRA(ierr);
    /* ierr = VecLoad(fd,&b); CHKERRA(ierr); */
    ierr = ViewerDestroy(fd); CHKERRA(ierr);
    
    /*
      Create a Vector 
    */
    ierr = MatGetLocalSize(A,&m,&n); CHKERRA(ierr);
    ierr = VecCreateMPI(MPI_COMM_WORLD,m,PETSC_DECIDE,&b);
    VecSet(&none,b);
    VecDuplicate(b,&u);

    PLogStagePop();
    PLogStagePush(2*i+1);
    sprintf(stagename[2*i+1],"MatMult %d",i);
    PLogStageRegister(2*i+1,stagename[2*i+1]);

    for ( j=0; j< 20; j++ ) {
      ierr = MatMult(A,b,u);
    }
    PLogStagePop();

    ierr = MatDestroy(A); CHKERRA(ierr); ierr = VecDestroy(b); CHKERRA(ierr);
    ierr = VecDestroy(u); CHKERRA(ierr); 
  }
  /* -----------------------------------------------------------
                      End of linear solver loop
     ----------------------------------------------------------- */

  PetscFinalize();
#endif
  return 0;
}



