 Following documents are written by Kazushige Goto.  If you find
 mistakes or suggestions, please let me know.  

 If you're interested in Alpha Architecture, please refer to 21164
 Hardware reference manual.

 Discription:

   The techniques of my routine are 

    1. Blocking algorithm.  It's basic theory.

    2. To avoid cache congruence problem, once it copies block matrix 
       to small matrix(2nd cache image), then calculate on these small
       matrix.  This algorithm is not so fast(especially small
       matrix), but it can caluculate with constant speed(even if the
       matrix size are large or small, it shows same calculating speed).

    3. The floating point calculating latency is 4 clock. 
       The matrix-matrix multiply routine requires "multiply and add".
       If you program as it is, it takes 8 clock. To avoid this, 
       I uses software pipelining technique.

    4. Taking consideration into MAF(Miss Address File).
       21164(A) has six MAF.  Each MAF has 32byte-buffer(4 double
       floating point data).  If we access the data on the 2nd cache,
       we can not request the data to 2nd cache over 6 lines at once.
       For example,
                                   note: This program is example.
	for(k=0; k<x; k++){
	  for(j=0;j<y; j++){
	    double sum = 0.0;
	    for(i=0; i<z; i++)
		 sum += a[j][i]*b[k][i];
	   }
	}

       At a glance,  it's a appropriate access, because the memory
       access is in order.  If you make fast inner product routine,
       maybe it's a good algorithm.  But if you make matrix-matrix
       multiply routine on 21164(A), it isn't the best algorithm.

       Let me explain. 

       First, unrooling for j,

        for(k=0; k<x; k++){
          for(j=0;j<y; j+=4){
            double sum1 = 0.0, sum2 = 0.0;
            double sum3 = 0.0, sum4 = 0.0;
            for(i=0; i<z; i++)
                 sum1 += a[j+0][i]*b[k][i];
                 sum2 += a[j+1][i]*b[k][i];
                 sum3 += a[j+2][i]*b[k][i];
                 sum4 += a[j+3][i]*b[k][i];
           }
        }

       Second, unrolling for k,

        for(k=0; k<x; k+=2){
          for(j=0;j<y; j+=4){
            double sum1 = 0.0, sum2 = 0.0;
            double sum3 = 0.0, sum4 = 0.0;
            double sum5 = 0.0, sum6 = 0.0;
            double sum7 = 0.0, sum8 = 0.0;
            for(i=0; i<z; i++)
                 sum1 += a[j+0][i]*b[k+0][i];
                 sum2 += a[j+1][i]*b[k+0][i];
                 sum3 += a[j+2][i]*b[k+0][i];
                 sum4 += a[j+3][i]*b[k+0][i];

                 sum5 += a[j+0][i]*b[k+1][i];
                 sum6 += a[j+1][i]*b[k+1][i];
                 sum7 += a[j+2][i]*b[k+1][i];
                 sum8 += a[j+3][i]*b[k+1][i];

           }
        }

       Please attention "addressing of a and b".  The matrix a
       requires 4 lines and the matrix b requires 2 lines(total six
       lines).  21164 has six MAF. Therefre, it is saturated easily.
       The calculating ability is upto 1 Flops/Hz(600MFlops) at most.

       So, we must take other algorithm.

	for(k=0; k<x; k++){
	  for(j=0;j<y; j++){
	    double sum = 0.0;
	    for(i=0; i<z; i++)
		 sum += a[i][j]*b[k][i];
	   }
	}

       I swapped the matrix a of column and row(This case is
       Non-Transposed - Non-Transposed routine for FORTRAN).

       First, unrolling for j,

	for(k=0; k<x; k++){
	  for(j=0;j<y; j+=4){
            double sum1 = 0.0, sum2 = 0.0;
            double sum3 = 0.0, sum4 = 0.0;
	    for(i=0; i<z; i++)
		 sum1 += a[i][j+0]*b[k][i];
		 sum2 += a[i][j+1]*b[k][i];
		 sum3 += a[i][j+2]*b[k][i];
		 sum4 += a[i][j+3]*b[k][i];
	   }
	}

       Second, unrooling for k,

	for(k=0; k<x; k+=2){
	  for(j=0;j<y; j+=4){
            double sum1 = 0.0, sum2 = 0.0;
            double sum3 = 0.0, sum4 = 0.0;
            double sum5 = 0.0, sum6 = 0.0;
            double sum7 = 0.0, sum8 = 0.0;
	    for(i=0; i<z; i++)
		 sum1 += a[i][j+0]*b[k+0][i];
		 sum2 += a[i][j+1]*b[k+0][i];
		 sum3 += a[i][j+2]*b[k+0][i];
		 sum4 += a[i][j+3]*b[k+0][i];

		 sum5 += a[i][j+0]*b[k+1][i];
		 sum6 += a[i][j+1]*b[k+1][i];
		 sum7 += a[i][j+2]*b[k+1][i];
		 sum8 += a[i][j+3]*b[k+1][i];
	   }
	}

       In this case, the routine requires only 3 lines(a requires 1
       line, and b requires 2 lines).  The data is transferred 32bytes
       each. So there are no saturations and able to unroll for i.

       note)

       These algorithm is used in the ealier routine, but I've changed
       to access matrix continuously in the transposed routine. These
       transposed routine is pretty complex, but it becomes be
       significantly faster than before. 


    5. Instruction align.
       21164(A) operates with a small block called "slot" which is
       consist of 4 instructions with 64 byte-align and it can issue
       on one clock.  But there are some limitations of instruction's
       combination.

       Among 4 instructions, 21164(A) can issue 2 integer instruction
       and 2 floating point instruction at once.  The integer
       combination is pretty flexible, but the floating combination is
       limited. 21164(A) can only issue with Add-Mutiply,  therefore,
       Add-Add or Multiply-Multiply instruction can not issue on one
       clock;it takes 2 clocks.  The only exception is data
       movement(e.g. cpys), which can issue 2 instruction on one
       clock.

    6. Avoiding to use "nop" or "fnop"
       21164(A)'s nop(Non OPeration) and fnop(floating Non OPeration)
       are very convenient instructions.  But these instructions are
       dummy instructions.  Actually, "nop" instruction is replaced
       with "shift" or "addq" instructions with $31(Zero Register),
       and "fnop" instruction is replaced with "cpys" instructions
       with $f31(Floating Zero register).

       Therefore, "nop" and "fnop" instructions fill the pipleline.
       Even if CPU is busy,  the CPU can not skip these instructions.

       However, there is a convenient non operating instruction which
       is called "unop".  This instruction is also dummy instructions
       (ldq_u $31, 0($31)) and do not send any instructions to
       pipeline.  If CPU is busy, the CPU skips this instruction or
       can issue 4 instructions on one clock.

       Normally, the GNU assembler(as) replaces ".align" with some 
       "nop", but it takes one or two clocks.  I programmed to use
       "unop".

    7. predicting data flow and prefetching
       The matrix A and B is entirely blocked in stack matrix "sa" and 
       "sb" which can be accessed with no wait.  But the matrix C is
       not blocked.  So, we must move a part of data of C to 2nd cache
       in advance.

       21164(A) has also convenient instruction which is called
       "prefetch" instruction.  Originally, the Alpha Architecture has
       prefetch instructions which are called "fetch" and "fetch_m"
       instructions. But it's not fast and rather slow, besides, it is
       very doubtful to imprement this instructions(of cource, alpha
       can issue these instructions but no affect speed).

       Instead, 21164(A) has two useful prefetch instructions.  These
       instructions  are "ldq $31, 0($??)" and "ldt $f31, 0($??)".
       There are no difference in terms of "moving data to cache", but
       "ldq" prefetch instruction moves data to 8kB first level cache
       by force.  "ldt" prefetch instruction stops moving data to 8kB
       first level cache or store second 96kB cache if memory bus is
       busy(see at Appending section of Alpha Architecutre handbook).

       Different from "fetch" instruction, this "ldq" and "ldt"
       instructions can move only 64bytes(the "fetch" instruction can
       move 512bytes at once).  So, we must use in loops with care.
       Because the most inner loop is so very busy that CPU can not 
       move data smoothly if unexpected prefetch instruction is
       issued.

       note) This "ldq" and "ldt" prefetch instructions are regarded
             as "nop" on the 21066 or 21064(EV4 architecture).  So,
             we need not to take consideration into architecture
             difference. 

    8. cache align
       Alpha's stack is very simple(this is not so much architecture
       as promise on Digital UNIX). So, we can get temporary working
       areas as much as we need. 

       ------------
       Usually, this size is limited by kernel and default value is
       maybe 2MB.  My routine uses about 400kB maximum(it depends on
       the situation, 136kB minimum). If you have not enough stack
       area, please change stack limit size by using ulimit(bash) or
       limit(csh) command.
       ------------

       The important thing is the kernel data is placed on these stack
       areas which stand for 2nd cache image.  So, this area should be
       aligned for first and second cache.  According to the Alpha
       Architecture Handbook,  it should be aligned for 8kB.  But it
       causes "unstable" result, which mean it shows different speeds
       each time(slow, fast, slow, fast...).  I tested some values
       to show "stable" result.

       Thus, I found this image should be aligned for 256kB.
       Surprisingly, if this image is aligned for 256kB,  it shows
       very constant speed each time, and further, it is 10 to 20
       MFlops faster than unaligned image(maybe it reduces DTB
       misses).
       
       Fllowing shows minute explanation.


 P, Q, R:
   Internal BLOCK sizes(2nd cache image).  The default values are 
   P=24, Q=200, R=200(dgemm).

   If you have other Alpha CPU, please modify this.  I strongly
   recommend the values are multiply for 8.

   If you find better parameters, please let me know.


 *** cache align map ***

 sizeof sa = Q*LDA = Q*P  =  37.5 kB
 sizeof sb = R*LDB = R*Q  = 312   kB

At first, I designed that this stack area is an image of 2nd cache.
But through many benchmarking, I found these parameters are the best.
I do not know why(maybe due to TLB problem).

So, it seems there's no rooms for storing sa and sb in second cache
entirely(Of cource, we must take consideration into C matrix, but 
it's pretty small that we can ignore it).

 In fact, the size of sa matrix is not sensitive on Non-transposed
* Non-transposed routine.  But if we use Transposed routine, it's
significantly affects the calculating speed.  I've tested many
combinations of sa and sb matrix sizes, I determined this parameter.

 I think this comes from my algorithm.  My routine's main target is
internal second level cache and all the data which are on second level
cache are accessed with no-wait.  and the matrix sb is not so busy, 
maybe a part of sb is out of second level cache and moves 3rd level cache.

 When we need the data of sb matrix, using floating point prefetch($f31)
command, we can move to second level cache easily(but we must wait about
27 clocks from Bcache(3rd cache).  First level cache is very busy, so we
use floating point prefetch rather than integer prefetch($31) which
moves into first level cache by force). 

When we access C matrix, we must load carefully.  Because my routine's
coding waits only 4 clocks which is enough to load from first level cache,
but not enough to load from second level cache.  So, we must load
to first level cache in advance.  You know, integer prefetch command 
is convenient in this case.  To move C matrix data to first level cache,
I took 2-stage prefetch command.  First, using floating point integer
prefetch command, the data move into second level cache.  Then, using
integer prefetch command, the data move into first level cache within
8 clocks.


    9. To acomplish these techniques, this routine uses unrooling
       loops.

   10. Performance Counter

       21164 has a special counter called "Performance Counter"(PMCTR).
       This counter tells us various and useful information,
       such as cycles, cache misses, traps...

       Usually, we can not access this counter, because PAL mode can
       only control this counter.  I modified PAL code and Linux
       kernel to handle performance counter.

       Following two tables are results of PMCTR(PAL, OS, USER) with 
       1000 x 1000 operation.  One is "Total information", and another
       is "Most innter loop information".  If you want to know "What
       happens in my routine", please check this(PMCTR start and end
       points are shown as "PMCTR").

       For example,  double precision in most inner loop, the cycle is
       1159021923, and one of FP Operation issue is 2000000000, so it
       runs about 1.72 Flops/Cycle(1035MFlops! at 600MHz), and single
       precision, 1.80 Flops/Cycle(1077MFlops at 600MHz, theoretical
       value is 1.94 Flops/Cycle(1164MFlops)).


                 1. TOTAL PERFORMANCE(Double Precision)

     I t e m          < P A L >           < O S >            < U S E R>
Cycles              :24040635(  1.6)   88757986(  5.9)      1391413460( 92.5)
Instructions        :10054893(  0.3)    6574276(  0.2)      3064465624( 99.5)
nonissue cycles     :11805424(  4.4)   18244225(  6.9)       235788592( 88.7)
split-issue cycles  :   38174(  0.4)     404458(  4.5)         8470208( 95.0)
pipe-dry cycles     : 6977519(  6.5)    1387619(  1.3)        98169336( 92.1)
replay trap         :  317868(  2.5)    1557948( 12.4)        10712742( 85.1)
single-issue cycles : 2065946(  7.4)    1122060(  4.0)        24873089( 88.6)
dual-issue cycles   : 3986780(  1.2)     326087(  0.1)       326973648( 98.7)
triple-issue cycles :       0(  0.0)      16333(  0.0)       469064912(100.0)
quad-issue cycles   :       0(  0.0)      94461(  0.0)       245124815(100.0)
flow-change insns   : 1980269(  5.2)     324247(  0.8)        35842746( 94.0)
IntOps issued       :  345642(  0.3)     954533(  0.9)       110849333( 98.8)
FPOps issued        :       0(  0.0)         16(  0.0)      2037152000(100.0)
loads issued        :  663822(  0.1)     983167(  0.1)       861625000( 99.8)
stores issued       :   17378(  0.1)     281093(  1.5)        19000161( 98.5)
Icache issued       : 4947642(  0.5)    4875580(  0.5)       966920447( 99.0)
Dcache accesses     :  681028(  0.1)    3859971(  0.4)       880625360( 99.5)
S cache access      : 1755517(  0.5)    1200971(  0.3)       382611552( 99.2)
long(>15) stalls    :  130102(  4.8)     191422(  7.0)         2403319( 88.2)
PC-mispredicts      :   25070( 40.2)      37339( 59.8)               0(  0.0)
BR-mispredicts      :      49(  0.0)       6720(  0.5)         1418693( 99.5)
Icache/RFB misses   :  308679( 34.3)     183935( 20.4)          407212( 45.3)
ITB misses          :      40(100.0)          0(  0.0)               0(  0.0)
Dcache LD misses    :  348236(  0.0)    1751006(  0.2)       797313848( 99.7)
DTB misses          :   12838(  2.0)         74(  0.0)          622846( 98.0)
LDs merged in MAF   :    4733(  0.0)     322860(  0.1)       428665533( 99.9)
LDU replay traps    :  315715( 66.7)     157372( 33.3)               0(  0.0)
WB/MAF replay traps :     155(  0.0)     268411(  2.5)        10630795( 97.5)
MB stall cycles     :   26439(  2.6)    1000964( 97.4)               0(  0.0)
LDx_L instructions  :       0(  0.0)      30693(100.0)               0(  0.0)
Scache misses       :  457459(  3.2)     510195(  3.6)        13143147( 93.1)


                 2. TOTAL PERFORMANCE(Single Precision)

     I t e m          < P A L >           < O S >            < U S E R>
Cycles              : 7320201(  0.6)   19849313(  1.6)      1205834574( 97.8)
Instructions        : 2464631(  0.1)    6188563(  0.2)      2955596001( 99.7)
nonissue cycles     : 2477460(  1.9)    3983745(  3.1)       122191708( 95.0)
split-issue cycles  :   15954(  0.3)     301108(  5.2)         5526437( 94.6)
pipe-dry cycles     : 2760518(  4.9)    5371666(  9.5)        48503371( 85.6)
replay trap         :   94757(  1.4)     383509(  5.5)         6500476( 93.1)
single-issue cycles :  529520(  2.5)    1822707(  8.6)        18745438( 88.9)
dual-issue cycles   :  971043(  0.3)     463624(  0.1)       330787590( 99.6)
triple-issue cycles :       0(  0.0)     167187(  0.0)       469672746(100.0)
quad-issue cycles   :       0(  0.0)     118404(  0.1)       216264337( 99.9)
flow-change insns   :  480347(  1.4)     292649(  0.8)        33754592( 97.8)
IntOps issued       :  310987(  0.3)     842205(  0.8)       103092900( 98.9)
FPOps issued        :       0(  0.0)          4(  0.0)      2020500000(100.0)
loads issued        :  157744(  0.0)     871543(  0.1)       789750199( 99.9)
stores issued       :   14770(  0.2)     255384(  2.9)         8500000( 96.9)
Icache issued       : 1451087(  0.2)    1661207(  0.2)       932178953( 99.7)
Dcache accesses     :  172398(  0.0)    1458506(  0.2)       798250360( 99.8)
S cache access      :  611916(  0.1)     993141(  0.2)       498446770( 99.7)
long(>15) stalls    :   36248(  1.2)     323825( 10.5)         2720491( 88.3)
PC-mispredicts      :   13457( 29.6)      31918( 70.3)              44(  0.1)
BR-mispredicts      :      62(  0.0)       6275(  0.8)          753817( 99.2)
Icache/RFB misses   :  139075( 34.6)     127264( 31.7)          135703( 33.8)
ITB misses          :       6(100.0)          0(  0.0)               0(  0.0)
Dcache LD misses    :  107625(  0.0)    1455598(  0.2)       771737700( 99.8)
DTB misses          :    2719(  1.9)       4798(  3.3)          139353( 94.9)
LDs merged in MAF   :    4321(  0.0)     169742(  0.1)       278756738( 99.9)
LDU replay traps    :   92134( 39.9)     138727( 60.1)             112(  0.0)
WB/MAF replay traps :       7(  0.0)     123794(  2.2)         5631519( 97.8)
MB stall cycles     :   22611(  2.8)     774628( 97.2)              12(  0.0)
LDx_L instructions  :       0(  0.0)       6665(100.0)               0(  0.0)
Scache misses       :  158439(  2.7)     794794( 13.4)         4968657( 83.9)


                 3. MOST INNER LOOP PERFORMANCE(Double Precision)

     I t e m          < P A L >           < O S >            < U S E R>
Cycles              : 8942107(  0.8)   20984396(  1.8)      1159021923( 97.5)
Instructions        : 6708602(  0.2)    5995120(  0.2)      2958627444( 99.6)
nonissue cycles     : 1284498(  1.4)    9295545( 10.0)        82020078( 88.6)
split-issue cycles  : 1142955( 12.2)     319981(  3.4)         7906002( 84.4)
pipe-dry cycles     : 2605269(  4.6)    2432302(  4.3)        51485984( 91.1)
replay trap         :   43999(  0.7)     802643( 12.6)         5542400( 86.7)
single-issue cycles : 3629087( 21.1)     911279(  5.3)        12632438( 73.6)
dual-issue cycles   : 1539791(  0.5)     380440(  0.1)       301712463( 99.4)
triple-issue cycles :       0(  0.0)      75877(  0.0)       463204003(100.0)
quad-issue cycles   :       0(  0.0)      29433(  0.0)       238233812(100.0)
flow-change insns   : 1328068(  3.9)     262959(  0.8)        32375265( 95.3)
IntOps issued       : 2436793(  2.4)     706059(  0.7)        97142318( 96.9)
FPOps issued        :       0(  0.0)         10(  0.0)      2000000000(100.0)
loads issued        :   65524(  0.0)     134050(  0.0)       829872447(100.0)
stores issued       :   14042(  5.7)     230650( 94.3)               0(  0.0)
Icache issued       : 4115170(  0.4)    4257726(  0.5)       910026433( 99.1)
Dcache accesses     :   80150(  0.0)    1370198(  0.2)       829125000( 99.8)
S cache access      : 1366347(  0.4)     945848(  0.3)       366755336( 99.4)
long(>15) stalls    :   17178(  1.4)     109193(  8.8)         1112076( 89.8)
PC-mispredicts      :   16665( 35.2)      30611( 64.7)              24(  0.1)
BR-mispredicts      :    1115(  0.1)       5552(  0.5)         1125014( 99.4)
Icache/RFB misses   :  116714( 17.7)     160151( 24.3)          381791( 58.0)
ITB misses          :      13(100.0)          0(  0.0)               0(  0.0)
Dcache LD misses    :   58902(  0.0)    1702718(  0.2)       773813770( 99.8)
DTB misses          :    3916(  7.9)         47(  0.1)           45771( 92.0)
LDs merged in MAF   :    3887(  0.0)     422379(  0.1)       414117219( 99.9)
LDU replay traps    :   46068( 26.8)     125666( 73.2)              40(  0.0)
WB/MAF replay traps :       0(  0.0)     263574(  5.0)         4969462( 95.0)
MB stall cycles     :   20592(  2.5)     802797( 97.5)              12(  0.0)
LDx_L instructions  :       0(  0.0)       6097(100.0)               0(  0.0)
Scache misses       :  137703(  1.3)     437914(  4.2)         9738174( 94.4)


                 4. MOST INNER LOOP PERFORMANCE(Single Precision)

     I t e m          < P A L >           < O S >            < U S E R>
Cycles              : 4820641(  0.4)   19033425(  1.7)      1114185626( 97.9)
Instructions        : 3636268(  0.1)     284799(  0.0)      2906832025( 99.9)
nonissue cycles     :  258612(  0.4)   11443352( 16.5)        57689054( 83.1)
split-issue cycles  :  637384( 11.0)     413759(  7.1)         4769107( 81.9)
pipe-dry cycles     : 1581592(  3.2)    4742599(  9.6)        43019180( 87.2)
replay trap         :   15724(  0.3)      48219(  0.8)         5889423( 98.9)
single-issue cycles : 2009237( 12.3)     827514(  5.1)        13531948( 82.7)
dual-issue cycles   :  816861(  0.3)     412295(  0.1)       318320733( 99.6)
triple-issue cycles :       0(  0.0)      35287(  0.0)       467131998(100.0)
quad-issue cycles   :       0(  0.0)      25762(  0.0)       212446897(100.0)
flow-change insns   :  717174(  2.2)     258451(  0.8)        31875436( 97.0)
IntOps issued       : 1397810(  1.4)     712877(  0.7)        95625418( 97.8)
FPOps issued        :       0(  0.0)         34(  0.0)      2000000000(100.0)
loads issued        :   27868(  0.0)     829052(  0.1)       773750097( 99.9)
stores issued       :   13502(  6.2)     204496( 93.7)             161(  0.1)
Icache issued       : 2182080(  0.2)    4636035(  0.5)       909099506( 99.3)
Dcache accesses     :   42175(  0.0)    1316166(  0.2)       773750176( 99.8)
S cache access      :  779090(  0.2)    1004147(  0.2)       490343510( 99.6)
long(>15) stalls    :    7337(  0.3)     121189(  5.4)         2128035( 94.3)
PC-mispredicts      :   11013( 10.6)      92718( 89.2)             174(  0.2)
BR-mispredicts      :       5(  0.0)      10601(  1.7)          625030( 98.3)
Icache/RFB misses   :   56879( 20.2)     124347( 44.3)           99758( 35.5)
ITB misses          :      36(100.0)          0(  0.0)               0(  0.0)
Dcache LD misses    :   21976(  0.0)    1587594(  0.2)       759713787( 99.8)
DTB misses          :    1232(  8.3)         55(  0.4)           13540( 91.3)
LDs merged in MAF   :    3822(  0.0)      87517(  0.0)       272338560(100.0)
LDU replay traps    :   15835( 11.7)      47437( 34.9)           72593( 53.4)
WB/MAF replay traps :       0(  0.0)      46680(  0.9)         4872651( 99.1)
MB stall cycles     :   21393(  2.9)     715356( 97.1)              12(  0.0)
LDx_L instructions  :       0(  0.0)       5836(100.0)               0(  0.0)
Scache misses       :   73594(  1.8)     324212(  7.7)         3805258( 90.5)

   11. Copying Overhead

       My routine uses "copying method", so there are some overheads
       to copy to small matrix.

-----------------------------------------------------------------------------
       | Outer Copy | Inner Copy | Calculation | C matrix access |   total
=======+============+============+=============+=================+===========
Clocks |   26961006 |  110197497 |  1166805571 |     57091631    | 1361055705
Ratio  |     2.0%   |     8.0%   |     85.7%   |       4.3%      |   100.0%
-----------------------------------------------------------------------------
                          (These values are not precise, but are approx.)

       There are only about 10% copying-overhead, thus, my routine has 
       some possibility of 1028MFLops(1200*0.857), but the fact is
       only 860MFlops(71.6%). So, there are about 14% loss in most
       inner loop(also, PMCTR shows same result in most inner loop).

       To get faster code, we must optimize/improve most inner loop
       by reducing "MAF full replay traps".
