MemMXtest - A Memory Testing Environment for MMX PCs

   Copyright (C) 1999, 2000  J.A. Bezemer

   Version 2.0  16 Mar 2000


OVERVIEW
========

MemMXtest is an extendible computer-based memory testing system that is
built specifically for Intel or compatible processors that incorporate MMX
technology (Pentium w/MMX, Pentium II and III, NOT Pentium Pro). The MMX
instruction set allows reading and writing all 64 bits of the data words
provided by the memory modules at once. MemMXtest incorporates a large
number of well-known march tests and several pseudo-random tests. The
vital parts of each test use manually optimized machine code for maximal
speed.

In it's physical form, the test is a floppy with which the "system under
test" should be booted. It may also be possible to boot the test from a
harddisk, bootrom or over a network, but this has not been tested. The
"system under test" does not need a harddisk; it doesn't even need a
keyboard or monitor, as a serial port can be used to control the system
remotely.

This document is meant to be read in-order (at least the first time). 
However, since many of the discussed subjects are highly related, there is
no obvious ordering which provides a linear buildup of knowledge. This
means that you might have to read this document twice in order to fully
understand it. 


QUICKSTART
==========

While MemMXtest is meant to be adjusted and recompiled for specific
situations, I supplied a a pre-compiled version of the test program. This
should allow you to get acquainted with the program soon.

First unzip the memmxtest-1.0.tar.gz archive to a empty directory. On
UNIX systems use

  gunzip < memmxtest-1.0.tar.gz | tar xvf -

and on Windows systems use a recent version of WinZip. (Note: you probably
can't do any _development_ on Windows systems, but you can use the
pre-compiled version).

As said, the test is a bootable floppy. The file "image" in the base
directory of the archive is the pre-compiled version. Insert a formatted
1.44 MB floppy in the floppy drive. On Linux systems use a command like

  cat image > /dev/fd0

and on other UNIX systems

  dd if=image of=/dev/fd0 bs=512 conv=sync ; sync

(You may need to change the permissions of /dev/fd0 to be able to access
the floppy as a normal user.)

On a Windows system, open a Dos window, and change to the MemMXtest base
directory. There, give the command

  rawrite\rawrite2

and follow the prompts to write the file "image" to the floppy.

Then insert the floppy in the drive of a PC that has a processor with MMX
technology, and (re-)boot that computer. The words "Loading...." should
appear, followed by a clear-screen and the MemMXtest version number. After
this, the actual testing begins. Numerous notices appear so that you can
easily track what's being done. You can control various aspects of the
tests, see the next section.

(Note: if something goes wrong, it's likely that either your floppy is bad
or you're not using an MMX processor.) 


COMMANDS AND COMMAND MODE
=========================

There is a "command mode" that can be used for "on-line" altering of the
test parameters. (Some parameters are only changeable at compile time.)

These keys are usable during testing (i.e. _not_ in command mode):

  `,'   Switch to command mode as soon as possible, aborting the
        running test.

  `.'   Switch to command mode at the end of this pass. This is useful for
        automated control (via the serial port) that needs to set different
        options for the next pass. You should wait till the
        "CommandMode:" indicator appears before entering additional
        commands.

  `;' and `ScrollLock'   Stop the test output until the next press of
                         `;' or `ScrollLock'. Other keys are still
                         accepted and are placed in a buffer for later
                         processing.

(Note: these keys can either be pressed on the local keyboard or be sent
to the PC via the serial port, see under "THE SERIAL PORT" below.)

These keys can be used in command mode:

  `Escape' or `(Ctrl-Alt-)Del'   Reboot the system immediately ("warm" boot).

  `a'   Set the address bit maps.
        See under "ENTERING THE ADDRESS BIT MAPS" below for a detailed
        explanation.

  `c'   Set cache mode. Modes are:
          AutoToggle: change between On and Half every pass.
          Force On: always cache both program and tested memory.
          Force Half: always cache program, but not the tested memory.
          Force Off: never cache anything.
        A discussion about caching is under "CACHING" below.
        (See also cacherefr.c)

  `r'   Set refresh mode.
         ! NOTE: This has no effect on modern systems in which the host
           bridge chipset (like "VX" or "BX") provides the refresh.
           For those systems, you might be able to control the refresh
           rate from the BIOS setup menu (press Del during system startup).
        Modes are:
          AutoToggle: change between Normal and Extended every two passes
                      (BIOS default is used in passes 0x0 and 0x1; this is
                      supposed to be 15 ms).
          Force Normal: always use normal refresh rate, 15 ms.
          Force Extended: always use extended refresh rate, 50 ms.
          Force XLong: always use extra long refresh rate, 500 ms.
          KeepThis: don't change the refresh rate any more. When used in
                    pass 0x0 or 0x1, this will maintain the BIOS default
                    refresh rate (which is supposed to be 15 ms, but may
                    be different in recent systems).
        (See cacherefr.c)

  `m'   Set memory range that should be tested.
        For example: 01000000 - 02000000 tests 16M - 32M (the second 32MB
        DIMM module). See under "MEMORY IN A PC" below for more details.
        (Note: these are addresses as the processor sees them; they will
        be translated into `memory addresses' automatically.)

  `s'   Select test set.
        See under "TEST SETS" below for a list.
        (See also tests.c)

  `p'   Select the data-background pattern generator.
        See under "PATTERN GENERATORS" below for a list.
        (See also patterngen.c)

  `.'   Stop the command mode and resume testing. This always starts
        a new pass.

Note: all input and output numbers are in hexadecimal!

When inputting a number, you can usually press `Escape' to quit without
changing anything. Don't just press Enter as this will be interpreted as
`0'. The Delete, Backspace and cursor keys are not supported, so try to
avoid typing errors. 

(Command-mode keys also "work" during normal testing. They are buffered
and processed at the end of the pass. This functionality may be removed in
future versions, so don't use it; use the command mode instead.)


TEST SETS
=========

A large number of tests and test sets are available (see tests.c). Per
pass, only one test set is used. You can not specify more test sets per
pass; if you want another combination of tests, or another sequence, add a
new test set to tests.c and recompile. You can also use the `.' key at the
beginning of each pass, and specify a new test set for the next pass when
the command mode is entered at the end of the pass (also see under
"COMMANDS AND COMMAND MODE" above). 

Note that most tests are run several times. All march tests except WOM
(nr. 110) are run for all address updating schemes and for all
data-background patterns of the current pattern generator (see under
"PATTERN GENERATORS" below). For example, MATS+ with two addressing
schemes (fast-x and fast-y) and the counting pattern will run 2x7=14
times. The WOM test always runs only once; it provides its own data and
expects a fast-x scheme in address bit map #0, and a fast-y scheme in #1.
The Pseudo-Random tests also provide their own data patterns; they are run
once for each addressing mode. The "repeats" in the Pseudo-Random tests
are executed 5 times for the 2xx tests, and 10 times when called via set
1001 (hard-coded in tests.c).

Available test sets (as defined in tests.c; see below for more info):

  Single March-like tests:

  100   March A     {Any(w0);Up(r0,w1,w0,w1);Up(r1,w0,w1);Dn(r1,w0,w1,w0);
                     Dn(r0,w1,w0)}
  101   March B     {Any(w0);Up(r0,w1,r1,w0,r0,w1);Up(r1,w0,w1);
                     Dn(r1,w0,w1,w0);Dn(r0,w1,w0)}
  102   March C-    {Any(w0);Up(r0,w1);Up(r1,w0);Dn(r0,w1);Dn(r1,w0);Any(r0)}
  103   March C-R   {Any(w0);Up(r0,r0,w1);Up(r1,r1,w0);Dn(r0,r0,w1);
                     Dn(r1,r1,w0);Any(r0,r0)}
  104   March G     {Any(w0);Up(r0,w1,r1,w0,r0,w1);Up(r1,w0,w1);
                     Dn(r1,w0,w1,w0);Dn(r0,w1,w0);Delay;Any(r0,w1,r1);Delay;
                     Any(r1,w0,r0)}
  105   March LA    {Any(w0);Up(r0,w1,w0,w1,r1);Up(r1,w0,w1,w0,r0);
                     Dn(r0,w1,w0,w1,r1);Dn(r1,w0,w1,w0,r0);Dn(r0)}
  106   March LR    {Any(w0);Dn(r0,w1);Up(r1,w0,r0,w1);Up(r1,w0);
                     Up(r0,w1,r1,w0);Dn(r0)}
  107   MATS+       {Any(w0);Up(r0,w1);Dn(r1,w0)}
  108   MATS++      {Any(w0);Up(r0,w1);Dn(r1,w0,r0)}
  109   PMOVI       {Dn(w0);Up(r0,w1,r1);Up(r1,w0,r0);Dn(r0,w1,r1);
                     Dn(r1,w0,r0)}
  10A   PMOVI-R     {Dn(w0);Up(r0,w1,r1,r1);Up(r1,w0,r0,r0);Dn(r0,w1,r1,r1);
                     Dn(r1,w0,r0,r0)}
  10B   Scan        {Any(w0);Any(r0);Any(w1);Any(r1)}
  10C   March U     {Any(w0);Up(r0,w1,r1,w0);Up(r0,w1);Dn(r1,w0,r0,w1);
                     Dn(r1,w0)}
  10D   March U-R   {Any(w0);Up(r0,w1,r1,r1,w0);Up(r0,w1);Dn(r1,w0,r0,r0,w1);
                     Dn(r1,w0)}
  10E   March UD    {Any(w0);Up(r0,w1,r1,w0);Delay;Up(r0,w1);Delay;
                     Dn(r1,w0,r0,w1);Dn(r1,w0)}
  10F   March Y     {Any(w0);Up(r0,w1,r1);Dn(r1,w0,r0);Any(r0)}
  110   WOM         4-bit Word Oriented March test, UpX=fast-x, UpY=fast-y
                    {UpX(w0000,w1111,r1111);DnY(r1111,w0000,r0000);
                     DnX(r0000,w0111,r0111);
                                      UpY(r0111,w1000,r1000);UpX(r1000,w0000);
                     DnX(w1011,r1011);DnY(r1011,w0100,r0100);UpX(r0100,w0000);
                     UpY(w1101,r1101);DnX(r1101,w0010,r0010);UpX(r0010,w0000);
                     DnY(w1110,r1110);UpY(r1110,w0001,r0001);DnY(r0001)} 

  Single Pseudo-Random tests:

  200   Pseudo-Random Scan equivalent      {Up(wA);Repeat[Up(rA);Up(wB)]}
  201   Pseudo-Random March C- equivalent  {Up(wA);Repeat[Up(rA,wB)]}
  202   Pseudo-Random PMOVI equivalent     {Up(wA);Repeat[Up(rA,wB,rB)]}

  Combined test sets:

  1000  Sequence of all March tests
  1001  Sequence of all Pseudo-Random tests


[Definitions for the majority of the implemented tests have been taken from
A.J. van de Goor & J. de Neef: "Industrial Evaluation of DRAM tests", IEEE
ref. 0-7695-0078-1/99$10.00]


MEMORY CHIPS
============

To aid the discussion of the tests, first a typical schematic of a memory
chip. This chip holds 16 bits of data.

                  a d  .-----------.
                  d e--+b0 b1 b2 b3|
             a2---d c--+b4 b5 b6 b7|
             a3---r o--+b8 b9 bA bB|
                  e d--+bC bD bE bF|
                  s e  `-+--+--+--+'
                  s r    |  |  |  |
                          address
                           decoder<------->d
                            |  | (=demux)
                           a0 a1

Where a0-a3 are 4 address bits, b0-bF are memory elements "remembering"
one bit of data, d is the 1-bit data in/out.

The memory elements are organized in a `square'. A few address bits select
the row, and the others select which column of that row to use. Together
they always specify exactly one memory location. 

Modern, high-capacity memory chips have a different layout. For enhanced
operation, the square is cut in halves or even quarters, called "banks".

                         a0
                          |
                        #cols
                       .-----.   .-----.
             a3--#rows |b0 b1|   |b2 b3|
                       |b8 b9|   |bA bB|
                       `-----'   `-----'
                             #banks=======a1,a2
                       .-----.   .-----.
                       |b4 b5|   |b6 b7|
                       |bC bD|   |bE bF|
                       `-----'   `-----'

Now there is only one address bit that determines the row, one for the
column, but two for the bank number. In the Intel architecture, the
address bits are customarily divided as indicated in the drawing above,
which means that rows are "circulating" through the banks (which is
exactly why higher speeds are achieved).

Note that the `squares' in this `set of squares' can perfectly well be
rectangular, when #rows != #cols.


There can be faults everywhere, for example in the address decoders, in
the memory elements, between several memory elements, and in wiring.


MEMORY IN A PC
==============

Intel/IBM compatible systems use the memory area between 640 kB and 1024
kB for memory-mapped I/O purposes (e.g. text mode video), and there is no
way to access the `hidden' memory in this region. If memory chips are to
be tested as thoroughly as possible, they need to be accessed in a
continuous manner (i.e. without "gaps"). Also, the testing program
requires some code and data memory which can be located well below the 640
kB boundary. So the structure of the memory will look like this:


      up to 4 GB --+----
                   |
                   |  memory under test
                   |
                   |
                   |
      >= 1024 kB --+----
                   |  `trusted' memory for test program/data and
                   |  to fill 640 kB - 1024 kB "gap"
              0  --+----


In the memory area above 1024 kB, there may be several memory modules that
can be tested in one run. The Intel processors provide 32-bit memory
addresses, resulting in an addressable space of 4 GB. The BIOS startup
code is located at the top of these 4 GB; depending on the hardware used,
the memory above a certain limit might not be accessible (the BX chip"set"
for example has a maximum of 1 GB; see the appropriate documentation). 

On a system board with four memory slots, a possible configuration would
be four 32 MB SDRAM modules. The first module will contain the test
program; the memory to be tested is from 32 MB to 128 MB. In case the
first module is also `untrusted', another test can be run with the modules
shuffled.


But there is more. The host bridge chip"set" (modern versions are actually
only one big chip) does some weird shuffling of the address lines. 

For example, the datasheet of the host bridge may show the following
connections:

       bit#   B0   2   1   0
           +----------------
       row |  a2  a4  a3  a5
       col |  a2      a1  a0


Which means that e.g. address line `a4' (as the program sees it) is
connected to the row-bit #2, and `a1' to column bit #1. The actual
addressing (with a0 etc. the _processors_ address lines, and b00 etc. the
_processors_ memory references) is as shown below. 


                      a1 a0       a2
                  col1|   |col0    |B0
                 ----decoder----   |        (id.)
                  |   |   |   |   MUX   |   |   |   |
         row0 d--b00 b01 b02 b03  | |  b04 b05 b06 b07
      a5------e--b20 b21 b22 b23<-' `->b24 b25 b26 b27
              c--b08 b09 b0A b0B       b0C b0D b0E b0F
      a3------o--b28 b29 b2A b2B       b2C b2D b2E b2F
         row1 d--b10 b11 b12 b13       b14 b15 b16 b17
              e--b30 b31 b32 b33       b34 b35 b36 b37
      a4------r--b18 b19 b1A b1B       b1C b1D b1E b1F
         row2 |--b38 b39 b3A b3B       b3C b3D b3E b3F


So, in order to address, for example, the first column top-down (called
"fast-x addressing"), the processor must use the address sequence 00, 20,
08, 28, 10, 30, 18, 38. Actually, to access the memory in any orderly way
(subsequent rows, subsequent columns, or subsequent banks), some
non-trivial counting and wrapping has to be done. 


ADDRESS BIT MAPS
================

To solve the issue of the shuffled address bits described in the previous
section, the notion of "memory addresses" is introduced. For any given
addressing scheme, the memory address is defined as starting from 0 and
increasing with 1 for each next location. For the previously used example
of "fast-x" addressing, the memory addresses are as indicated below. 

                 b00 b08 b10 b18       b20 b28 b30 b38
                 b01 b09 b11 b19       b21 b29 b31 b39
                 b02 b0A b12 b1A       b22 b2A b32 b3A
                 b03 b0B b13 b1B       b23 b2B b33 b3B
                 b04 b0C b14 b1C       b24 b2C b34 b3C
                 b05 b0D b15 b1D       b25 b2D b35 b3D
                 b06 b0E b16 b1E       b26 b2E b36 b3E
                 b07 b0F b17 b1F       b27 b2F b37 b3F

This means, that memory address 00 corresponds with "processor address" 
00, but memory address 01 with processor address 20, 02 with 08, 03 with
28 and so on, until 3F with 3F. (The notion "processor address" refers to
the address that the processor actually sees, and the program actually
uses to access the memory.) This mapping of memory address to processor
address is unique for this specific addressing mode and this specific
chip, we'll see some other examples below. First a more detailed analysis
of this mapping.

What is done in the program, is working internally with the memory address
(always increment by 1), and then "inverse shuffling" it to get the
processor address with which to do the reading/writing. This "inverse
shuffling" looks like this:


                memory address bit      5  4  3  2  1  0

                                        |  |  |  |  |  |
                                        V  V  V  V  V  V

     goes to processor address bit      2  1  0  4  3  5


For example: memory address 0C maps to processor address 11, and 33 maps
to 2E as follows: 


     memory address   0x0C = 0 0 1 1 0 0        0x33 = 1 1 0 0 1 1
       maps to bit#          2 1 0 4 3 5               2 1 0 4 3 5
                               `-----. |                   `---|-.
                             .-------|-' etc.              .---' | etc.
                             V       V                     V     V
     processor bit#          5 4 3 2 1 0               5 4 3 2 1 0
  processor address          0 1 0 0 0 1 = 0x11        1 0 1 1 1 0 = 0x2E


Alternatively, we can calculate the memory address equivalent of a given
processor address by applying the reverse procedure ("forward mapping").

We identify a mapping by it's "bit destination list". So the fast-x scheme
in this example has an address bit map "2 1 0 4 3 5".

The address bit map can be determined easily from the table in the
chipset's specifications. Recall that we used

             B0   2   1   0
           +----------------
       row | a2  a4  a3  a5
       col | a2      a1  a0

in the current example. The address bit map now is:

      bank  -col-   ---row---
        2   1   0   4   3   5

        0   0   0   0   0   0   memory addresses
        0   0   0   0   0   1
        0   0   0   0   1   0
        0   0   0   0   1   1
        0   0   0   1   0   0
                  :
        0   0   0   1   1   1
        0   0   1   0   0   0   <-- start of second column
        0   0   1   0   0   1
                  :
        0   1   1   1   1   1
        1   0   0   0   0   0   <- start of second bank
        1   0   0   0   0   1
                  :
        1   1   1   1   1   1

In words, this means that we want to address subsequent rows of the first
column in the first bank, then subsequent rows of the second, third and
fourth columns in the first bank; then the same procedure for the second
bank. The order rows - columns - banks is visible clearly (from right to
left) in the address bit map.


Fast-y addressing has the address bit map "2 4 3 5 1 0". (Note the columns
- rows - banks order.) This is not as extreme as fast-x, for only two bits
are exchanged. For example: memory address 0C now maps to processor
address 28, and 33 now maps to 17. 


     memory address   0x0C = 0 0 1 1 0 0        0x33 = 1 1 0 0 1 1
       maps to bit#          2 4 3 5 1 0               2 4 3 5 1 0
                             | |                             | |
                             `-|---.     etc.          .-----' |   etc.
                               V   V                   V       V
     processor bit#          5 4 3 2 1 0               5 4 3 2 1 0
  processor address          1 0 1 0 0 0 = 0x28        0 1 0 1 1 1 = 0x17


The complete scheme is the following:

                 b00 b01 b02 b03       b20 b21 b22 b23
                 b04 b05 b06 b07       b24 b25 b26 b27
                 b08 b09 b0A b0B       b28 b29 b2A b2B
                 b0C b0D b0E b0F       b2C b2D b2E b2F
                 b10 b11 b12 b13       b30 b31 b32 b33
                 b14 b15 b16 b17       b34 b35 b36 b37
                 b18 b19 b1A b1B       b38 b39 b3A b3B
                 b1C b1D b1E b1F       b3C b3D b3E b3F

Other possibilities include "fast-bank,y" with map "4 3 5 1 0 2", but also
"fast-x, x-interleaved" with map "2 1 0 5 4 3" and "fast-x, y-interleaved" 
with map "2 0 1 4 3 5". 


As real-world example, we now look at the Intel BX chipset and a 64MB
SDRAM DIMM module. The specs show this mapping:

             11  B1  B0  10   9   8   7   6   5   4   3   2   1   0
           +--------------------------------------------------------
       row | 25  13  12  23  14  24  22  21  20  19  18  17  16  15
       col |     13  12  AP      11  10   9   8   7   6   5   4   3

(The "AP" is not important here, so we simply ignore it.)

Following the rules above, for fast-x the address bit map should be:

       bank  --------col--------  ----------------row----------------
      13 12  11 10 9 8 7 6 5 4 3  25 23 14 24 22 21 20 19 18 17 16 15

and for fast-y:

       bank  ----------------row----------------  --------col--------
      13 12  25 23 14 24 22 21 20 19 18 17 16 15  11 10 9 8 7 6 5 4 3


Note that the lowest bit occurring is #3. This is because the modules
provide one 64-bit word at a time; the bits #0, #1 and #2 are used to
select one of the eight 8-bit sub-words (bytes). The test program still
increments the memory address with 1, which for the fast-y scheme results
in processor address increments of 8. 

The highest bit occuring is #25, so there are 26 address bits for this
module (counting from 0). This indeed results in a memory size of
2^26=64M.

In case multiple subsequent modules, more bits should be added to the
lefthand side, for example for fast-x:

  --modules--  bank  --------col--------  ----------------row----------------
  29 28 27 26 13 12  11 10 9 8 7 6 5 4 3  25 23 14 24 22 21 20 19 18 17 16 15

However, MemMXtest will automatically do this for you. That is, it
automatically fills the left-hand side of the map starting from the
highest entered bit number plus one. (So it does not use bit numbers that
were skipped in the entered series.) 

For example, if you entered:

  5 4 3 9 8 7

it will be read as:

  32 31 30 29 28 ... 15 14 13 12 11 10 5 4 3 9 8 7
                                      |
   implicitly added by the program <--|--> entered values

Note that bit #6 is absent; the usefulness of that is questionable.


ENTERING THE ADDRESS BIT MAPS
=============================

With the `a' command (in command mode), the address bit maps can be
changed. 

The format for the address bit maps is as indicated in the previous
section, but note that all input and output values are in hexadecimal.

For example, the address bit map for fast-x as shown above,

       bank  --------col--------  ----------------row----------------
      13 12  11 10 9 8 7 6 5 4 3  25 23 14 24 22 21 20 19 18 17 16 15

should be entered as:

   0D 0C 0B 0A 09 08 07 06 05 04 03 19 17 0E 18 16 15 14 13 12 11 10 0F

You should just start with entering the `0D' (or just `D', leading zeros
are optional) and finish with the `0F' (or `F'), from left to right. This
order might seem a bit unnatural at the first sight.

After entering the maps, they will be printed for verification purposes.
You'll see that the unspecified bit positions are filled with `FF', these
will be calculated by the program at a later time. 

By default, 5 address bit maps can be specified (MAX_ADRBITMAPS in
defines.h). During one pass, all march and/or pseudo-random tests will be
executed once for each `filled' address bit map (where `unfilled' = all
`FF's). This allows you to test with for example fast-x, fast-y,
fast-bank-y, fast-bank-x and x-interleaved addressing modes in one run,
without any interaction. 

When entering the address bit maps with the `a' command, you always get
prompted for all 5 maps. Possible responses are: 

  Escape or           Map remains unchanged.
  <numbers> Escape

  Enter               Map is cleared (i.e. set to all `FF's).

  <numbers> Enter     Map is set to entered numbers.

The numbers may be separated by any number of spaces. Pressing Escape on
every input line will just show the current address bit maps without
changing anything.

To test the working of an address bit map, you can specify a non-existant
memory region to be tested, for example a small range above 1 GB
(=0x40000000). The fault addresses that are printed are in the order in
which the memory is accessed. 


There is one default address bit map, which is just an equivalence
relation between memory and processor addresses (i.e. there is no
shuffling of the address lines). This works acceptably in most cases, but
with situation-specific maps a higher fault coverage is often possible. 
The default is defined in defines.h (ADRBITMAP_LIST_INIT).


ABOUT MARCH TESTS
=================

March tests were originally designed to test for "permanent" faults in a
single-bit memory chip. This as opposed to "non-permanent" faults that are
caused (or enhanced) by for example heat, high frequencies (radiation, but
also "overclocking") or specific usage patterns. 

March tests can be guaranteed to detect all permanent faults of certain
classes. They may also detect "non-permanent" faults, but there is no
guarantee due to the very nature of these faults. 

Example of a march test ("MATS+" test):

  { Any(w0); Up(r0, w1); Down(r1, w0) }

This says: write a `0' in all memory locations in any address order. Then,
in ascending address order, for each memory location: first test (read) 
if there is a `0' and immediately write a `1', then continue with the next
location. Last, in descending address order, for each memory location: 
first test if there is a `1' and immeduately write a `0'. (Remember that
we're still discussing a 1-bit memory.)

Or, as a program, for a 1-bit memory of 16 (0xF) locations:

  Mem[0] = 0                 Any(w0)
  Mem[1] = 0
   :
  Mem[F] = 0

  Test(Mem[0] == 0)          Up(r0, w1)
  Mem[0]=1
  Test(Mem[1] == 0)
  Mem[1]=1
   :
  Test(Mem[F] == 0)
  Mem[F]=1

  Test(Mem[F] == 1)          Down(r1, w0)
  Mem[F]=0
  Test(Mem[E] == 1)
  Mem[E]=0
   :
  Test(Mem[0] == 1)
  Mem[0]=0

(The address order does not necessarily have to be ascending and
descending, but may be any order, as long as "Up" is the exact reverse of
"Down".)


PATTERN GENERATORS
==================

March tests were originally designed for single-bit memories. With
MemMXtest, we are using 64-bit words. Each bit-position is still supposed
to be in one `set of squares' (so 64 sets, one for each bit), but that's
not sure. So, instead of using all 1's where a test says "write 1" and all
0's for "write 0", we substitute "write pattern p" and "write pattern n".
Then we run each test with several values of p and n (n is usually the
bitwise inverse of p). These values are generated with a "pattern
generator". 

These pattern generators are the only data source for the march tests (nr.
1xx) except the WOM test, which provides its own data patterns. The march
tests can't use pseudo-random data. And vice versa, the pseudo-random
tests (nr. 2xx) can't use the pattern generators. 

Several pattern generators are available (see pattern_gen.c):

  0     Counting pattern (default): p gets these values (and n=inv(p)):

          0000000000000000000000000000000000000000000000000000000000000000
          0101010101010101010101010101010101010101010101010101010101010101
          0011001100110011001100110011001100110011001100110011001100110011
          0000111100001111000011110000111100001111000011110000111100001111
          0000000011111111000000001111111100000000111111110000000011111111
          0000000000000000111111111111111100000000000000001111111111111111
          0000000000000000000000000000000011111111111111111111111111111111

        This is the shortest pattern that tests all possible combinations
        of the values of any two bits.
        The name "counting" comes from the standard binary counting pattern
        that can be seen when rotating the table above by 90 degrees
        clockwise.

  1     Counting pattern, incl. inversion: p gets all values of the
        "original counting pattern" above, followed by the inverted
        values. According to theory, this should not be necessary, but
        in practice it might trigger some more faults.

  2     Walking one: p gets these values:

          0000000000000000000000000000000000000000000000000000000000000001
          0000000000000000000000000000000000000000000000000000000000000010
          0000000000000000000000000000000000000000000000000000000000000100
          0000000000000000000000000000000000000000000000000000000000001000
            :
          0010000000000000000000000000000000000000000000000000000000000000
          0100000000000000000000000000000000000000000000000000000000000000
          1000000000000000000000000000000000000000000000000000000000000000
          0000000000000000000000000000000000000000000000000000000000000000

        The total number of patterns is 64+1=65.

  3     Walking one per nibble: p gets these values:

          0001000100010001000100010001000100010001000100010001000100010001
          0010001000100010001000100010001000100010001000100010001000100010
          0100010001000100010001000100010001000100010001000100010001000100
          1000100010001000100010001000100010001000100010001000100010001000
          0000000000000000000000000000000000000000000000000000000000000000

  4     All ones: p always:

          1111111111111111111111111111111111111111111111111111111111111111

        So n always 000...000. This is the most basic way to extend a
        1-bit march test to 64 bits; it will not detect coupling faults
        between data lines.

  5     All ones, all zeros: p gets these values:

          1111111111111111111111111111111111111111111111111111111111111111
          0000000000000000000000000000000000000000000000000000000000000000


Note that several popular patterns like "row stripe", "column stripe" and
"checkerboard" are not available. This is because the patterns would have
to be swapped each or every few words, which is very time-consuming. 
Furtermore, there is no guarantee about the actual ordering of the bits on
the chip, which means there is no obvious way to ensure that the intended
patterns are indeed present. 


GENERATING PSEUDO-RANDOM SEQUENCES
==================================

A well-known way to generate pseudo-random numbers is using a Linear
Feedback Shift Register (LFSR). This is basically a one-bit shift register
with some logic around it. Many variants are available, but only one
software implementation is both easy and fast.


              .------------+-----------------------------.
              |            |d2                           |
              |            V                             |
              |d2.---.   .---. d1.---.         d0.---.   |d2
              `--| M |<--|XOR|<--| M |<----------| M |<--'
                 `---'   `---'   `---'           `---'


Above is an example of a 3-stage (or 3-bit) LFSR. The `M's are memory
elements; the `state' of the shift register is d2d1d0, which is also the
generated pseudo-random number. 

Say the state is 010 at some time. To determine the next state, calculate
the values at the inputs of the memory elements.

         new_d2 = d1 XOR d2         new_d1 = d0      new_d0 = d2
                =  1 XOR 0  = 1            = 0              = 0

So we get 100. The next step is 101:

         new_d2 = d1 XOR d2         new_d1 = d0      new_d0 = d2
                =  0 XOR 1  = 1            = 0              = 1

And so on.

This can be implemented easily in software. The d2, d1 and d0 bits are now
the least significant bits of a >=3-bit variable or register. The
procedure is as follows:

     Old content of register:    010
     Shift one place left:      0100
     If the shifted-out bit
       is 1, XOR with 101:      0100
     New content of register:    100

     Old content of register:    100
     Shift one place left:      1000
     If the shifted-out bit
       is 1, XOR with 101:      1101
     New content of register:    101

And so on. We get exactly the same values as in the `circuit'-variant.

The complete sequence is:

                                 010 = 2
                                 100 = 4
                                 101 = 5
                                 111 = 7
                                 011 = 3
                                 110 = 6
                                 001 = 1
                                 010 = 2 (back to start)
                                 100 = 4  etc.

So we have a pseudo-random sequence of length 7 = 2^3-1.

The placement of the XOR gates, or the equivalent value of the `xor-mask'
are very important because only very specific placements produce a
maximum-length (2^#bits-1) sequence. This is accomplished only if the mask
represents a `primitive polynomial'. The above example uses the polynomial

                 3    2            3      2      1      0
                x  + x  + 1  =  1*x  + 1*x  + 0*x  + 1*x
                                       -      -      -

When you forget the x^3, the 101 xor-mask is clearly visible (the
exponents are the bit-positions).

The `inverse' of a primptive polynomial is also primitive. It is
constructed by writing the exponents in the other direction:

             0      1      2      3        3      2      1      0
          1*x  + 1*x  + 0*x  + 1*x   =  1*x  + 0*x  + 1*x  + 1*x
             ^      ^      ^      ^            -      -      -

So the inverse xor-mask is 011.

The file doc/primitive_polys contains a Maple program that `scans' for
primitive polynomials. For 64-bit pseudo-random values, this gave the
polynomial

                         64    61    60    27
                        x   + x   + x   + x   + 1

with xor-mask

    0011000000000000000000000000000000001000000000000000000000000001

or as a hexadecimal number: 0x3000000008000001. This is defined as
PR_XORMASK64 in defines.h. A C version of the actual shifting and XOR'ing
is in update_currentpr64() in main.c. Because the shifted-out bit is lost
after shifting, we first determine using its value if we have to shift
either with or without XOR'ing.

Note that the state of all-zeros will never be reached (it's impossible to
leave, too). However, since it will take several years to generate the
entire 64-bit pseudo-random sequence, the fact of one missing value will
not be any problem. 


ABOUT PSEUDO-RANDOM TESTS
=========================

There are various forms of pseudo-random tests, depending on where
pseudo-random information is used. Pseudo-random tests are usually
classified by their use of either pseudo-_r_andom or _d_eterministic
_d_ata ("RD" or "DD"), _a_ddresses ("RA" or "DA") and read/_w_rite
operations ("RW" or "DW", Random Write = Random Read). 

The three pseudo-random tests that are implemented in MemMXtest are
"DADWRD"-type tests. These tests behave much like march tests: the address
order and read/write operations are pre-defined, but the data patterns are
pseudo-random. 

The pseudo-random data is generated in the way described in section
"GENERATING PSEUDO-RANDOM SEQUENCES" above. The initial value (`state' of
the LFSR) is derived from the system time using the "hardware clock" (also
named "CMOS clock") at the start of the program (the "RandomSeed64:" 
value on screen), see pseudorandom_init() in main.c. 

All pseudo-random test elements use the same LSFR (the code is sometimes
different, but the xor-mask is the same, and the state is passed from one
element to the next). To enhance reproducibility, the starting value is
printed on screen at the beginning of each test element. However, to
actually have the program use a user-defined starting value (as opposed to
the clock-derived value), you'll have to change pseudorandom_init() in
main.c.

Example of a Pseudo-Random test (March C- equivalent):

  { Up(wA); Repeat[ Up(rA, wB) ] }

This says: initialize the memory with a sequence of 64-bit pseudo-random
values, using increasing address order. Then do multiple repeats of the
following: in increasing address order, for each memory location, first
test if the expected value is present, then immediately write a completely
unrelated new pseudo-random value. The "increasing address order" is
either fast-x or fast-y. 

So for 5 repeats, the test becomes

  {Up(wA); Up(rA,wB); Up(rB,wC); Up(rC,wD); Up(rD,wE); Up(rE,wF)}

with A, B, C, D, E and F completely unrelated pseudo-random values that
are different for each memory location. 

The values in subsequent memory locations, when seen as 64-bit words, are
highly related, because after shifting, either 0 or only 4 of 64 bits
change value (there are only 4 `1'-bits in the xor-mask). This fact may
lead to the masking of certain coupling faults in the data lines. To get
"more random" values, it is possible to use only one of every 2, 4 or up
to 64 values of the pseudo-random sequence. To achieve this, define
PR_SHIFTMOREBITS64 in defines.h and set PR_SHIFTBITSNO64 to for example 2,
4 or 64. Note that this may cause a (severe) slowdown of the program.

The `prmarchel' test elements uses pre-calculated table of pseudo-random
data. This allows writing the data top-speed to the current row or column,
while new data is calculated between these writes. The `prdmarchel'
elements (Pseudo-Random Direct) write each value when it is calculated
(or, in other words, they calculate "on the fly"). [Note: currently there
are no prdmarchel's any more.]


Some provisions have been made for future "RADWRD"-type tests, which have
random data as well as random addresses. 16-bit to 32-bit xor-masks for
generating the addresses have been constructed and tested. They are
available in prepare_pr32() in main.c, but are not yet used by any test. 


THE SERIAL PORT
===============

One of the serial ports is used to provide both a second screen and a
second keyboard. The I/O-address of the serial port is defined as
SERIAL_ADR in defines.h, as is the choice for ending lines with either
CRLF or only LF (SERIAL_SEND_CRLF). The communication parameters (speed,
parity, start/stopbits) can be configured in serial_echo_init() in
mtest.c. Default is COM2 at 9600 baud, 8 bits, no parity, 1 stop bit, only
LF.

The intention is that this can be used by automatic processes that monitor
and control the testing parameters in this way. All data is displayed in
hexadecimal, and there are "line headers" ending in a colon, which should
be easily parsable.

For fault logging, the controlling device should remember at least the
current pass number, memory range, address increments, test identification
(and possibly all earlier test identifications of the current pass),
sub-test identification (and all earlier ones of this test), and of course
the printed sub-test read operand number, address and read/expected data.

To give commands via the serial line, the contoller should use either `.'
or `,' as described under "COMMANDS AND COMMAND MODE" above, and wait
until the "CommandMode:" line header appears. 

The line header interpretation should be done case-insensitively.

An "Error:" line header indicates that the current test or test set could
not be completed succesfully. This usually means that some test parameters
were entered incorrectly.


DEVELOPERS' INFO
================

In the following sections, information is given that is primarily of
interest to people wanting to change MemMXtest. 

MemMXtest is meant to be compiled under the Linux operating system. More
info on Linux can be found at

  http://www.linux.org

Using the Cygwin package, available from

  http://sourceware.cygnus.com/cygwin/

it might also be possible to compile under Windows 95/98/NT, but this is
not officially supported and not tested. 

Apart from a C development environment (gcc, as, ld, make), you also need
the "bin86" package to produce the startup section in "real mode" code. 

Note that the compiled code (boot floppy) doesn't need any operating
system at all. It's effectively it's own OS. 


(RE-)COMPILING
==============

You can start compilation by giving the command

  make

in the src/ subdirectory. This will produce a file "image" in that
directory. (Note: the provided pre-compiled image is also called "image" 
but is in the package's base directory.) You can write that file to a
floppy as described under "QUICK START" above, or use the command

  make install

which does both compilation and writing.

As usual, the Makefile has been constructed in such a way that supposedly
unchanged object files won't be compiled again. 


SOURCE FILES
============

There are several types of source files:

  .c   C code that is preprocessed and compiled to "protected mode"
       code (.o).

  .S   Assembler code that is preprocessed (.s) and assembled to either
       "real mode" code (bootsect.o and setup.o) or "protected mode"
       code (head.o en *_ml.o).

  .h   Header files.


  marchel/marchel_*_ml.S   Assembler code for one march element
                           (ml=machine language). ~~~~~ ~~

  marchel/marchel_*.c      C file containing one function that calls the
                           corresponding _ml.S code. This is the only place
                           _ml.S code may be called from.

  march/march_*.c          C files calling marchel_*.c functions as
                           required by the march test.

  prmarch(el)/prmarch*     march(el)* equivalents for Pseudo-Random tests.


CALLING SEQUENCE
================

BIOS                                   Boot code, loads first sector of floppy
 |
 V
bootsect.S                         Loads rest of "image" to temporary location
 |
 V
setup.S          Moves "image" to where it belongs, switches to protected mode
 |
 V
head.S                                              Installs interrupt vectors
 |
 V
do_test() in mtest.c                                     Main loop for testing
 |
 +-> init() in mtest.c                  Initialization, only before first pass
 | 
 |then per pass:
 |
 +-> set_cache() and set_refresh() in cacherefr.c          Set per-pass cache/
 |                                                              refresh status
 +-> testseq() in tests.c                         "Dispatch" routine for tests
      |
      V
    (stdtestseq() in tests.c                  Runs test several times for each
      |                                       data pattern and addressing mode)
      V example:
     march_a() in march/march_marcha.c               Executes one March A test
      |
      +-> marchel_up_rp_wn_wp_wn() in marchel/marchel_rp_wn_wp_wn.c
      |    |
      |    V
      :   marchel_up_template() in marchel_template.c        Template function
      :    |
           +->  prepare_update_adr_up() in update_adr.c    Compiles addressing
           |                                                              code
           V
          marchel_rp_wn_wp_wn_ml in marchel/marchel_rp_wn_wp_wn_ml.S


SERVICE FUNCTIONS
=================

The fact that MemMXtest doesn't run under any operating system results in
quite severe limitations when compared to `normal' programs.

This is most obvious with the Input/Output operations, that are usually
taken care of by the OS. This support is totally unavailable to MemMXtest; 
even the standard BIOS routines (that are in the computers' ROM) can't be
used because they have 16-bit code, which doesn't run in the 32-bit mode
that MemMXtest uses. 

But also complex functions normally provided by a C library can't be used. 
Among those are for example all functions related to strings (strlen,
strchr), memory blocks (memcpy, malloc), mathematics (sin, log), date/time
(gettimeofday, alarm) and process/job control (system, fork, exit). 

When standard functions are unavailable, they have to be reprogrammed or
worked around. In MemMXtest, several service functions are available (in
mtest.c and .h). The most important are: 

   cprint()         Prints a character string (char*); special characters
                    are not treated specially.

   b/h/p64print()   Print 1/4/8 bytes in hexadecimal form, zero-padded.

   println()        Prints a newline (which is _not_ printed with cprint!).

   memdump()        Dumps specified range of memory, useful for debugging.

   delay()          Waits for specified number of seconds (-0/+1) by
                    looking at the hardware (CMOS) clock.

Input-related functions are used only internally in mtest.c. Several other
operations (like copying memory blocks) are mostly programmed ad-hoc. 

If other, more complex functions are needed, the source code of the GNU C
library (http://www.gnu.org/software/libc/libc.html) might be quite useful. 


CACHING
=======

In order to detect certain types of faults in memory chips, the order of
read and write accesses is important. When testing memory in a computer
system, the various caching mechanisms will cluster the read and write
operations. It is therefore required that the cache is turned off (or
ineffective) during these tests.

In Intel CPUs, there are two internal state bits that control the `global
caching' (bits CD and NW of the control register CR0). These bits set the
caching mode to either `fill' or `no-fill'. The `fill' mode is the normal
operation mode in which the cache is fully functional; in `no-fill' mode
the cache remains functional (read/write) for the memory locations it was
caching before the mode change, but it will not cache any new memory
locations. The `WBINVD' instruction will invalidate ("flush") the cache so
that, when in `no-fill' mode, the cache is effectively switched off.

While the `fill'/`no-fill' caching modes affect all memory locations,
there are additional mechanisms for disabling caching for certain areas of
memory, without affecting other areas. Intel CPUs up to the Pentium (with
or without MMX technology) can only do this in hardware; the system board
sets a certain pin of the processor to a certain logic level to indicate
that the accessed memory address should not be cached. 

The P6 family (Pentium Pro, II and III) allow software control of memory
region specific caching, but to a limited extent. A number of Memory Type
Range Registers (MTRRs) are provided, which each set the caching mode for
a particular (fixed or variable) range of memory. Two problems arise when
trying to use MTRRs in application software:

  1. The BIOS sets the MTRRs to computer specific values at system boot.
     There may or may not be MTRRs left for use by application software,
     depending on system configuration. Additionally, defining an
     application specific MTRR may interfere with BIOS defined MTRRs,
     resulting in unspecified (hence unpredictable) behaviour.

  2. The MTRRs are in the set of `Model Specific Registers', which means
     that there is no guarantee for them to be supported in future Intel
     processors. Application software that uses these MTRRs may well become
     useless in a few years.

These considerations lead to the conclusion that memory range specific
caching settings should not be used in programs that should function on a
wide variety of systems. (Note that you can still customize the program
for your specific needs to get the highest performance possible. However,
this is not advisable.) 

MemMXtest therefore restricts its attention to the global caching
controls. The main disadvantage of using only this technique is that, when
data caching is switched off, code caching will be switched off too, so
the testing program will run slowly. However, there are ways to get around
this problem. 

The caching system is organized as sketched below.


   Processor <--- L1 cache <--- L2 cache <--- (L3 cache) <--- Main Memory
                 code / data


On all Pentium processors, there is a separate code and data part for the
L1 (Level 1) cache. Starting from the Pentium Pro, the L2 cache is `on
board' (i.e. very close to the processor) and runs at full processor
speed. The L3 cache, if present, is located at the system board and runs
at a slower speed. 

Two rules that are of paramount importance for MemMXtest are not mentioned
in the Intel documentation:

- If there is free space in L1, the requested code/data is cached in L1
  only, and not in L2 or L3.

- Memory locations can not be cached in L1-data and L1-code at the same
  time. When a memory location, that was in L1-code, is read as data, it
  will be marked as `invalid' in L1-code. This happens even when global
  caching is set to no-fill mode. (If caching is in fill mode, the
  location will be re-read from memory and cached in L1-data.)

In MemMXtest a structure is used that, after invalidating (=emptying) the
caches, pre-charges them by reading the required regions of data and
"pre"-executing the test code for a small test region just below 640kB.

The caching priority is as shown in the diagram below.

                           _
    Error reporting area   |  unimportant             data
                           |  (not used often)
    Update_adr code        |                          code
                           |
    Pseudo-random tables   |                          data
    Test code              V  most important          code


This leads to the following structure for calling a test element in
MemMXtest. 


  Cache: ON                       HALF                      OFF

                            get "pre"-params
                            copy "pre" upadr code
                              invalidate cache
                            read error report.area
                            call upadr_start
                            call test ("pre")
                              cache to no-fill
  get real params           get real params           get real params
  copy real upadr code      copy real upadr code      copy real upadr code
                              cache to fill
  call upadr_start          call upadr_start          call upadr_start
                            [read PR tables]          (read PR tables)
                              cache to no-fill          cache to no-fill
                                                        invalidate cache
  call test                 call test (real)          call test
    (cache to fill)           cache to fill             cache to fill
    (invalidate cache)        invalidate cache          invalidate cache
  end.                      end.                      end.


The "pre"-test always uses its own parameters (data patterns etc.) and
address updating scheme (also see under "THE ADDRESS UPDATING CODE" 
below). The upadr_start call sets up the start and end addresses (for both
"pre" and "real" tests), and also executes the address updating code that
is used by the test. The Pseudo-Random tables are read only by the
Pseudo-Random test elements. During the "real" test, all needed code and
data is available in the caches; the tested memory region however is not
in any cache. 

Aliasing is no problem here, because all caches are at least 2-way set
associative (each memory location can be cached in at least two different
cache lines), and the L1 and L2 caches also seem to complement each other. 

This structure is implemented in the general march element template
marchel_ml_template.S; see there for more info.


EXAMPLE CODE
============

Below is a `pseudo-assembler' version of an Up(rp, wn) march element. The
notation is a little unusual, but translates easily to a format that is
used by the GNU `as' assembler (this assembler has a "opcode  source,
destination" instruction format, this pseudo code has "opcode  source1,
source2 -> destination").


    #include "marchel_ml_template.S"


    loop:   movq    [EAX] -> MMin          |    Registers:
            movq    MMin -> MMinbak        |      32-bit:
            pcmpeqd MMin, MMp -> MMin      |    EAX: current address
 read       psrlq   MMin, MM16 -> MMin     |    EBX: autonomous increment
  &         movd    MMin -> EDX            |    ECX: <unused>
 test       not     EDX -> EDX             |    EDX: temporary values
            test    EDX, EBP               |    ESI: mem-adr start of current
            jnz     fault                  |         autonomous range
                                           |    EDI: proc-adr end of current
    write:  movq    MMp' -> [EAX]          |         autonomous range
                                           |    EBP: 0xFFFFFFFF  (AND value)
            add     EBX, EAX -> EAX        |  	  64-bit MMX:
 next       cmp     EAX, EDI               |    MMp    : p pattern
 addr       jbe     loop                   |    MMp'   : not-p pattern
            call    update_adr             |    MMin   : data read from memory
            jnc     loop                   |    MMinbak: backup of read value
            <return>                       |    MM16   : 16=0x10 (shift amount)


    fault:  store the fault pattern (Addr: EAX, Expect: MMin, Read: MMinbak)
            jmp   write


The marchel_ml_template.S file contains a template that uses the actual
test code as a subroutine. The functionality of the template is discussed
in the section "CACHING" above.

Before this routine is called, EAX is loaded with the start address, EBX
with the autonomous increment, EDI with the end of the autonomous range
(also see below), and of course the correct values are in the MMX
registers.

The `read & test' instructions read one 64-bit memory word, that is backed
up in case there is an error. The read value is then compared to the
expected value on a 32-bit basis. If both values are the same, all bits of
the result (MMin) will be 1's. In there are differences, either the left
or the right or both 32-bit sub-words of the result will be set to 0's. 
Since none of the MMX instructions set condition code bits, the result is
shifted 16 bits to the right and then transferred to a 32-bit register
(EDX). So for EDX it now holds that if there is no error, all bits will be
1's, and if there are errors, either the left or the right or both 16-bit
subwords of EDX will be set to 0's. EDX is then inverted which means that
all 0's indicate no errors, which is then tested. 

The `write' instruction writes the `not-p' pattern.

The `next addr' instructions finally calculate the next address. This is
done in two parts: the autonomous increment, and, at the end of the
autonomous range, a call to update_adr which calculates the start of the
next autonomous range (see below). If update_adr returns with the carry
bit set, the end of the test is reached. The branching condition to
determine the end of the autonomous range is mutated at the beginning of
marchel_ml_template.S to allow only one test routine to be used for both
`up' and `down' addressing modes. 


Some notes about the optimization of this code for the Pentium processors:
 - Intel processors have a `write buffer' that writes data to memory in
   the background, but still `in order'. Therefore the `write' instruction
   should be placed immediately after the read instruction (labeled
   `loop:').
 - The `read & test' block has too much data dependencies and can't be
   rescheduled. Register renaming on software level is impossible because
   there are no free registers.
 - The branch prediction mechanism may be `pre-loaded' with correct
   default values; this is done by the cache-filling "pre"-test.
 - All instructions (except read/write) operate on registers only,
   resulting in shorter machine instructions and faster execution.
 - Branch target labels should be aligned on 16-byte boundaries for optimal
   performance.


THE ADDRESS UPDATING CODE
=========================

The code to update addresses during the tests is probably the conceptually
most difficult code in the program. Don't panic if you don't understand it
the first time. 

Basically the "problem" is that we're dealing with two different kinds of
addresses, the processor address and the memory address. See the section
"ADDRESS BIT MAPS" above.

During a test, the "address situation" is basically as follows:

  1. Processor start and end addresses are passed to the test (the march
     element, to be exact) from the main program
  2. Start and end are converted from processor to memory addresses
  3. `Current memory address' is memory start address
  4. Convert `current memory address' to `current processor address'
  5. Test location `current processor address'
  6. Increment `current memory address' by 1
  7. Go to 4 (until finished)

So we have to translate addresses both ways. The functions proc2mem() and
mem2proc() in update_adr.c do this the C way, and bit-by-bit. Both examine
the address bit map from right to left, and use that to decide where the
specific memory address bit comes from (proc2mem) or goes to (mem2proc).
The first bit that comes from/goes to a bit number >=32 (counting from 0; 
usually 0xFF) is seen as the end of the map, and subsequent bits come
from/go to positions following the largest previously used position. 

Using these C functions during the test (step 4) of course has disastrous
effects on the speed. Four speedup factors have been implemented:

  a. Machine code for mem2proc during test
  b. Converting ranges of bits instead of individual bits
  c. Converting only when necessary
  d. Run-time compilation: fast code and immediate values

ad a. Manually optimized machine code is used (effectively making
mem2proc() C function useless).

ad b. The actual bit mappings show large bit ranges that can be converted
at once by shifting them together to their rightful place. If this is
enabled (`allowranges'), the the address bit map is basically split into
ranges with consecutive bit numbers (sometimes only one bit in a range).
The memory address is then ANDed with a value that extracts just that
range, the result is shifted as appropriate and then ORed to the result.

ad c. Usually there will be a range of consecutive bit mappings at the
righthand side of the address bit map (like memory address bits #7 - #0,
with mappings 22 - 15, in the fast-x example discussed above). At the
start of the test they will be filled with all 0's, then increment by 1
each step (memory address!) until they are all 1's. The next increment by
one is the first step that changes anything outside the righthand-side
consecutive range (in the fast-x example, the bit with mapping 24). After
that increment, the consecutive bits are all 0's again, and so on.

Because the mappings are consecutive, there is also a range of bits in the
processor address that exhibits exactly the same behaviour. In the fast-x
example, bits #22 - #15 are all 0's at the start of the test, then bit #15
is incremented each step until all bits in the range #22 - #15 are 1's.
Then "something happens" which results in bits #22 - #15 being set back to
all 0's, and the counting starts again.

Since both the processor address increment (bit #15) and the "temporary" 
processor start and end addresses (bits #22 - #15 all 0's/1's) can easily
be determined in advance, a simple loop can just test the memory range
between a given processor start and end address with the given increment,
without bothering to convert to or from memory addresses. This is called
`autonomous testing' and is fully exploited in MemMXtest. During the test,
the machine code routine update_adr calculates the new processor start and
end addresses (procstartautonom and procendautonom) while also keeping
track of the current memory start address (memstartadr). The processor
address increment (autonominc) does not change during the test, and is
calculated in the C functions prepare_update_adr_up() and _dn(). 

ad d. The machine language implementation of update_adr (with included
address conversion) could have been implemented elegantly by using
conversion tables, reserved addresses for variables and so on. This would
however have made the routine very slow because of extra data memory
accesses per instruction and many unnecessary loops with branches. 

These problems have been solved by applying a technique of run-time
compilation and immediate values. Instead of creating a simple table with
the address bit mappings, an "extended" table is created with additional
machine code that performs the conversion and also various other updates
and checks. This is the update_adr code, which in this way effectively has
the needed data handy right in the code itself (`immediate' values for the
instructions). This `compilation' is done in the C functions
prepare_update_adr_up() and _dn().

As indicated above, the prepare_update_adr_up/dn() functions are called
from marchel_up/dn_template(), which are called from each march element's
C function. In marchel_up/dn_template(), two distinct update_adr code
blocks are generated, one for the cache-filling `pre'-test and the other
for `real' use. Start addresses and lengths of both blocks
(pre_adrcode(_len) and real_adrcode(_len)) are passed to the machine
language test code via the `param' mechanism. The first part of the
machine code then copies these blocks to a special reserved memory region
to enable optimal caching and to provide one entry point to the actual test
routine (namely update_adr).


For more information regarding the functionality of update_adr, refer to
the comments at the beginning of update_adr_ml.S.


IMPLEMENTING A NEW MARCH ELEMENT
================================

To implement a new march element, use the following procedure.

1. Create a marchel/marchel_*_ml.S file by copying an existing one and
   adding/removing a few things.

2. Create a marchel/marchel_*.c file in the same way.

3. Add the name of the marchel/marchel_*.c file to the CSSRCS variable in
   the Makefile.

4. Add the prototype of the C fuction to tests.h.

The same procedure applies for a new pseudo-random test element; change
the file names as appropriate. 

Note that _every_ .S file is called *_ml.S and has an accompanying .c
file. The one and only exception is the head.S file.

The march element designation (like "rp_wn") always starts with the
`positive' pattern, as does the argument list of the C function. The march
test then calls the elements with either `normal' ("element(p,n)") or
`reversed' ("element(n,p)") arguments. 


IMPLEMENTING A NEW MARCH TEST
=============================

A march test does nothing more than calling several march elements. To
implement a new one, use the following procedure.

1. Create a march/march_*.c file by copying an existing one and
   adding/removing a few things.

2. Add the name of the march/march_*.c file to the CSRCS variable in the
   Makefile.

3. Add the prototype of the C fuction to tests.h.

4. In testseq() in tests.c, add a new 1xx test number that calls your
   new test, and also add your test to the "everything"-sequence (nr.
   1000).

5. Add a description of the test to this MANUAL and the CAP-SHEET.

A similar procedure applies for a new pseudo-random test.

As mentioned above, for the first action (read or write) taken by the
march elements, always the pattern is used that was passed as first
argument to the C function. In the formal definitions of the march tests,
the first action of any element can use either a `0' or a `1'
(possibilities are `r0', `r1', `w0' and `w1'); in general the first
element of a complete test is a `w0' action. In the implementation, that
uses `p' and `n' instead of `0' and `1', the convention is that each march
test starts with a `wp' action. This leads to the situation that, in most
cases, any element starting with a `0'-action in the formal definition has
an "element(p,n)" calling scheme, while any element starting with a
`1'-action has "element(n,p)". However, some caution is recommended.


IMPLEMENTING A NEW PATTERN GENERATOR
====================================

To implement a new pattern generator, use this procedure.

1. In pattern_gen.c, create a new function pattern_*() taking one of the
   existing functions as an example.

2. At the beginning of pattern_gen.c, add a prototype for the new
   function.

3. In pattern_generate64() in pattern_gen.c, add a new number which calls
   the new pattern generator.

4. Add a description of the generator to this MANUAL and the CAP-SHEET.

Note that the generator must re-calculate the pattern every call using the
patterncount variable; a generator is not supposed to remember anything
between calls. (In fact, that may well result in erroneous behaviour.)
The generator should set *p and *n with the calculated patterns and return
1; except when patterncount is too high, in which case it should return 0.
The program will then reset the pattern counter to 0 and try again.