[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]
Conference ulysse::rdb_vms_competition

Title:	DEC Rdb against the World

Moderator:	HERON::GODFRIND

Created:	Fri Jun 12 1987
Last Modified:	Thu Feb 23 1995
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1348
Total number of notes:	5438
1013.0. "Oracle, nCUBE and the 1073 tpsB benchmark" by TPS::ABBOTT (Robert Abbott) Mon Oct 21 1991 16:50

+---------------------------+ TM
|   |   |   |   |   |   |   |
| d | i | g | i | t | a | l |         I N T E R O F F I C E   M E M O
|   |   |   |   |   |   |   |
+---------------------------+
    
    
TO: Distribution                      DATE:  October 18, 1991   
                                      FROM:  Robert Abbott
                                      DEPT:  TNSG Software Performance Group
                                      DTN:   227-4411
                                      LOC/MAIL STOP:  TAY1
                                      ENET:  TPS::ABBOTT  

                                                                           
SUBJECT: Oracle, nCUBE and the 1073 tpsB Benchmark.

In July 1991, Oracle Corporation announced the results of a TPC-B benchmark
that achieved 1073 tpsB at a cost of $2,482 per tpsB. These are industry
leading results for both performance and price/performance.

This memo explains how these results were achieved. I first discuss the
architecture of the nCUBE 2 machine. I also discuss the architecture of the
ORACLE Parallel Server and how the implementation on the nCUBE differs from the
implementation on VAXclusters. I summarize the benchmark configuration and
present some conclusions.

Slides for a presentation on this material can be found at
TPSDOC::SYS$PUBLIC:[TP_PERFORMANCE]ORACLE_NCUBE_SLIDES.PS. This memo is an
abbreviated script for that presentation. 

Thanks to John Sopka for supplying information about nCUBE and other massively
parallel machines.

1. The nCUBE 2 machine.

1.1 Architecture overview

The nCUBE 2 Scalar Supercomputer, built by the nCUBE Corporation, is a MIMD
parallel machine with loosely coupled processors connected via a hypercube
interconnect.

MIMD, or Multiple Instruction (streams), Multiple Data (streams) means that
processors execute independent threads of control. (In a SIMD architecture,
Single Instruction Multiple Data, processing elements operate synchronously,
executing the same instruction in lock-step.)

In a loosely coupled architecture, memory blocks are directly addressable by a
single processor. Data sharing among processors occurs through message passing.
Each processors runs its own copy of the operating system. VAXclusters are an
example of a loosely coupled system. Loosely coupled parallel machines are
often called multi-computers. (Tightly coupled, or shared memory, systems are
often called multi-processors, e.g., VAX SMP.)

A hypercube is a point-to-point, multi-step, communication network that is
characterized by a dimension d. A dimension d cube has 2**d nodes. 
A hypercube can be defined inductively: a cube of dimension d+1 is formed by
joining two cubes of dimension d using 2**d edges. Each edge connects one node
in one sub-cube with its companion node in the other sub-cube. 

A hypercube of dimension 0, or 0-cube, consists of a single node. A 1-cube is
formed by connecting two 0-cubes with a single edge. A 2-cube is formed from
two 1-cubes and two additional edges; it looks like a square. Finally, a 3-cube
is formed by connecting two 2-cubes with four edges. The 3-cube has 8 nodes and
looks like a familiar cube. (If this seems confusing, try drawing some
pictures, or consult the slides.)

The dimension d tells us more about the hypercube than simply the number of
nodes. A d-cube has d*2**(d-1) edges. Each node in the cube is directly
connected to d neighbors (a point-to-point network.) The maximum distance
between any two nodes is d. Thus a message between two nodes can travel over at
most d links (edges) in the network (a multi-step network.)

1.2 Product overview

An nCUBE 2 machine can be configured with 32 to 8192 (13-cube) processing nodes
(p-nodes.) The machine can accommodate 128 processor array boards; each
board contains 64 p-nodes. It could have 512 GB memory and, according to nCUBE
literature, attain processing speeds of 27 GigaFLOPS and 60K MIPS. There are 29
IO board slots available, one of which must be used for a host interface. That
leaves 28 IO boards each with 16 IO processing nodes (io-node.) Each io-node
has one SCSI controller that serves 7 SCSI devices. Thus total storage
capabilities are in the range of 3136 SCSI devices (28*16*7.)

1.2.1 Processing nodes

A p-node contains one custom VLSI, 64 bit, single-chip CPU. It is fabricated in
1.2 micron CMOS and has 460K transistors. The clock speed is 20MHz and the CPU
is rated at 7.5 MIPS.  By today's standards, that is not a particularly fast
CPU. nCUBE claims that individual CPU speed was sacrificed for a lower chip
count and consequently higher system reliability.

A p-node can have a maximum of 64 MB memory. There is an independent 64-bit
IEEE floating point unit and an independent message routing unit implemented in
hardware. This is an improvement over earlier nCUBE models which did message
routing in software.

A p-node has 14 Direct Memory Access (DMA) channels, 28 uni-directional
channels arranged in pairs. Thirteen channels are used for the hypercube
interconnect. The remaining channel is the system I/O channel; a node talks to
the IO subsystem over this channel. Each channel transmits bit-serially at a
maximum speed of 2.5 MB/sec. DMA operation means that data transmission and CPU
processing can proceed simultaneously.

Processing nodes can be packaged 32 or 64 to a board. Processor array boards
plug into a backplane to from higher dimension cubes. Thus two 6-cube boards
combine to form one 7-cube. Or 3 boards can make one 7-cube and one 6-cube.
Packaging is dense, the system is water-cooled.

1.2.2 I/O subsystem

An I/O node (io-node) is essentially identical to a processing node. An io-node
has 14 DMA channels, 8 of which connect to to 8 p-nodes. The remaining 6
channels are used to interconnect the io-nodes. Thus each p-node is directly
connected to exactly one io-node and each io-node serves 8 p-nodes. Note that a
p-node does not have direct access to all storage devices, only to the devices
served by its one io-node. Other disks are reached through the network that
interconnects the io-nodes.

An io-node has one SCSI controller which serves 7 SCSI devices. Io-nodes are
packaged 16 to a board.

1.2.3 Software

Each processing node runs its one copy of the Vertex operating system. Vertex
is a micro-kernel with support for process loading and scheduling and message
handling and routing. There is also support for debugging programs under host
control.

The full-blown operating system is called Axis and runs on a host board. (As
near as I can tell, a host board is configured with one processor node, memory
and connections for terminal devices. I do not know how many host boards can be
configured in a cube.) Axis is a UNIX-like OS with extensions for cube
management (e.g., process loading.) Axis views the cube as a device. There are
utilities that allow a cube to be divided into sub-cubes and managed as such.
Different applications would run in separate sub-cubes. Axis also contains a
special file system to deal with the unique IO sub-system.

The nCUBE can also be hosted by a SUN workstation, as was done in the benchmark
test.

2. The ORACLE Parallel Server

ORACLE Parallel Server technology first appeared in Oracle version 6.2. It
allows multiple database server instances to coherently share a single common
database in a loosely coupled processor configuration. This technology does not
employ two-phase commit.

Each Oracle instance:
  - resides on its own node
  - has its own memory
  - serves multiple clients
  - accesses the same database

Oracle employs a deferred update strategy, which means that the most recent
copy of a database page can be located in main memory, and not on the disk.
Since a single database can be updated concurrently on multiple nodes, some
mechanism is needed to ensure serializability of transaction execution
histories. Oracle employs a two-level locking scheme to solve the problem of
coordinating access to a single database from multiple, independent database
servers. 

The first level is the Parallel Cache Manager which globally tracks the
physical location of every data block. The cache manager ensures that an
instance updates the most recent copy of a database page.  The second level of
locking is a Oracle-native concurrency control manager. This type of locking
only occurs within a single node, or instance of Oracle. 

2.1 VAXcluster implementation

On a VAXcluster the Parallel Cache Manager is implemented using the VMS
distributed lock manager (DLM.)

When an instance wants to modify a data block on behalf of a transaction, it
first checks to see if it already holds a DLM lock with the appropriate access
rights to that block. If it does, then no further DLM lock operations are
necessary. The instance will read the block from disk if it not already in the
shared buffer cache. 

If the instance does not hold the lock then it will attempt to acquire the lock
from the DLM. If no other instance holds the lock then the lock is granted
immediately. If another instance holds the DLM lock, then the DLM will request
that it release the lock. First, it writes all modified pages of that data
block to the disk. Then the lock is released and granted to the requesting
instance. This instance can then read the desired data block from disk into
buffer cache. Since modified pages were written to disk, the data block will
contain the most current updates.

Oracle claims that the transfer of ownership of a data block from instance A to
instance B can occur without forcing transactions on instance A to commit
before the transfer occurs. In other words, transactions on different nodes can
concurrently update different rows within the same datablock, even the same
database page. Since the VMS DLM is used only for cache management and not for
concurrency control, it appears that information about what types of locks are
currently held on a page are flushed to disk with the page when the transfer of
data block ownership occurs. The requesting instance can read this information
to learn which rows are locked by another instance.

Oracle also claims that the Parallel Server can perform an automatic on-line
recovery of the database in the event of a node (instance) failure.

2.2 nCUBE implementation

For the nCUBE machine, Oracle has implemented its own distributed lock manager
that runs in a separate process. More than one process can be used.

The basic algorithm is the same as the VAXcluster implmentation with one
important difference: when ownership of a datablock is transferred, the
datablock is transferred directly from the cache of one instance to the cache
of the requesting instance. The transfer is done using the hypercube
interconnect and may require the data to pass through intermediary nodes in the
network. (Recall that a hypercube is a multi-step, point-to-point network.)
This ability to avoid 'bouncing the data off the disk' is very important in a
hypercube environment where processor nodes are not directly connected to all
(most) of the data disks.

I was unable to learn how Oracle handles failures in an nCUBE. Can the Oracle
lock manger rebuild the lock tree after a node failure? What happens if one of
the lock manager processes fails?  Please forward any information in this area.

2.3 Performance

The Oracle Parallel Server will exhibit good performance (i.e., scaling) on
applications that have very low levels of inter-node data sharing. High levels
of intra-node sharing are not a problem. Low levels of inter-node sharing are
natural for applications that are mostly read-only (e.g., decision support) or
applications that randomly update records in a large database. 

A low level of inter-node data sharing can also be achieved by partitioning the
database among nodes. The partitioning is accomplished by controlling where
transactions execute. A transaction that updates partition A is executed on
Node A, and a transaction that updates partition B is executed on Node B. Data
from a partition will be buffered nearly exclusively in the single node that
executes transactions against that partition. Partitioning does not mean that
there is more than one database.   

This is how Oracle ran the TPC-B benchmark.

3. The TPC-B Benchmark

The benchmark used a single database scaled for 1120 tpsB. This database was
'partitioned' over 56 application nodes, i.e., nodes that executed
transactions. Thus there were 20 branches per node. Each application node
had 12 driver processes that generated teller IDs over a 20 branch range.

A driver generates transactions for tellers that are local to its node. Thus
the TELLER relation is completely partitioned among the 56 application nodes.
There is no inter-node data sharing for the TELLER relation. Since each teller
is associated with a single branch, the BRANCH relation is also partitioned
completely over the 56 application nodes. There is no inter-node data sharing
for the BRANCH relation. A driver generates accounts that are foreign to its
branch with probability .15. Thus the ACCOUNT relation is also partitioned
effectively, with inter-node sharing occurring infrequently. 

When a transaction accesses an account that is foreign to its branch, the
distributed lock manager will be invoked and ownership (i.e., a lock) of the
data block containing that account record will pass from the home branch node
to the requesting node. Since the ACCOUNT relation is very large; only a small
fraction (less than 8%) can be buffered in memory. It is unlikely that the home
branch node will have the data block buffered in memory. The requesting node
will have to read the account data from disk. Thus the inter-node transfer of
data blocks occurs seldom in this benchmark. Inter-node data contention is
almost nil.

3.1 Benchmark details

A total of 113 nodes were used in the test. There were 64 p-nodes configured on
two processor array boards, a 6-cube. Each p-node had 16 MB memory. 48 io-nodes
(16 with 1 MB and 32 with 4 MB) were used for disk and tape processing. 205
disks were allocated across 48 SCSI controllers. One node was used for the PIB
host interface, i.e., it talked to the SUN workstation which hosted the nCUBE.

A SUN SPARCstation 330 was used to host the nCUBE. Its functions were
  - boot the nCUBE
  - load Vertex on each p-node
  - load device drivers
  - start Oracle instances on the nCUBE
  - load and start the benchmark driver

The SUN workstation did not do any database work and was not part of the
benchmark driver.

Of the 64 processing nodes, 56 ran the driver programs and executed
transactions. Six p-nodes each ran the Oracle distributed lock manager that
implements the parallel cache management. Two p-nodes ran an archiver server
process. This process copies redo logs to an archive space, see below.

Oracle uses a multiple buffer technique for managing redo logs. When the current
log buffer fills, a checkpoint is taken and the system begins logging to
another log buffer. Meanwhile, an archiver server process copies the data from
the filled log to a separate archive space. When the copy completes the file is
truncated and can be re-used. The benchmark used 3 redo log files, each was 2
GB, the maximum size. Two checkpoints spaced 23 minutes apart were taken during
the measurement interval. Throughput dropped to (only !) 500 TPS during the
checkpoints.

All database data, logs and the log archives were striped. The redo logs were
mirrored.

4. Conclusions

TPC-B is a perfect benchmark for the ORACLE/nCUBE architecture. By controlling
where transactions execute, they can effectively partition the database among
56 processing nodes with very little inter-node data sharing.

NOTE: Digital has challenged the Oracle benchmark on exactly this point. We
believe that a database must be partitioned in the schema and not via a
restricted access pattern. Thus a partitioned database implies more than one
database; this would require that a two-phase commit protocol be used when
transactions update data in more than one partition.

The ORACLE/nCUBE combination by itself would not perform well on a TPC-A
benchmark. There are two reasons for this. First, like most massively parallel
machines, the nCUBE has a very limited capacity for attaching terminals.
Second, they could not partition the database with a pre-arranged access
pattern. However, a TP monitor could be exploited to solve both of these
problems. Terminal handling, and transaction request initiation is done by
conventional front-end nodes. Transaction requests are sent to a TP monitor
that runs on the cube. The monitor routes the requests to the appropriate
database servers in order to achieve the desired partitioning.

Regardless of whether this solution conforms to the TPC-A specification, it
represents a very powerful OLTP engine for applications that can be partitioned
in this manner. There's every reason to believe that Oracle and nCUBE know this
as well.

The Oracle/nCUBE configuration may have other weaknesses:
 - How does Oracle handle node failure in this environment? 
 - How does the nCUBE handle failures?
 - How difficult is it to program in this environment? (Not too hard from my
   reading of the Oracle benchmark code.)

My guess is that nCUBE system management utilities (Backup etc.) are probably
not up to production level requirements, but these can be improved.

What about Digital's own massively parallel machine DECMPP?
  - DECMPP is a SIMD machine. In general, SIMD machines are less well suited 
    to OLTP applications than MIMD machines.
T.R	Title	User	Personal Name	Date	Lines