[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

8789.0. "Peterson's Algorithm/interprocess synchronization" by RHETT::SHEPPARD () Tue Feb 11 1997 09:42

[unix 3.2d, alpha 2100]

A customer has several processes running on a system.  One process
collects and sends information out over the network, communicates with
the other client processes using shared memory. The user tells us that
syncronization of this communication is accomplished using an
algorithm from Tanenbaum's book "Modern Operating Systems" (page 37),
called "Peterson's Solution to Interprocess Communication Synchronization."

The problem is subtle and difficult to reproduce: the user sees
occasional corruption of messages sent through these queues. He has
demonstrated several instances where messages are stuffed into the
shared memory segment intact, but are corrupt once pulled out.

The customer cannot identify how the corruption occurs. He suspects a
problem with synchronization, but says the same algorithm works just
fine on other platforms including 64 bit SUN machines. He thinks the
synchronization problem may even be at the CPU instruction level.

An extended search via AltaVista turned up pointers to a few white
papers on process synchronization issues, but no further data on
"Petersons Algorithm".  We're stumped -- we don't have access to
Tanenbaum's text and couldn't locate other references to the algorithm.

The customer asks:
o  With your knowledge of the alpha chip(s), digital Unix (including
   bugs) and Peterson's solution, can you find a case where Peterson's
   solution breaks down ?

o  Are there any known semaphore bugs ?
   Especially as may relate to shared memory or multi-processor systems ?

Can anyone summarize Peterson's Algorithm for us, or at least refer us
to other useful info ?
Thanks in advance for any information or pointers !

                        Steve Sheppard
                        Digital Unix / Ultrix Applications Support Team
                        [email protected]



[info follows: excerpted from customer email]
------------------------------------------------------------------------------
Background:
We have a product, called Distributed Message Passing System (D-MBX)
(similar to digital's DMQ).  This product allows users to create, read
from and write to mailboxes.  These mailboxes may be located
(transparently) on either the local node or on another networked node.
Part of the architecture of this product has a network reader
(dmnRead) task that reads messages from the TCP/IP network directly
into shared memory space.  These messages are passed to the local node
processing task (mbx).  There is also a network writer task (dmnWrite)
who receives messages that have been placed in shared memory for
writing to other nodes.

Platforms:
This is a Unix based product that (currently) runs on IBM's AIX, HPs
HP-UX and SUN's SunOS and Solaris as well as digital Unix.  We run on
single and multiprocessor boxes.

Shared Memory:
Shared memory is allocated in one hunk at startup and manged by a
library.  The shared memory is organized into two sections; one shared
between the network reader and the mbx and one shared between the mbx
and the network writer.  Access to shared memory is controlled by
Peterson's solution to the mutual exclusion problem (see page 37 of
Andrew S. Tanenbaum's Modern Operating Systems book, Prentice Hall,
1992, or if you can't find one, I'll mail you the code.

The Problem:
We have recently seen messages stored in shared memory become
corrupted.  Architecturally, the only way this could happen is if the
mutual exclusion is failing.  NOTE, we have not ruled out the
possibility that this is an applications problem.  However, we have
only seen this problem on the digital unix platform, first on MP
boxes, but later on single CPU boxes as well.

What I'm Doing:
Obviously, at the highest level, I'm trying to fix the message
corruption problem.  I've so far failed to finger the exact time or
task that is corrupting the memory.  My current strategy is to replace
Peterson's solution with semaphores.  If the problem does not go away
(and I don't hit any semaphore bugs) I can be pretty sure that the
problem is in fact in my application code.  If the problem does go
away, I can't really say anything, but can only guess that either the
problem was in the OS or that the problem is now hidden by the
changes.

What I Want From You:
With your knowledge of the alpha chip(s), digital Unix (including
bugs) and Peterson's solution, can you find a case where Peterson's
solution breaks down?  Are there any known semaphore bugs?  Especially
as may relate to shared memory or multi-processor systems?
------------------------------------------------------------------------------
T.RTitleUserPersonal
Name
DateLines
8789.1short/char and mbs....SMURF::WOODWARDTue Feb 11 1997 10:4714
    two alpha architecture issues to watch...  
    
    1) watch out for shorts and chars in the algorithm.  Alpha architecture
    	supports only integer or long atomic accesses (ok...byte/short
    	has recently been added).  Thus if two shorts are being written in
        the same aligned integer, corruption could take place.
    
    2) watch out for implied ordering of reads and writes in the
       algorithm.  If ordering is needed, them "mb" instructions need to
       be inserted at proper spots in the code.
    
    /jim
    
    
8789.2SMURF::DENHAMDigital UNIX KernelTue Feb 11 1997 12:422
    But if the problem is happening on uniprocessors, the an issue
    with mb's is very unlikely.