[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:NETCAD::COLELLADT
Created:Wed Nov 13 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4455
Total number of notes:16761

2877.0. "Hub 900 reset" by LEMAN::PAIVA (Hawkeye - Network Support @GEO) Wed Oct 18 1995 13:37

    HI!
    
      One of our customers reports that is HUB900 has "crashed" on a
    protected (by an UPS) installation, ie. all the users were
    disconnected. Two DEC 7000 on the same sector didn't "suffer" anything.
    
      The MAM firmware's version is 4.0.0 and the Error log has the
    following entries:
    
      Entry = 4
      Time Stamp = 0 186569200
      Reset Count = 8
      Catch VO=00C SR=2200 PC=46A562 F=A562000
    
      Entry = 3
      Time Stamp = 0 218991500
      Reset Count = 7
      Catch VO=00C SR=2200 PC=4697C8 F=97C8000
    
      Entry = 2
      Time Stamp = 0 0
      Reset Count = 4
      Catch VO=07C SR=2709 PC=606B82
    
      Entry = 1
      Time Stamp = 0 0
      Reset Count = 3
      SW V3.1.0-> V4.0.0 ; Config retained.
    
    Any hint about what may have happened ?
    
    Cheers,
    
    Pedro
    
T.RTitleUserPersonal
Name
DateLines
2877.1easy oneNETCAD::MILLBRANDTanswer mamThu Oct 19 1995 10:426
Ah, yess...those PCs point to the dreaded "BTREE" crash.
You should be pleased to know that this is a known problem,
and the known fix is already available.  Just upgrade to
MAM_V4.1.0.

	Dotsie
2877.2Want more info pleaseLEMAN::PAIVAHawkeye - Network Support @GEOMon Oct 23 1995 11:549
    Dotsie,
    
       Could you be a little bit more specific about those BTREE crashes? I
    already gave the customer the new firmware, but he wants to know more
    about it and me too.
    
       Thanks
    
       Pedro
2877.3BTREE bug: the gory detailsNETCAD::BRANAMSteve, Hub Products Engineering, LKG2-2, DTN 226-6043Mon Oct 23 1995 14:2067
Generically, a B-Tree is a fairly complex data structure used to organize
ordered lists of data. A number of hub products use B-Trees internally to manage
the repeater MAC address database, the list of last addresses seen on each
repeater port. Every manageable repeater has this list for its own ports;
management agent devices which do repeater proxy management, such as the MAM,
DECagent 90, and proxying repeaters like the DECrepeater 90TS, also have this
list for all the repeaters they manage.

On any given module, entries get added to the MAC address database as they are
detected. Repeaters will immediately get these addresses from their MAC chips.
Management agents will get addresses by regular polling of the managed
repeaters, really just sampling the total stream of MAC addresses flowing
through a given port, since the polling rate is much lower than the packet rate.
Each time a module adds an entry to its database, it timestamps the entry. If
the module gets the same entry again, it updates the timestamp. Periodically,
the database is swept to check for old entries to remove, known as "aging out".
When one of these entries is found, it is deleted from the database. Thus an
address that is seen only once on a repeater port will soon age out and be
removed from the database. An address that is seen repeateadly will remain in
the database indefinitely, until that node remains quiet for a while.

B-Trees support add and remove operations. Under certain conditions in this
implementation of B-Trees, adding an entry requires more heap memory to be
allocated to the B-Tree structures. At other times, new memory does not need to
be allocated. Correspondingly, under certain conditions, when an entry is
removed, heap memory is freed from the B-Tree structures, while at other times
memory does not need to be freed. The frequency of adds and removals, and of
memory allocations and deallocations, is dependent on the turnover of MAC
addresses detected. The database holds up to 256 entries, and will not permit
new entries when full. Aging out frees up entries. 

The MAC address database for a small network of continuously active nodes will
stabilize over a period of time, with the same entries remaining in the database
forever. However, if an address is not seen for too long a period (because a
node was down, or the MAC address polling of a port just happened to sample
different addresses on a port with multiple stations), an entry will be removed
occasionally. This may also happen with server nodes which periodically
advertise their presence, then stay quiet for a while. In more active networks,
or networks where nodes are bursty (send traffic for a while, then stay quiet
for a while), removals may take place more frequently. The degenerate case is a
large (specifically, much more than 256 MAC addresses), active network, where
whole batches of MAC addresses age out at the same time and new addresses
replace them rapidly, causing the database to repeatedly grow to maximum, then
deflate, then fill up again. I like to refer to this behavior as "churning the
database", because it causes so many continuous additions and removals.

As noted above, under certain conditions (which are non-determinstic, so don't
even *think* of trying to avoid it!), removal of an entry causes heap memory to
be freed. The B-Tree bug (a bug in this implementation, not in B-Trees as a data
structure) was that it would not free *all* of the memory it had allocated. Each
module has a fixed amount of heap memory that in normal use is repeatedly
allocated and freed by a number of operations; the amount of memory freed must
be equal to the amount allocated. The B-Tree code was consuming more memory than
it was releasing each time, so that over time it consumed all available heap
memory. This is a generic class of bug known as a "memory leak". The memory is
there, but all the software data structures have lost track of it, so it appears
to be gone. This would then generate the error reported in the log, when some
part of the MAM code attempted to allocate heap memory and found none available.
The fix was to free the correct amount of memory each time.

In a network with a fairly stable MAC address database, it takes a while to
consume all the heap. Entry 4 in .0 shows a timestamp of 186569200, in 100 ms
increments. This is 518 hours (21 days). Entry 3 is 608 hours (25 days).
Networks that really churn the database may causes a module to crash more
frequently. Predicting the failure time is virtually impossible, but once this
behavior is identified, it should be fairly repeatable given consistent network
traffic.
2877.4NETCAD::BRANAMSteve, Hub Products Engineering, LKG2-2, DTN 226-6043Mon Oct 23 1995 15:242
Minor correction to .3, log timestamps are in 10 ms increments, not 100 ms
increments. I calculated the hours/days correctly, though.
2877.5ThanksLEMAN::PAIVAHawkeye - Network Support @GEOThu Oct 26 1995 12:0811
     Steve,
    
       Thanks a million for such a descriptive answer! That's what I call an
     engineer's answer. Although I did my Master's in Computer Engineering
     (in french because in Switzerland) it didn't occur to me that BTREE
     was a binary tree... 8-( . Anyway I learned a lot from the rest of the
     explanation: the big chunk.
    
        Cheers,
    
        Pedro
2877.6You're welcome!NETCAD::BRANAMSteve, Hub Products Engineering, LKG2-2, DTN 226-6043Fri Oct 27 1995 14:4610
Actually, a B-TREE is a "balanced multiway tree", in this case of order 16. This
means each node has up to 16 children, where a binary tree has only 2. It's
primary advantage over binary trees, and the major source of complexity, is that
it is always balanced, regardless of the order of adds or removes, with every
subtree having the same number of levels of children. This avoids the degenerate
case that binary trees are prone to where all children are down the same branch,
making the tree act in reality like a simple linear list and voiding all its
performance benefits.

Sorry, couldn't resist the pedantic explanation!