[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:NETCAD::COLELLADT
Created:Wed Nov 13 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4455
Total number of notes:16761

4210.0. "Decrepeater 900TM error log entries?" by KERNEL::WARDJO () Tue Feb 11 1997 09:47

    Hello,
    
    A customer just reported that he lost all network communication throgh
    his Dechub900.
    
    This has 2 Decrepeater 900TM, 2 Decserver 900TM, a Decrepeater 90C and
    a Decrepeater 90FL.
    
    The customer re powered the hub to get communications back, however the
    error logs for the 900TM repeaters show the following:-
    
    Repeater 1 SW V1.1.0 - One entry
    
    ==============================================================================
    
                                    DUMP ERROR LOG
                                Current Reset Count: 31817
    
    ==============================================================================
    
    
    
            Entry        = 1
            Time Stamp   = 0
            Reset Count  = 7
            Fatal error 759 file h1dmr.c
    
    
    Repeater 2 SW V2.0.0 - Four Entries
    
    
            Entry        = 4
            Time Stamp   = 0 635447380
            Reset Count  = 7
            Fatal error: Bus/addr writeexc @ 4320438h
            PC 253374
    
                Entry        = 3
                Time Stamp   = 0 19762914
                Reset Count  = 6
                Fatal error: Bus/addr readexc @ 320032h
                PC 253162h
    
                    Entry        = 2
                    Time Stamp   = 0 91802966
                    Reset Count  = 5
                    Fatal error: Bus/addr readexc @ 32003ch
                    PC 253656h
    
    Entry 1 was a firmware upgrade.
    
    There were no errors in the hub error log which is running V4.1.1.
    
    Can anyone offer an explanation for these entries?
    
    Regards,
    
    Jon Ward
    
    UK CSC
    
           
    
    
T.RTitleUserPersonal
Name
DateLines
4210.1Can any one help?KERNEL::WARDJOThu Feb 20 1997 08:328
    
    Can no one offer any explanation for .0 ?
    
    Any help appreciated.
    
    Jon
    
    
4210.2NETCAD::MILLBRANDTanswer mamThu Feb 20 1997 10:115
None of the current repeater developers read notes, so you're out of
luck on internal firmware questions.  The rest of us can help you out
on operational issues.

	Dotsie
4210.3900tm log:pc=253374,253162NPSS::KILSDONKWed Mar 12 1997 09:2222
Hello, I maintain the DETMM code.  This reset is the cause celebre' of CASE 42089 which we are working very hard
to solve.  CASE 47248 goes along with this.  Its a tough bug but the theory is that the list of machine addresses
in a situation where network utilization is high and broadcasts (such as ARPs) comprise the meat of the
utilization, stress the algorithm for maintaining 'current' machine addresses at the ports of the repeater.  We
appreciate any traces which stop on the SNMP cold start packet which the box puts out when it resets.  It is
always the case that the error happens when the repeater is not relying the hub backplane-nor should it since
the machine address list algorithms aren't used for this operational case.

The bug is tough because the list of machine addresses has a node that points to outer space.  How it got there
is a question whose answer lies in the one in a million possible pathes through the code.  Obviously, the
designers design to preclude this very thing from happening.  For example, one process may be changing the list
while another goes to reference it.  Of course, replicating the exact error case is key to isolating a fix.

Our current test to replicate the problem is to snow the repeater with packets from random machine addresses and
then start a continuous ARP for the repeater's IP address.  We're having mixed success but refinement continues.

We are also mapping the assembly code back to the source to find the exact bug.  Any logs with other PC's reported
would be useful as would the trace mentioned above and some sort of network characterization (i.e. % of broadcast
traffic) but I can get that from any trace that becomes available.  

The design engineers are busily preparing bold new products and we are busily trying to fix this sole bug in the
digital product set.
4210.4Re-formatted for 80 column viewing folks, like myself :-)NETCAD::BATTERSBYWed Mar 12 1997 10:2640
            <<< NETCAD::KALI$USER3:[NOTES$LIBRARY]HUB_MGNT.NOTE;1 >>>
                   -< DEChub/HUBwatch/PROBEwatch CONFERENCE >-
================================================================================
Note 4210.3           Decrepeater 900TM error log entries?                3 of 3
NPSS::KILSDONK                                       22 lines  12-MAR-1997 09:22
                        -< 900tm log:pc=253374,253162 >-
--------------------------------------------------------------------------------
Hello, I maintain the DETMM code.  This reset is the cause celebre' of 
CASE 42089 which we are working very hard to solve.  CASE 47248 goes 
along with this.  Its a tough bug but the theory is that the list of 
machine addresses in a situation where network utilization is high 
and broadcasts (such as ARPs) comprise the meat of the utilization, 
stress the algorithm for maintaining 'current' machine addresses at the 
ports of the repeater.  We appreciate any traces which stop on the 
SNMP cold start packet which the box puts out when it resets.  It is
always the case that the error happens when the repeater is not relying 
the hub backplane-nor should it since the machine address list algorithms 
aren't used for this operational case.

The bug is tough because the list of machine addresses has a node that 
points to outer space.  How it got there is a question whose answer 
lies in the one in a million possible pathes through the code. Obviously, 
the designers design to preclude this very thing from happening. For \
example, one process may be changing the list while another goes to 
reference it.  Of course, replicating the exact error case is key to 
isolating a fix.

Our current test to replicate the problem is to snow the repeater with 
packets from random machine addresses and then start a continuous ARP 
for the repeater's IP address.  We're having mixed success but 
refinement continues.

We are also mapping the assembly code back to the source to find the 
exact bug.  Any logs with other PC's reported would be useful as would 
the trace mentioned above and some sort of network characterization
(i.e. % of broadcast traffic) but I can get that from any trace that 
becomes available.  

The design engineers are busily preparing bold new products and we are 
busily trying to fix this sole bug in the digital product set.
4210.5Any more news?KERNEL::WARDJOThu Apr 10 1997 10:506
    Any news on this problem?
    
    Thanks,
    
    Jon