[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:	DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:	Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:	NETCAD::COLELLADT

Created:	Wed Nov 13 1991
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4455
Total number of notes:	16761

4210.0. "Decrepeater 900TM error log entries?" by KERNEL::WARDJO () Tue Feb 11 1997 09:47

    Hello,
    
    A customer just reported that he lost all network communication throgh
    his Dechub900.
    
    This has 2 Decrepeater 900TM, 2 Decserver 900TM, a Decrepeater 90C and
    a Decrepeater 90FL.
    
    The customer re powered the hub to get communications back, however the
    error logs for the 900TM repeaters show the following:-
    
    Repeater 1 SW V1.1.0 - One entry
    
    ==============================================================================
    
                                    DUMP ERROR LOG
                                Current Reset Count: 31817
    
    ==============================================================================
    
    
    
            Entry        = 1
            Time Stamp   = 0
            Reset Count  = 7
            Fatal error 759 file h1dmr.c
    
    
    Repeater 2 SW V2.0.0 - Four Entries
    
    
            Entry        = 4
            Time Stamp   = 0 635447380
            Reset Count  = 7
            Fatal error: Bus/addr writeexc @ 4320438h
            PC 253374
    
                Entry        = 3
                Time Stamp   = 0 19762914
                Reset Count  = 6
                Fatal error: Bus/addr readexc @ 320032h
                PC 253162h
    
                    Entry        = 2
                    Time Stamp   = 0 91802966
                    Reset Count  = 5
                    Fatal error: Bus/addr readexc @ 32003ch
                    PC 253656h
    
    Entry 1 was a firmware upgrade.
    
    There were no errors in the hub error log which is running V4.1.1.
    
    Can anyone offer an explanation for these entries?
    
    Regards,
    
    Jon Ward
    
    UK CSC

T.R	Title	User	Personal Name	Date	Lines
4210.1	Can any one help?	KERNEL::WARDJO		`Thu Feb 20 1997 08:32`	8
	Can no one offer any explanation for .0 ? Any help appreciated. Jon
4210.2		NETCAD::MILLBRANDT	answer mam	`Thu Feb 20 1997 10:11`	5
	None of the current repeater developers read notes, so you're out of luck on internal firmware questions. The rest of us can help you out on operational issues. Dotsie
4210.3	900tm log:pc=253374,253162	NPSS::KILSDONK		`Wed Mar 12 1997 09:22`	22
	Hello, I maintain the DETMM code. This reset is the cause celebre' of CASE 42089 which we are working very hard to solve. CASE 47248 goes along with this. Its a tough bug but the theory is that the list of machine addresses in a situation where network utilization is high and broadcasts (such as ARPs) comprise the meat of the utilization, stress the algorithm for maintaining 'current' machine addresses at the ports of the repeater. We appreciate any traces which stop on the SNMP cold start packet which the box puts out when it resets. It is always the case that the error happens when the repeater is not relying the hub backplane-nor should it since the machine address list algorithms aren't used for this operational case. The bug is tough because the list of machine addresses has a node that points to outer space. How it got there is a question whose answer lies in the one in a million possible pathes through the code. Obviously, the designers design to preclude this very thing from happening. For example, one process may be changing the list while another goes to reference it. Of course, replicating the exact error case is key to isolating a fix. Our current test to replicate the problem is to snow the repeater with packets from random machine addresses and then start a continuous ARP for the repeater's IP address. We're having mixed success but refinement continues. We are also mapping the assembly code back to the source to find the exact bug. Any logs with other PC's reported would be useful as would the trace mentioned above and some sort of network characterization (i.e. % of broadcast traffic) but I can get that from any trace that becomes available. The design engineers are busily preparing bold new products and we are busily trying to fix this sole bug in the digital product set.
4210.4	Re-formatted for 80 column viewing folks, like myself :-)	NETCAD::BATTERSBY		`Wed Mar 12 1997 10:26`	40
	<<< NETCAD::KALI$USER3:[NOTES$LIBRARY]HUB_MGNT.NOTE;1 >>> -< DEChub/HUBwatch/PROBEwatch CONFERENCE >- ================================================================================ Note 4210.3 Decrepeater 900TM error log entries? 3 of 3 NPSS::KILSDONK 22 lines 12-MAR-1997 09:22 -< 900tm log:pc=253374,253162 >- -------------------------------------------------------------------------------- Hello, I maintain the DETMM code. This reset is the cause celebre' of CASE 42089 which we are working very hard to solve. CASE 47248 goes along with this. Its a tough bug but the theory is that the list of machine addresses in a situation where network utilization is high and broadcasts (such as ARPs) comprise the meat of the utilization, stress the algorithm for maintaining 'current' machine addresses at the ports of the repeater. We appreciate any traces which stop on the SNMP cold start packet which the box puts out when it resets. It is always the case that the error happens when the repeater is not relying the hub backplane-nor should it since the machine address list algorithms aren't used for this operational case. The bug is tough because the list of machine addresses has a node that points to outer space. How it got there is a question whose answer lies in the one in a million possible pathes through the code. Obviously, the designers design to preclude this very thing from happening. For \ example, one process may be changing the list while another goes to reference it. Of course, replicating the exact error case is key to isolating a fix. Our current test to replicate the problem is to snow the repeater with packets from random machine addresses and then start a continuous ARP for the repeater's IP address. We're having mixed success but refinement continues. We are also mapping the assembly code back to the source to find the exact bug. Any logs with other PC's reported would be useful as would the trace mentioned above and some sort of network characterization (i.e. % of broadcast traffic) but I can get that from any trace that becomes available. The design engineers are busily preparing bold new products and we are busily trying to fix this sole bug in the digital product set.
4210.5	Any more news?	KERNEL::WARDJO		`Thu Apr 10 1997 10:50`	6
	Any news on this problem? Thanks, Jon