[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

525.0. "SCSI bus going offline" by TAINO::JFUSTE (JOSE FUSTE) Fri Mar 07 1997 13:54

    I have the folowing problem in one of my customers;
    
    Problem:
    	The customer is doing backups from a VAX4500 through the ethernet
    to the Alphaserver 4100. The disks in the SCSI bus C are generating 2
    to 3 errors while they are going offline. The same operation was
    performed using the disks in the SCSi bus D and the problem never
    happened. All the disks in the same SCSI reflect the errors.
    
    Actions done already:
    
       1. The SCSI controller C was replaced by a new one. 
       2. All the disks in the BA356 box in the SCSI C were moved to the
    BA356 used by the SCSI D, and the cabling was switched in the back of
    each controller. Meaning that the box and the cables were switched. 
    
    System Configuration:
    
    		(2) BA356 with 7 disks each box. 
    		(14) RZ28D-VW
    		(2) KZPDA-AA connected to each BA356. One KZPDA-AA is used
    as SCSI bus C and the other as SCSI bus D.
    		SCSI Bus A is the internal SCSI controlling only the CD. The
    SCSI bus B is a narrow controller connected to a BA356 with the system
    disk and the tape unit TZ88-VA. 
    
    The disks generating the problem are all the DKCxx. The customer is
    running VMS 6.2-1H3. The only thing to do now is to move the KZPDA-AA
    to a different PCI slot. 
    
    Please any input will be very helpful.
    
    Thanks,
    	Jos�
T.RTitleUserPersonal
Name
DateLines
525.11 KZPDA/PCIMOVMON::DAVISFri Mar 07 1997 16:394
    Make sure the KZPDAs are in separate PCI segments.  (probably not your
    problem, but an unsupported config...)
    
    Todd
525.2KZPDA/PCI Config...TAINO::JFUSTEJOSE FUSTESun Mar 09 1997 09:024
    Thanks for the input. I'll check that next monday. Now the system is
    configured as it came from the factory.
    
    Jos�
525.3HARMNY::CUMMINSMon Mar 10 1997 10:225
    If you find that the SCSI bus C controller behaves differently in one
    PCI versus another, we'd very much like to know about this. The two
    PCIs use different arbitration logic. PCI0 uses the EISA bridge's and
    PCI1 uses the IOD's arbitration (which is settable in more recent
    consoles via the PCI_ARBMODE EV and defaults to "Round-robin".)
525.4a couple experiments, what is your network deviceWONDER::CARLSONDaveMon Mar 10 1997 16:2612
    try moving your network interface to Hose 1. 
    Have one KZPDA in hose 0 (shared by EISA) and the other in Hose 1.
    
    Confirm that the tape you are backing up to is on PKB, but the disk errors
    you are seeing are on PKC..
    
        Also could you please give us a "top to bottom" of what is in each
    module slot?
    
         Also if possible, shut off DECwindows and retry.
    
    		Dave
525.5KZPDA-AA in PCI1 with errorsTAINO::JFUSTEJOSE FUSTETue Mar 11 1997 08:1616
    Yesterday I checked the configuration and both KZPDA-AA are in
    differents PCI's segments. I switch the KZPDA-AA controlling the DKCxx
    to a different PCI slot in the same segment, also switch between
    KZPDA-AA to see if the problem goes to the other. This morning the
    customer call informing that the same disks generate errors. 
    
    I cheked using DECevent and the first disk in the BA356 that report an
    error was the DKC0 follow by only milliseconds by the others.
    
    The KZPDA-AA acting as SCSI C is in slot 4 in the PCI1. The system
    firmware was updated to the latest using the CD V3.8.
    
    Where this parameter need to be set to(PCI_ARBMODE EV)???
    
    Thanks,
    	Jos�
525.6System Actual ConfigurationTAINO::JFUSTEJOSE FUSTETue Mar 11 1997 11:1122
    This is the configuration slot by slot;
    
    		PCI1 - slot 5 Empty
    		PCI1 - slot 4 Empty
    		PCI1 - slot 3 KZPDA-AA
    		PCI1 - slot 2 NCR810 (DKBxx,MKBxx)
    ------------------------------------------------
    		PCI0 - slot 5 KZPDA-AA (DKDxx)
    		EISA - slot 3 Empty
    		PCI0 - slot 4 DE435
    		EISA - slot 2 Empty
    		PCI0 - slot 3 Empty
    		EISA - slot 1 Empty
    		PCI0 - slot 2 S3 Trio64 (Video)
    
    PAL Code is 1.19-2 for VMS and 1.21-14 for Unix.
    
    Alphabios is V5.24-0
    SRM is V3.0-10
    
    Thanks,
    	Jose
525.7Known problem (93.21?)MAY30::CUMMINSTue Mar 11 1997 11:4911
    Could this problem be that described in note 93.21? Sounds like it may
    well be.. If not, I'm out of my area of expertise, but here are some
    other things to check:
    
    1. Is the customer sure that he/she has the termination correct? If I
       read your original note correctly, the disks and cables were swapped
       between DKC and DKD controllers (and the original DKC controller was
       replaced). When the disks/cables were swapped the problem stayed
       with the controller and not the shelf, thereby implying that the 
       problem is not termination. Still, this would be something to check.
    2. What version of FW are your RZ28D disks at?
525.8HARMNY::CUMMINSTue Mar 11 1997 14:1921
    Re: .3 and the PCI_ARBMODE EV.
    
    Have gotten a couple calls re: this EV. There is currently zero data to
    suggest there's any issue with the default PCI arbitration mode setting
    on 4100/4000. Earlier SRM console versions used the recommended Bridge
    mode and provided no EV for changing the mode. However, problems were
    seen in the lab with certain options and we were requested to init the
    bridge with a Round-robin setting.
    
    Since the release of V2.0-3 SRM console and the introduction of the
    PCI_ARBMODE EV, the PCI arbitration default has been Round-robin and no
    known or otherwise confirmed arbitration problems exist. It is highly
    recommended that PCI_ARBMODE not be altered when troubleshooting flaky
    hardware/applications. This may only serve to complicate the situation.
    
    As an FYI, note that PCI0 uses the EISA arbitration logic and so is
    immune from PCI_ARBMODE settings any way..
    
    Post-FRS 4100/4000 option support statements typically reflect the
    minimum SRM console version required and this is typically always
    V2.0-3 or greater.. One of the reasons is PCI arbitration mode..
525.9leave the arb aloneWONDER::CARLSONDaveTue Mar 11 1997 14:5722
    I need to confirm a couple things.
    
    This only happens when doing a backup of the vax system, via the
    network (de435) right?
    
    You can do a backup of a local disk and you have no problems right?
    
    Could you try moving your de435 from PCI0 to any slot in PCI 1 and
    try it again?
    
    I meant to put something in regarding arbitration, but as Bill
    mentioned, DON'T MUCK WITH IT...
    
         I just confirmed that RSE qualified the TZ88 on the KZPAA under 
    V6.2-1H3 on the 4100...
    
    Could you describe the system activity, especially with respect to
    the drives/controller that is getting errors, when the backup is 
    taking place.
    
    		Thanks
    		Dave
525.10HARMNY::CUMMINSTue Mar 11 1997 15:055
    The more I look at this, the more I believe this is the problem
    described in 93.21. How many power supplies in the BA356 shelf?
    Have you read 93.21 - does this look like the problem? Have you
    tried pulling out say three drivers per shelf to see if that makes
    a difference?
525.11More information....TAINO::JFUSTEJOSE FUSTETue Mar 11 1997 15:5821
    Well,
    	At this moment is very difficult for me to pull out drives because
    the system is live in production. The only time I got is at lunch time
    from 1200 to 1:00PM. 
    
    Each BA356 only have one power supply because they are full with 7
    disks each. Tomorrow the system manager is going to pull out the
    personality module at SCSI C to check the revision of it. 
    
    The problem look to be only when the backup operation was in progress
    via ethernet but this morning they generate errors while was in use
    without any net operation.
    
    The RZ28D-VW's firmware is 0010.
    
    As soon as my customer call me back tommorow I'll let you know the
    revision of the personality modules.
    
    Thanks for the help Bill and Dave.
    
    Jos�
525.12MAY30::CUMMINSTue Mar 11 1997 17:428
    I would say from personal experience that the problem you're running
    into is the one described in note 93.21. I was actually involved in
    trying to figure out why the disks would go out to lunch and return
    errors depending upon configuration. You could pursue this further in
    the ASK_SSAG nots conference, though I'm assuming adding a second
    supply should resolve your customer's problem.
    
    BC
525.13Personalities in rev A03TAINO::JFUSTEJOSE FUSTEWed Mar 12 1997 09:2713
    Bill,
    	This morning we found that the personality modules revision is A03.
    The article said that it need to be B01 minimun. I 'm going to replaced
    all the personalities and see how it works.
    My question is;
    Why only the disks in the SCSI C are the only reflecting the problem
    while the others in the SCSI D have the same personality revision? Is
    this is because both PCI's behave different? Or what?
    
    I let you know as soon as I replaced the personalities.
    
    Thanks,
    	Jose
525.14HARMNY::CUMMINSWed Mar 12 1997 10:316
    We saw the same thing in the lab. Certain shelves were apparently more
    sensitive (read: more susceptible to the problem) than others. I would
    say that this has nothing to do with which PCI the shelf/disks are
    hanging off of, but cannot say for sure..
    
    BC
525.15DWZZB-VW and 70-31490-01TAINO::JFUSTEJOSE FUSTEMon Apr 07 1997 15:5118
    Bill,
    	Note 93.21 reffered to the DWZZB-VW. What the customer has is the
    following part number:
    			70-31490-01 rev.a03
    
    I tried to order via logistics revision B01 for this type of
    personality and they told us that the latest revision is A04. Following
    note 93.21 both are totally different. It take a long time for
    logistics to find out that this is the latest revision for this module.
    The customer has revision A03. They still getting errors from the same
    SCSi bus (C). Is revision A04 for this module do the same as revision
    B01 in the DWZZB-VW?
    
    Please let me know what to try next.
    
    Thanks,
    	Jose