[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::microvax

Title:MicroVAX, VAXstation, VAX 4000 Systems
Moderator:QUARK::LIONEL
Created:Fri Jan 03 1986
Last Modified:Fri May 30 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4370
Total number of notes:18956

4366.0. "DBL ERR2 on a VAX 4000 with KA690" by SUBSYS::FILGATE (Bruce Filgate SHR3-2/W4 237-6452) Thu May 01 1997 14:14

 We have been chasing our tails on a problem with a VAX 4600 that has been
 running flawlessly until facilities brought the office area up to 85 F for
 the week end, a couple of weeks ago.

 After replacing the cooked disk drives and toasted VCB02 and bringing the
 machine up, it would get a DBL ERR2, try to reboot and get RESTART FAILURE
 then a CONSOLE OVERRUN.  So field service came back and swapped the KA690.

 After this the machine would just stop/freeze w/o warning or provocation;
 when the screen froze, the halt switch would not flip the VCB02 display
 over to console mode.  Totally froze several times.  So field service came
 back and swapped the KA690.

 After this the machine was back into the DBL ERR2 fault at normal room
 temperature.  From a cold power off state, the machine would run 2-6
 hours and fault.  A verification was made that both cooling fans were
 running.  A double check on the environmental limits (105 F +) was done.

 A couple of converstations with the tech support assistance line in Field
 Service led to the recommendation to replace the KA690 again as it appeared
 to be the only item that could source the DBL ERR2 fault.

 Either we have a terrible run of KA690s or there is something else toasted
 in the system that we have not located yet.

 Anyone have another place to look/consider with this problem?

 thanks,
 Bruce
T.RTitleUserPersonal
Name
DateLines
4366.185F shouldn't have hurt it, but..TEKVAX::KOPECTom Kopec W1PFFri May 02 1997 07:449
    What other deviecs are in the system?
    
    I think you've swapped enough KA690s for now.. It has been a way long
    time since I've done much troubleshooting on an Omega, but my next step
    would be to swap out the power supply, then Qbus widgets, then memory.
    
    Somewhat of a long shot: make sure the new VCB02 is up to rev.
    
    ...tom
4366.2VCB02? ...will checkSUBSYS::FILGATEBruce Filgate SHR3-2/W4 237-6452Fri May 02 1997 09:3216
 The Qbus has only a 3 board VCB02, this had been replaced just before the
 air conditioner failure, the rev of the replacement was higher, but will
 check this.

 There are three memories, 1-32MB and 2-64MB, at the moment the system is 
 ruuning with on a 64MB installed and the service door open.

 In the storage bay are a dual RF36, 2-RF74 and a TF86 tape, running at the
 moment w/o one of the RF74 and w/o the TF.

 Strange as it seems, the current common thread in configurations that work
 is that the service door for the CPU/memory bay is open.  

 thanks
  Bruce
4366.3Definitely Check The VCB02 Revision...XDELTA::HOFFMANSteve, OpenVMS EngineeringFri May 02 1997 11:348
   An out-of-revision VCB02 was known to cause all manner of weird problems
   with the older KA650-class processors; I'd expect similar weird problems
   with the faster KA690.  These problems could be *very* flaky, too -- there
   was a simple ECO to the VCB02 board to adjust the Q-bus timing, and this
   ECO cleaned up the timing problems.  I encountered this on a workstation
   upgrade from a KA630 to a KA650, about six or seven years ago.

4366.4arghhhhh....so close, but not quite thereSUBSYS::FILGATEBruce Filgate SHR3-2/W4 237-6452Fri May 02 1997 17:1526
 Well the `new' vcb02 controller card was rev D, and needed to be at least
 E1, field service just re-replace it.

 After playing around somewhat with the configuration, we are now sure that
 the front panel cover for the CPU/memory bay is playing some role in the
 machine checks, so field service replace it...no change.

 If the machine is run with the CPU cover open it seems fine.  If the door
 is closed the machine usually checks before it complete the boot startup
 process, or soon there after.

 When the new cover was fitted and the CPU ribbon cable plugged in and the
 other 4 pin molex cable plugged in and the door closed the system would
 not finish boot.  Opening the door would not let it boot afterwards.
 Pulling and resetting (gingerly) the cpu and memory cards did allow
 boot and operation with the door open.

 Anyone else recall a mechanism for destroying an Omega backplane by
 putting the cpu/memory cards in their slots and pushing too hard on
 the tension handles?  

 It would seem that we are down to the backplane being the next 
 most likely candidate for replacement.

 Bruce
4366.4well, it was not the vcb02, nor a lot of other things...butSUBSYS::FILGATEBruce Filgate SHR3-2/W4 237-6452Sun May 04 1997 12:0716
 Well, we pulled the vcb02 and plugged in a VT220 console, removed all
 disks. And replaced the memory.  Then borrowed a system disk from another
 machine and retested.  Still machine checks, usually within a couple of
 minutes of booting.

 So we now have a machine with empty Q-bus, one known good RF74 system
 disk, a replacement CPU and memory and replacement power supply,
 and no external storage that machine checks as it did during the
 air conditioning failure two weeks ago.  

 Is there anything about the backplanes, upper and/or lower, that could
 result in a machine check?  Everything else has been replaced at least
 once  or removed from the configuration.

 Bruce
4366.5Swap The Box...XDELTA::HOFFMANSteve, OpenVMS EngineeringMon May 05 1997 14:4017
: So we now have a machine with empty Q-bus, one known good RF74 system
: disk, a replacement CPU and memory and replacement power supply,
: and no external storage that machine checks as it did during the
: air conditioning failure two weeks ago.  

    Are there loading requirements on the Q-bus?  Some systems required
    a couple of boards in the Q-bus to prevent the power supply from
    tripping out -- the supply assumed the unloaded box was a problem.

: Is there anything about the backplanes, upper and/or lower, that could
: result in a machine check?  Everything else has been replaced at least
: once  or removed from the configuration.

    Swap the system into another box -- that looks like it's the only
    remaining possibility.  (Yes, there have been (a very few) cases
    where the backplane has caused problems.)

4366.6TEKVAX::KOPECTom Kopec W1PFTue May 06 1997 07:2913
    BTW, you probably shouldn't run the machine with the CPU door open
    (that's the thing with the console connector etc on it).. the cooling
    airflow gets screwed up.
    
    I don't think that's THE problem, though.. (we used to run them like
    that in the early debug)
     
    Is there any interesting info in the crash (e.g. something that looks
    like an address)? I don't remember what gets dumped on that error..
    
    ...tom
    
    
4366.7getting better...SUBSYS::FILGATEBruce Filgate SHR3-2/W4 237-6452Tue May 06 1997 10:5931
 Field Sevice came yesterday afternoon and replaced the backplane
 and VTerm regulator.  The machine ran all night with two RF36 drives,
 KA690, 2-64MB memories and a 32MB memory, no errors.  Field Service
 got out the door before the machine crashed.

 I swapped the power supply and fans between this machine and another
 contracted machine that was running.  This machine immediately 
 crashed again.  I then swapped the power cord and VTerm regulators
 between the two machines, and placed a piece of paper in the bottom
 of the Qbus cage to block the air intake, forcing the air to be
 dragged through the CPU/memory cage.  Both machines then ran all night.

 This morning we added an RF74 (user disk) and the VCB02 and ran it 
 for 3 hours without incident.  (We did leave a piece of paper on the
 bottom of the unused Qbus slots to increase the cooling in the
 cpu/memory cage....is there a way to lock the fans in a `high speed'
 mode?)

 So with so much good news we shut down and added the second RF74, the
 machine failed to reboot, it just froze part way through the boot and
 ignored the halt switch.

 We pulled the disk and installed that TF tape drive, the boot faulted on
 a double machine check.

 We pulled the other RF74 and rebooted w/o incident, and the machine is
 running now, missing two RF74 drives.

 Bruce

4366.8Questions...XDELTA::HOFFMANSteve, OpenVMS EngineeringTue May 06 1997 16:0814
  There are black plastic "blank modules" that can be inserted into
  the Q-bus backplane; blanks which may help with the airflow.

  Can you move the disks into another DSSI enclosure for testing,
  and install the cover over the tape cavity?

  What kind of environmental sensors are in this system?  Thermal? 
  Airflow?  None?  (If there are sensors, consider swapping them.)

  You say "supply".  I'm not as familiar with these boxes as I used
  to be, but there used to be at least two power supplies in these
  boxes...

4366.9more...SUBSYS::FILGATEBruce Filgate SHR3-2/W4 237-6452Tue May 06 1997 16:5316
 re: two supplies were in the 3000 machines, the 4000 `tombstone'
     used just one supply, somewhat larger than the 3000

 We ripped everything  out of the machine and went to the computer
 to install the parts in another machine that runs fine.  All the
 storage worked fine on the alternate machine, as did the memory,
 the CPU failed.  A second brand new spare failed as well.  
 Putting the original machine back together with the cpu board
 from the alternate machine allows the original to run.

 If the machine runs over night, we will put the original cpu
 board back in and retest to see if the real problem was just
 the backplane.

 Bruce
4366.10PROXY::J_EVANSWed May 07 1997 10:125
    We had an IPMT case where the cable connecting the CPU to the CPU panel
    went bad. Did you try swapping the cable and panel assembly?
    
    jim
    
4366.11fixed...machine seems to be running nowFIEVEL::FILGATEBruce Filgate SHR3-2/W4 237-6452Wed May 07 1997 11:1933
As always, there were no complicated problems, merely a number
of simple problems masquerading as a single complex problem.


 1) replaced the dead VCB02 controller, but the replacement
    was too down rev to work and we missed this.  The second
    replacement was procured and checked.  Our assumption was
    that the VCB02 eco was so long ago that all the product 
    would be at adequate revision...wrong.

 2) the first, second, and third replacement KA690-AA CPU modules
    were defective, in two machines they only ran for a few
    hours at best. In general they would run until the machines
    warmed up.  We assumed that the probability of 3 bad CPU 
    boards in a row would be improbably unless the same process
    error was involved in each...it probably was.  The original
    CPU board appears to be fine and is currently running this
    workstation.

 3) the backplane did have a problem, after its replacement we
    could run the machine enough to trouble shoot the other
    problem with the cpu.


thanks all for the help!  I'll try to stop by and drop an update
if the machine is still running after the weekend (SHR plant has
a problem that heats this 10% of the building 10-15 degrees above 
the rest of the plant on weekends...while it raises havoc with
MTBF on computers, if the machine is still running on monday
morning it will probably run all week!)

Bruce