T.R | Title | User | Personal Name | Date | Lines |
---|
4366.1 | 85F shouldn't have hurt it, but.. | TEKVAX::KOPEC | Tom Kopec W1PF | Fri May 02 1997 07:44 | 9 |
| What other deviecs are in the system?
I think you've swapped enough KA690s for now.. It has been a way long
time since I've done much troubleshooting on an Omega, but my next step
would be to swap out the power supply, then Qbus widgets, then memory.
Somewhat of a long shot: make sure the new VCB02 is up to rev.
...tom
|
4366.2 | VCB02? ...will check | SUBSYS::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Fri May 02 1997 09:32 | 16 |
|
The Qbus has only a 3 board VCB02, this had been replaced just before the
air conditioner failure, the rev of the replacement was higher, but will
check this.
There are three memories, 1-32MB and 2-64MB, at the moment the system is
ruuning with on a 64MB installed and the service door open.
In the storage bay are a dual RF36, 2-RF74 and a TF86 tape, running at the
moment w/o one of the RF74 and w/o the TF.
Strange as it seems, the current common thread in configurations that work
is that the service door for the CPU/memory bay is open.
thanks
Bruce
|
4366.3 | Definitely Check The VCB02 Revision... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Fri May 02 1997 11:34 | 8 |
|
An out-of-revision VCB02 was known to cause all manner of weird problems
with the older KA650-class processors; I'd expect similar weird problems
with the faster KA690. These problems could be *very* flaky, too -- there
was a simple ECO to the VCB02 board to adjust the Q-bus timing, and this
ECO cleaned up the timing problems. I encountered this on a workstation
upgrade from a KA630 to a KA650, about six or seven years ago.
|
4366.4 | arghhhhh....so close, but not quite there | SUBSYS::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Fri May 02 1997 17:15 | 26 |
|
Well the `new' vcb02 controller card was rev D, and needed to be at least
E1, field service just re-replace it.
After playing around somewhat with the configuration, we are now sure that
the front panel cover for the CPU/memory bay is playing some role in the
machine checks, so field service replace it...no change.
If the machine is run with the CPU cover open it seems fine. If the door
is closed the machine usually checks before it complete the boot startup
process, or soon there after.
When the new cover was fitted and the CPU ribbon cable plugged in and the
other 4 pin molex cable plugged in and the door closed the system would
not finish boot. Opening the door would not let it boot afterwards.
Pulling and resetting (gingerly) the cpu and memory cards did allow
boot and operation with the door open.
Anyone else recall a mechanism for destroying an Omega backplane by
putting the cpu/memory cards in their slots and pushing too hard on
the tension handles?
It would seem that we are down to the backplane being the next
most likely candidate for replacement.
Bruce
|
4366.4 | well, it was not the vcb02, nor a lot of other things...but | SUBSYS::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Sun May 04 1997 12:07 | 16 |
|
Well, we pulled the vcb02 and plugged in a VT220 console, removed all
disks. And replaced the memory. Then borrowed a system disk from another
machine and retested. Still machine checks, usually within a couple of
minutes of booting.
So we now have a machine with empty Q-bus, one known good RF74 system
disk, a replacement CPU and memory and replacement power supply,
and no external storage that machine checks as it did during the
air conditioning failure two weeks ago.
Is there anything about the backplanes, upper and/or lower, that could
result in a machine check? Everything else has been replaced at least
once or removed from the configuration.
Bruce
|
4366.5 | Swap The Box... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Mon May 05 1997 14:40 | 17 |
| : So we now have a machine with empty Q-bus, one known good RF74 system
: disk, a replacement CPU and memory and replacement power supply,
: and no external storage that machine checks as it did during the
: air conditioning failure two weeks ago.
Are there loading requirements on the Q-bus? Some systems required
a couple of boards in the Q-bus to prevent the power supply from
tripping out -- the supply assumed the unloaded box was a problem.
: Is there anything about the backplanes, upper and/or lower, that could
: result in a machine check? Everything else has been replaced at least
: once or removed from the configuration.
Swap the system into another box -- that looks like it's the only
remaining possibility. (Yes, there have been (a very few) cases
where the backplane has caused problems.)
|
4366.6 | | TEKVAX::KOPEC | Tom Kopec W1PF | Tue May 06 1997 07:29 | 13 |
| BTW, you probably shouldn't run the machine with the CPU door open
(that's the thing with the console connector etc on it).. the cooling
airflow gets screwed up.
I don't think that's THE problem, though.. (we used to run them like
that in the early debug)
Is there any interesting info in the crash (e.g. something that looks
like an address)? I don't remember what gets dumped on that error..
...tom
|
4366.7 | getting better... | SUBSYS::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Tue May 06 1997 10:59 | 31 |
|
Field Sevice came yesterday afternoon and replaced the backplane
and VTerm regulator. The machine ran all night with two RF36 drives,
KA690, 2-64MB memories and a 32MB memory, no errors. Field Service
got out the door before the machine crashed.
I swapped the power supply and fans between this machine and another
contracted machine that was running. This machine immediately
crashed again. I then swapped the power cord and VTerm regulators
between the two machines, and placed a piece of paper in the bottom
of the Qbus cage to block the air intake, forcing the air to be
dragged through the CPU/memory cage. Both machines then ran all night.
This morning we added an RF74 (user disk) and the VCB02 and ran it
for 3 hours without incident. (We did leave a piece of paper on the
bottom of the unused Qbus slots to increase the cooling in the
cpu/memory cage....is there a way to lock the fans in a `high speed'
mode?)
So with so much good news we shut down and added the second RF74, the
machine failed to reboot, it just froze part way through the boot and
ignored the halt switch.
We pulled the disk and installed that TF tape drive, the boot faulted on
a double machine check.
We pulled the other RF74 and rebooted w/o incident, and the machine is
running now, missing two RF74 drives.
Bruce
|
4366.8 | Questions... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Tue May 06 1997 16:08 | 14 |
|
There are black plastic "blank modules" that can be inserted into
the Q-bus backplane; blanks which may help with the airflow.
Can you move the disks into another DSSI enclosure for testing,
and install the cover over the tape cavity?
What kind of environmental sensors are in this system? Thermal?
Airflow? None? (If there are sensors, consider swapping them.)
You say "supply". I'm not as familiar with these boxes as I used
to be, but there used to be at least two power supplies in these
boxes...
|
4366.9 | more... | SUBSYS::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Tue May 06 1997 16:53 | 16 |
|
re: two supplies were in the 3000 machines, the 4000 `tombstone'
used just one supply, somewhat larger than the 3000
We ripped everything out of the machine and went to the computer
to install the parts in another machine that runs fine. All the
storage worked fine on the alternate machine, as did the memory,
the CPU failed. A second brand new spare failed as well.
Putting the original machine back together with the cpu board
from the alternate machine allows the original to run.
If the machine runs over night, we will put the original cpu
board back in and retest to see if the real problem was just
the backplane.
Bruce
|
4366.10 | | PROXY::J_EVANS | | Wed May 07 1997 10:12 | 5 |
| We had an IPMT case where the cable connecting the CPU to the CPU panel
went bad. Did you try swapping the cable and panel assembly?
jim
|
4366.11 | fixed...machine seems to be running now | FIEVEL::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Wed May 07 1997 11:19 | 33 |
| As always, there were no complicated problems, merely a number
of simple problems masquerading as a single complex problem.
1) replaced the dead VCB02 controller, but the replacement
was too down rev to work and we missed this. The second
replacement was procured and checked. Our assumption was
that the VCB02 eco was so long ago that all the product
would be at adequate revision...wrong.
2) the first, second, and third replacement KA690-AA CPU modules
were defective, in two machines they only ran for a few
hours at best. In general they would run until the machines
warmed up. We assumed that the probability of 3 bad CPU
boards in a row would be improbably unless the same process
error was involved in each...it probably was. The original
CPU board appears to be fine and is currently running this
workstation.
3) the backplane did have a problem, after its replacement we
could run the machine enough to trouble shoot the other
problem with the cpu.
thanks all for the help! I'll try to stop by and drop an update
if the machine is still running after the weekend (SHR plant has
a problem that heats this 10% of the building 10-15 degrees above
the rest of the plant on weekends...while it raises havoc with
MTBF on computers, if the machine is still running on monday
morning it will probably run all week!)
Bruce
|