T.R | Title | User | Personal Name | Date | Lines |
---|
6400.1 | Does not look right... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Mon Feb 17 1997 15:51 | 14 |
|
The analysis shows no sense keys. It does not seem right. Maybe Chris Loane will
chime in, but for some reason you got an error with no data to support it in the
log.
As for busting open the SBB, you should not have to, and you may end up eating
the cost if the dispensation we requested on the warrenty sticker has not made
it in to manufacturing yet. Have your logistic folk find out why they can't get
the correct varient in stock.
We/are the terminaton power jumpers installed in the device as per TIMA BLITZ
TD2041?
roger.
|
6400.2 | | KERNEL::LOANE | Comfortably numb!! | Tue Feb 18 1997 02:31 | 23 |
| >Decoded Instance Code is:-
> The disk device reported standard SCSI Sense Data. Check the service
> manual for the device for further instructions.
The Instance code SUGGESTS that the HSJ is about to log all the
extended Sense data, but....
> LONGWORD 16. 0070802A
> /*.p./
> LONGWORD 17. 00000000
> /..../
> LONGWORD 18. 00000A00
> /..../
> LONGWORD 19. 00000000
> /..../
> LONGWORD 20. 00000000
> /..../
......it's all zero.....this is very strange (i.e. nothing
useful/no further help). Were there ANY other errors logged at or
around the same time??
Chris
|
6400.3 | GOOD, it wasn't just me ;^) | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Tue Feb 18 1997 07:43 | 7 |
|
You may want to hook up a printer to the console of the controller and see if
any event are being dumped to the console. It sounds like either something is
being incorrectly reported OR we are missing some magic key to unlock what IS
coming back.
roger.
|
6400.4 | Action plan | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Tue Feb 18 1997 13:55 | 24 |
| The customer's merged errorlog and text output of SWEAT V2.7 is now
available on node MQOU27 decnet account (FAL$server). I found out that
the errorlog can be analysed on VAX VMS6.2 even though the errorlog is
out of a 5.5-2 system. File names are MSE_errlog.sys & MSE_sweat.txt. I
will try to run DECEVENT from that file this afternoon.
By the way, I checked the jumper on the drive that was replaced this
weekend, and the jumper was missing. The only jumper present was 1-2 .I
ran SCSIpro on that drive at the office and found no growing list out
of the drive. I tried running read scan, write verify...etc but i
cannot go over block number 48000. I'm working on this rigth now.I ran
format successfully but still cannot go over block 48000.
The plan for tonight, is to run dilx on the HSJ50 to exercise the disk
to see if we cannot get more accurate info out of the test.Then, maybe
format the drive is necessary.After, we should be changing the drive's
slot in the BA356 and recreate the unit. FMU on both HSJ's did not show
any problem so far. I will hookup a printer on the HSJ also.
Regards,
Denis Ormaechea
DTN-632-7942
|
6400.5 | Troubleshooting results. | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Wed Feb 19 1997 19:49 | 365 |
| --------------------------------------------------------------------------------
DENIS ORMAECHEA <Troubleshooting results.> 19-FEB-1997 21:30
--------------------------------------------------------------------------------
18-Feb-97 Action
I was onsite almost all day to gather all possible information in logs
about the problem since it started. Most important info were in node TS4
errorlog log, but the file had corrupted entries that needed to fix by RMS.
After fixing everything, i merged all errorlog info from the cluster since
problem started (20-jan-97).Brougth merged errorlog and SWEAT text output to
office by tape cartridge.
Copied files to MQOU27 decnet's account and asked RDC to run DECEVENT
from it get get more info.All files are called MSE_errlog.sys,MSE_sweat.txt,
MSE_decevent.txt.
My first action onsite at 17:30 Hr was to run DILX on both drives to
get better info from SCSI ASC/ASCQ status. The dua1000 disk showed errors
within 2 Mins with the following results:
This is the config :
*****************************************************************************
Controller:
HSJ50-AX ZG63300559 Firmware V50J-2, Hardware A01
Configured for dual-redundancy with ZG63100486
In dual-redundant configuration
SCSI address 6
Time: 18-FEB-1997 17:38:50
Host port:
Node name: HSJ010, valid CI node 6, 16 max nodes
System ID 420010061122
Path A is ON
Path B is ON
MSCP allocation class 1
TMSCP allocation class 1
CI_ARBITRATION = ASYNCHRONOUS
MAXIMUM_HOSTS = 31
NOCI_4K_PACKET_CAPABILITY
Cache:
128 megabyte write cache, version 3
Cache is GOOD
Battery is GOOD
No unflushed data in cache
CACHE_FLUSH_TIMER = DEFAULT (10 seconds)
CACHE_POLICY = A
NOCACHE_UPS
HSJ010 > sho d1000
MSCP unit Uses
--------------------------------------------------------------
D1000 DISK150
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
AVAILABLE
No exclusive access
PREFERRED_PATH = THIS_CONTROLLER
Size: 523366 blocks
HSJ010 > sho disk150
Name Type Port Targ Lun Used by
------------------------------------------------------------------------------
DISK150 disk 1 5 0 D1000
DEC EZ32 (C) DEC V064
Switches:
NOTRANSPORTABLE
TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
Size: 523366 blocks
Configuration being backed up on this container
HSJ010 > sho d1100
MSCP unit Uses
--------------------------------------------------------------
D1100 DISK210
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
State:
ONLINE to the other controller
No exclusive access
PREFERRED_PATH = OTHER_CONTROLLER
Size: 523366 blocks
HSJ010 > sho disk210
Name Type Port Targ Lun Used by
------------------------------------------------------------------------------
DISK210 disk 2 1 0 D1100
DEC EZ32 (C) DEC V064
Switches:
NOTRANSPORTABLE
TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
Size: 523366 blocks
Configuration being backed up on this container
This is the results :
*******************************************************************************
HSJ010 > run dilx
Disk Inline Exerciser - version 2.0
Note: DILX will only test units with a single physical device.
The Auto-Configure option will automatically select, for testing, half or
all of the disk units configured. It will perform a very thorough test with
*WRITES* enabled. Only disk units with a single physical device will be
tested. The user will only be able to select the run time and
performance summary options and whether to test a half or full configuration.
The user will not be able to specify specific units to test.
The Auto-Configure option is only recommended for initial installations.
Do you wish to perform an Auto-Configure (y/n) [n] ?
Use all defaults and run in read only mode (y/n) [y] ?n
Enter execution time limit in minutes (1:65535) [10] ?30
Enter performance summary interval in minutes (1:65535) [10] ?
Include performance statistics in performance summary (y/n) [n] ?
Display hard/soft errors (y/n) [n] ?y
Display hex dump of Error Information Packet Requester Specific
information (y/n) [n] ?
When the hard error limit is reached, the unit will be dropped from testing.
Enter hard error limit (1:65535) [65535] ?
When the soft error limit is reached, soft errors will no longer be
displayed but testing will continue for the unit.
Enter soft error limit (1:65535) [32] ?
Enter IO queue depth (1:12) [4] ?
*** Available tests are:
1. Basic Function
2. User Defined
Use the Basic Function test 99.9% of the time. The User Defined
test is for special problems only.
Enter test number (1:2) [1] ?1
**CAUTION**
If you answer yes to the next question, user data WILL BE destroyed.
Write enable disk unit(s) to be tested (y/n) [n] ?y
The write percentage will be set automatically.
Enter read percentage for Random IO and Data Intensive phase (0:100) [67] ?
Enter data pattern number 0=ALL, 19=USER_DEFINED, (0:19) [0] ?
Perform initial write (y/n) [n] ?y
The erase percentage will be set automatically.
Enter access percentage for Seek Intensive phase (0:100) [90] ?
Perform data compare (y/n) [n] ?y
Enter compare percentage (1:100) [5] ?50
Disk unit numbers available for testing on this controller include:
1000
1100
Enter unit number to be tested ?1000
Unit 1000 will be write enabled.
Do you still wish to add this unit (y/n) [n] ?y
Enter start block number (0:523365) [0] ?
Enter end block number (0:523365) [523365] ?
Unit 1000 successfully allocated for testing
Select another unit (y/n) [n] ?
DILX testing started at: 18-FEB-1997 17:57:01
Test will run for 30 minutes
Type ^T(if running DILX through VCS) or ^G(in all other cases)
to get a current performance summary
Type ^C to terminate the DILX test prematurely
Type ^Y to terminate DILX prematurely
Error Information Packet in hex
Cmd Ref Number 000010D5
Unit Number 000003E8
Log Sequence 0000002F
Format 02
Flags 40
Event Code 0000000B
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0005
Unit ID[0] 00000000
Unit ID[1] 02FF0000
Unit Software Rev 01
Unit Hardware Rev 34
Recovery Level 01
Retry Count 00
Serial Number 05590004
Header Code 00022B8F
Instance 0328450A
Template Type 51
Requestor Information Size 3C
Sense Key 01
ASC 17
ASQ 07
Error Information Packet in hex
Cmd Ref Number 000010D5
Unit Number 000003E8
Log Sequence 00000030
Format 02
Flags 80
Event Code 0000000B
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0005
Unit ID[0] 00000000
Unit ID[1] 02FF0000
Unit Software Rev 01
Unit Hardware Rev 34
Recovery Level 01
Retry Count 00
Serial Number 05590004
Header Code 00022B8F
Instance 0328450A
Template Type 51
Requestor Information Size 3C
Sense Key 01
ASC 17
ASQ 07
Error Information Packet in hex
Cmd Ref Number 00000000
Unit Number 00000000
Log Sequence 00000032
Format 00
Flags 02
Event Code 0000016A
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0000
Instance 03F40064
Template Type 41
Requestor Information Size 04
Bad Value Added Completion Status for unit 1000, end message in hex
Event Code 0043
Op Code 21
Cmd Ref Number 000017CE
Byte Count 00005A00
Error Byte Count 00000000
Sequence Number 0000
Flags 00
Error Information Packet in hex
Cmd Ref Number 000017CE
Unit Number 000003E8
Log Sequence 00000031
Format 02
Flags 40
Event Code 0000002B
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0005
Unit ID[0] 00000000
Unit ID[1] 02FF0000
Unit Software Rev 01
Unit Hardware Rev 34
Recovery Level 01
Retry Count 00
Serial Number 05590004
Header Code 00026B8B
Instance 031A4002
Template Type 51
Requestor Information Size 3C
Sense Key 04
ASC B0
ASQ 00
Error Information Packet in hex
Cmd Ref Number 00000000
Unit Number 00000000
Log Sequence 00000034
Format 00
Flags 02
Event Code 0000016A
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0000
Instance 03F40064
Template Type 41
Requestor Information Size 04
Error Information Packet in hex
Cmd Ref Number 000017CE
Unit Number 000003E8
Log Sequence 00000033
Format 02
Flags 00
Event Code 0000012B
Controller ID 63300559 012D0009
Controller SW ver 50
Controller HW ver 01
Multi Unit Code 0005
Unit ID[0] 00000000
Unit ID[1] 02FF0000
Unit Software Rev 01
Unit Hardware Rev 34
Recovery Level 01
Retry Count 00
Serial Number 05590004
Header Code 00026B8B
Instance 03134002
Template Type 51
Requestor Information Size 3C
Sense Key 04
ASC E0
ASQ 06
The unit status and/or the unit device type changed unexpectedly.
Unit 1000 dropped from testing
DILX Summary at 18-FEB-1997 17:58:34
Test minutes remaining: 29, expired: 1
Cnt err in HEX IC:03F40064 PTL:01/05/FF Key:06 ASC/Q:00/00 HC:0 SC:2
Total Cntrl Errs Hard Cnt 0 Soft Cnt 2
Unit 1000 Total IO Requests 6098
Err in Hex: IC 0328450A PTL:01/05/00 Key:01 ASC/Q:17/07 HC:0 SC:2
Err in Hex: IC 031A4002 PTL:01/05/00 Key:04 ASC/Q:B0/00 HC:0 SC:1
Err in Hex: IC 03134002 PTL:01/05/00 Key:04 ASC/Q:E0/06 HC:1 SC:0
Total Errs Hard Cnt 1 Soft Cnt 3
The unit status and/or the unit device type changed unexpectedly.
Unit 1000 dropped from testing
Reuse Parameters (stop, continue, restart, change_unit) [stop] ?
DILX - Normal Termination
************************************************************************
Also had these errors:
Unit 1100 Total IO Requests 1136
Err in Hex: IC 0326450A PTL:02/01/00 Key:03 ASC/Q:80/00 HC:1 SC:0
Err in Hex: IC 031A4002 PTL:02/01/00 Key:04 ASC/Q:B0/00 HC:0 SC:1
Err in Hex: IC 03134002 PTL:02/01/00 Key:04 ASC/Q:E0/06 HC:1 SC:0
Total Errs Hard Cnt 2 Soft Cnt 1
The unit status and/or the unit device type changed unexpectedly.
Unit 1100 dropped from testing
******************************************************************************
Troubleshooting aliminated the following:
HSJ50 Controller :By running test from both controllers on both drives
SCSI cables :By interchanging drives (on differents busses) and
BA356 BUS again running DILX on both drives from both contr.
BA356 slot Same drive was failing.
SCSI terminators
After T/S, i've put back previously replaced drive in SBB and ran
same tests. All test ran fine. Customer ran INIT/erase on both unit and
putted them back in their respective shadowset.
I have ordered an EZ32-VW (Whole SBB swap unit from SR17 with ETA for 3-march.
Also contacted customer today and their was still no errors.
|
6400.6 | ok, | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Thu Feb 20 1997 11:11 | 42 |
| Let's see...
DILX Summary at 18-FEB-1997 17:58:34
Test minutes remaining: 29, expired: 1
Cnt err in HEX IC:03F40064 PTL:01/05/FF Key:06 ASC/Q:00/00 HC:0 SC:2
Total Cntrl Errs Hard Cnt 0 Soft Cnt 2
Unit 1000 Total IO Requests 6098
Err in Hex: IC 0328450A PTL:01/05/00 Key:01 ASC/Q:17/07 HC:0 SC:2
Err in Hex: IC 031A4002 PTL:01/05/00 Key:04 ASC/Q:B0/00 HC:0 SC:1
Err in Hex: IC 03134002 PTL:01/05/00 Key:04 ASC/Q:E0/06 HC:1 SC:0
Total Errs Hard Cnt 1 Soft Cnt 3
The unit status and/or the unit device type changed unexpectedly.
Unit 1000 dropped from testing
Reuse Parameters (stop, continue, restart, change_unit) [stop] ?
Unit 1100 Total IO Requests 1136
Err in Hex: IC 0326450A PTL:02/01/00 Key:03 ASC/Q:80/00 HC:1 SC:0
Err in Hex: IC 031A4002 PTL:02/01/00 Key:04 ASC/Q:B0/00 HC:0 SC:1
Err in Hex: IC 03134002 PTL:02/01/00 Key:04 ASC/Q:E0/06 HC:1 SC:0
Total Errs Hard Cnt 2 Soft Cnt 1
The unit status and/or the unit device type changed unexpectedly.
Unit 1100 dropped from testing
Unit 1000 had a couple recoverable errors then disappeared. (E0 and B0 are HSJ)
Unit 1100 had a 1 hard error then disappeared. (E0 and B0 are HSJ events)
Seems strange. Both units are just dropping out of site. I see from the logs
they are different ports so that kinda rules out a power/bus issue.
I see you did some moving around and reseating of hardware, did you change
anything or did the units just start running?
Did you have a "regular" disk to also use to make sure you did not have a non-ez
problem?
If these units fail again, escalate a case to engineering and have those units
analyzed to make sure you are not fighting a symptom of "something else".
roger.
|
6400.7 | Confusion here... Sorry! | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Fri Feb 21 1997 09:46 | 26 |
|
Roger,
Let me appologize for the confusion here. With the cut and paste
i've done from my document, my intention was to show you that i had two
kind of error string out of DILX on the first instance code line:
ASC/q:80/000 & 17/07.
The two units that you see in the report are actually the same
physical drive, but in a different configuration during the
troubleshooting step. To answer your question about the hardware
change, the unit that i was troubleshooting had a solid problem even
after moving the unit aroud, and putting it back the way it was. I've
put back the original unit back in the SBB because i had it with me,
and the ETA for the new EZ32-VW is March -03. The original unit failed
on Feb-13, but may only have an intermitent problem (with the same
symptoms), so i think that this unit is not reliable.
Conclusion of this, i think that i had 3 bad units in a row !!!!
Regards,
Denis
|
6400.8 | I hope not. | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Fri Feb 21 1997 09:50 | 7 |
|
Three bad is REALLY BAD luck or I'm about to get a LOT busier ;^)
I am going on vacation next week (yes, even I do take vacations ;^) but if you
want me to look at the bad unit, send me mail offine.
roger.
|