[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | HSJ30/40 Product Conference |
|
Moderator: | SSDEVO::EDMONDS |
|
Created: | Mon Jul 12 1993 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1264 |
Total number of notes: | 4958 |
1245.0. "instance code=01cc3002/0122330a ?" by MANM01::NOELGESMUNDO () Mon May 05 1997 03:46
Hello!
The other day, we installed a new HSJ40C on a separate BA350-MB in Smart's
SW800. This controller is connected to 5 BA356 bays; the last cable is just
hanging. This controller though has 32mb cache but no battery so we set it to
read cache only pending the installation/arrival of the battery. We installed
the third member of each of the 3 shadow sets plus 2 TZ87 tape drives on one of
the bays.
On the early morning of 29 April, we installed the said controller and upgraded
the firmware of all the disks to the latest version. On the evening of 30 April
(20:04), the customer called up to report that the shadow members residing on
the second HSJ40 controller suddenly became 'online' instead of the usual
'member of DSAx'. The tape drives though are still available and a standalone
disk on the same controller remained 'mounted'. They manually mounted the
members to their shadow sets and no more errors were encountered.
This morning, I investigated the problem and found the following information:
1. OPERATOR.LOG
%%%%%%%%%%% OPCOM 30-apr-1997 20:04:33.77 %%%%%%%%%%
Messagefrom user INTERnet on SMART
TELNET Login Request from remote Host: 31.0.1.2 Port:1066
%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.40 %%%%%%%%%%
DSA3: shadow set has changed state.
Mount verification in progress
%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.40 %%%%%%%%%%
DSA1: shadow set has changed state.
Mount verification in progress
%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.43 %%%%%%%%%%
DSA2: shadow set has changed state.
Mount verification in progress
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA12: (HSJ410) has been removed from shadow set.
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA9: (HSJ410) has been removed from shadow set.
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA6: (HSJ410) has been removed from shadow set.
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA3:
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA1:
%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA2
2. ERROR LOG:
*************************ENTRY 36292. **************************
error sequence 2165. logged on: cpu_type 00000005
date/time: 30-apr-1997 20:04:51.39 SYS_TYPE 0000000C
SYSTEM UPTIME : 8 DAYS 14:39:09
SCS NODE: SMART OPENVMS AXP 6.2
HW_MODEL : 0000044E HARDWARE MODEL = 1102
ERL$LOGMESSAGE ALPHASERVER 8400 5/300
CIXCD SUB-SYSTEM _SMART$PNA0:
PORT HAS CLOSED VIRTUAL CIRCUIT
LOCAL STATION ADDRESS, 6(X)
LOCAL SYSTEM ID, 408(X)
REMOTE STATION ADDRESS, 4(X)
REMOTE SYSTEM ID, 4200100400122(X)
.
.
.
*************************ENTRY 36296. **************************
error sequence 2169. logged on: cpu_type 00000005
date/time: 30-apr-1997 20:05:21.73 SYS_TYPE 0000000C
SYSTEM UPTIME : 8 DAYS 14:39:09
SCS NODE: SMART OPENVMS AXP 6.2
HW_MODEL : 0000044E HARDWARE MODEL = 1102
ERL$LOGMESSAGE ALPHASERVER 8400 5/300
MESSAGE TYPE 0000B DATAGRAM FOR NON-EXISTING "UCB"
CLASS DRIVER 4B534944 /DISK/
.
. UNIQUE IDENTIFIER, 964401993(X)
MASS STORAGE CONTROLLE
MODEL = 40
SEQUENCE #11
CONTROLLER LOG
NON-ERROR/INFORMATIONAL EVENT
CONTROLLER ERROR
DEVICE INTERFACE HW ERROR
CONTROLLER VERSION #39
CONTROLLER HARDWARE VERSION #11
Above message appears several times after this entry.
3. DECEVENT:
***************************** ENTRY 36292 ***********************
TIMESTAMP 30-APR-1997 20:04:51
SYSTEM UPTIME IN SECONDS 743949
FLAGS X0001 DYNAMIC RECOGNITION PRESENT
----DEVICE PROFILE----
PRODUCT NAME CIMNA XMI TO CI PORT
UNIT NAME SMART$PNA
UNIT NUMBER 0
DEVICE CLASSS CONTROLLER
***************************** ENTRY 36293 ************************
SOFTWARE PARAMETERS
HSX01 MSCP VIRTUAL DISK
HSJ410$DUA
UNIT NUMBER 12.
UCB$X_STS X08001010 ONLINE
UNLOAD AT DISMOUNT
UNIT SUPPORTS THE EXTENDED FUNCTION BIT
***************************** ENTRY 36294 ************************
SOFTWARE PARAMETERS
HSX01 MSCP VIRTUAL DISK
HSJ410$DUA
UNIT NUMBER 9.
UCB$X_STS X08001010 ONLINE
UNLOAD AT DISMOUNT
UNIT SUPPORTS THE EXTENDED FUNCTION BIT
***************************** ENTRY 36295 ************************
SOFTWARE PARAMETERS
HSX01 MSCP VIRTUAL DISK
HSJ410$DUA
UNIT NUMBER 6.
UCB$X_STS X08001010 ONLINE
UNLOAD AT DISMOUNT
UNIT SUPPORTS THE EXTENDED FUNCTION BIT
***************************** ENTRY 36296 ************************
LOGGED MSCP MESSAGE
FM DEVICE CLASS NOT DEFINED
NO UNIT IN DATAGRAM MESSAGE
LOGGED MESSAGE FORMAT 0 CONTROLLER ERROR
MSCP FLAGS X02 INFORMATIONAL
MSCP EVENT CODE X016A MAJOR EVENT = CONTROLLER ERROR
SUB-EVENT = DRIVE INTERFACE HARDWARE ERROR
INSTANCE CODE X03F40064 DEVICE SERVICES HAD TO RESET THE PORT TO
CLEAR A BAD CONDITION. nOTE THAT IN THIS
INSTANCE THE ASSOCIATED TARGEET, ASSOCIATED
ASC, AND ASSOCIATED ASCQ FIELDS ARE UNDEFINED
COMPONENT ID = DEVICE SERVICES
EVENT NUMBER = X000000F4
REPAIR ACTION = X000000
NR THRESHOLD = X 00000064
TEMPLATE X41 DEVICE NON-TRANSFER ERROR
****************************ENTRY 36297 ***************************
INSTANCE CODE X01010302 AN UNRECOVERABLE HARDWARE DETECTED FAULT
OCCURRED
COMPONENT ID = EXECUTIVE SERVICES
EVENT NUMBER = X0000001
REPAIR ACTION = X000003
NR THRESHOLD = X 0000002
TEMPLATE X01 LAST FAILURE EVENT
LAST FAILURE CODE X018B2580 COMPONENT ID = EXECUTIVE SERVICES
EVENT NUMBER = X0000008B
REPAIR ACTION = X000025
FLAG = 1, HARDWARE DETECTED FAULT.
RESTART CODE = FULL FIRMWARE RESTART
PARAMETER COUNT = 0.
AN NMI INTERRUPT WAS GENERATED WITH AN
INDICATION THAT A MEMORY SYSTEM PROBLEM
OCCURRED.
**************************** ENTRY 36298 **********************************
INSTANCE CODE X01CC3002 THE CACHE10 DRAB DETECTED A WRITE DATA
PARITY ERROR DURING A HOST PORT ATTEMPT
COMPONENT ID = EXECUTIVE SERVICES
EVENT NUMBER = X000000CC
REPAIR ACTION = X0000030
NR THRESHOLD = X 0000002
**************************** ENTRY 36299 **********************************
INSTANCE CODE X0122330A AN ERROR CONDITION DETECTED BY ONE
COMPONENT ID = EXECUTIVE SERVICES
EVENT NUMBER = X00000022
REPAIR ACTION = X0000032
NR THRESHOLD = X 000000A
TEMPLATE X14 MEMORY SYSTEM FAILURE
We surfed the web on COMET and found some suggestions:
1 - replace cache module
2 - replace controller
3 - upgrade to 16-port star coupler
I could not see the logic for upgrading to a 16-port coupler but it seemd to
have worked on said problems reported. The thing is only one of the controllers
(the new one without battery) is having problem. The other controller seemed
okay.
Any help would be appreciated.
Thanks.
Noel Gesmundo
MCS/Digital Equipment Filipinas Inc.
T.R | Title | User | Personal Name | Date | Lines |
---|
1245.1 | 16 node should fix things for you | SSDEVO::RMCLEAN | | Mon May 05 1997 10:54 | 3 |
| The 16 node coupler does usually help. It's all black magic! It seems that
the different coupler changes the characteristics of the CI load and makes
things work better/different.
|
1245.2 | Shadow Member Timeout too low to "ride through" | MSE1::BURKE | | Mon May 05 1997 12:00 | 10 |
| Hi,
As Ron stated, the incidence of this problem may well go down or go
away with 16 node coupler, however the loss of shadow members is likely
to have been due to the setting of the SYSGEN parameter SHADOW_MBR_TMO.
It looks from your console entries that this is set for 20 seconds, in
order for Shadowing to "ride through" events like this, the recomended
setting when HSJ's are used is 120 seconds.
|
1245.3 | any connection to battery? | MANM01::NOELGESMUNDO | | Tue May 06 1997 06:34 | 9 |
| Hi!
Thank you for the inputs. How about the fact that the HSJ40 has no
battery? Can this cause memory failure and affected the shadow members?
Will the battery prevented this problem?
Thanks again.
Noel
|
1245.4 | Nope | SSDEVO::RMCLEAN | | Tue May 06 1997 09:31 | 3 |
| Battery will make no difference in this case. Battery only allows you to
have writeback which will improve performance. Setting the timeout value
will do more than anything for you.
|