[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsz40_product

Title:	HSZ40 Product Conference

Moderator:	SSDEVO::EDMONDS

Created:	Mon Apr 11 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	902
Total number of notes:	3319

859.0. "disk error monitoring from Unix ?" by PANTER::MARTIN (Be vigilant...) Fri May 02 1997 08:15

    Is there a nice way to monitor devices connected behind a HSZ
    under Digital Unix ?
    
    Customer has Stripe sets on HSZ40/50s and is using LSM mirroring
    for security (SAP applic.) !
    
    He wants to be able to detect HW errors on physical disks...
    eg. bad blocks / dead disk...
    
    Actually LSM does detect the bad plex and automatically removes 
    it from the volume, but the errorlog is just able to show which 
    LOGICAL disk is wrong (from bus # target # LUN # info).
    
    So then the customer has to go at the HSZ's console and type a
    	HSZ> show disk full 
    (or run it from hszterm) for knowing which PHYSICAL device is 
    wrong ! 
    
    Actually he uses a ksh script which runs a "show disk full" 
    from hszterm each 30 minutes !!!
    
    These show device full commands cause regularily SCSI bus 
    resets on both SCSI buses (system has 2 KZPSAs connecting
    to 2 HSZ50s)... already discussed in note #851 in this conf.
    
    A told him to use Polycenter Console Manager for logging HSZ
    console messages, but he says it's too expensive !
    A suggested to connect a terminal with hard copy printer but
    he says it's going back 10 years ago...
    
    We sought about using a "tip" connection from the UNIX machine
    through a serial line to get the HSZ's console messages. 
    Does it looks a reasonnable/feasible approach ?
    
    Is there a better way to monitor the devices from HSZ/Unix, a
    kind of "swxcrmon" would be welcome ?
    
    What we are trying to fix actually are these intermittent SCSI
    bus resets, the SCSI buses have been checked (cables, terminator,etc)
    
    Thanks in advance for your advice, I have to confess that I personnaly
    have almost no HSZ experience as we don't have such equipment here,
    we learn at work...with help of customers...
    
    Cheers,
    
    				============================
    				Alain MARTIN/SSG Switzerland
    
    
    
    ************************ customer script ***************************
    
    #]/bin/ksh
    #
    #
    # The following script is intended to look over the HSZ controllers
    # and monitor stripe sets and disks.
    # This monitoring is done through the SCSI channels, using the utility
    # hszterm
    #
    # Author : Felix Hassine, EDC, Philip MORRIS EU SA, 12th march 1997
    #
    # To test this script, set up the HSZ_TRACE variable to "1"
    #
    LIST_VOLUMES=$*                 #
    TRAPNUMBER=1102
    TRAPMIB="1.2.3"
    datestamp=`date +"%H%M"`
    #
    #
    trace() {
            if [ "$TRACE" = "1" !; then
                    echo $*
            fi
    }
    
    COMMAND=`basename $0`
    LOGFILE="/usr/spool/hsz/$COMMAND.`date +%H%M`"
    LOGFILE1="${LOGFILE}.1"
    LOGFILE2="${LOGFILE}.2"
    
    log() {
            echo "`date +%d-%b-%Y:%H:%M:%S`  -- LOG -- $* "
    }
    
    sendalarm() {
            TRAPNUMBER=1100
            TRAPMIB="1.2.3"
            trace " Sendalarm() : sending an alarm with text [$*! "
            if [ "$TRACE" ]= "1" !; then
                           send_trap $TRAPNUMBER $TRAPMIB "`hostname` :
    Critical: $* "
            fi
            echo "ALARM TO SEND  $* "
    }
    #
    if [ ] -d /usr/spool/hsz/ !; then
            mkdir -p  /usr/spool/hsz/
    fi
    #
    
    LOG_DISK_HSZ1=/usr/spool/hsz/hsz1_log.$datestamp
    TEMPLATE_DISK_HSZ1=/usr/spool/hsz/hsz1_template
    #
    LOG_DISK_HSZ2=/usr/spool/hsz/hsz2_log.$datestamp
    TEMPLATE_DISK_HSZ2=/usr/spool/hsz/hsz2_template
    #
    #
    HSZ1=" -b3 -t1 -l7 "    # SCSI coordinates for controller #1
    HSZ2=" -b1 -t1 -l7 "    # SCSI coordinates for controller #2
    
    PATH=/usr/bin ; export PATH
    
    #
    # Step 0 : set up the TIME at same value as host's value
    #
    log " Step 0 : set up the TIME at same value as host's value "
    controller=$HSZ1
    time="`date +%d-%b-%Y:%H:%M:%S`"
    trace "executing : [hszterm -o$LOGFILE1 $controller! "
    hszterm -o$LOGFILE1 $controller <<AAC  > /dev/null 2>&1
            set this time=$time
            sho disk
    AAC
    
    sleep 5
    controller=$HSZ2
    time="`date +%d-%b-%Y:%H:%M:%S`"
    trace "executing : [hszterm -o$LOGFILE2 $controller! "
    hszterm -o$LOGFILE2 $controller <<AAB  > /dev/null 2>&1
            set this time=$time
            sho disk
    AAB
    
    #
    # Step 1: Check for power supply failures
    #
    log " Step 1: Check for power supply failures"
    for logfile in $LOGFILE1 $LOGFILE2
    do
            if [ `grep "bad power" $logfile � wc -l` -ne 0 ! ; then
                    if [ "$logfile" = "$LOGFILE1" ! ; then
                            sendalarm  " Bad power detected in DISK
    CABINET/
    HSZ-1. More information in `hostname`:$logfile "
                            echo " Bad power detected in DISK CABINET/
    HSZ-1.
    More information in `hostname`:$logfile " >>
    /usr/spool/hsz/alarm_sent_1.$$
                    else
                            sendalarm  " Bad power detected in DISK
    CABINET/
    HSZ-2. More information in `hostname`:$logfile "
                            echo " Bad power detected in DISK CABINET/
    HSZ-2.
    More information in `hostname`:$logfile " >>
    /usr/spool/hsz/alarm_sent_2.$$
                    fi
            fi
    done
    rm $LOGFILE1 $LOGFILE2
    
    #
    # Step 2: Check for failed disks
    #
    
    controller=$HSZ1
    if [ ] -f $TEMPLATE_DISK_HSZ1 ! ; then
            trace "creating new TEMPLATE $TEMPLATE_DISK_HSZ1 "
            hszterm -o $TEMPLATE_DISK_HSZ1  $HSZ1 "sho disk full"  >
    /dev/null
    2>&1
            exit 0
    fi
    
    sleep 5
    controller=$HSZ2
    if [ ] -f $TEMPLATE_DISK_HSZ2 ! ; then
            trace "creating new TEMPLATE $TEMPLATE_DISK_HSZ2 "
            hszterm -o $TEMPLATE_DISK_HSZ2  $HSZ2 "sho disk full"  >
    /dev/null
    2>&1
            exit 0
    fi
    
     rm -f $LOG_DISK_HSZ1 $LOG_DISK_HSZ2
     hszterm -o $LOG_DISK_HSZ1  $HSZ1  "sho disk full" > /dev/null 2>&1
     sleep 5
     hszterm -o $LOG_DISK_HSZ2  $HSZ2  "sho disk full" > /dev/null 2>&1
    
     trace "diff  $LOG_DISK_HSZ1 $TEMPLATE_DISK_HSZ1 "
     count=`diff $LOG_DISK_HSZ1 $TEMPLATE_DISK_HSZ1 � wc -l`
     if [ $count -ne 0 !; then
            # THERE IS A DIFFERENCE ] WARN OPERATOR
            sendalarm " Bad disk detected on HSZ controller 1 . Please call
    piquet person NOW  "
            echo " Bad disk detected on HSZ controller 1 . Please call
    piquet
    person NOW " >>  /usr/spool/hsz/alarm_sent_1.$$
    else
            trace " No alarm sent"
            rm $LOG_DISK_HSZ1
     fi
    
     trace "diff  $LOG_DISK_HSZ2 $TEMPLATE_DISK_HSZ2 "
     count=`diff  $LOG_DISK_HSZ2 $TEMPLATE_DISK_HSZ2 � wc -l`
     if [ $count -ne 0 !; then
            # THERE IS A DIFFERENCE ] WARN OPERATOR
            sendalarm " Bad disk detected on HSZ controller 2 . Please call
    piquet person NOW  "
            echo " Bad disk detected on HSZ controller 2 . Please call
    piquet
    person NOW " >>  /usr/spool/hsz/alarm_sent_2.$$
     else
            trace " No alarm sent"
            rm $LOG_DISK_HSZ2
     fi
    
     exit 0

T.R	Title	User	Personal Name	Date	Lines
859.1		SMURF::KNIGHT	Fred Knight	`Fri May 02 1997 09:20`	15
	I know of 2 options: 1) The SWCC product. It is an online monitor for the HSZ* much like the SWXCRMON program. It does require a Windows or NT system on the network (last I knew anyway). 2) Look at the Digital UNIX error log. As of V4.0A virtually all HSZ events are put into the error log (you do need DECevent to understand them however; uerf just dumps the HSZ data in hex). For V3.2G and earlier, then get the latest cam_disk.o patches (which includes this new logging code as well). Fred Knight
859.2		SSDEVO::T_GONZALES		`Fri May 02 1997 12:45`	4
	I agree with fred, the unix error logging will show errors that occur on devices on the hsz, even if they are part of a raid set. you could use tip, but it is very expensive in terms of cpu.
859.3	Thanks for your inputs...	LEMAN::MARTIN_A	Be vigilant...	`Mon May 05 1997 06:03`	10
	Thanks for your inputs. The cam_disk.o patch was provided a few days ago, hope it'll fit my customer needs, he'll run emergency tests soon so I'll update this note if we encounter any surprise by then ... Cheers, ============================ Alain MARTIN/SSG Switzerland