[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | HSZ40 Product Conference |
|
Moderator: | SSDEVO::EDMONDS |
|
Created: | Mon Apr 11 1994 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 902 |
Total number of notes: | 3319 |
859.0. "disk error monitoring from Unix ?" by PANTER::MARTIN (Be vigilant...) Fri May 02 1997 09:15
Is there a nice way to monitor devices connected behind a HSZ
under Digital Unix ?
Customer has Stripe sets on HSZ40/50s and is using LSM mirroring
for security (SAP applic.) !
He wants to be able to detect HW errors on physical disks...
eg. bad blocks / dead disk...
Actually LSM does detect the bad plex and automatically removes
it from the volume, but the errorlog is just able to show which
LOGICAL disk is wrong (from bus # target # LUN # info).
So then the customer has to go at the HSZ's console and type a
HSZ> show disk full
(or run it from hszterm) for knowing which PHYSICAL device is
wrong !
Actually he uses a ksh script which runs a "show disk full"
from hszterm each 30 minutes !!!
These show device full commands cause regularily SCSI bus
resets on both SCSI buses (system has 2 KZPSAs connecting
to 2 HSZ50s)... already discussed in note #851 in this conf.
A told him to use Polycenter Console Manager for logging HSZ
console messages, but he says it's too expensive !
A suggested to connect a terminal with hard copy printer but
he says it's going back 10 years ago...
We sought about using a "tip" connection from the UNIX machine
through a serial line to get the HSZ's console messages.
Does it looks a reasonnable/feasible approach ?
Is there a better way to monitor the devices from HSZ/Unix, a
kind of "swxcrmon" would be welcome ?
What we are trying to fix actually are these intermittent SCSI
bus resets, the SCSI buses have been checked (cables, terminator,etc)
Thanks in advance for your advice, I have to confess that I personnaly
have almost no HSZ experience as we don't have such equipment here,
we learn at work...with help of customers...
Cheers,
============================
Alain MARTIN/SSG Switzerland
************************ customer script ***************************
#]/bin/ksh
#
#
# The following script is intended to look over the HSZ controllers
# and monitor stripe sets and disks.
# This monitoring is done through the SCSI channels, using the utility
# hszterm
#
# Author : Felix Hassine, EDC, Philip MORRIS EU SA, 12th march 1997
#
# To test this script, set up the HSZ_TRACE variable to "1"
#
LIST_VOLUMES=$* #
TRAPNUMBER=1102
TRAPMIB="1.2.3"
datestamp=`date +"%H%M"`
#
#
trace() {
if [ "$TRACE" = "1" !; then
echo $*
fi
}
COMMAND=`basename $0`
LOGFILE="/usr/spool/hsz/$COMMAND.`date +%H%M`"
LOGFILE1="${LOGFILE}.1"
LOGFILE2="${LOGFILE}.2"
log() {
echo "`date +%d-%b-%Y:%H:%M:%S` -- LOG -- $* "
}
sendalarm() {
TRAPNUMBER=1100
TRAPMIB="1.2.3"
trace " Sendalarm() : sending an alarm with text [$*! "
if [ "$TRACE" ]= "1" !; then
send_trap $TRAPNUMBER $TRAPMIB "`hostname` :
Critical: $* "
fi
echo "ALARM TO SEND $* "
}
#
if [ ] -d /usr/spool/hsz/ !; then
mkdir -p /usr/spool/hsz/
fi
#
LOG_DISK_HSZ1=/usr/spool/hsz/hsz1_log.$datestamp
TEMPLATE_DISK_HSZ1=/usr/spool/hsz/hsz1_template
#
LOG_DISK_HSZ2=/usr/spool/hsz/hsz2_log.$datestamp
TEMPLATE_DISK_HSZ2=/usr/spool/hsz/hsz2_template
#
#
HSZ1=" -b3 -t1 -l7 " # SCSI coordinates for controller #1
HSZ2=" -b1 -t1 -l7 " # SCSI coordinates for controller #2
PATH=/usr/bin ; export PATH
#
# Step 0 : set up the TIME at same value as host's value
#
log " Step 0 : set up the TIME at same value as host's value "
controller=$HSZ1
time="`date +%d-%b-%Y:%H:%M:%S`"
trace "executing : [hszterm -o$LOGFILE1 $controller! "
hszterm -o$LOGFILE1 $controller <<AAC > /dev/null 2>&1
set this time=$time
sho disk
AAC
sleep 5
controller=$HSZ2
time="`date +%d-%b-%Y:%H:%M:%S`"
trace "executing : [hszterm -o$LOGFILE2 $controller! "
hszterm -o$LOGFILE2 $controller <<AAB > /dev/null 2>&1
set this time=$time
sho disk
AAB
#
# Step 1: Check for power supply failures
#
log " Step 1: Check for power supply failures"
for logfile in $LOGFILE1 $LOGFILE2
do
if [ `grep "bad power" $logfile � wc -l` -ne 0 ! ; then
if [ "$logfile" = "$LOGFILE1" ! ; then
sendalarm " Bad power detected in DISK
CABINET/
HSZ-1. More information in `hostname`:$logfile "
echo " Bad power detected in DISK CABINET/
HSZ-1.
More information in `hostname`:$logfile " >>
/usr/spool/hsz/alarm_sent_1.$$
else
sendalarm " Bad power detected in DISK
CABINET/
HSZ-2. More information in `hostname`:$logfile "
echo " Bad power detected in DISK CABINET/
HSZ-2.
More information in `hostname`:$logfile " >>
/usr/spool/hsz/alarm_sent_2.$$
fi
fi
done
rm $LOGFILE1 $LOGFILE2
#
# Step 2: Check for failed disks
#
controller=$HSZ1
if [ ] -f $TEMPLATE_DISK_HSZ1 ! ; then
trace "creating new TEMPLATE $TEMPLATE_DISK_HSZ1 "
hszterm -o $TEMPLATE_DISK_HSZ1 $HSZ1 "sho disk full" >
/dev/null
2>&1
exit 0
fi
sleep 5
controller=$HSZ2
if [ ] -f $TEMPLATE_DISK_HSZ2 ! ; then
trace "creating new TEMPLATE $TEMPLATE_DISK_HSZ2 "
hszterm -o $TEMPLATE_DISK_HSZ2 $HSZ2 "sho disk full" >
/dev/null
2>&1
exit 0
fi
rm -f $LOG_DISK_HSZ1 $LOG_DISK_HSZ2
hszterm -o $LOG_DISK_HSZ1 $HSZ1 "sho disk full" > /dev/null 2>&1
sleep 5
hszterm -o $LOG_DISK_HSZ2 $HSZ2 "sho disk full" > /dev/null 2>&1
trace "diff $LOG_DISK_HSZ1 $TEMPLATE_DISK_HSZ1 "
count=`diff $LOG_DISK_HSZ1 $TEMPLATE_DISK_HSZ1 � wc -l`
if [ $count -ne 0 !; then
# THERE IS A DIFFERENCE ] WARN OPERATOR
sendalarm " Bad disk detected on HSZ controller 1 . Please call
piquet person NOW "
echo " Bad disk detected on HSZ controller 1 . Please call
piquet
person NOW " >> /usr/spool/hsz/alarm_sent_1.$$
else
trace " No alarm sent"
rm $LOG_DISK_HSZ1
fi
trace "diff $LOG_DISK_HSZ2 $TEMPLATE_DISK_HSZ2 "
count=`diff $LOG_DISK_HSZ2 $TEMPLATE_DISK_HSZ2 � wc -l`
if [ $count -ne 0 !; then
# THERE IS A DIFFERENCE ] WARN OPERATOR
sendalarm " Bad disk detected on HSZ controller 2 . Please call
piquet person NOW "
echo " Bad disk detected on HSZ controller 2 . Please call
piquet
person NOW " >> /usr/spool/hsz/alarm_sent_2.$$
else
trace " No alarm sent"
rm $LOG_DISK_HSZ2
fi
exit 0
T.R | Title | User | Personal Name | Date | Lines |
---|
859.1 | | SMURF::KNIGHT | Fred Knight | Fri May 02 1997 10:20 | 15 |
| I know of 2 options:
1) The SWCC product. It is an online monitor for the HSZ*
much like the SWXCRMON program. It does require a
Windows or NT system on the network (last I knew
anyway).
2) Look at the Digital UNIX error log. As of V4.0A virtually
all HSZ events are put into the error log (you do need
DECevent to understand them however; uerf just dumps
the HSZ data in hex). For V3.2G and earlier, then get
the latest cam_disk.o patches (which includes this new
logging code as well).
Fred Knight
|
859.2 | | SSDEVO::T_GONZALES | | Fri May 02 1997 13:45 | 4 |
| I agree with fred, the unix error logging will show errors that occur
on devices on the hsz, even if they are part of a raid set.
you could use tip, but it is very expensive in terms of cpu.
|
859.3 | Thanks for your inputs... | LEMAN::MARTIN_A | Be vigilant... | Mon May 05 1997 07:03 | 10 |
| Thanks for your inputs.
The cam_disk.o patch was provided a few days ago, hope it'll
fit my customer needs, he'll run emergency tests soon so I'll
update this note if we encounter any surprise by then ...
Cheers,
============================
Alain MARTIN/SSG Switzerland
|