T.R | Title | User | Personal Name | Date | Lines |
---|
324.1 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Fri Feb 28 1997 09:24 | 9 |
| Michael,
are you sure the VAX is running V2.3 of RAID? Any cluster configuration
is a legal RAID Software configuration (s. SPD for details). In this
scenario none of the members should have been removed using V2.3 RAID.
I'll run a test here.
Guenther
|
324.2 | RAID5 config with problems? | TAGEIN::GRUENWALD | | Mon Mar 03 1997 06:01 | 12 |
| Guenther,
> are you sure the VAX is running V2.3 of RAID?
Yes, we are. That customer site is located in Hungary. I just spoke to
the colleague there. He will get the data i asked for.
There wheren't any ECOs installed!
Regards
Michael
|
324.3 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Mon Mar 03 1997 16:54 | 6 |
| Michael,
I could, to my surprise, reproduce this behavior here. Have to find out
where RAID lost its mind and does this.
Guenther
|
324.4 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Tue Mar 04 1997 11:37 | 22 |
| There's a quick solution to that. Disable the timeout action on all
RAID 5 arrays with:
$ RAID MODIFY/NOTIMEOUT array-ID
/TIMEOUT is the default and allows the driver to make a decision to
remove a member device if an I/O to the disk driver did not return
within timeout seconds (default is 30 sec.).
The disadvantage of using /NOTIMEOUT is that a RAID 5 array may hang
unnecessarily. Assume member devices are connected to different
controllers and just one member device becomes unavailable. With
/NOTIMEOUT RAID driver waits for the device to become accessible
again which may take minutes or hours. Until then the whole array is
inaccessible. With /TIMEOUT=n in place the array would be reduced after
n seconds and the array is accessible again. This is a tradeoff the
users have to make.
Instead of choosing /NOTIMEOUT they can use /TIMEOUT=n with n large
enough so that a serving node can reboot.
Guenther
|
324.5 | SCSI patch solved the problem | BPSTGA::TORONY | G�bor Tornyossy | Wed Mar 05 1997 07:18 | 25 |
| Guenther,
I assume that the SCSI patch (ALPSCSI02_070) solved the problem.
Installing step by step the patches suggested by Michael when we arrived to that
patch this "member_0_reconstructing" behaviour has disappeared (there were
writes to that raidset, breaking the alpha, ...).
Some additional things: at the office I set up a test environment where
(without any patches) I couldn't reproduce the problem.
My envir: dec3000/300 with turbochannel scsi adapter (PMAZB) and vax3100/78
with NI cluster (the software versions were the same as at the customer).
The differences between this and the site are the machines and the PCI/SCSI
stuff (KZPSA) and the DSSI as an additional cluster interconnect.
It gave a lesson to me: somehow beeing able to declare what patches to where
before a problem arises (=what are the MUP like patches). We know (or we believe
we know) the existence of patches, there are summaries concerning the different
op_sys versions. But anyhow they are to many to read through them all...
How do you solve it? Can you help?
Thanks,
G�bor
(the involved engineer in Hungary)
|
324.6 | | nabsco::FROEHLIN | Let's RAID the Internet! | Wed Mar 05 1997 09:33 | 34 |
| G�bor,
the problem mentioned by Michael in .0 has nothing to do with hardware
or patches. Let me explain:
RAID$DPRIVER does RAID member management. If an I/O to the underlying
driver (DS/DU/DKdriver) returns with an error and the RAID 5 array is
in normal state, the member will be removed. When a disk server
disappears with outstanding I/Os, the I/Os are not returned while
the disk has entered mount verification.
But RAID$DPRIVER can timeout the I/Os if told so (RAID MODIFY/TIMEOUT=n).
An error status returned for a member disk I/O starts a removal
request for an array. The driver function goes thru all members of an
array with such a request and checks members 0 to n. If all disks of
this array have been served by a disappearing disk server and there
were active I/Os it is likely that more than one member has a removal
request.
The driver function starts with the first member and, if the array is
in normal state, removes it. Then the driver checks the next member.
But since we are reduced now, no further member can be removed and
the DPA device enters mount verification.
Starting with V2.3 the driver now waits for 2 seconds before working on
a removal request. If more than 1 member needs to be removed the driver
skips any removal request. But if after 2 seconds there's exactly one
member with a removal request it will be removed. If there have been no
I/Os at all during the time the disk server reboots, no member will be
removed.
Hope this helps!
Guenther
|
324.7 | | BPSTGA::TORONY | G�bor Tornyossy | Thu Mar 06 1997 10:23 | 36 |
| Guenther,
thank you for the info. It's like a "technical liberal education" in the topic
of sw raid. This is why one reads the notes. I'm serious.
But holds the problem (and the customer) in a state of uncertainty:
As of .1 - should work without reconstructing,
.2 - no (=feature),
.4 - use /NOTIMEOUT. ---> Changing dinamically? Too strange to leave it to the
customer without giving any suggestions.
.6 - it's obviuos that removes the first members... ---> How to use than in a
cluster environment then anyhow?
Still not clear: is it raining or the sun is shining? We smile, o.k. but in a
swimming dress or under an umbrella?
.What are the expectations in a cluster environment? Is it normal that when one
clustermember comes back (for example after an autogen/reboot) while the other
node remains working the raid will be reconstructing? Yes or not how to handle?
.There were tests with and without that patch (or patches) causing changes in
the behaviour. Right - nothing to do with that, then why. And let's say to the
customer that although the problem seems to be disappeared nothing has been
solved - so don't use it? What's the answer to his next obvious question?
The customer has decided to use it (after the successful test, as we said it's
okay). What will happen during the next reboot?
The more so important since he is about deciding to leave or keep Digital
customer services and what more - platform!
Regards,
G�bor
|
324.8 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Thu Mar 06 1997 11:58 | 53 |
| G�bor,
>Still not clear: is it raining or the sun is shining? We smile, o.k. but in a
>swimming dress or under an umbrella?
If it's raining reconstructs use the /NOTIMEOUT umbrella to bring out
your smile ;-).
>.What are the expectations in a cluster environment? Is it normal that when one
>clustermember comes back (for example after an autogen/reboot) while the other
>node remains working the raid will be reconstructing? Yes or not how to handle?
The point is not a general cluster member but a disk server. Some
cluster nodes might serve their local disks to other nodes and are
therefore disk servers as well. Rebooting such a "disk server" with
active I/Os to the disks will cause the disks (I'm talking physical disks
like the RAID set members) to enter mount verification. RAID software
has a special feature built-in for RAID 5 arrays which can tolerate the
loss of one member. This feature is a timeout of disk I/Os which is by
default turned on. The idea behind this is to remove a hindering member
quickly and continue with the remaining members instead of stalling
the whole RAID set. Most customers might not need/like this feature
and therefore can turn it off dynamically at any time for specific arrays
using the RAID MODIFY command.
>.There were tests with and without that patch (or patches) causing changes in
>the behaviour. Right - nothing to do with that, then why. And let's say to the
Then why WHAT?
>customer that although the problem seems to be disappeared nothing has been
>solved - so don't use it? What's the answer to his next obvious question?
Patches are typically early point fixes to problems. If a system has
a severe impact caused by such a problem then the patch should be
installed. Otherwise wait until the patch has been incorporated into the
next release of the product (e.g. OpenVMS) and has hence passed a full
qualification test.
>The customer has decided to use it (after the successful test, as we said it's
>okay). What will happen during the next reboot?
What is expected?
>The more so important since he is about deciding to leave or keep Digital
>customer services and what more - platform!
Because of the RAID reconstruct issue?
Or did they have so many little flames started in the past and now it's
a wildfire the customer thinks we, Digital, cannot extinguish?
Guenther
|
324.9 | | BPSTGA::TORONY | G�bor Tornyossy | Fri Mar 07 1997 06:57 | 34 |
| Guenther,
I enjoy that you joined in. It's unfortunate that because of geographical
reasons it cannot be deepened beside a mug of beer.
The customer wants this software raid not to recognise server shutdown/reboot
(not to rebuild the set) moreover not to make any failure in the filesystem on
it. We can suggest or not suggest to use the sold software raid. To be more
precise we have to give them a procedure how to use it safetely and meeting
their expectations.
Therefore let me summarize if I understood well your letter:
. If you have a cluster (like in .0) and you have to shutdown/reboot one
member (the server of the disks in question) while the other should run (the
process making the IO to that disk will hang - no problem)
- if you don't want to allow removing the first member (even if you have a
spare disk) then switch both drivers (on both involved nodes) not to
recognise events (/NOTIMEOUT)and switch them back again as the disks
become online again, or
- you may live with the state change in your raidset (use spare, ...).
. If there is a crash like event - removing the first member will take
place (actions are the same as the line above)
If this is the case the behaviour under or without the patch has only technical
interest: what changes in the SCSI related driver code make raid software
react differently.
Thanks a lot,
G�bor
(True, the last sentence - about the customer's decision of leaving the
Digital presence - turned out too theatrical. The story is grotescue but
Digital's role (local and general) and conscience quite clear... I just wanted
underline the importance of the problem.)
|
324.10 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Fri Mar 07 1997 09:25 | 34 |
| G�bor,
>reasons it cannot be deepened beside a mug of beer.
Czech mana...ah!
>precise we have to give them a procedure how to use it safetely and meeting
>their expectations.
I assume the /NOTIMEOUT does it. Your summarization is correct. Use
/NOTIMOUT and no reconstructs happen when a disk server reboots either
caused by a crash or a shutdown/reboot.
>(not to rebuild the set) moreover not to make any failure in the filesystem on
^^^^^^^^^^^^^^^^^^^^^^^^^
You didn't mention this before. Any more details?
>interest: what changes in the SCSI related driver code make raid software
>react differently.
Nothing I could think of. RAID driver just fabricates on or more I/O
request packets for the real disk driver (DU/DS/DKdriver) and queues it
to its I/O queue. It's up to the real disk driver to perform the I/O.
RAID driver just orchestrates that.
>Digital's role (local and general) and conscience quite clear... I just wanted
>underline the importance of the problem.)
I want to get this customer back on the road with his computing
equipment...Digital Equipment.
Guenther
|