T.R | Title | User | Personal Name | Date | Lines |
---|
3009.1 | | STAR::TSPEER | | Mon Feb 10 1997 11:24 | 6 |
| Unless RMS has a bug (that's been known ;-), BUSY
should only be returned from unsynchronized RMS ASY
calls. Is this application calling RMS from
an AST routine, or from exec or kernel mode?
Tom
|
3009.2 | More often seen with C code | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Mon Feb 10 1997 15:21 | 3 |
| FAB$V_ASY and RAB$V_ASY are bit 0 in their respective FOP/ROP fields.
About half the time, users who use $V_xxx when they mean $M_xxx hit
ASY by mistake -- If they include any odd-numbered options...
|
3009.3 | Start with FOP and ROP first | STAR::EWOODS | | Mon Feb 10 1997 18:39 | 19 |
| John --
I would suggest before you spend much time instrumenting
your code, you assume that they have inadvertently set the
low-order bit in FAB$L_FOP (for $close should be FAB) and
first take an easy approach. You want to get value in
FAB$L_FOP and it wouldn't hurt while you're at it to do
RAB$L_ROP as well and verify whether bit 0 is set.
RMS$_BUSY wouldn't be returned if it was someone else
"doing an operation on the file at the same point." It is
something within this process and the most likely source is
unintended setting of bit 0 (and as .2 notes, not necessarily
due to use of the BUSY symbol) or as Tom in .1 suggests
some AST-level or mode mishandling in their code.
Do you know what version of OVMS? Whether VAX or Alpha?
-- Elinor
|
3009.4 | | TLE::REAGAN | All of this chaos makes perfect sense | Mon Feb 10 1997 22:45 | 15 |
| Its OpenVMS VAX V6.1 and V6.2.
They don't set ASY at all (neither does the Pascal RTL). The first
thing I did was to check their code for $V_ vs $M_ and I didn't see
anything. Their code is in Pascal, my code in the RTL is in BLISS.
I've instrumented an RTL for them to dump out the ASY field
in the FOP to see if it got accidently set somewhere. I'm also
checking the value in the FAB$B_BLN field to see if they didn't
accidently walk on it (which from my limited understanding of RMS
might give false BUSY signals, right?)
I'm hoping to get something back from them tomorrow.
-John
|
3009.5 | | TLE::REAGAN | All of this chaos makes perfect sense | Tue Feb 11 1997 10:59 | 13 |
| I got data back from the user. When $CLOSE returned RMS$_BUSY,
the FAB$V_ASY bit was 0 and the FAB$B_BLN field was 51 (hex).
That sure looks like a real "busy", right?
I'll check the corresponding RAB$V_ASY (we don't do multi-connect)
but the BLN of 51(hex) with a ASY of 0 sure looks odd to me.
Can I assume that the set/clear of the "busy" bit is done in
exec-mode so a user-mode AST can't see a "busy" file that really
isn't "busy"?
-John
|
3009.6 | | TLE::REAGAN | All of this chaos makes perfect sense | Tue Feb 11 1997 13:34 | 11 |
| OK, here's the data I have,
Right after the $CLOSE returned RMS$_BUSY,
FAB$B_BLN = 51 (hex), FAB$V_ASY = 0
RAB$B_BLN = 44 (hex), RAB$V_ASY = 0
Any idea where to go next?
-John
|
3009.7 | next step | DIMOND::EWOODS | | Tue Feb 11 1997 18:18 | 113 |
|
John, I know how able you are so I know that if you were able to get a
reproducer, you would have. So I presume this is not reproducable.
Has this application ever worked (perhaps on VAX 5.5-2)? Or is it a
new application or a major revision of old application? Do they
consistently get the BUSY error whenever run application or 1 out of
10 times (or what)? Has anyone tried to see whether they can pull
out the sequence of operations into a simple program and still reproduce
the error? Have they tried to create a debuggable version of application,
and if so, were they able to step through and still reproduce the problem?
(This would allow us to use SDA before and after without building in any
pauses.) Not asking you to answer all this. Just making sure other
avenues have been pursued. (Feel free to phone me offline on this.)
First, responses to some of your specific questions; and then, where
do we go from here.
<< When I get back a RMS$_BUSY from a $CLOSE, is there anything I
<< can dig out from the FAB/RAB/IFAB/etc. that can tell me what the
<< other operation is?
No.
<< Can I assume that the set/clear of the "busy" bit is done in
<< exec-mode so a user-mode AST can't see a "busy" file that really
<< isn't "busy"?
The set/clear of the "busy" bit is done in exec-mode with uninterruptable
instruction so your assumption is correct.
But does this mean that application has some AST handlers?
<< I'm also
<< checking the value in the FAB$B_BLN field to see if they didn't
<< accidently walk on it (which from my limited understanding of RMS
<< might give false BUSY signals, right?)
That is correct. Right at beginning of $close, it does a call
that will end up in a FAB check routine in RM0FABCHK. One of its
first checks is whether the low-order bit in FAB$B_BLN is set. If
it is, then RMS$_BUSY gets returned. But you cannot assume that it
is a "false" BUSY -- depends whether BUSY bit in IFAB is also set.
(This makes me think if this problem reproduces frequently at
the customer site, you could have him add a LIB$WAIT for 20 seconds
and then retry the $close. If still RMS$_BUSY error, then no more
retries (this would suggest the BUSY bit in BLN indicates a false
busy state). If it succeeds on retry, this suggests it was a for-real
busy state.)
Hex 44 is exactly size of RAB. The size of FAB is hex 50 so fact that
x51 indicates bit 0 got toggled suggests we aren't just looking at
garbage in this structure. And FAB is the structure that would be used
for $close.
First thing that would be interesting to learn is whether IFAB has its
BUSY bit set. Will be very useful to learn that IFB$V_BUSY is set. If
it isn't set, we can't tell much because between time $close failed,
returns to application with RMS$_BUSY and you dump some of IFAB to a log,
if this FAB is busy because of another operation, it could have
completed by now and cleared the IFB$V_BUSY bit. If FAB-BLN
is x51 "AND" IFB$V_BUSY bit is set, then such a consistency would suggest
that FAB was already busy.
It appears that you have instrumented a RTL for them to use. If it
displays BLN when RMS$_BUSY is returned for a $CLOSE, then we are more
than halfway there. You want to add the immediate display of the
first 8 bytes of the IFAB (you indicated you are able to locate it).
Bit 0 at offset 0 of IFAB is the IFB$V_BUSY bit. I take it that you will
be using the IFI in FAB to locate IFB so also display the IFI (if we get
SDA log, it will be useful).
Once you have done that, how about calling SYS$HIBER to put process to
sleep -- if customer is able to monitor process to detect this. Or if
the customer will be running interactively and will be able to detect
BUSY occurrence, then you could do it with a LIB$WAIT for a large
number of seconds (user could detect output has stopped coming to screen).
I am looking for a way to freeze the process at this moment, so customer
can execute ANAL/SYSTEM in another process and gather some data on
this process.
If this is possible, then I would suggest the customer be instructed to
get following into log file that can be transmitted to us. I have included
much more than we will likely need, but I figure it is best to get more
than less (so we don't keep going back to the customer). I have my
fingers crossed that this is not an application that has a large number
of files opened.
$ ! assumes have pid of process
$ !
$ ANALYZE/SYSTEM
> SET PROC/INDEX=pid
> SHOW PROC ! confirm correct process
> SET OUTPUT sys$login:dec_busy.log ! or name of their choice
> SHOW PROC
> SHOW PROC/PCB
> SHOW PROC/RMS=ALL
> SHOW STACK/ALL
> SHOW PROC/IMAGE
> SHOW PROC/CHAN
> SHOW PROC/LOCKS
> EXIT
If this were occurring on Alpha (they were migrating an old application from
VAX), I would be more open to possibility of timing problem unearthed.
Fact that it is a VAX makes me more inclined to suspect user error.
Hopefully there will be some traces left in stacks or structures to give
us some hint.
I'll be glad to analyze this log if you are able to get it, John.
Just contact me by mail.
-- Elinor
|
3009.8 | | TLE::REAGAN | All of this chaos makes perfect sense | Wed Feb 12 1997 09:41 | 31 |
| Thanks for the lengthy response, I'll read it over a couple of times
and do some work.
The customer is Innosoft (you know them well I think as do we).
It is one of their detacted programs that does mail stuff. I
can't reproduce it here, Innosoft can't reproduce there, we can
only reproduce it consistenly on a machine in the middle of Ohio
somewhere. I make an RTL send it to Innosoft, then push it to
Ohio, and the .log files make the reverse trip.
More detail of the application: It only fails on VAX, not on Alpha.
It is very DECthreads heavy. This error has been intermittant for
quite some time, but a customer seems to be finding it dozens of times
per day hence the increased interest of figuring out the problem.
The application creates/writes/closes a .LOG file during its running.
Most of the time they close the file themselves. On occasion, their
application does something that gets the RTL's exit-handler running.
The RTL's exit handler goes down and tries to close all open files.
When it tries to close the .LOG file, it gets this error.
At first, I was wondering if DECthreads could be switching threads
in the middle of an RMS operation. But since DECthreads uses
a user-mode timer AST to switch threads, it shouldn't be able
to switch in the middle of an RMS exec-mode operation.
I'll dump out some IFAB bits and see what I can do.
-John
|
3009.9 | DECthreads and exit handler | STAR::EWOODS | | Wed Feb 12 1997 14:57 | 55 |
|
The addition of DECthreads to the picture was helpful info. I just
consulted with someone who knows DECthreads -- particularly from VAX
perspective. He says that DECthreads and exit handlers are particularly
problematic and very susceptible to user error since there is no
straight-line code for either.
One scenario he could hypothesize for error is:
1) A DECthread is kicked off to do create or write (both of which
would use FAB) of file getting RMS$_BUSY for $close operation.
2) DECthreads also sets up a timer for kicking off next one.
3) If before the first thread completes, timer goes off and the first
one is stalled in usermode waiting for completion AST, DECthreads
will kick off in user-mode a second one.
4) The second thread does some operation -- most likely to a different
file -- but encounters an error -- which triggers the exit-handler.
5) The exit-handler tries to close the file associated with first
thread (you said RMS$_BUSY error was coming back to exit-handler)
but first thread operation is still stalled so bit in FAB$B_BLN
still toggled (appropriately) and RMS$_BUSY error is (appropriately)
returned by RMS when FAB is checked at beginning of $CLOSE
operation.
So the question is whether the logs for this application would have any
record of whether just before the $close RMS$_BUSY error returned to
exit-handler, there was an error returned to a thread.
I would think one of my suggestions in .7 could add some support for
this hypothesis. If the exit-handler went into a wait (lib$wait), I
would think this would give the first thread a chance to get
rescheduled and complete its operation. Then when wait ended if
exit-handler retried $close operation, it either will succeed (if
first thread was doing a $create) or fail but with a different error
than RMS$_BUSY if first thread had been doing a $close -- something that
indicates there is no file to close. But if this system is really
taxed for memory, etc., you want to put in a decent wait to get anything
conclusive. I would say at least 1 minute.
This actually would dove-tail nicely with only one machine experiencing
problem. It may be that it is low on some quota -- something that is
making this system have a higher probability of errors occurring
that would increase its chances of breaking through this hole.
Not to get too far ahead, but if it does turn out this hypothesis is
correct, my understanding is that there are ways in user application
to safeguard against the collision of DECthreads and exit-handler.
Hope this helps,
-- Elinor
|
3009.10 | "Synchronous" doesn't mean "uninterruptable" | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Wed Feb 12 1997 15:46 | 6 |
| > But since DECthreads uses
> a user-mode timer AST to switch threads, it shouldn't be able
> to switch in the middle of an RMS exec-mode operation.
But remember that RMS does its waiting in user-mode, specifically
to allow user-mode AST's to get delivered during long waits.
|
3009.11 | | TLE::REAGAN | All of this chaos makes perfect sense | Fri Feb 14 1997 10:09 | 34 |
| Thanks for the extra info and ideas. The fact that RMS waits in
user-mode with the "busy" bit still set is different than what I
had imagined. I can see why, but it does lead to some interesting
situations where multiple threads are trying to write to the same
file. The interesting part is that the code is essentially the
same on Alpha and the customer hasn't seen anything like this there.
As for cleanly hooking DECthreads, AST-driven programs, and
exit-handlers, my experience from reading the various notesfiles
is that you are on your own. The problem really is that the PASRTL
has the exit-handler. It has no idea whether the program is using
DECthreads. The scenario in .9 could easily be done without DECthreads
by having you own timer AST firing at just the right place waiting
for a $PUT to complete and the AST routine trying to do a $CLOSE.
My biggest question with regards to DECthreads (perhaps I should
be asking there instead, but I think I have over the years and not
been satisified with the answers) is what else is going on while
an exit-handler is running. My experience is that other threads
may continue to run while another thread is in the exit-handler.
At one point, I even convinced myself that multiple copies of
the exit-handler could be running. The DECthreads manuals are
quite vague on these points.
At the point in the exit-handler that I'm doing the close, I have
disabled ASTs. If I do a LIB$WAIT, it doesn't matter since DECthreads'
timer AST won't get to fire and the other thread that is waiting for
its $PUT to finish won't get to run. The exit-handler could renable
ASTs during its call to LIB$WAIT, but in the meantime any 'ol AST could
happen and do Pascal RTL work upsetting the data structures that the
exit-handler thinks it has exclusive access to.
-John
|
3009.12 | | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Fri Feb 14 1997 11:14 | 4 |
| John, when you disable user-mode AST's, that doesn't prohibit
RMS's exec-mode AST from getting delivered. RMS will still get
its chance to clear the flag in the IFAB (assuming it does that
in exec mode).
|
3009.13 | | TLE::REAGAN | All of this chaos makes perfect sense | Mon Feb 17 1997 15:13 | 5 |
| Sure, but who (and in which mode) clears the bottom bit in the
FAB$B_BLN field? RMS' exec-mode AST or the user-mode code that
continues after the $PUT waits for the AST to be delivered.
-John
|
3009.14 | PASRTL exit handler solution | STAR::EWOODS | | Tue Feb 18 1997 13:51 | 43 |
|
I sent a mail message to John to following effect:
----------------------------------------------------------------------------
I was wrong in thinking a LIB$WAIT would help you. The thread stalled
in user mode has to get back to WAIT_IO_DONE and it is never going to
from an exit handler.
Consulted with a guru, who feels there is a "right thing" for the PASRTL
exit handler to do:
1. In your exit handler do not use the user's FAB structures.
For your $close use a new FAB with the IFI in the user's FAB
structure copied in. Now it is impossible for the FAB$B_BLN
bit 0 to be set.
2. If the IFAB's busy bit is set, the $close will return the
error: RMS$_ACT (file activity precludes operation).
3. If this error is returned, you can use RMS service SYS$WAIT with
your new FAB. When wait completes, you reissue your $close.
Note: If your exit handler does anything with user's RAB structures,
it should follow same practice.
Bit 0 in FAB$B_BLN is set when exiting from RMS exec-level code
when stall is required. There is a module in VAX SYS facility that
handles the WAIT_IO_DONE (system_service_exit). The bit is set in
caller's mode, and when a status has been filled into FAB structure itself,
the bit is cleared in caller's mode. RMS has an AST on the $QIO operation
itself that it is waiting to complete. This exec-mode AST can interrupt
caller mode code (assuming called from user or supervisor mode) at anytime
and complete any RMS processing remaining for the operation after $QIO
itself completes. In the normal state of affairs it would eventually get
back to the user-mode code in the system-service exit path and clear bit 0
in the FAB$B_BLN. But from an exit-handler it will never return.
I would think the change suggested above, wouldn't involve too much
recoding in your PASRTL exit handler. Then you can test it on the VAX
machine, which is consistently getting RMS$_BUSY from their DECthread's
application.
-- Elinor
|
3009.15 | | TLE::REAGAN | All of this chaos makes perfect sense | Mon Feb 24 1997 13:31 | 16 |
| Just to follow up on the final resolution of this.
The application creates multiple detached threads. Their main program
does not use pthread_join to wait for these detached threads but just
uses a condition variable to tell the main program that it is done. The
main program then returns to the system where the RTL's exit handler is
called. However, the user's thread wasn't quite finished. They forgot
about the thread-specific destructor. In that destructor (running in
parallel with the main routine which is trying to return back to the
system), they explicitly close a file. That destructor didn't quite
finish before the main program returned and the exit-handler then tried
to close the same file and got a busy.
The customer has modified their code to remove the race condition.
-John
|