[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vmszoo::rms_openvms

Title:RMS asks, 'R U Journaled?'
Moderator:STAR::TSPEERUVEL
Created:Tue Mar 11 1986
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:3031
Total number of notes:12302

3009.0. "Tracking down a RMS$_BUSY problem" by TLE::REAGAN (All of this chaos makes perfect sense) Fri Feb 07 1997 15:25

    I'm trying to help debug a customer problem.  In their program,
    they try to close a file and are getting RMS$_BUSY back.  They
    don't intentionally set FAB$V_ASY or RAB$V_ASY, but I'm adding some
    code to verify that.
    
    When I get back a RMS$_BUSY from a $CLOSE, is there anything I
    can dig out from the FAB/RAB/IFAB/etc. that can tell me what the
    other operation is?  I can't get at the system with ANAL/SYS to
    check that way, but I can instrument my code to print out whatever
    bits might be needed to help figure out who else is doing an operation
    on the file at the same point.
    
    				-John
T.RTitleUserPersonal
Name
DateLines
3009.1STAR::TSPEERMon Feb 10 1997 11:246
Unless RMS has a bug (that's been known ;-), BUSY 
should only be returned from unsynchronized RMS ASY
calls.  Is this application calling RMS from
an AST routine, or from exec or kernel mode?

Tom
3009.2More often seen with C codeWIBBIN::NOYCEPulling weeds, pickin' stonesMon Feb 10 1997 15:213
FAB$V_ASY and RAB$V_ASY are bit 0 in their respective FOP/ROP fields.
About half the time, users who use $V_xxx when they mean $M_xxx hit
ASY by mistake  --  If they include any odd-numbered options...
3009.3Start with FOP and ROP firstSTAR::EWOODSMon Feb 10 1997 18:3919
  John --

  I would suggest before you spend much time instrumenting
  your code, you assume that they have inadvertently set the 
  low-order bit in FAB$L_FOP (for $close should be FAB) and
  first take an easy approach.  You want to get value in 
  FAB$L_FOP and it wouldn't hurt while you're at it to do 
  RAB$L_ROP as well and verify whether bit 0 is set.

  RMS$_BUSY wouldn't be returned if it was someone else 
  "doing an operation on the file at the same point."  It is
  something within this process and the most likely source is
  unintended setting of bit 0 (and as .2 notes, not necessarily
  due to use of the BUSY symbol) or as Tom in .1 suggests
  some AST-level or mode mishandling in their code.

  Do you know what version of OVMS?  Whether VAX or Alpha?

  -- Elinor
3009.4TLE::REAGANAll of this chaos makes perfect senseMon Feb 10 1997 22:4515
    Its OpenVMS VAX V6.1 and V6.2.
    
    They don't set ASY at all (neither does the Pascal RTL). The first
    thing I did was to check their code for $V_ vs $M_ and I didn't see 
    anything. Their code is in Pascal, my code in the RTL is in BLISS.
    
    I've instrumented an RTL for them to dump out the ASY field
    in the FOP to see if it got accidently set somewhere.  I'm also
    checking the value in the FAB$B_BLN field to see if they didn't
    accidently walk on it (which from my limited understanding of RMS
    might give false BUSY signals, right?)
    
    I'm hoping to get something back from them tomorrow.
    
    				-John
3009.5TLE::REAGANAll of this chaos makes perfect senseTue Feb 11 1997 10:5913
    I got data back from the user.  When $CLOSE returned RMS$_BUSY,
    the FAB$V_ASY bit was 0 and the FAB$B_BLN field was 51 (hex).
    That sure looks like a real "busy", right?
    
    I'll check the corresponding RAB$V_ASY (we don't do multi-connect)
    but the BLN of 51(hex) with a ASY of 0 sure looks odd to me.
    
    Can I assume that the set/clear of the "busy" bit is done in
    exec-mode so a user-mode AST can't see a "busy" file that really
    isn't "busy"?
    
    				-John
    
3009.6TLE::REAGANAll of this chaos makes perfect senseTue Feb 11 1997 13:3411
    OK, here's the data I have,
    
    Right after the $CLOSE returned RMS$_BUSY,
    
    FAB$B_BLN = 51 (hex), FAB$V_ASY = 0
    RAB$B_BLN = 44 (hex), RAB$V_ASY = 0
    
    Any idea where to go next?
    
    				-John
    
3009.7next stepDIMOND::EWOODSTue Feb 11 1997 18:18113
  John, I know how able you are so I know that if you were able to get a 
  reproducer, you would have.  So I presume this is not reproducable.
  Has this application ever worked (perhaps on VAX 5.5-2)?  Or is it a
  new application or a major revision of old application?  Do they 
  consistently get the BUSY error whenever run application or 1 out of
  10 times (or what)?   Has anyone tried to see whether they can pull
  out the sequence of operations into a simple program and still reproduce
  the error?  Have they tried to create a debuggable version of application,
  and if so, were they able to step through and still reproduce the problem?
  (This would allow us to use SDA before and after without building in any
  pauses.)   Not asking you to answer all this.  Just making sure other
  avenues have been pursued.  (Feel free to phone me offline on this.)

  First, responses to some of your specific questions; and then, where
  do we go from here.

<<    When I get back a RMS$_BUSY from a $CLOSE, is there anything I
<<    can dig out from the FAB/RAB/IFAB/etc. that can tell me what the
<<    other operation is?  

  No.

<<    Can I assume that the set/clear of the "busy" bit is done in
<<    exec-mode so a user-mode AST can't see a "busy" file that really
<<    isn't "busy"?

  The set/clear of the "busy" bit is done in exec-mode with uninterruptable
  instruction so your assumption is correct.

  But does this mean that application has some AST handlers?
  
<<  I'm also
<<  checking the value in the FAB$B_BLN field to see if they didn't
<<  accidently walk on it (which from my limited understanding of RMS
<<  might give false BUSY signals, right?)

  That is correct.  Right at beginning of $close, it does a call
  that will end up in a FAB check routine in RM0FABCHK.  One of its
  first checks is whether the low-order bit in FAB$B_BLN is set.  If
  it is, then RMS$_BUSY gets returned.  But you cannot assume that it
  is a "false" BUSY -- depends whether BUSY bit in IFAB is also set.
  (This makes me think if this problem reproduces frequently at
  the customer site, you could have him add a LIB$WAIT for 20 seconds
  and then retry the $close.  If still RMS$_BUSY error, then no more
  retries (this would suggest the BUSY bit in BLN indicates a false
  busy state).  If it succeeds on retry, this suggests it was a for-real
  busy state.)

  Hex 44 is exactly size of RAB.  The size of FAB is hex 50 so fact that
  x51 indicates bit 0 got toggled suggests we aren't just looking at
  garbage in this structure.  And FAB is the structure that would be used
  for $close.

  First thing that would be interesting to learn is whether IFAB has its
  BUSY bit set.  Will be very useful to learn that IFB$V_BUSY is set.  If
  it isn't set, we can't tell much because between time $close failed,
  returns to application with RMS$_BUSY and you dump some of IFAB to a log,
  if this FAB is busy because of another operation, it could have
  completed by now and cleared the IFB$V_BUSY bit.  If FAB-BLN
  is x51 "AND" IFB$V_BUSY bit is set, then such a consistency would suggest 
  that FAB was already busy.

  It appears that you have instrumented a RTL for them to use.  If it
  displays BLN when RMS$_BUSY is returned for a $CLOSE, then we are more
  than halfway there.  You want to add the immediate display of the
  first 8 bytes of the IFAB (you indicated you are able to locate it).
  Bit 0 at offset 0 of IFAB is the IFB$V_BUSY bit.  I take it that you will
  be using the IFI in FAB to locate IFB so also display the IFI (if we get
  SDA log, it will be useful).

  Once you have done that, how about calling SYS$HIBER to put process to
  sleep -- if customer is able to monitor process to detect this.  Or if
  the customer will be running interactively and will be able to detect
  BUSY occurrence, then you could do it with a LIB$WAIT for a large
  number of seconds (user could detect output has stopped coming to screen).  
  I am looking for a way to freeze the process at this moment, so customer 
  can execute ANAL/SYSTEM in another process and gather some data on
  this process.

  If this is possible, then I would suggest the customer be instructed to
  get following into log file that can be transmitted to us.  I have included
  much more than we will likely need, but I figure it is best to get more 
  than less (so we don't keep going back to the customer).  I have my
  fingers crossed that this is not an application that has a large number
  of files opened.

      $ ! assumes have pid of process 
      $ !
      $ ANALYZE/SYSTEM
	> SET PROC/INDEX=pid
	> SHOW PROC  ! confirm correct process
	> SET OUTPUT sys$login:dec_busy.log  ! or name of their choice
	> SHOW PROC
	> SHOW PROC/PCB
	> SHOW PROC/RMS=ALL
	> SHOW STACK/ALL
	> SHOW PROC/IMAGE
	> SHOW PROC/CHAN
	> SHOW PROC/LOCKS
	> EXIT
  
  If this were occurring on Alpha (they were migrating an old application from
  VAX), I would be more open to possibility of timing problem unearthed.
  Fact that it is a VAX makes me more inclined to suspect user error.
  Hopefully there will be some traces left in stacks or structures to give
  us some hint.

  I'll be glad to analyze this log if you are able to get it, John.
  Just contact me by mail.

  -- Elinor
  
3009.8TLE::REAGANAll of this chaos makes perfect senseWed Feb 12 1997 09:4131
    Thanks for the lengthy response, I'll read it over a couple of times
    and do some work.
    
    The customer is Innosoft (you know them well I think as do we).
    It is one of their detacted programs that does mail stuff.  I
    can't reproduce it here, Innosoft can't reproduce there, we can
    only reproduce it consistenly on a machine in the middle of Ohio
    somewhere.  I make an RTL send it to Innosoft, then push it to
    Ohio, and the .log files make the reverse trip.  
    
    More detail of the application:  It only fails on VAX, not on Alpha. 
    It is very DECthreads heavy.  This error has been intermittant for
    quite some time, but a customer seems to be finding it dozens of times
    per day hence the increased interest of figuring out the problem.
    
    The application creates/writes/closes a .LOG file during its running. 
    Most of the time they close the file themselves.  On occasion, their
    application does something that gets the RTL's exit-handler running. 
    The RTL's exit handler goes down and tries to close all open files. 
    When it tries to close the .LOG file, it gets this error.  
    
    At first, I was wondering if DECthreads could be switching threads
    in the middle of an RMS operation.  But since DECthreads uses
    a user-mode timer AST to switch threads, it shouldn't be able
    to switch in the middle of an RMS exec-mode operation.
    
    I'll dump out some IFAB bits and see what I can do.
    
    				-John
    
    
3009.9DECthreads and exit handlerSTAR::EWOODSWed Feb 12 1997 14:5755
   The addition of DECthreads to the picture was helpful info.  I just 
   consulted with someone who knows DECthreads -- particularly from VAX 
   perspective.  He says that DECthreads and exit handlers are particularly 
   problematic and very susceptible to user error since there is no 
   straight-line code for either.

   One scenario he could hypothesize for error is:

	1) A DECthread is kicked off to do create or write (both of which 
	   would use FAB) of file getting RMS$_BUSY for $close operation.

	2) DECthreads also sets up a timer for kicking off next one.  

	3) If before the first thread completes, timer goes off and the first 
	   one is stalled in usermode waiting for completion AST, DECthreads 
	   will kick off in user-mode a second one.  

	4) The second thread does some operation -- most likely to a different
   	   file -- but encounters an error -- which triggers the exit-handler.  

	5) The exit-handler tries to close the file associated with first 
	   thread (you said RMS$_BUSY error was coming back to exit-handler) 
	   but first thread operation is still stalled so bit in FAB$B_BLN 
	   still toggled (appropriately) and RMS$_BUSY error is (appropriately)
           returned by RMS when FAB is checked at beginning of $CLOSE 
	   operation.

   So the question is whether the logs for this application would have any
   record of whether just before the $close RMS$_BUSY error returned to
   exit-handler, there was an error returned to a thread.  

   I would think one of my suggestions in .7 could add some support for
   this hypothesis.  If the exit-handler went into a wait (lib$wait), I
   would think this would give the first thread a chance to get 
   rescheduled and complete its operation.  Then when wait ended if 
   exit-handler retried $close operation, it either will succeed (if
   first thread was doing a $create) or fail but with a different error
   than RMS$_BUSY if first thread had been doing a $close -- something that 
   indicates there is no file to close.  But if this system is really
   taxed for memory, etc., you want to put in a decent wait to get anything 
   conclusive.  I would say at least 1 minute.
   
   This actually would dove-tail nicely with only one machine experiencing 
   problem.  It may be that it is low on some quota -- something that is 
   making this system have a higher probability of errors occurring 
   that would increase its chances of breaking through this hole.

   Not to get too far ahead, but if it does turn out this hypothesis is 
   correct, my understanding is that there are ways in user application 
   to safeguard against the collision of DECthreads and exit-handler.
   
   Hope this helps,

   -- Elinor
3009.10"Synchronous" doesn't mean "uninterruptable"WIBBIN::NOYCEPulling weeds, pickin&#039; stonesWed Feb 12 1997 15:466
>   But since DECthreads uses
>   a user-mode timer AST to switch threads, it shouldn't be able
>   to switch in the middle of an RMS exec-mode operation.

But remember that RMS does its waiting in user-mode, specifically
to allow user-mode AST's to get delivered during long waits.
3009.11TLE::REAGANAll of this chaos makes perfect senseFri Feb 14 1997 10:0934
    Thanks for the extra info and ideas.  The fact that RMS waits in
    user-mode with the "busy" bit still set is different than what I
    had imagined.  I can see why, but it does lead to some interesting
    situations where multiple threads are trying to write to the same
    file.  The interesting part is that the code is essentially the
    same on Alpha and the customer hasn't seen anything like this there.
    
    As for cleanly hooking DECthreads, AST-driven programs, and
    exit-handlers, my experience from reading the various notesfiles
    is that you are on your own.  The problem really is that the PASRTL
    has the exit-handler.  It has no idea whether the program is using
    DECthreads.  The scenario in .9 could easily be done without DECthreads
    by having you own timer AST firing at just the right place waiting
    for a $PUT to complete and the AST routine trying to do a $CLOSE.
    
    My biggest question with regards to DECthreads (perhaps I should
    be asking there instead, but I think I have over the years and not
    been satisified with the answers) is what else is going on while
    an exit-handler is running.  My experience is that other threads
    may continue to run while another thread is in the exit-handler.
    At one point, I even convinced myself that multiple copies of
    the exit-handler could be running.  The DECthreads manuals are
    quite vague on these points.
    
    At the point in the exit-handler that I'm doing the close, I have
    disabled ASTs.  If I do a LIB$WAIT, it doesn't matter since DECthreads'
    timer AST won't get to fire and the other thread that is waiting for
    its $PUT to finish won't get to run.  The exit-handler could renable
    ASTs during its call to LIB$WAIT, but in the meantime any 'ol AST could
    happen and do Pascal RTL work upsetting the data structures that the
    exit-handler thinks it has exclusive access to.
    
    				-John
    
3009.12WIBBIN::NOYCEPulling weeds, pickin&#039; stonesFri Feb 14 1997 11:144
John, when you disable user-mode AST's, that doesn't prohibit
RMS's exec-mode AST from getting delivered.  RMS will still get
its chance to clear the flag in the IFAB (assuming it does that
in exec mode).
3009.13TLE::REAGANAll of this chaos makes perfect senseMon Feb 17 1997 15:135
    Sure, but who (and in which mode) clears the bottom bit in the
    FAB$B_BLN field?  RMS' exec-mode AST or the user-mode code that
    continues after the $PUT waits for the AST to be delivered.
    
    				-John
3009.14PASRTL exit handler solutionSTAR::EWOODSTue Feb 18 1997 13:5143
    
  I sent a mail message to John to following effect:
----------------------------------------------------------------------------

  I was wrong in thinking a LIB$WAIT would help you.  The thread stalled
  in user mode has to get back to WAIT_IO_DONE and it is never going to
  from an exit handler.

  Consulted with a guru, who feels there is a "right thing" for the PASRTL 
  exit handler to do:

    1.  In your exit handler do not use the user's FAB structures.  
	For your $close use a new FAB with the IFI in the user's FAB 
	structure copied in.  Now it is impossible for the FAB$B_BLN 
	bit 0 to be set.

    2.  If the IFAB's busy bit is set, the $close will return the
	error:  RMS$_ACT (file activity precludes operation).  

    3.  If this error is returned, you can use RMS service SYS$WAIT with
        your new FAB.   When wait completes, you reissue your $close.

    Note:  If your exit handler does anything with user's RAB structures,
    it should follow same practice.

  Bit 0 in FAB$B_BLN is set when exiting from RMS exec-level code 
  when stall is required.  There is a module in VAX SYS facility that
  handles the WAIT_IO_DONE (system_service_exit).  The bit is set in 
  caller's mode, and when a status has been filled into FAB structure itself, 
  the bit is cleared in caller's mode.  RMS has an AST on the $QIO operation 
  itself that it is waiting to complete.  This exec-mode AST can interrupt 
  caller mode code (assuming called from user or supervisor mode) at anytime 
  and complete any RMS processing remaining for the operation after $QIO 
  itself completes.  In the normal state of affairs it would eventually get 
  back to the user-mode code in the system-service exit path and clear bit 0 
  in the FAB$B_BLN.  But from an exit-handler it will never return.

  I would think the change suggested above, wouldn't involve too much
  recoding in your PASRTL exit handler.  Then you can test it on the VAX 
  machine, which is consistently getting RMS$_BUSY from their DECthread's 
  application.  

  -- Elinor  
3009.15TLE::REAGANAll of this chaos makes perfect senseMon Feb 24 1997 13:3116
    Just to follow up on the final resolution of this.  
    
    The application creates multiple detached threads.  Their main program
    does not use pthread_join to wait for these detached threads but just
    uses a condition variable to tell the main program that it is done. The
    main program then returns to the system where the RTL's exit handler is
    called.  However, the user's thread wasn't quite finished.  They forgot
    about the thread-specific destructor.  In that destructor (running in
    parallel with the main routine which is trying to return back to the
    system), they explicitly close a file.  That destructor didn't quite
    finish before the main program returned and the exit-handler then tried
    to close the same file and got a busy.
    
    The customer has modified their code to remove the race condition.
    
    				-John