[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vmszoo::rms_openvms

Title:	RMS asks, 'R U Journaled?'

Moderator:	STAR::TSPEERUVEL

Created:	Tue Mar 11 1986
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3031
Total number of notes:	12302

3009.0. "Tracking down a RMS$_BUSY problem" by TLE::REAGAN (All of this chaos makes perfect sense) Fri Feb 07 1997 15:25

    I'm trying to help debug a customer problem.  In their program,
    they try to close a file and are getting RMS$_BUSY back.  They
    don't intentionally set FAB$V_ASY or RAB$V_ASY, but I'm adding some
    code to verify that.
    
    When I get back a RMS$_BUSY from a $CLOSE, is there anything I
    can dig out from the FAB/RAB/IFAB/etc. that can tell me what the
    other operation is?  I can't get at the system with ANAL/SYS to
    check that way, but I can instrument my code to print out whatever
    bits might be needed to help figure out who else is doing an operation
    on the file at the same point.
    
    				-John

T.R	Title	User	Personal Name	Date	Lines
3009.1		STAR::TSPEER		`Mon Feb 10 1997 11:24`	6
	Unless RMS has a bug (that's been known ;-), BUSY should only be returned from unsynchronized RMS ASY calls. Is this application calling RMS from an AST routine, or from exec or kernel mode? Tom
3009.2	More often seen with C code	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Mon Feb 10 1997 15:21`	3
	FAB$V_ASY and RAB$V_ASY are bit 0 in their respective FOP/ROP fields. About half the time, users who use $V_xxx when they mean $M_xxx hit ASY by mistake -- If they include any odd-numbered options...
3009.3	Start with FOP and ROP first	STAR::EWOODS		`Mon Feb 10 1997 18:39`	19
	John -- I would suggest before you spend much time instrumenting your code, you assume that they have inadvertently set the low-order bit in FAB$L_FOP (for $close should be FAB) and first take an easy approach. You want to get value in FAB$L_FOP and it wouldn't hurt while you're at it to do RAB$L_ROP as well and verify whether bit 0 is set. RMS$_BUSY wouldn't be returned if it was someone else "doing an operation on the file at the same point." It is something within this process and the most likely source is unintended setting of bit 0 (and as .2 notes, not necessarily due to use of the BUSY symbol) or as Tom in .1 suggests some AST-level or mode mishandling in their code. Do you know what version of OVMS? Whether VAX or Alpha? -- Elinor
3009.4		TLE::REAGAN	All of this chaos makes perfect sense	`Mon Feb 10 1997 22:45`	15
	Its OpenVMS VAX V6.1 and V6.2. They don't set ASY at all (neither does the Pascal RTL). The first thing I did was to check their code for $V_ vs $M_ and I didn't see anything. Their code is in Pascal, my code in the RTL is in BLISS. I've instrumented an RTL for them to dump out the ASY field in the FOP to see if it got accidently set somewhere. I'm also checking the value in the FAB$B_BLN field to see if they didn't accidently walk on it (which from my limited understanding of RMS might give false BUSY signals, right?) I'm hoping to get something back from them tomorrow. -John
3009.5		TLE::REAGAN	All of this chaos makes perfect sense	`Tue Feb 11 1997 10:59`	13
	I got data back from the user. When $CLOSE returned RMS$_BUSY, the FAB$V_ASY bit was 0 and the FAB$B_BLN field was 51 (hex). That sure looks like a real "busy", right? I'll check the corresponding RAB$V_ASY (we don't do multi-connect) but the BLN of 51(hex) with a ASY of 0 sure looks odd to me. Can I assume that the set/clear of the "busy" bit is done in exec-mode so a user-mode AST can't see a "busy" file that really isn't "busy"? -John
3009.6		TLE::REAGAN	All of this chaos makes perfect sense	`Tue Feb 11 1997 13:34`	11
	OK, here's the data I have, Right after the $CLOSE returned RMS$_BUSY, FAB$B_BLN = 51 (hex), FAB$V_ASY = 0 RAB$B_BLN = 44 (hex), RAB$V_ASY = 0 Any idea where to go next? -John
3009.7	next step	DIMOND::EWOODS		`Tue Feb 11 1997 18:18`	113
	John, I know how able you are so I know that if you were able to get a reproducer, you would have. So I presume this is not reproducable. Has this application ever worked (perhaps on VAX 5.5-2)? Or is it a new application or a major revision of old application? Do they consistently get the BUSY error whenever run application or 1 out of 10 times (or what)? Has anyone tried to see whether they can pull out the sequence of operations into a simple program and still reproduce the error? Have they tried to create a debuggable version of application, and if so, were they able to step through and still reproduce the problem? (This would allow us to use SDA before and after without building in any pauses.) Not asking you to answer all this. Just making sure other avenues have been pursued. (Feel free to phone me offline on this.) First, responses to some of your specific questions; and then, where do we go from here. << When I get back a RMS$_BUSY from a $CLOSE, is there anything I << can dig out from the FAB/RAB/IFAB/etc. that can tell me what the << other operation is? No. << Can I assume that the set/clear of the "busy" bit is done in << exec-mode so a user-mode AST can't see a "busy" file that really << isn't "busy"? The set/clear of the "busy" bit is done in exec-mode with uninterruptable instruction so your assumption is correct. But does this mean that application has some AST handlers? << I'm also << checking the value in the FAB$B_BLN field to see if they didn't << accidently walk on it (which from my limited understanding of RMS << might give false BUSY signals, right?) That is correct. Right at beginning of $close, it does a call that will end up in a FAB check routine in RM0FABCHK. One of its first checks is whether the low-order bit in FAB$B_BLN is set. If it is, then RMS$_BUSY gets returned. But you cannot assume that it is a "false" BUSY -- depends whether BUSY bit in IFAB is also set. (This makes me think if this problem reproduces frequently at the customer site, you could have him add a LIB$WAIT for 20 seconds and then retry the $close. If still RMS$_BUSY error, then no more retries (this would suggest the BUSY bit in BLN indicates a false busy state). If it succeeds on retry, this suggests it was a for-real busy state.) Hex 44 is exactly size of RAB. The size of FAB is hex 50 so fact that x51 indicates bit 0 got toggled suggests we aren't just looking at garbage in this structure. And FAB is the structure that would be used for $close. First thing that would be interesting to learn is whether IFAB has its BUSY bit set. Will be very useful to learn that IFB$V_BUSY is set. If it isn't set, we can't tell much because between time $close failed, returns to application with RMS$_BUSY and you dump some of IFAB to a log, if this FAB is busy because of another operation, it could have completed by now and cleared the IFB$V_BUSY bit. If FAB-BLN is x51 "AND" IFB$V_BUSY bit is set, then such a consistency would suggest that FAB was already busy. It appears that you have instrumented a RTL for them to use. If it displays BLN when RMS$_BUSY is returned for a $CLOSE, then we are more than halfway there. You want to add the immediate display of the first 8 bytes of the IFAB (you indicated you are able to locate it). Bit 0 at offset 0 of IFAB is the IFB$V_BUSY bit. I take it that you will be using the IFI in FAB to locate IFB so also display the IFI (if we get SDA log, it will be useful). Once you have done that, how about calling SYS$HIBER to put process to sleep -- if customer is able to monitor process to detect this. Or if the customer will be running interactively and will be able to detect BUSY occurrence, then you could do it with a LIB$WAIT for a large number of seconds (user could detect output has stopped coming to screen). I am looking for a way to freeze the process at this moment, so customer can execute ANAL/SYSTEM in another process and gather some data on this process. If this is possible, then I would suggest the customer be instructed to get following into log file that can be transmitted to us. I have included much more than we will likely need, but I figure it is best to get more than less (so we don't keep going back to the customer). I have my fingers crossed that this is not an application that has a large number of files opened. $ ! assumes have pid of process $ ! $ ANALYZE/SYSTEM > SET PROC/INDEX=pid > SHOW PROC ! confirm correct process > SET OUTPUT sys$login:dec_busy.log ! or name of their choice > SHOW PROC > SHOW PROC/PCB > SHOW PROC/RMS=ALL > SHOW STACK/ALL > SHOW PROC/IMAGE > SHOW PROC/CHAN > SHOW PROC/LOCKS > EXIT If this were occurring on Alpha (they were migrating an old application from VAX), I would be more open to possibility of timing problem unearthed. Fact that it is a VAX makes me more inclined to suspect user error. Hopefully there will be some traces left in stacks or structures to give us some hint. I'll be glad to analyze this log if you are able to get it, John. Just contact me by mail. -- Elinor
3009.8		TLE::REAGAN	All of this chaos makes perfect sense	`Wed Feb 12 1997 09:41`	31
	Thanks for the lengthy response, I'll read it over a couple of times and do some work. The customer is Innosoft (you know them well I think as do we). It is one of their detacted programs that does mail stuff. I can't reproduce it here, Innosoft can't reproduce there, we can only reproduce it consistenly on a machine in the middle of Ohio somewhere. I make an RTL send it to Innosoft, then push it to Ohio, and the .log files make the reverse trip. More detail of the application: It only fails on VAX, not on Alpha. It is very DECthreads heavy. This error has been intermittant for quite some time, but a customer seems to be finding it dozens of times per day hence the increased interest of figuring out the problem. The application creates/writes/closes a .LOG file during its running. Most of the time they close the file themselves. On occasion, their application does something that gets the RTL's exit-handler running. The RTL's exit handler goes down and tries to close all open files. When it tries to close the .LOG file, it gets this error. At first, I was wondering if DECthreads could be switching threads in the middle of an RMS operation. But since DECthreads uses a user-mode timer AST to switch threads, it shouldn't be able to switch in the middle of an RMS exec-mode operation. I'll dump out some IFAB bits and see what I can do. -John
3009.9	DECthreads and exit handler	STAR::EWOODS		`Wed Feb 12 1997 14:57`	55
	The addition of DECthreads to the picture was helpful info. I just consulted with someone who knows DECthreads -- particularly from VAX perspective. He says that DECthreads and exit handlers are particularly problematic and very susceptible to user error since there is no straight-line code for either. One scenario he could hypothesize for error is: 1) A DECthread is kicked off to do create or write (both of which would use FAB) of file getting RMS$_BUSY for $close operation. 2) DECthreads also sets up a timer for kicking off next one. 3) If before the first thread completes, timer goes off and the first one is stalled in usermode waiting for completion AST, DECthreads will kick off in user-mode a second one. 4) The second thread does some operation -- most likely to a different file -- but encounters an error -- which triggers the exit-handler. 5) The exit-handler tries to close the file associated with first thread (you said RMS$_BUSY error was coming back to exit-handler) but first thread operation is still stalled so bit in FAB$B_BLN still toggled (appropriately) and RMS$_BUSY error is (appropriately) returned by RMS when FAB is checked at beginning of $CLOSE operation. So the question is whether the logs for this application would have any record of whether just before the $close RMS$_BUSY error returned to exit-handler, there was an error returned to a thread. I would think one of my suggestions in .7 could add some support for this hypothesis. If the exit-handler went into a wait (lib$wait), I would think this would give the first thread a chance to get rescheduled and complete its operation. Then when wait ended if exit-handler retried $close operation, it either will succeed (if first thread was doing a $create) or fail but with a different error than RMS$_BUSY if first thread had been doing a $close -- something that indicates there is no file to close. But if this system is really taxed for memory, etc., you want to put in a decent wait to get anything conclusive. I would say at least 1 minute. This actually would dove-tail nicely with only one machine experiencing problem. It may be that it is low on some quota -- something that is making this system have a higher probability of errors occurring that would increase its chances of breaking through this hole. Not to get too far ahead, but if it does turn out this hypothesis is correct, my understanding is that there are ways in user application to safeguard against the collision of DECthreads and exit-handler. Hope this helps, -- Elinor
3009.10	"Synchronous" doesn't mean "uninterruptable"	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed Feb 12 1997 15:46`	6
	> But since DECthreads uses > a user-mode timer AST to switch threads, it shouldn't be able > to switch in the middle of an RMS exec-mode operation. But remember that RMS does its waiting in user-mode, specifically to allow user-mode AST's to get delivered during long waits.
3009.11		TLE::REAGAN	All of this chaos makes perfect sense	`Fri Feb 14 1997 10:09`	34
	Thanks for the extra info and ideas. The fact that RMS waits in user-mode with the "busy" bit still set is different than what I had imagined. I can see why, but it does lead to some interesting situations where multiple threads are trying to write to the same file. The interesting part is that the code is essentially the same on Alpha and the customer hasn't seen anything like this there. As for cleanly hooking DECthreads, AST-driven programs, and exit-handlers, my experience from reading the various notesfiles is that you are on your own. The problem really is that the PASRTL has the exit-handler. It has no idea whether the program is using DECthreads. The scenario in .9 could easily be done without DECthreads by having you own timer AST firing at just the right place waiting for a $PUT to complete and the AST routine trying to do a $CLOSE. My biggest question with regards to DECthreads (perhaps I should be asking there instead, but I think I have over the years and not been satisified with the answers) is what else is going on while an exit-handler is running. My experience is that other threads may continue to run while another thread is in the exit-handler. At one point, I even convinced myself that multiple copies of the exit-handler could be running. The DECthreads manuals are quite vague on these points. At the point in the exit-handler that I'm doing the close, I have disabled ASTs. If I do a LIB$WAIT, it doesn't matter since DECthreads' timer AST won't get to fire and the other thread that is waiting for its $PUT to finish won't get to run. The exit-handler could renable ASTs during its call to LIB$WAIT, but in the meantime any 'ol AST could happen and do Pascal RTL work upsetting the data structures that the exit-handler thinks it has exclusive access to. -John
3009.12		WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Fri Feb 14 1997 11:14`	4
	John, when you disable user-mode AST's, that doesn't prohibit RMS's exec-mode AST from getting delivered. RMS will still get its chance to clear the flag in the IFAB (assuming it does that in exec mode).
3009.13		TLE::REAGAN	All of this chaos makes perfect sense	`Mon Feb 17 1997 15:13`	5
	Sure, but who (and in which mode) clears the bottom bit in the FAB$B_BLN field? RMS' exec-mode AST or the user-mode code that continues after the $PUT waits for the AST to be delivered. -John
3009.14	PASRTL exit handler solution	STAR::EWOODS		`Tue Feb 18 1997 13:51`	43
	I sent a mail message to John to following effect: ---------------------------------------------------------------------------- I was wrong in thinking a LIB$WAIT would help you. The thread stalled in user mode has to get back to WAIT_IO_DONE and it is never going to from an exit handler. Consulted with a guru, who feels there is a "right thing" for the PASRTL exit handler to do: 1. In your exit handler do not use the user's FAB structures. For your $close use a new FAB with the IFI in the user's FAB structure copied in. Now it is impossible for the FAB$B_BLN bit 0 to be set. 2. If the IFAB's busy bit is set, the $close will return the error: RMS$_ACT (file activity precludes operation). 3. If this error is returned, you can use RMS service SYS$WAIT with your new FAB. When wait completes, you reissue your $close. Note: If your exit handler does anything with user's RAB structures, it should follow same practice. Bit 0 in FAB$B_BLN is set when exiting from RMS exec-level code when stall is required. There is a module in VAX SYS facility that handles the WAIT_IO_DONE (system_service_exit). The bit is set in caller's mode, and when a status has been filled into FAB structure itself, the bit is cleared in caller's mode. RMS has an AST on the $QIO operation itself that it is waiting to complete. This exec-mode AST can interrupt caller mode code (assuming called from user or supervisor mode) at anytime and complete any RMS processing remaining for the operation after $QIO itself completes. In the normal state of affairs it would eventually get back to the user-mode code in the system-service exit path and clear bit 0 in the FAB$B_BLN. But from an exit-handler it will never return. I would think the change suggested above, wouldn't involve too much recoding in your PASRTL exit handler. Then you can test it on the VAX machine, which is consistently getting RMS$_BUSY from their DECthread's application. -- Elinor
3009.15		TLE::REAGAN	All of this chaos makes perfect sense	`Mon Feb 24 1997 13:31`	16
	Just to follow up on the final resolution of this. The application creates multiple detached threads. Their main program does not use pthread_join to wait for these detached threads but just uses a condition variable to tell the main program that it is done. The main program then returns to the system where the RTL's exit handler is called. However, the user's thread wasn't quite finished. They forgot about the thread-specific destructor. In that destructor (running in parallel with the main routine which is trying to return back to the system), they explicitly close a file. That destructor didn't quite finish before the main program returned and the exit-handler then tried to close the same file and got a busy. The customer has modified their code to remove the race condition. -John