[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

2473.0. "SMU and FCS problems" by JOCKEY::MARSHALLJ (Glad that the devil is red ......) Thu Mar 25 1993 11:49

                 *****  Caught by Catch-22 ?!?!  *****

Hi,

Since upgrading to V3.0, my customer has been experienceing problems that seem 
to be related to the File Cabinet Server.  They are running a 5 node cluster 
with >1000 users concurrent.   A lot of the problems have been overcome by FCS 
tuning but a couple remain.  I've done the usual DIR/TITLE but can't seem to 
find a match so can anyone help.

1.  A user performs an SMU and works merrily away.  However when they have 
    finished and try to SMU back to their own account, they get a message that 
    "Drawer is already in use by another User".  Investigation shows that the 
    FCS process still has open the users own DOCDB, DAF and RESERVATIONS.DAT 
    file of their MAIN drawer.

2.  The solution to this problem would seem to be to do a SM MFC MS MSC and from 
    the Index  select the users that are affected and disconnect them.  This is 
    where Catch 22 comes in.  When this operation is attempted, an error message 
    "Client Buffer not big enough for Requested Operation" and no Index is 
    displayed. Consequently, the remaining alternative is to stop the FCS in its 
    entirety which then affects everyone.

Are these known problems ?  Any workarounds ?  Any fixes now or in a PFP/PFR ?

Thanks in advance,
John

T.R	Title	User	Personal Name	Date	Lines
2473.1	What version are you running?	CHRLIE::HUSTON		`Thu Mar 25 1993 14:17`	30
	I just did it and it worked fine. >1. A user performs an SMU and works merrily away. However when they have > finished and try to SMU back to their own account, they get a message that > "Drawer is already in use by another User". Investigation shows that the > FCS process still has open the users own DOCDB, DAF and RESERVATIONS.DAT > file of their MAIN drawer. > >2. The solution to this problem would seem to be to do a SM MFC MS MSC and from > the Index select the users that are affected and disconnect them. This is > where Catch 22 comes in. When this operation is attempted, an error message > "Client Buffer not big enough for Requested Operation" and no Index is > displayed. Consequently, the remaining alternative is to stop the FCS in its > entirety which then affects everyone. > >Are these known problems ? Any workarounds ? Any fixes now or in a PFP/PFR ? Killing the client connections will not close down the drawer files. The FCS keeps drawers open for performance reasons. Are you by chance running V2.4 of ALL-IN-1? There is no workaround for the "client buffer not big enough..." problem, it has to be fixed in the UI. What happens, if while you are SMU'd to another user, you try to go into ALL-IN-1 into your account, from another terminal? --Bob
2473.2		FROIS1::HOFMANN	Stefan Hofmann, LC Frankfurt, ISE	`Thu Mar 25 1993 14:32`	4
	Bob, John must be using V3, since V2.4 didn't provide a SMU option. Stefan
2473.3		IOSG::MAURICE	Because of the architect the building fell down	`Thu Mar 25 1993 18:07`	27
	Hi, Here's how I think the scenario is: 1. User does an SMU and so the current drawer is the Manager's drawer. 2. A cross-drawer operation is done which involves the user's MAIN drawer - perhaps a message is refiled to it for example. The FCS now has to access the user's MAIN drawer, and as a performance optimisation attempts first to get an exclusive lock on the drawer. As only the FCS is accesing the drawer this is successful. 3. The user now wishes to SMU back to the MAIN drawer. The ALL-IN-1 File Cabinet code attempts to get a lock on the drawer. In normal working the FCS is triggered to release the exclusive lock and downgrade to a read lock. Your symptom suggests that the FCS is not reacting to the downgrade request. Note that no client/server dialogue is required - it is the VMS lock manager which should trigger the FCS into performing the downgrade. Since this is an abnormal situation I recommend you look in the FCS log files to see if any errors have been recorded there. Cheers Stuart
2473.4	Intermittent problem - will post logs soon	JOCKEY::MARSHALLJ	Glad that the devil is red ......	`Tue Mar 30 1993 10:21`	11
	** awaiting further info ** Re .1,.3 Thanks for the ideas so far. The problem isn't reproducible at will so I have asked the customer to copy the log files and also turn on FCS tracing as soon as the next occurence is reported. I will post them here. John
2473.5	More FCS Problems (moved from 2585.0)	TENTO1::MARSHALLJ	Glad that the devil is red ......	`Sat Apr 17 1993 15:49`	205
	Hi, Unfortunately these haven't gone away and below I include more detailed problem statements plus the associatted FCS logs containing the relevant error messages etc. Any help would be greatfully appreciatted. Is there anything else we can set to receive more debug/error type information ? Just out of curiosity, some of the errors listed are MCC-E-******* Does MCC mean that hooks are in the FCS so that it can be managed/monitored by DECmcc (Polcenter Framework) ? If so, any details on what I need to do to enable this ? Thanks in advance, John ______________________________________________________________ We have again experienced problems with the A1 file Cab servers this week. These problems have not all been the same but generally require the filecab server in question being shutdown and restarted. Details are as follows:- PROBLEM 1:- User did a reserve on a document then unreserved it. At this point the user got a DOCUMENT IN USE. We were able to use the MSC option to show the users on the file cab server but this user did not show as a client. Looking at the files held open on the users disk the file cab server had the users DOCDB, RESERVATIONS etc held open as well as the .WPL file of the document the user was trying to access. Shutting down and restarting cleared the problem. PROBLEM 2:- Over the past couple of days we have had a few users reporting problems with SMU. They have SMU'd successfully to another user and attempted to create a new email. At this point they enter the EMHEAD information and attempt to enter WPS. It is then that they are taken back to the EMAIL menu with a message UNABLE TO CREATE DOCUMENT. Investigating the file cab servers we found one that was rejecting requests. Its channel count was up to 356 out of a max of 400 with about 35 attached clients and approx 30 more threads allocated than deleted. There should be ample channel count to accomodate the number of users on this server. What appears to be happening and this is also reflected in PROBLEM 3 below is that the file cab server is holding open channels and not releasing them. PROBLEM 3:- This morning a user logged into ALLIN1 and attempted to access his main drawer for WP and got DRAWER CURRENTLY BEING USED BY ANOTHER USER. None of this users drawers are shared and he does not have access to any other drawer. Investigation of the files open for him showed him logged on to GRFH9 node of the cluster whilst the file cab server on GRFH12 node in the cluster was holding open his DOCDB.DAT, RESERVATIONS.DAT and DAF.DAT. Looking at the SAI option on Manage servers screen for the GRFH12 server we could see that the channel count was up to 356 out of 400 and it was rejecting requests to it. Again it appears that channels are being held open. A bit of a guess would say that the user in question was probably logged on to GRFH12 node yesterday and the server has held onto him. Below are the server log files from each node in our cluster since we last rebooted on the 11th April. They show various internal errors and problems as well as the shutdown/restarts. 11-APR-1993 15:42:38.52 Server: GRFH8::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 13-APR-1993 13:40:50.44 Server: GRFH8::"73=" Error: %OAFC-E-INTERR, Internal error in File Cabinet Server Message: FCS has access violated, please submit an SPR. 13-APR-1993 22:54:56.38 Server: GRFH8::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: CsiCacheBlockAstService; Error from mcc_astevent_receive 13-APR-1993 22:54:57.33 Server: GRFH8::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvTimeoutSysMan; receive alert to terminate thread 13-APR-1993 22:55:54.36 Server: GRFH8::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 14-APR-1993 18:47:02.91 Server: GRFH8::"73=" Error: %MCC-E-IN_USE_ERROR, in use error Message: CsiCacheFlushDrawerAccess; Error from mcc_mutex_try_lock 11-APR-1993 15:37:08.05 Server: GRFH9::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 11-APR-1993 15:52:50.90 Server: GRFH10::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 13-APR-1993 17:22:52.86 Server: GRFH10::"73=" Error: %MCC-E-IN_USE_ERROR, in use error Message: CsiCacheFlushDrawerAccess; Error from mcc_mutex_try_lock 11-APR-1993 15:54:35.70 Server: GRFH11::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 11-APR-1993 15:39:16.06 Server: GRFH12::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 14-APR-1993 09:11:05.46 Server: GRFH12::"73=" Error: %OAFC-E-INTERR, Internal error in File Cabinet Server Message: FCS has access violated, please submit an SPR. 15-APR-1993 10:36:35.55 Server: GRFH12::"73=" Error: %MCC-E-EXISTENCE_ERROR, object does not exist 15-APR-1993 10:51:26.76 Server: GRFH12::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 11-APR-1993 15:37:25.91 Server: GRFH13::"73=" Message: Startup for File Cabinet Server V1.0-2 complete 13-APR-1993 15:49:12.96 Server: GRFH13::"73=" Error: %OAFC-E-INTERR, Internal error in File Cabinet Server Message: FCS has access violated, please submit an SPR. 13-APR-1993 16:10:15.94 Server: GRFH13::"73=" Error: %MCC-E-EXISTENCE_ERROR, object does not exist 14-APR-1993 14:56:34.56 Server: GRFH13::"73=" Error: %MCC-E-EXISTENCE_ERROR, object does not exist 14-APR-1993 14:57:02.50 Server: GRFH13::"73=" Error: %MCC-E-EXISTENCE_ERROR, object does not exist 14-APR-1993 15:00:35.38 Server: GRFH13::"73=" Message: Startup for File Cabinet Server V1.0-2 complete Below is an extract from one of the file cab servers error logs (OAFC$SERVER_ERROR.LOG). The information in this log is typical of what is in all six of our file cab server logs on our cluster. The manual says that errors should be reported to Digital if they occur in this log. Can you throw any light on them? Is it also possible to move the location of this log file from SYS$MANAGER to our own location and perform some form of new version processing? At present the file cab servers have been appending to the same file since we bought up version 3 of ALL-IN-1 last October. The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]�S Drawer owner: DIRECTUW The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]�� Drawer owner: DIRECTUW The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$OANDG:[OANDGSD.ALLIN1.OGIPOL] Drawer owner: OANDGSD ALL-IN-1 Index Server Internal Error: Error locking DAB during cache garbage collection: The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$OANDG:[OANDGSD.ALLIN1.OGIPOL])� Drawer owner: OANDGSD The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$ITNLUSER:[ALEXANDERMM.ALLIN1]ab.dat Drawer owner: ALEXANDERMM The lock on the following drawer has become invalidated by another process. Note that the lock has been granted and OafcNormal will be returned to the client, however, all other processes wishing to share this lock will also be granted invalid locks until all processes sharing this lock are terminated. Drawer directory: DIR$DIV36:[RIUKIPS.ALLIN1.SAH_SECTION_INFO]EMO!000874 Drawer owner: RIUKIPS
2473.6	A few comments	CHRLIE::HUSTON		`Mon Apr 19 1993 15:06`	77
	re .5 >Is there anything else we can set to receive more debug/error type >information ? THe only thing else you can do is turn on FCS tracing for the users that are having problems, not sure if it will show anything and it will get large quick, but worth a shot. >Just out of curiosity, some of the errors listed are MCC-E-******* > >Does MCC mean that hooks are in the FCS so that it can be >managed/monitored by DECmcc (Polcenter Framework) ? If so, any details >on what I need to do to enable this ? MCC is the threads package used by the FCS. There is nothing you can do to get more information from it. >User did a reserve on a document then unreserved it. At this point the user got >a DOCUMENT IN USE. We were able to use the MSC option to show the users on the >file cab server but this user did not show as a client. Looking at the files >held open on the users disk the file cab server had the users DOCDB, >RESERVATIONS etc held open as well as the .WPL file of the document the user was >trying to access. Shutting down and restarting cleared the problem. Having a FCS trace of this would be helpfull to see what FCS calls are being made and what status is being returned. It sounds like there is a bit of non-cooperation between the FCS and IOS with respect to locking. >Is it also possible to move the location of this log file from SYS$MANAGER to >our own location and perform some form of new version processing? At present the >file cab servers have been appending to the same file since we bought up version >3 of ALL-IN-1 last October. You can move the log simply by renaming it, the FCS opens the file, if not there it creates a new one. Sorry but the location of oafc$server_error.log is hard coded in the FCS. > The lock on the following drawer has become invalidated by another > process. Note that the lock has been granted and OafcNormal will be > returned to the client, however, all other processes wishing to share > this lock will also be granted invalid locks until all processes > sharing this lock are terminated. > Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]�S > Drawer owner: DIRECTUW The only time I have seen this is when IOS has a MAIN drawer open (not by using the FCS) and then the FCS tries to access it. THere is code in to allow the locks to be managed properly, What happened is that the FCS had exclusive lock on the drawer, IOS (or someone else) also requested access. Background ASTs and the VMS lock manager work together to tell the guy with the exclusive lock to loosen up its hold on the resource (drawer name). This sounds like something went corrupt in the lock resource. THe drawer directory looks like garbage. In fact all the drawer directory fields in that look appear to have a couple bytes of garbage on the end. As fro the channels, teh only thing I can think of is that when the FCS access violates it is not letting go of the channels that that thread had. Probably due to channels being process allocated and there is no map of what thread has how many channels. The condition handler will attempt to close down files/drawers, not sure if it is smart enough to let go of the channels as well. Also you seem to have alot of uses for only 400 channels, each drawer takes 4 channels, I seem to recall you having alot of users (could be confusing you with someone else though). If so, bump up the channel count and see if that problem goes away. I also don't see any messages in the log file about the FCS thinking it is low on channels and trying to release some. Whenever the FCS hits 90% used channels, it tries to close some drawers/files down to free up channels, when it does this it writes a message to the server log file (sys$manager:oafc$server.log). --Bob
2473.7	Any news here ???	VNABRW::EHRLICH_K	Ronnie James DIO, vocals!	`Wed Jun 30 1993 11:01`	64
	Hi Bob, Kevin, I've been at a customer (ABB Vienna) today's morning because they've had some troubles with SMU and back again. (DWRLOCKED!) (The same as John in Re.1 mentioned!) Also some users had problems with Creating a Mail. They filled in TO's , CC's and a subject. And after the subject they hung. Having a look in the trace I've found the following: ![SCRIPT] WP_SYS_EDIT Line 7: GET #DOC_FULLPATH = #DRAWER_FULLPATH "." '"' OA$CU ! RDOC_FOLDER '".' OA$CURDOC_DOCNUM ![FUNC] Function = GET, Cmd line = #DOC_FULLPATH = #DRAWER_FULLPATH "." '"' OA ! $CURDOC_FOLDER '".' OA$CURDOC_DOCNUM ![A1LOG] Entry = %OA-I-LOGFUN, Funktion: GET #DOC_FULLPATH = #DRAWE ! R_FULLPATH "." '"' OA$CURDOC_FOLDER '".' OA$CURDOC_DOCNUM ![SYMBOL] Symbol = #DOC_FULLPATH = #DRAWER_FULLPATH "." '"' OA$CURDOC_FOLDER '". ! ' OA$CURDOC_DOCNUM, Value = OFFICE::."[PINCZOLITS JOSEF]STANDARD ! "."AUSGANG".000437 ![SCRIPT] WP_SYS_EDIT Line 8: FILECAB GET_ATTRIBUTES (DOCUMENT = #DOC_FULLPATH, ! #MS = MAIL_STATUS, #MF = MODIFY) ![FUNC] Function = FILECAB, Cmd line = GET_ATTRIBUTES (DOCUMENT = #DOC_FULLPAT ! H, #MS = MAIL_STATUS, #MF = MODIFY) ![A1LOG] Entry = %OA-I-LOGFUN, Funktion: FILECAB GET_ATTRIBUTES (DOCUME ! NT = #DOC_FULLPATH, #MS = MAIL_STATUS, #MF = MODIFY) ![SYMBOL] Symbol = #DOC_FULLPATH, Value = OFFICE::."[PINCZOLITS JOSEF]STANDARD". ! "AUSGANG".000437 ![IO] FILECAB Server Request = LIST ![IO] Getting field CODE from OA$FOLDERS, Value = DEDE ![A1LOG] Entry = %OA-I-LOGERROR, %OA-W-SUBTERM, Fehler beim Ablauf des Subproze ! sses "20801C18". ![A1LOG] Entry = %OA-I-LOGERROR, -NONAME-W-NOMSG, Message number >>>>>> A6E83240 <<<<<< Here I had to STOP/ID the process! The files were locked by the FCS, after doing a SHOW DEVICE /FILES. If you're interested in the whole Tracefile you'll find it on VNOTSC:: (49790::)ABB_TRACE.LOG Now my question is, have you both found something. Are there any news about the FCS. I've told ABB to install ICF #10 which solves some problems with SMU. ABB will tune the FCS as described in the ManagementGuide, maybe this will help ??? But it can not be a solution to stop and restart the FCS. No fun, I know. Best regards and greetings from Vienna Charly
2473.8	some SMU problems have been fixed	CHRLIE::HUSTON		`Wed Jun 30 1993 12:39`	13
	There were problems in the FCS that would restrict SMU, they have been fixed and put into some sort of patch (MUP or ICF not sure which, I just build 'em, don't ship 'em :-) ). ICF 10 does sound to be about the right timeframe though. Also, the trace you showed is very hard to use to get FCS problems, if you could show that and the FCS trace on the user in question things may make more sense. --Bob
2473.9	Yes, but it's difficult to trace ...	VNABRW::EHRLICH_K	Ronnie James DIO, vocals!	`Wed Jun 30 1993 13:05`	20
	Bob, first, it's great to get such a fast response. - Thank you very much! It's difficult to trace things that happened in the past. And enable tracing after a FCS-restart for 500 ALL-IN-1 users will also be a challenge, but I will tell this ABB. But mostly, the problems occour when noone is reachable. It looks like that there are sometimes some 'unserious' behaviours between the FCS and the VMS-lockmanager. Who knows? ABB has restarted FCS, all problems have gone (at the moment, hopefully they never come back!). Best regards Charly_from_CSC_Vienna
2473.10	Doing. . .	IOSG::STANDAGE		`Wed Jun 30 1993 13:44`	14
	Charly, Some filelocking problems similar to what you are experiencing have been investigated to some degree here in IOSG. The good news is that progress is being made, but the extent of the changes means that you won't see a fixed version of the FCS for a while yet. Thanks for your feedback, Kevin.
2473.11	Yes, I know (2934.0)	VNABRW::EHRLICH_K	Ronnie James DIO, vocals!	`Wed Jun 30 1993 14:18`	15
	Kevin, yes, I understand what you mean by >The good news is that >progress is being made, but the extent of the changes means that you >won't see a fixed version of the FCS for a while yet. after announcing note 2934 by GAP. Is there really no way to get an 'ICF' for this. If there's a need, I'll come over to you and help you! Good luck for you (as we say in Austria toi, toi, toi!) Charly_who's_happy_and_a_little_bit_sad_now.
2473.12	Clarifying...	IOSG::PYE	Graham - ALL-IN-1 Sorcerer's Apprentice	`Thu Jul 01 1993 15:26`	6
	Well actually, (putting words in Kevin's mouth!) I think he meant that the fix is sufficiently complicated that we might not be doing it straight away. Besides the FCS team (which was unaffected by the 2934 announcement) is flat out on our committments for TeamLinks connection. Graham
2473.13	We'll get there eventually!	IOSG::STANDAGE		`Fri Jul 02 1993 09:37`	11
	Yes. As usual, Graham is very accurate ! The changes are rather extensive to the server, so we want to take our time and get it right, plus the fact that there are other committments which are taking priority. Thanks, Kevin.