[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

2473.0. "SMU and FCS problems" by JOCKEY::MARSHALLJ (Glad that the devil is red ......) Thu Mar 25 1993 11:49

                 *****  Caught by Catch-22 ?!?!  *****

Hi,

Since upgrading to V3.0, my customer has been experienceing problems that seem 
to be related to the File Cabinet Server.  They are running a 5 node cluster 
with >1000 users concurrent.   A lot of the problems have been overcome by FCS 
tuning but a couple remain.  I've done the usual DIR/TITLE but can't seem to 
find a match so can anyone help.

1.  A user performs an SMU and works merrily away.  However when they have 
    finished and try to SMU back to their own account, they get a message that 
    "Drawer is already in use by another User".  Investigation shows that the 
    FCS process still has open the users own DOCDB, DAF and RESERVATIONS.DAT 
    file of their MAIN drawer.

2.  The solution to this problem would seem to be to do a SM MFC MS MSC and from 
    the Index  select the users that are affected and disconnect them.  This is 
    where Catch 22 comes in.  When this operation is attempted, an error message 
    "Client Buffer not big enough for Requested Operation" and no Index is 
    displayed. Consequently, the remaining alternative is to stop the FCS in its 
    entirety which then affects everyone.

Are these known problems ?  Any workarounds ?  Any fixes now or in a PFP/PFR ?

Thanks in advance,
John
T.RTitleUserPersonal
Name
DateLines
2473.1What version are you running?CHRLIE::HUSTONThu Mar 25 1993 14:1730
    
    I just did it and it worked fine.
    
>1.  A user performs an SMU and works merrily away.  However when they have 
>    finished and try to SMU back to their own account, they get a message that 
>    "Drawer is already in use by another User".  Investigation shows that the 
>    FCS process still has open the users own DOCDB, DAF and RESERVATIONS.DAT 
>    file of their MAIN drawer.
>
>2.  The solution to this problem would seem to be to do a SM MFC MS MSC and from 
>    the Index  select the users that are affected and disconnect them.  This is 
>    where Catch 22 comes in.  When this operation is attempted, an error message 
>    "Client Buffer not big enough for Requested Operation" and no Index is 
>    displayed. Consequently, the remaining alternative is to stop the FCS in its 
>    entirety which then affects everyone.
>
>Are these known problems ?  Any workarounds ?  Any fixes now or in a PFP/PFR ?
    
    Killing the client connections will not close down the drawer files.
    The FCS keeps drawers open for performance reasons. Are you by chance
    running V2.4 of ALL-IN-1?
    
    There is no workaround for the "client buffer not big enough..." 
    problem, it has to be fixed in the UI.
    
    What happens, if while you are SMU'd to another user, you try to 
    go into ALL-IN-1 into your account, from another terminal?
    
    --Bob
    
2473.2FROIS1::HOFMANNStefan Hofmann, LC Frankfurt, ISEThu Mar 25 1993 14:324
    Bob,
    
    John must be using V3, since V2.4 didn't provide a SMU option.
    	Stefan
2473.3IOSG::MAURICEBecause of the architect the building fell downThu Mar 25 1993 18:0727
    Hi,
    
    Here's how I think the scenario is:
    
    1. User does an SMU and so the current drawer is the Manager's drawer.
    
    2. A cross-drawer operation is done which involves the user's MAIN
       drawer - perhaps a message is refiled to it for example. The FCS 
       now has to access the user's MAIN drawer, and as a performance
       optimisation attempts first to get an exclusive lock on the
       drawer. As only the FCS is accesing the drawer this is successful.
    
    3. The user now wishes to SMU back to the MAIN drawer. The ALL-IN-1
       File Cabinet code attempts to get a lock on the drawer. In normal 
       working the FCS is triggered to release the exclusive lock and 
       downgrade to a read lock. Your symptom suggests that the FCS is
       not reacting to the downgrade request. Note that no client/server
       dialogue is required - it is the VMS lock manager which should 
       trigger the FCS into performing the downgrade.
    
    Since this is an abnormal situation I recommend you look in the FCS log
    files to see if any errors have been recorded there.
    
    Cheers
    
    Stuart
    
2473.4Intermittent problem - will post logs soonJOCKEY::MARSHALLJGlad that the devil is red ......Tue Mar 30 1993 11:2111
    
    		****	awaiting further info	****
    
    Re .1,.3
    
    Thanks for the ideas so far.  The problem isn't reproducible at will so
    I have asked the customer to copy the log files and also turn on FCS
    tracing as soon as the next occurence is reported.  I will post them
    here.
    
    John
2473.5More FCS Problems (moved from 2585.0)TENTO1::MARSHALLJGlad that the devil is red ......Sat Apr 17 1993 16:49205
    Hi,
    
    Unfortunately these haven't gone away and below I include more detailed
    problem statements plus the associatted FCS logs containing the
    relevant error messages etc.
    
    Any help would be greatfully appreciatted.
    
    Is there anything else we can set to receive more debug/error type
    information ?
    
    Just out of curiosity, some of the errors listed are MCC-E-*******
    
    Does MCC mean that hooks are in the FCS so that it can be
    managed/monitored by DECmcc (Polcenter Framework) ?  If so, any details
    on what I need to do to enable this ?
    
    Thanks in advance,
    John
    
    	______________________________________________________________
    

We have again experienced problems with the A1 file Cab servers this week.
These problems have not all been the same but generally require the filecab 
server in question being shutdown and restarted. Details are as follows:-

PROBLEM 1:-

User did a reserve on a document then unreserved it. At this point the user got 
a DOCUMENT IN USE. We were able to use the MSC option to show the users on the 
file cab server but this user did not show as a client. Looking at the files 
held open on the users disk the file cab server had the users DOCDB, 
RESERVATIONS etc held open as well as the .WPL file of the document the user was 
trying to access. Shutting down and restarting cleared the problem.

PROBLEM 2:-

Over the past couple of days we have had a few users reporting problems with 
SMU. They have SMU'd successfully to another user and attempted to create a new 
email. At this point they enter the EMHEAD information and attempt to enter WPS.   
It is then that they are taken back to the EMAIL menu with a message UNABLE TO 
CREATE DOCUMENT. Investigating the file cab servers we found one that was 
rejecting requests. Its channel count was up to 356 out of a max of 400 with 
about 35 attached clients and approx 30 more threads allocated than deleted.
There should be ample channel count to accomodate the number of users on this 
server. What appears to be happening and this is also reflected in PROBLEM 3 
below is that the file cab server is holding open channels and not releasing 
them. 

PROBLEM 3:-

This morning a user logged into ALLIN1 and attempted to access his main drawer 
for WP and got DRAWER CURRENTLY BEING USED BY ANOTHER USER. None of this users 
drawers are shared and he does not have access to any other drawer. 
Investigation of the files open for him showed him logged on to GRFH9 node of 
the cluster whilst the file cab server on GRFH12 node in the cluster was holding 
open his DOCDB.DAT, RESERVATIONS.DAT and DAF.DAT. Looking at the SAI option on 
Manage servers screen for the GRFH12 server we could see that the channel count 
was up to 356 out of 400 and it was rejecting requests to it. Again it appears 
that channels are being held open. A bit of a guess would say that the user in 
question was probably logged on to GRFH12 node yesterday and the server has held 
onto him.


Below are the server log files from each node in our cluster since we last 
rebooted on the 11th April. They show various internal errors and problems as 
well as the shutdown/restarts.



11-APR-1993 15:42:38.52  Server: GRFH8::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete

13-APR-1993 13:40:50.44  Server: GRFH8::"73="  Error: %OAFC-E-INTERR, Internal 
error in File Cabinet Server  Message: FCS has access violated, please submit 
an SPR.

13-APR-1993 22:54:56.38  Server: GRFH8::"73="  Error: %MCC-E-ALERT_TERMREQ, 
thread termination requested  Message: CsiCacheBlockAstService; Error from 
mcc_astevent_receive

13-APR-1993 22:54:57.33  Server: GRFH8::"73="  Error: %MCC-E-ALERT_TERMREQ, 
thread termination requested  Message: SrvTimeoutSysMan; receive alert to 
terminate thread

13-APR-1993 22:55:54.36  Server: GRFH8::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete

14-APR-1993 18:47:02.91  Server: GRFH8::"73="  Error: %MCC-E-IN_USE_ERROR, in 
use error  Message: CsiCacheFlushDrawerAccess; Error from mcc_mutex_try_lock



11-APR-1993 15:37:08.05  Server: GRFH9::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete



11-APR-1993 15:52:50.90  Server: GRFH10::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete

13-APR-1993 17:22:52.86  Server: GRFH10::"73="  Error: %MCC-E-IN_USE_ERROR, in 
use error  Message: CsiCacheFlushDrawerAccess; Error from mcc_mutex_try_lock



11-APR-1993 15:54:35.70  Server: GRFH11::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete



11-APR-1993 15:39:16.06  Server: GRFH12::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete

14-APR-1993 09:11:05.46  Server: GRFH12::"73="  Error: %OAFC-E-INTERR, 
Internal error in File Cabinet Server  Message: FCS has access violated, 
please submit an SPR.

15-APR-1993 10:36:35.55  Server: GRFH12::"73="  Error: %MCC-E-EXISTENCE_ERROR, 
object does not exist  

15-APR-1993 10:51:26.76  Server: GRFH12::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete




11-APR-1993 15:37:25.91  Server: GRFH13::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete

13-APR-1993 15:49:12.96  Server: GRFH13::"73="  Error: %OAFC-E-INTERR, 
Internal error in File Cabinet Server  Message: FCS has access violated, 
please submit an SPR.

13-APR-1993 16:10:15.94  Server: GRFH13::"73="  Error: %MCC-E-EXISTENCE_ERROR, 
object does not exist  

14-APR-1993 14:56:34.56  Server: GRFH13::"73="  Error: %MCC-E-EXISTENCE_ERROR, 
object does not exist  

14-APR-1993 14:57:02.50  Server: GRFH13::"73="  Error: %MCC-E-EXISTENCE_ERROR, 
object does not exist  

14-APR-1993 15:00:35.38  Server: GRFH13::"73="  Message: Startup for File 
Cabinet Server V1.0-2 complete


Below is an extract from one of the file cab servers error logs 
(OAFC$SERVER_ERROR.LOG). The information in this log is typical of what is in 
all six of our file cab server logs on our cluster. The manual says that errors 
should be reported to Digital if they occur in this log. 

Can you throw any light on them? 

Is it also possible to move the location of this log file from SYS$MANAGER to 
our own location and perform some form of new version processing? At present the 
file cab servers have been appending to the same file since we bought up version 
3 of ALL-IN-1 last October.


	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]�S
	Drawer owner: DIRECTUW                      
	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]��
	Drawer owner: DIRECTUW                      
	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$OANDG:[OANDGSD.ALLIN1.OGIPOL]
	Drawer owner: OANDGSD                       
ALL-IN-1 Index Server Internal Error:
    Error locking DAB during cache garbage collection:
	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$OANDG:[OANDGSD.ALLIN1.OGIPOL])�
	Drawer owner: OANDGSD                       
	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$ITNLUSER:[ALEXANDERMM.ALLIN1]ab.dat
	Drawer owner: ALEXANDERMM                   
	The lock on the following drawer has become invalidated by another
	process.  Note that the lock has been granted and OafcNormal will be
 	returned to the client, however, all other processes wishing to share
	this lock will also be granted invalid locks until all processes
	sharing this lock are terminated.
	Drawer directory: DIR$DIV36:[RIUKIPS.ALLIN1.SAH_SECTION_INFO]EMO!000874
	Drawer owner: RIUKIPS                       

2473.6A few commentsCHRLIE::HUSTONMon Apr 19 1993 16:0677
    re .5
    
    >Is there anything else we can set to receive more debug/error type
    >information ?
    
    THe only thing else you can do is turn on FCS tracing for the 
    users that are having problems, not sure if it will show anything
    and it will get large quick, but worth a shot.
    
    >Just out of curiosity, some of the errors listed are MCC-E-*******
    >
    >Does MCC mean that hooks are in the FCS so that it can be
    >managed/monitored by DECmcc (Polcenter Framework) ?  If so, any details
    >on what I need to do to enable this ?
    
    MCC is the threads package used by the FCS. There is nothing you can
    do to get more information from it.
    
>User did a reserve on a document then unreserved it. At this point the user got 
>a DOCUMENT IN USE. We were able to use the MSC option to show the users on the 
>file cab server but this user did not show as a client. Looking at the files 
>held open on the users disk the file cab server had the users DOCDB, 
>RESERVATIONS etc held open as well as the .WPL file of the document the user was 
>trying to access. Shutting down and restarting cleared the problem.
    
    Having a FCS trace of this would be helpfull to see what FCS calls are
    being made and what status is being returned. It sounds like there is
    a bit of non-cooperation between the FCS and IOS with respect to
    locking.
    
>Is it also possible to move the location of this log file from SYS$MANAGER to 
>our own location and perform some form of new version processing? At present the 
>file cab servers have been appending to the same file since we bought up version 
>3 of ALL-IN-1 last October.
    
    You can move the log simply by renaming it, the FCS opens the file, if
    not there it creates a new one. Sorry but the location of
    oafc$server_error.log is hard coded in the FCS.
    
    >	The lock on the following drawer has become invalidated by another
    >	process.  Note that the lock has been granted and OafcNormal will be
    >	returned to the client, however, all other processes wishing to share
    >	this lock will also be granted invalid locks until all processes
    >	sharing this lock are terminated.
    >	Drawer directory: DIR$BROKACCT:[DIRECTUW.ALLIN1.CREDIT_CONTROL]�S
    >	Drawer owner: DIRECTUW                      
    
    The only time I have seen this is when IOS has a MAIN drawer open (not
    by using the FCS) and then the FCS tries to access it. THere is code
    in to allow the locks to be managed properly,  What happened is that
    the FCS had exclusive lock on the drawer, IOS (or someone else) also
    requested access. Background ASTs and the VMS lock manager work
    together to tell the guy with the exclusive lock to loosen up its hold
    on the resource (drawer name). This sounds like something went corrupt
    in the lock resource. THe drawer directory looks like garbage.
    
    In fact all the drawer directory fields in that look appear to have
    a couple bytes of garbage on the end. 
    
    As fro the channels, teh only thing I can think of is that when the FCS
    access violates it is not letting go of the channels that that thread
    had. Probably due to channels being process allocated and there is no
    map of what thread has how many channels. The condition handler will
    attempt to close down files/drawers, not sure if it is smart enough to
    let go of the channels as well.
    
    Also you seem to have alot of uses for only 400 channels, each drawer
    takes 4 channels, I seem to recall you having alot of users (could be
    confusing you with someone else though). If so, bump up the channel
    count and see if that problem goes away.  I also don't see any messages
    in the log file about the FCS thinking it is low on channels and trying
    to release some. Whenever the FCS hits 90% used channels, it tries to
    close some drawers/files down to free up channels, when it does this
    it writes a message to the server log file
    (sys$manager:oafc$server.log).
    
    --Bob

2473.7Any news here ???VNABRW::EHRLICH_KRonnie James DIO, vocals!Wed Jun 30 1993 12:0164
    Hi Bob, Kevin,
    
    	I've been at a customer (ABB Vienna) today's morning because
    they've had some troubles with SMU and back again. (DWRLOCKED!)
    (The same as John in Re.1 mentioned!) 
    
    Also some users had problems with Creating a Mail. They filled in
    TO's , CC's and a  subject. And after the subject they hung.
    
    Having a look in the trace I've found the following:
    
    ![SCRIPT] WP_SYS_EDIT Line 7: GET #DOC_FULLPATH = #DRAWER_FULLPATH "."
    '"' OA$CU
    !               RDOC_FOLDER '".' OA$CURDOC_DOCNUM
    ![FUNC]   Function = GET, Cmd line = #DOC_FULLPATH = #DRAWER_FULLPATH
    "." '"' OA
    !               $CURDOC_FOLDER '".' OA$CURDOC_DOCNUM
    ![A1LOG]  Entry = %OA-I-LOGFUN, Funktion: GET             #DOC_FULLPATH
    = #DRAWE
    !               R_FULLPATH "." '"' OA$CURDOC_FOLDER '".'
    OA$CURDOC_DOCNUM
    ![SYMBOL] Symbol = #DOC_FULLPATH = #DRAWER_FULLPATH "." '"'
    OA$CURDOC_FOLDER '".
    !               ' OA$CURDOC_DOCNUM, Value = OFFICE::."[PINCZOLITS
    JOSEF]STANDARD
    !               "."AUSGANG".000437
    ![SCRIPT] WP_SYS_EDIT Line 8: FILECAB GET_ATTRIBUTES (DOCUMENT =
    #DOC_FULLPATH,
    !               #MS = MAIL_STATUS, #MF = MODIFY)
    ![FUNC]   Function = FILECAB, Cmd line = GET_ATTRIBUTES (DOCUMENT =
    #DOC_FULLPAT
    !               H, #MS = MAIL_STATUS, #MF = MODIFY)
    ![A1LOG]  Entry = %OA-I-LOGFUN, Funktion: FILECAB        
    GET_ATTRIBUTES (DOCUME
    !               NT = #DOC_FULLPATH, #MS = MAIL_STATUS, #MF = MODIFY)
    ![SYMBOL] Symbol = #DOC_FULLPATH, Value = OFFICE::."[PINCZOLITS
    JOSEF]STANDARD".
    !               "AUSGANG".000437
    ![IO]     FILECAB Server Request = LIST
    ![IO]     Getting field CODE from OA$FOLDERS, Value = DEDE
    ![A1LOG]  Entry = %OA-I-LOGERROR, %OA-W-SUBTERM, Fehler beim Ablauf des
    Subproze
    !               sses "20801C18".
    ![A1LOG]  Entry = %OA-I-LOGERROR, -NONAME-W-NOMSG, Message number
    >>>>>>   A6E83240 <<<<<<
    
    Here I had to STOP/ID the process! The files were locked by the FCS,
    after doing a SHOW DEVICE /FILES.
    
    If you're interested in the whole Tracefile you'll find it on VNOTSC::
    (49790::)ABB_TRACE.LOG
    
    Now my question is, have you both found something. Are there any news
    about the FCS. I've told ABB to install ICF #10 which solves some
    problems with SMU.
    
    ABB will tune the FCS as described in the ManagementGuide, maybe this
    will help ??? But it can not be a solution to stop and restart the FCS.
    
    No fun, I know.
    
    Best regards and greetings from Vienna
    Charly  
     
2473.8some SMU problems have been fixedCHRLIE::HUSTONWed Jun 30 1993 13:3913
    
    There were problems in the FCS that would restrict SMU, they have been
    fixed and put into some sort of patch (MUP or ICF not sure which, I
    just build 'em, don't ship 'em :-) ).
    
    ICF 10 does sound to be about the right timeframe though.
    
    Also, the trace you showed is very hard to use to get FCS problems,
    if you could show that and the FCS trace on the user in question
    things may make more sense.
    
    --Bob
    
2473.9Yes, but it's difficult to trace ...VNABRW::EHRLICH_KRonnie James DIO, vocals!Wed Jun 30 1993 14:0520
    Bob,
    
    	first, it's great to get such a fast response. - Thank you very
    much!
    
    It's difficult to trace things that happened in the past. And enable
    tracing after a FCS-restart for 500 ALL-IN-1 users will also be a 
    challenge, but I will tell this ABB. But mostly, the problems occour
    when noone is reachable.
    
    It looks like that there are sometimes some 'unserious' behaviours
    between the FCS and the VMS-lockmanager. Who knows?
    
    ABB has restarted FCS, all problems have gone (at the moment, hopefully
    they never come back!).
    
    Best regards
    Charly_from_CSC_Vienna
    
    
2473.10Doing. . .IOSG::STANDAGEWed Jun 30 1993 14:4414
    
    Charly,
    
    Some filelocking problems similar to what you are experiencing have
    been investigated to some degree here in IOSG. The good news is that
    progress is being made, but the extent of the changes means that you
    won't see a fixed version of the FCS for a while yet.
    
    
    Thanks for your feedback,
    
    Kevin.
    
    
2473.11Yes, I know (2934.0)VNABRW::EHRLICH_KRonnie James DIO, vocals!Wed Jun 30 1993 15:1815
    Kevin,
    
    	yes, I understand what you mean by
     
    >The good news is that
    >progress is being made, but the extent of the changes means that you
    >won't see a fixed version of the FCS for a while yet.
    
    after announcing note 2934 by GAP.
    
    Is there really no way to get an 'ICF' for this. If there's a need, I'll 
    come over to you and help you! 
    
    Good luck for you (as we say in Austria toi, toi, toi!)
    Charly_who's_happy_and_a_little_bit_sad_now.
2473.12Clarifying...IOSG::PYEGraham - ALL-IN-1 Sorcerer&#039;s ApprenticeThu Jul 01 1993 16:266
    Well actually, (putting words in Kevin's mouth!) I think he meant
    that the fix is sufficiently complicated that we might not be doing it
    straight away. Besides the FCS team (which was unaffected by the 2934
    announcement) is flat out on our committments for TeamLinks connection.
    
    Graham
2473.13We'll get there eventually!IOSG::STANDAGEFri Jul 02 1993 10:3711
    
    Yes. As usual, Graham is very accurate !
    
    The changes are rather extensive to the server, so we want to take our
    time and get it right, plus the fact that there are other committments
    which are taking priority.
    
    Thanks,
    Kevin.