[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:	VAX and Alpha VMS
Notice:	This is a new VMSnotes, please read note 2.1
Moderator:	VAXAXP::BERNARDO

Created:	Wed Jan 22 1997
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	703
Total number of notes:	3722

448.0. "CPU bottleneck with plenty of CPU available" by TAY2P1::HOWARD (Whoever it takes) Thu Apr 10 1997 17:06

    I have a VAX 4000-300 running VMS V6.1.  The main purpose of this
    system is to process print jobs from DQS.  I installed PATHWORKS V5 a
    few months ago, and ran AUTOGEN.  It is possible that some SYSGEN
    parameters were not written to MODPARAMS.DAT.  The problem is that the
    system hesitates.  A MONITOR PROCESS/TOPCPU reveals no processes using
    a lot of CPU time.  In fact the system has never been overloaded; there
    are no interactive users, and PATHWORKS is only used for print serving;
    there are no file shares.  The PSPA report consistently has a note like
    this:
    
                                                                    {C0020}
              There is an apparent bottleneck at the CPU due  to  the
              large  number  of  COM/COMO processes.  There is also a
              process consuming at least 40 percent of the CPU time.
    
              Examine the process which  is  consuming  the  CPU  for
              faulty design, mismanaged priorities, or other possible
              reasons.
    
    Further, there are lists of processes, such as this:

 # Proc       Process receiving most CPU         COM Process
in COM   -----------------------------------  ----------------      Time of
or COMO     USERNAME     IMAGE     %CPU PRIB   USERNAME   PRIB     occurrence
------   ------------ ------------ ---- ---- ------------ ----  ----------------
     9   SYSTEM       NSCHED         20    6 SYSTEM          4   09-APR 08:00:00
    10   SYSTEM       NSCHED         21    6 SYSTEM          4   09-APR 08:02:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 08:04:00
     8   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 08:22:00
     8   SYSTEM       NSCHED          9    6 SYSTEM          4   09-APR 08:24:00
     8   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 08:42:00
     9   SYSTEM       NSCHED          9    6 SYSTEM          4   09-APR 08:44:00
     8   SYSTEM       NSCHED         21    6 SYSTEM          4   09-APR 09:02:00
     7   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 09:32:00
    10   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 09:42:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 09:44:00
    14   SYSTEM       NSCHED         18    6 SYSTEM          4   09-APR 10:02:00
    13   SYSTEM       NSCHED         12    6 SYSTEM          4   09-APR 10:04:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:22:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:24:00
     7   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:42:00
    16   SYSTEM       NSCHED         20    6 SYSTEM          4   09-APR 11:02:00
    15   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 11:04:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:22:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:24:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:32:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:42:00

    
    The obvious solution would be that the Scheduler is going crazy. 
    But NSCHED is not always the culprit.  I reinstalled it but to no
    avail.  Sometimes it is PATHWORKS processes.  I reduced the
    priority of some detached processes but others take their place. 
    Even 40% utilization by one process should not hurt the system,
    since even during peak times with these processes running, there
    is no more than about 60 utilization.   I have other systems
    running similar configurations and have nothing like this
    happening. Is there more that I can look at?  I have lots of PSPA
    reports which I could make available. I have rebooted the system
    since it started doing this and it continues to perform the same
    way.  I'm not sure where to turn.
    
    Ben

T.R	Title	User	Personal Name	Date	Lines
448.1		UTRTSC::utoras-198-48-146.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Fri Apr 11 1997 01:10`	8
	If you have DECPS data then start looking at it. It contains a lot of info on the system's state which should point you in the right direction. It may be possible to increase the default 2 minutes sample interval of DECPS to check with as finer granularity, which should certainly give the info you need. Jur.
448.2		BSS::JILSON	WFH in the Chemung River Valley	`Fri Apr 11 1997 09:08`	12
	You cannot change the main collection (CPD) interval. It is fixed at 2 min. You can create an alternate collection at a smaller interval. I would be looking at just the specific evidence time periods and see if the cpu is overworked at just those times. This conclusion is saying that you have too many COM processes that are above their default priority AND the process consuming the most cpu time is using > 40% AND the top process is a high priority process. It would appear you have time periods where the cpu cannot keep up with the demand or the top priority process is going compute bound. Might be time to add another CPU, if possible. Jilly
448.3	Timers falling into sync?	EPS::VANDENHEUVEL	Hein	`Fri Apr 11 1997 10:26`	54
	> The problem is that the system hesitates. : > there are no interactive users, Can you refine that problem description please. Who notices the hesitations since there are no interactive users to give feedback on say echo times during edit session :-) Should we think about sub-second hickups while editting? Monitor screens freezing for several seconds? No printers going for several minutes? Perhaps you 'simply' have a clock sync resonance problem ? Some timers that are prone to go of (milli)seconds apart on set time interval? For sake of the argument, let's have 10 schedulers of sorts waking themself up every 10 seconds and each requiring 0.5 seconds of CPU. If they manage to do so perfectly spread, you'll see no com queue and 50% cpu business. If the timers go of at the same time, you'll see a com queue of 9 at that time. Now the QUANTUM start to play an important role. If each of those tasks is allowed to go from start to finish without being pre-empted, then every 0.5 second one will be done and there will be an average com queue of 2.5 (half of 10 processes waiting for 5 seconds every 10 seconds) and still an average CPU business of 50% ! If quantum is low, then the CPU will be handed over all the time and all tasks will finish within a quantum from each other, when all of them are done. There would be a com que of 9 for 5 out of 10 second and 0 for the other 5 giving an average com queue of 4ish. Yuck. In the latter case, even if the start time spread is say 2 seconds wide around a central time, there will still be a clump of activity where you are likely to hit com queue, and once you hit the queue, you'll be part for that queue and compound the problem. It only takes a little external sync point, like adisk volume allocation lock, for them to start syncing more and more and for the queue to spike more and more. You might need bigger quantums to allow processes to do start their jobs and finish them, If you can tweak timers, you may try to set them not to coincide all the time (7,11, and 13 seconds instead of 3 times 10). You might find an external sync event between seemingly unrelated processes that you can spread, for example a log file looking for space on a disk all the time, fighting with a process creating files. Solutions: spread activities over disks and directories. Make files extend by serious chunks instead of the silly 5 blocks found too often. Get a second CPU! That should flatten out those peaks tremendeously. fwiw, Hein.
448.4	Too many scheduled jobs?	TAY2P1::HOWARD	Whoever it takes	`Fri Apr 11 1997 17:39`	39
	The only interactive users are a few people logging into SYSTEM to monitor queues and check on the system. I see the hesitation, but people printing see jobs that seem to take forever to print. We get people printing the same job to the same printer via an NT server printing in about a quarter of the time. Usually, I find that DECPS gives very concrete suggestions, such as "increase NPAGEDYN" or "reduce file fragmentation". The initial report said to run LIBDECOMP, which I did. It also suggested rewriting the applications, which are all standard Digital or former Digital products. I also installed DCPS$SMB, since that is used a great deal. File fragmentation is good to excellent on all drives. I like the idea of adding a CPU, but that isn't going to happen unless I can find an idle asset somewhere. The current report gives massive lists of image activations, e.g., # of Page Faults Avg. % of % of Uptime/ Cputim/ activ- per Actvtn Ws Direct Buffered % of image image Image ations -Soft--Hard size I/O I/O Cputim (sec) (sec) -------- ------- ------ ---- ------ ------ ------ ------ ------- -------- . . . DQS$CLIENT 4 325 22 374 0.08 0.05 0.01 11 0.56 Would INSTALLing this image be likely to make a difference? 325 page faults is not too many over a 24-hour period. There are a lot of Scheduler jobs. Some of them run periodically through the day. I'm not sure why they would cause undo strain, since they mostly look for stopped queues or error messages. Mostly DCL and mostly running at normal priority. Is it just that the Scheduler is waking up to check them? .3 suggests increasing quantums. Is that what you mean? I will review these jobs to try to keep them from running into each other. Ben
448.5		ZIMBRA::BERNARDO	Dave Bernardo, VMS Engineering	`Fri Apr 11 1997 18:22`	4
	I would be tempted to reduce QUANTUM before I'd increase it... If you do, reduce AWSTIME as well. d.
448.6	How many DCPS queues? Verison of DCPS?	KEIKI::WHITE	MIN(2�,FWIW)	`Sat Apr 12 1997 19:40`	19
	There is an issue with DCPS where all the symbionts would wake up every .10 of second whether the queues were active or not. This is a known problem that occurs because DCPS use DECthreads in its operation. DECthreads uses a timer AST which expires every 0.10 of a second and then resets itself. This causes the process to become computable (COM) to handle the AST. How many DCPS queues and what version are they? This would probably throw DECps into fits. Bill PS - Comet V4.3 search criteria- dcps decthreads cpu time - were the four words used.
448.7	Running 1 symbiont process per queue	TAY2P1::HOWARD	Whoever it takes	`Tue Apr 15 1997 16:39`	11
	Thanks for the input. I will start with the number of DCPS processes and look at QUANTUM after that. PSPA was very happy over the weekend after I CONVERTed NSCHED$:VSS.DAT. It went from 4200 to 52 blocks. But Monday's report is pretty much as before. Not sure if this was related, but it probably was a good idea anyway. >How many DCPS queues and what version are they? There are 58 DCPS queues running DCPS V1.3. 14 were inactive last week. Ben
448.8	DCPS$MAX_STREAMS	FUNYET::ANDERSON	Exchange this	`Tue Apr 15 1997 20:28`	5
	You can reduce the number of DCPS symbiont processes by changing the value of the logical name DCPS$MAX_STREAMS as described in DCPS$STARTUP.COM. This may help your situation. Paul
448.9	Looks like DCPS$MAX_STREAMS was the solution	TAY2P1::HOWARD	Whoever it takes	`Fri Apr 18 1997 17:34`	10
	I set DCPS$MAX_STREAMS to 4 and PSPA is now reporting no bottleneck. It had been 8 before the problem began, but I had removed the logical in an attempt to get PATHWORKS working. I don't know if it helped that problem, since I did several things at the same time to get things going again. I appreciate the help, because I did not see the relationship between that and the problems. Ben