[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:VAX and Alpha VMS
Notice:This is a new VMSnotes, please read note 2.1
Moderator:VAXAXP::BERNARDO
Created:Wed Jan 22 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:703
Total number of notes:3722

448.0. "CPU bottleneck with plenty of CPU available" by TAY2P1::HOWARD (Whoever it takes) Thu Apr 10 1997 18:06

    I have a VAX 4000-300 running VMS V6.1.  The main purpose of this
    system is to process print jobs from DQS.  I installed PATHWORKS V5 a
    few months ago, and ran AUTOGEN.  It is possible that some SYSGEN
    parameters were not written to MODPARAMS.DAT.  The problem is that the
    system hesitates.  A MONITOR PROCESS/TOPCPU reveals no processes using
    a lot of CPU time.  In fact the system has never been overloaded; there
    are no interactive users, and PATHWORKS is only used for print serving;
    there are no file shares.  The PSPA report consistently has a note like
    this:
    
                                                                    {C0020}
              There is an apparent bottleneck at the CPU due  to  the
              large  number  of  COM/COMO processes.  There is also a
              process consuming at least 40 percent of the CPU time.
    
              Examine the process which  is  consuming  the  CPU  for
              faulty design, mismanaged priorities, or other possible
              reasons.
    
    Further, there are lists of processes, such as this:

 # Proc       Process receiving most CPU         COM Process
in COM   -----------------------------------  ----------------      Time of
or COMO     USERNAME     IMAGE     %CPU PRIB   USERNAME   PRIB     occurrence
------   ------------ ------------ ---- ---- ------------ ----  ----------------
     9   SYSTEM       NSCHED         20    6 SYSTEM          4   09-APR 08:00:00
    10   SYSTEM       NSCHED         21    6 SYSTEM          4   09-APR 08:02:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 08:04:00
     8   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 08:22:00
     8   SYSTEM       NSCHED          9    6 SYSTEM          4   09-APR 08:24:00
     8   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 08:42:00
     9   SYSTEM       NSCHED          9    6 SYSTEM          4   09-APR 08:44:00
     8   SYSTEM       NSCHED         21    6 SYSTEM          4   09-APR 09:02:00
     7   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 09:32:00
    10   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 09:42:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 09:44:00
    14   SYSTEM       NSCHED         18    6 SYSTEM          4   09-APR 10:02:00
    13   SYSTEM       NSCHED         12    6 SYSTEM          4   09-APR 10:04:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:22:00
     9   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:24:00
     7   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 10:42:00
    16   SYSTEM       NSCHED         20    6 SYSTEM          4   09-APR 11:02:00
    15   SYSTEM       NSCHED         11    6 SYSTEM          4   09-APR 11:04:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:22:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:24:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:32:00
     8   SYSTEM       NSCHED         10    6 SYSTEM          4   09-APR 11:42:00

    
    The obvious solution would be that the Scheduler is going crazy. 
    But NSCHED is not always the culprit.  I reinstalled it but to no
    avail.  Sometimes it is PATHWORKS processes.  I reduced the
    priority of some detached processes but others take their place. 
    Even 40% utilization by one process should not hurt the system,
    since even during peak times with these processes running, there
    is no more than about 60 utilization.   I have other systems
    running similar configurations and have nothing like this
    happening. Is there more that I can look at?  I have lots of PSPA
    reports which I could make available. I have rebooted the system
    since it started doing this and it continues to perform the same
    way.  I'm not sure where to turn.
    
    Ben
T.RTitleUserPersonal
Name
DateLines
448.1UTRTSC::utoras-198-48-146.uto.dec.com::JurVanDerBurgChange mode to Panic!Fri Apr 11 1997 02:108
If you have DECPS data then start looking at it. It contains a lot of info
on the system's state which should point you in the right direction. It may
be possible to increase the default 2 minutes sample interval of DECPS to 
check with as finer granularity, which should certainly give the info you
need.

Jur.

448.2BSS::JILSONWFH in the Chemung River ValleyFri Apr 11 1997 10:0812
You cannot change the main collection (CPD) interval.  It is fixed at 2 
min.  You can create an alternate collection at a smaller interval.

I would be looking at just the specific evidence time periods and see if 
the cpu is overworked at just those times.  This conclusion is saying that 
you have too many COM processes that are above their default priority AND 
the process consuming the most cpu time is using > 40% AND the top process 
is a high priority process.  It would appear you have time periods where 
the cpu cannot keep up with the demand or the top priority process is going 
compute bound.  Might be time to add another CPU, if possible.

Jilly
448.3Timers falling into sync?EPS::VANDENHEUVELHeinFri Apr 11 1997 11:2654
    
> The problem is that the system hesitates. 
    :
> there are no interactive users, 
    
    Can you refine that problem description please.
    Who notices the hesitations since there are no interactive
    users to give feedback on say echo times during edit session :-)
    Should we think about sub-second hickups while editting?
    Monitor screens freezing for several seconds?
    No printers going for several minutes?
    
    Perhaps you 'simply' have a clock sync resonance problem ?
    
    Some timers that are prone to go of (milli)seconds apart
    on set time interval? For sake of the argument, let's 
    have 10 schedulers of sorts waking themself up every 10 
    seconds and each requiring 0.5 seconds of CPU. If they
    manage to do so perfectly spread, you'll see no com queue
    and 50% cpu business. If the timers go of at the same
    time, you'll see a com queue of 9 at that time. Now the QUANTUM
    start to play an important role. If each of those tasks is allowed
    to go from start to finish without being pre-empted, then every
    0.5 second one will be done and there will be an average com queue 
    of 2.5 (half of 10 processes waiting for 5 seconds every 10 seconds) 
    and still an average CPU business of 50% !
    If quantum is low, then the CPU will be handed over all the time
    and all tasks will finish within a quantum from each other, when
    all of them are done. There would be a com que of 9 for 5 out of 10
    second and 0 for the other 5 giving an average com queue of 4ish. Yuck.
    
    In the latter case, even if the start time spread is say 2 seconds wide
    around a central time, there will still be a clump of activity where
    you are likely to hit com queue, and once you hit the queue, you'll be
    part for that queue and compound the problem. It only takes a little
    external sync point, like adisk volume allocation lock, for them to
    start syncing more and more and for the queue to spike more and more.
    
    You might need bigger quantums to allow processes to do start their
    jobs and finish them, If you can tweak timers, you may try to set them
    not to coincide all the time (7,11, and 13 seconds instead of 3 times 10).
    You might find an external sync event between seemingly unrelated
    processes that you can spread, for example a log file looking for space
    on a disk all the time, fighting with a process creating files.
    Solutions: spread activities over disks and directories. Make files
    extend by serious chunks instead of the silly 5 blocks found too often.
    
    Get a second CPU! That should flatten out those peaks tremendeously.
    
    fwiw,
    	Hein.
    
    
    
448.4Too many scheduled jobs?TAY2P1::HOWARDWhoever it takesFri Apr 11 1997 18:3939
    The only interactive users are a few people logging into SYSTEM to
    monitor queues and check on the system.  I see the hesitation, but
    people printing see jobs that seem to take forever to print.  We get
    people printing the same job to the same printer via an NT server
    printing in about a quarter of the time.   Usually, I find that DECPS
    gives very concrete suggestions, such as "increase NPAGEDYN" or "reduce
    file fragmentation".  The initial report said to run LIBDECOMP, which I
    did.  It also suggested rewriting the applications, which are all
    standard Digital or former Digital products.  I also installed
    DCPS$SMB, since that is used a great deal.  File fragmentation is good
    to excellent on all drives.
    
    I like the idea of adding a CPU, but that isn't going to happen unless
    I can find an idle asset somewhere.  
    
    The current report gives massive lists of image activations, e.g.,

            # of   Page Faults  Avg.  % of     % of           Uptime/   Cputim/
             activ-  per Actvtn   Ws   Direct Buffered % of     image     image
  Image      ations -Soft--Hard  size   I/O     I/O   Cputim    (sec)     (sec)
  --------  ------- ------ ---- ------ ------  ------ ------  -------  --------
. . .
DQS$CLIENT       4     325   22    374   0.08    0.05   0.01       11      0.56



    Would INSTALLing this image be likely to make a difference?  325 page
    faults is not too many over a 24-hour period.
    
    There are a lot of Scheduler jobs.  Some of them run periodically
    through the day.  I'm not sure why they would cause undo strain, since
    they mostly look for stopped queues or error messages.  Mostly DCL and
    mostly running at normal priority.  Is it just that the Scheduler is
    waking up to check them?  
    
    .3 suggests increasing quantums. Is that what you mean?  I will review
    these jobs to try to keep them from running into each other.
    
    Ben
448.5ZIMBRA::BERNARDODave Bernardo, VMS EngineeringFri Apr 11 1997 19:224
    I would be tempted to reduce QUANTUM before I'd increase it...
    If you do, reduce AWSTIME as well.
    
    d.
448.6How many DCPS queues? Verison of DCPS?KEIKI::WHITEMIN(2�,FWIW)Sat Apr 12 1997 20:4019
    
    	There is an issue with DCPS where all the symbionts would wake up
    every .10 of second whether the queues were active or not.
    
    	This is a known problem that occurs because DCPS use DECthreads in its
    operation.  DECthreads uses a timer AST which expires every 0.10 of a
    second and then resets itself.  This causes the process to become
    computable (COM) to handle the AST.
    

    	How many DCPS queues and what version are they? This would probably
throw DECps into fits.
    
    						Bill
    
    PS - Comet V4.3 search criteria-  dcps decthreads cpu time -
    were the four words used.
    
    
448.7Running 1 symbiont process per queueTAY2P1::HOWARDWhoever it takesTue Apr 15 1997 17:3911
    Thanks for the input.  I will start with the number of DCPS processes
    and look at QUANTUM after that.  PSPA was very happy over the weekend
    after I CONVERTed NSCHED$:VSS.DAT. It went from 4200 to 52 blocks.  But
    Monday's report is pretty much as before.  Not sure if this was
    related, but it probably was a good idea anyway.
    
    >How many DCPS queues and what version are they? 
    
    There are 58 DCPS queues running DCPS V1.3.  14 were inactive last week.
    
    Ben
448.8DCPS$MAX_STREAMSFUNYET::ANDERSONExchange *this*Tue Apr 15 1997 21:285
You can reduce the number of DCPS symbiont processes by changing the value of
the logical name DCPS$MAX_STREAMS as described in DCPS$STARTUP.COM.  This may
help your situation.

Paul
448.9Looks like DCPS$MAX_STREAMS was the solutionTAY2P1::HOWARDWhoever it takesFri Apr 18 1997 18:3410
    I set DCPS$MAX_STREAMS to 4 and PSPA is now reporting no bottleneck. 
    It had been 8 before the problem began, but I had removed the logical
    in an attempt to get PATHWORKS working.  I don't know if it helped that
    problem, since I did several things at the same time to get things
    going again.  
    
    I appreciate the help, because I did not see the relationship between
    that and the problems. 
    
    Ben