[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::dectrace_v20

Title:	DECtrace V2.0 and All-in-1 Perf Rpts conf.
Notice:	Kits+Doc, 2 \| Patches, 3
Moderator:	OMYGOD::LAVASH

Created:	Mon Apr 26 1993
Last Modified:	Mon Jun 02 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	467
Total number of notes:	2058

456.0. "Collect monitor crashes cluster..." by M5::BLITTIN () Thu Mar 27 1997 17:44

    
    Ct running Trace 2.2 on vax/vms 6.1. Two node cluster. Raid array.
    Rdb 6.1. PC runs 'excursions' emulator.
    
    When the ct does a 'collect monitor' against a db data file and after
    about 5-10 minutes the swapper starts consumming about 30% cpu time
    and eventually locks users out.  The ct's terminal hangs.  The first
    time this lasted about 10-15 min, then apparently cleared up.  The
    ct then retried the same process with the same results, but this
    time it crashed the cluster. 
    
    No dump file that he could find.  I read an earlier note that 
    suggested using the /interval qualifier.  
    
    Things to look at...?
    
    Thank You

T.R	Title	User	Personal Name	Date	Lines
456.1		DUCATI::LASTOVICA	Is it possible to be totally partial?	`Thu Mar 27 1997 18:22`	4
	> time it crashed the cluster. I'd suggest calling Digital for some VMS analysis of why the cluster crashed. I'd hate to think that it was collect.
456.2	Active vs Static monitoring?	M5::BLITTIN		`Fri Mar 28 1997 13:29`	6
	re: .1 Ct will contact DEC. In the meantime. Ct reran the monitor against a static file and everything seemed to run ok. Since the problem occurred while the collection was active, does the monitor have any problem identifying the end of the active collection, if/when, it hits it?
456.3	end of file information in a lock value block	OMYGOD::LAVASH	Same as it ever was...	`Fri Mar 28 1997 14:37`	29
	If you are monitoring a collection in progess you should really be using the /interval qualifier. If not you are looking at all kinds of bogus data. We have a 32K default cache that gets flushed when full. If you don't use a flush interval you can get "old" data on the flush, which makes looking at it in real time pointless. The flush interval keeps data flushed to disk at a regular interval which keeps it all consistant for the monitor. For static data we can pre-sort the records in the file and pick them off as needed. Monitor is actually 2 processes, 1 the data channel tries to stay at the end of the .dat file, reading records in as fast as possible and updating global sections that the monitor process reads from. The data channel if it hits end of file will check the lock and lock value block for the file to see if any new data has come in. Actually it may issue a blocking ast to be automatically notified when the file contents have changed. Can't remember exactly it's been about 5 years... Anyway, they should use interval if they are doing on-line monitoring. If that makes their problem go away then I'd say ignore the other problem. George
456.4	/flush=00:00:02	M5::BLITTIN		`Fri Mar 28 1997 14:57`	6
	They are using the /flush set to 00:00:02. I'm having him contact DEC to evaluate the crash dump... Thank you for the reply...
456.5	couple things to try	OMYGOD::LAVASH	Same as it ever was...	`Fri Mar 28 1997 16:38`	10
	Then again if it's a heavily loaded system and they are using the 2 second interval, perhaps all the concentrated writing is causing the problems... Have them change the flush interval to 5, and bump the monitoring interval to 5 or 10... See if that helps. Or possibly they may need to tune some process/system parameters... George