[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:	DECthreads Conference

Moderator:	PTHRED::MARYSTEON

Created:	Mon May 14 1990
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1553
Total number of notes:	9541

1491.0. "Performance drop using 1003.4a- V3.2C to v4.0a ?" by MUFFIT::gerry (Gerry Reilly) Fri Feb 21 1997 11:49

I am testing the new release of our product and I am seeing a significant
performance degradation from V3.2C to V4.0a.  Investigation appears
to indicate that this is due to mutex performance degradation between
the releases.

The product is using the 1003.4a compatability.  Therefore, we were
not obviously expecting to gain the performance improvements available
by moving to 1003.1c.  However, the test code shows approximately
25% drop in performance- this we were also not expecting!

My questions-

1. Is this drop in performance inline with expectation from those 
   who understand the internal of threads and the compatability
   support ?

2. Any good suggestion on getting round it.  A quick migration of everything
   to the 1003.1c interface is not practical because of the reliance
   on code outsider of our immediate control.  However, we could modify
   the mutex handling (can we bypass the compatability routines for
   just this one area ?).

Any help will be really appreciated.

-gerry

1003.4 Test Code
----------------
/***************************************************************************
 testthread.c
 first threaded program .
 To test the mutex locking etc .
 Spawns TOTAL_WORKERS threads, and each threads sits in a loop until the
 endflag, and the loop consists of mutex lock, count++ , mutex unlock.
 At the end of the time (currently 60 secs) the count is printed out.
 The compiler options are taken from the encina example used as the
 basis for the raw sfs tpca benchmark test.
***************************************************************************/
#include     <stdio.h>
#include     <pthread.h>
#include     <errno.h>

#define   TOTAL_WORKERS   50

/***************************************************************************
 GV's
***************************************************************************/



pthread_mutex_t    countmutex ;
pthread_mutex_t    workercountmutex  ;
pthread_mutexattr_t mutex_attr;

long     count      = 0 ;        /* the count used a measure  ,
                                    protected by countmutex */
int      workercount = 0 ;       /* number of current worker threads
                                    protected by workercountmutex */

int      endflag = 0 ;          /* used by main thread to terminate workers */
/***************************************************************************
  start of code
***************************************************************************/
                                                /*************************
                                                  getsleeptime
                                                 *************************/
int    getsleeptime( void ) {
       return 60 ;
}
                                                /*************************
                                                  checkend
                                                 *************************/
int   checkend( void ) {
      if( ! endflag)  return 0 ;
      return  1;
}
                                                /*************************
                                                  updatecount
                                                 *************************/
void  updatecount( void ) {
       pthread_mutex_lock( &countmutex ) ;
       count++ ;
       pthread_mutex_unlock( &countmutex ) ;
}
                                                /*************************
                                                  incworkercount
                                                 *************************/
void  incworkercount( void ) {
       pthread_mutex_lock( &workercountmutex ) ;
       workercount++ ;
       pthread_mutex_unlock(& workercountmutex ) ;
}
                                                /*************************
                                                  decworkercount
                                                 *************************/
void  decworkercount( void ) {
       pthread_mutex_lock( & workercountmutex ) ;
       workercount-- ;
       pthread_mutex_unlock( & workercountmutex ) ;
}
                                                /*************************
                                                  workerthread
                                                 *************************/
void  *workerthread( void * data) {
       incworkercount() ;
       while( !checkend() ) {
          updatecount() ;
       }
       decworkercount() ;
       return NULL ;
}
                                                /*************************
                                                 main
                                                 *************************/
int  main( int argc , char **argvp , char **envpp) {
       pthread_t    *workerthreadp ;
       int           i , rc ;

       count = 0 ;
       workercount = 0 ;
       errno = 0;
       pthread_mutexattr_setkind_np(&mutex_attr, MUTEX_FAST_NP);
       pthread_mutex_init( & workercountmutex , mutex_attr );
       pthread_mutex_init( & countmutex , mutex_attr );
       for( i=0 ; i< TOTAL_WORKERS ; i++ ) {
              workerthreadp = (pthread_t*)malloc(sizeof(pthread_t));
              rc = pthread_create( workerthreadp , pthread_attr_default ,
                                       workerthread , NULL ) ;
              pthread_detach( workerthreadp );
              free( workerthreadp );  /* !!?? (from book)  */
       }
       sleep( getsleeptime() ) ;
       endflag = 1 ;
       while(  workercount != 0 )  ;
       printf("totalcount=(%li) workercount(%i) \n",count , workercount );
       exit(0);
}

T.R	Title	User	Personal Name	Date	Lines
1491.1	It's unlikely to be "just mutexes"...	WTFN::SCALES	Despair is appropriate and inevitable.	`Fri Feb 21 1997 14:41`	34
	.0> I am seeing a significant performance degradation from V3.2C to V4.0a. How much? 25%? .0> Investigation appears to indicate that this is due to mutex performance .0> degradation between the releases. It seems unlikely that the answer is so simple. That is, unless your application does _nothing_ but lock and unlock mutexes, a change in mutex performance alone could not possibly have such a grave effect. .0> However, the test code shows approximately 25% drop in performance Um, what do you mean by "performance"? As measured in "mutex lock/unlocks per second"? Could you post your compile command lines, as well as sample runs for your test? Also, could you tell us some basic configuration information about the two test machines, such as hardware type, number of CPUs, and OS rev (i.e., do you have all the pertinent patches?), as well as an indication of the system load present when you ran the test? How much do the results of the test vary if you run it several times? .0> 1. Is this drop in performance inline with expectation from those .0> who understand the internal of threads and the compatability .0> support ? No...25% is a little much to ask.... .0> 2. Any good suggestion on getting round it. I'd recommend finding the source of the performance sink and fixing it. (I.e., I doubt it has all that much to do with mutexes -- what else have you looked at?) Webb
1491.2	but the mutexes hurt	MUFFIT::gerry	Gerry Reilly	`Mon Feb 24 1997 08:55`	154
	Webb, Thanks for the quick reply. We have run a lot of tests because our initial concern was around our the product. However, though gradually reducing the number of components involved, plus profiling, plus codepath analysis, we felt that the impact is coming from- a. direct use of the pthread calls and in particular mutex activity b. indirect use via the DCE The application does not just do mutex locks/unlocks, however it does do an awful lot of them. Several of our processes have > 100 threads and need to use mutex locks extensively. Overall this looks to result in a 20% performance drop in the application. The sample code I posted in .0 does not show a 25% drop, it is actually much worse. V4.0 performance for this test code is about 25% of V3.2c performance (performance = lock/unlock calls per second). Compilation ----------- cc -I. -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__ -c testthread.c cc -g -o testthread testthread.o -threads -lc -lm -laio Average runs using two identically configured 3000 M500 with 320MB of memory give- V3.2C 18 million lock/unlocks in 60s V4.0 4.5 million lock/unlocks in 60s If I then rewrite the code to use the 1003.1c interface (see below), I get back all the V3.2c performance (and gain a bit more). What I am looking for is any advice on how to minimise the impact of this performance change, if possible. -gerry 1003.1c code ------------ /************************************************************************* testthread.c first threaded program . To test the mutex locking etc . Spawns TOTAL_WORKERS threads, and each threads sits in a loop until the endflag, and the loop consists of mutex lock, count++ , mutex unlock. At the end of the time (currently 60 secs) the count is printed out. The compiler options are taken from the encina example used as the basis for the raw sfs tpca benchmark test. ***********************************************************************/ #include <stdio.h> #include <pthread.h> #include <errno.h> #define TOTAL_WORKERS 50 /*********************************************************************** GV's ************************************************************************/ pthread_mutex_t countmutex ; pthread_mutex_t workercountmutex ; pthread_mutexattr_t mutex_attr; long count = 0 ; / the count used a measure , protected by countmutex / int workercount = 0 ; / number of current worker threads protected by workercountmutex / int endflag = 0 ; / used by main thread to terminate workers / /************************************************************************ start of code ***********************************************************************/ /********************* getsleeptime *********************/ int getsleeptime( void ) { return 60 ; } /********************* checkend *********************/ int checkend( void ) { if( ! endflag) return 0 ; return 1; } void updatecount( void ) { pthread_mutex_lock( &countmutex ) ; count++ ; pthread_mutex_unlock( &countmutex ) ; } /********************* incworkercount *********************/ void incworkercount( void ) { pthread_mutex_lock( &workercountmutex ) ; workercount++ ; pthread_mutex_unlock(& workercountmutex ) ; } /********************* decworkercount *********************/ void decworkercount( void ) { pthread_mutex_lock( & workercountmutex ) ; workercount-- ; pthread_mutex_unlock( & workercountmutex ) ; } /********************* workerthread **********************/ void workerthread( void * data) { incworkercount() ; while( !checkend() ) { updatecount() ; } decworkercount() ; return NULL ; } /*********************** main *********************/ int main( int argc , char argvp , char *envpp) { pthread_t workerthreadp ; int i , rc ; count = 0 ; workercount = 0 ; errno = 0; pthread_mutexattr_settype_np(&mutex_attr, PTHREAD_MUTEX_NORMAL_NP); pthread_mutex_init( & workercountmutex , &mutex_attr ); pthread_mutex_init( & countmutex , &mutex_attr ); for( i=0 ; i< TOTAL_WORKERS ; i++ ) { workerthreadp = (pthread_t)malloc(sizeof(pthread_t)); rc = pthread_create( workerthreadp , NULL , workerthread , NULL ) ; pthread_detach( workerthreadp ); free( workerthreadp ); /* !!?? (from book) */ } sleep( getsleeptime() ) ; endflag = 1 ; while( workercount != 0 ) ; printf("totalcount=(%li) workercount(%i) \n",count , workercount ); exit(0); }
1491.3		COL01::LINNARTZ		`Mon Feb 24 1997 09:39`	21
	Gerry, it's only indirect related to your question, but currently you're using update (function call time) mu lock increment counter mu unlock on int/longs which are handled atomic on alpha. I don't say just use counter++/-- but I would use the inlining example used in the Digital technical journal. (http://www.europe.digital.com/info/DTJN05/DTJN05HM.HTM) It's easy to enhance to SMP_AINCR/SMP_ADECR and in my view it's quite cheaper and should be also reliable on a SMP machine (discussion is welcome). Of course. if you update bigger codeparts, the mutex lock/unlock is the way to go. Pit
1491.4		MUFFIT::gerry	Gerry Reilly	`Mon Feb 24 1997 10:52`	13
	RE: -.1 Thanks for the suggestion, but.. Unfortunately, we don't have the opportunity to go significantly modify the code too much. Most of it (including the piece doing the vast majority of the mutex work) is from a external party. We could get some changes made but it needs to not distrupt the code base too much as the code is built on several platforms. However all thoughts are as always appreciated. -gerry
1491.5	So, there would seem to be a problem in the .4a support?	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon Feb 24 1997 11:09`	27
	Re .3: Pit, Gerry is pursuing what he thinks is a problem in mutexes; thus, providing a way to remove the mutexes from his test code is not helpful. The increments are intended to track how many mutex operations have occurred -- that is, they are a mechanism and not the purpose itself of the code -- replacing them with atomic operations would allow us to remove the mutex lock/unlocks. .2> If I then rewrite the code to use the 1003.1c interface (see below), I get .2> back all the V3.2c performance (and gain a bit more). That's a very interesting factiod. We'll have to look at that. We rewrote the "legacy" support after V4.0, but I don't know what the exact release was. (Is there some reason why you are on V4.0a instead of V4.0b?) Does the behavior of your test program change with the number of threads? My expectation is that it's basically one thread which is doing all of the mutex locking, and the other threads are just "in the way".... .2> cc -I. -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__ -c testthread.c Gerry, are you aware that you can (and should) use the -threads or -pthread switch on the compile line? It will provide the -D_REENTRANT and -DPTHREAD_USE_D4 (as appropriate) for you. Webb
1491.6		COL01::LINNARTZ		`Mon Feb 24 1997 11:29`	12
	.-1. Yes sure, that's why I said indirect. But nevertheless I wanted to mention it as I've seen the increment counter wrapped by mutex_lock/unlock in a couple of designs, and I'm suggesting the atomic increment always, as it saves library switching, a couple of function calls and at least one MB. If those blocks are heavily used, I've seen performance gains in the area of about 5 percent, which made me happy and due to this I wanted to share it. (in general, one could reduce it even further, but I'm always scared of cache implementation in SMP systems). I think you don't object regarding your library using this approach. Pit
1491.7	Rework not in V4.0b either	PTHRED::MARYS	Mary Sullivan, OpenVMS Development	`Mon Feb 24 1997 13:03`	11
	Webb, > That's a very interesting factiod. We'll have to look at that. We rewrote > the "legacy" support after V4.0, but I don't know what the exact release was. > (Is there some reason why you are on V4.0a instead of V4.0b?) The post-V4 rework of the "legacy" interfaces will be introduced in the next release of Digital UNIX. It might be interesting to compare the test programs on a V4.0b baselevel and a PTmin baselevel to see if that helps.. -Mary S.
1491.8	Beware of subtle effects of using atomic operations	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon Feb 24 1997 13:35`	18
	.6> I think you don't object regarding your library using this approach. I'm afraid that I cannot make a simple "yes" or "no" reply to that. While it's true that, in the abstract, I have no objection to people using hardware operations to synchronize between threads, the problem is that attempting to do so can often introduce other problems. For instance, if the target of the increment in Gerry's program had been a condition variable predicate, then removing the mutex lock/unlock would have introduced a bug, since it's the interplay of the mutex and condition variable which prevent the wake-up/waiter race in the condition variable wait. So, I generally recommend that people not use atomic operations to synchronize between threads. When people try to to cut corners, they often cut off something which later they discover that they needed. Webb
1491.9		COL01::LINNARTZ		`Mon Feb 24 1997 14:05`	6
	thanks much for the reminder, Webb! even I've ifdef'd them by USE_FAST_CNTR, I'll double check that I don't hit a scenario you've pointed at. Pit
1491.10	Results with differing no of threads	MUFFIT::gerry	Gerry Reilly	`Tue Feb 25 1997 07:30`	58
	.5> Does the behavior of your test program change with the number of .5> threads? My expectation is that it's basically one thread which is .5> doing all of the mutex locking, and the other threads are just "in the .5> way".... The mutexes are being hit by all threads. The updatecount() routine in the testcase is being run from each thread. I have rerun the test using differing numbers of threads and using both interfaces. The tests were run on an AlphaStation 255/300 with 300MB memory, running Digital UNIX V4.0b. +---------------+-----------------------------------+ \| \| Lock/unlocks in 60 secounds \| \| No of threads \| 1003.1c i/f \| 1003.4a D4 i/f \| +---------------+-----------------+-----------------+ \| 1 \| 22.7M * \| 20.4M \| \| 2 \| 30.2M \| 11.7M \| \| 5 \| 29.1M \| 7.1M \| \| 50 \| 17.4M \| 6.7M \| \| 500 \| 16.5M \| 4.5M \| +---------------+-----------------+-----------------+ * Interesting result but I guess I am not really worried about single-threaded lock performance. .2> cc -I. -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__ -c testthread.c .5> Gerry, are you aware that you can (and should) use the -threads or -pthread .5> switch on the compile line? It will provide the -D_REENTRANT and .5> -DPTHREAD_USE_D4 (as appropriate) for you. The real application build uses the proper flag, it's only the testcase makefile that explicitly set the defines. However, thanks anyway. .8> So, I generally recommend that people not use atomic operations to .8> synchronize between threads. When people try to to cut corners, they .8> often cut off something which later they discover that they needed. In the real code the mutexes are being used for many things, including control around the predicates for condition variables, so even if we rewrote the code some of the control would need to be through mutexes. .7> The post-V4 rework of the "legacy" interfaces will be introduced in the .7> next release of Digital UNIX. It might be interesting to compare the .7> test programs on a V4.0b baselevel and a PTmin baselevel to see if that .7> helps.. I have tested on both V4.0a and V4.0b, there was little or no difference in performance. However, I would be very interested in getting test results from the new "legacy" code if someone has access to a system running a PTmin baselevel. Alternatively, if the new "legacy" code is just replacement libraries I would happy to test myself. Thanks. -gerry
1491.11	One-thread shouldn't be slower than multiple-threads, here	WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Feb 25 1997 10:50`	27
	.10> The updatecount() routine in the testcase is being run from each thread. Certainly: running the test for as long as 60 seconds virtually assures that each thread will reach the updatecount() routine. However, simply reaching the routine does not imply that a thread gets to lock the mutex (more than once). Yes, the mutex is being hit by all threads, but I still assert that it is basically one thread (or a few) which is doing all the locking. .10> +---------------+-----------------------------------+ .10> \| \| Lock/unlocks in 60 secounds \| .10> \| No of threads \| 1003.1c i/f \| 1003.4a D4 i/f \| .10> +---------------+-----------------+-----------------+ .10> \| 1 \| 22.7M * \| 20.4M \| .10> \| 2 \| 30.2M \| 11.7M \| .10> \| 5 \| 29.1M \| 7.1M \| .10> \| 50 \| 17.4M \| 6.7M \| .10> \| 500 \| 16.5M \| 4.5M \| .10> +---------------+-----------------+-----------------+ .10> .10> * Interesting result but I guess I am not really worried about single-threaded .10> lock performance. Actually, having the one-thread result poorer than the two-thread result is shocking... Webb
1491.12	Testcase is pretty 'fair' when less than 50 threads	MUFFIT::gerry	Gerry Reilly	`Tue Feb 25 1997 14:55`	44
	.11> Certainly: running the test for as long as 60 seconds virtually .11> assures that each thread will reach the updatecount() routine. .11> However, simply reaching the routine does not imply that a thread gets .11> to lock the mutex (more than once). Yes, the mutex is being hit by all .11> threads, but I still assert that it is basically one thread (or a few) .11> which is doing all the locking. I decided to see how much balance there was between the threads. I therefore modified the test cases to update an array where each thread updated a seperate element in the array. Interestingly, while the number of threads was low (< 50ish) then the spread of updates across the threads was pretty fair. When the number of threads was high (500) then the spread was much less even. The actually charateristics of the spread differs depending on which interface is used. Using the legacy (1003.4) interface the spread is fairly random but with some threads - though not those created first necessarily - get many more locks onto the mutex. Using the 1003.1c interface the spread is biased in the order of thread creation. This is not surprising as the thread creation appears (Disclaimer - I havn't done any timings on pthread_create- this is just perception) much slower (using 1003.1c) when there are very large numbers of threads in the process. Therefore as threads starting hitting the updates as soon as they start running, and the timer isn't started until after the last pthread_create, then the early threads get much longer to run and so to do updates to their count. .11> Actually, having the one-thread result poorer than the two-thread .11> result is shocking... I checked this again, and yes with the new 1003.1c interface I consistently get degraded performance when I go from two threads to one. This is not the case if I go through the legacy library. Two Threads =========== thread 0 count=(14420093) thread 1 count=(14420985) One Thread ========== thread 0 count=(21038465) -gerry
1491.13	Your threads are cheating!!	WTFN::SCALES	Despair is appropriate and inevitable.	`Wed Feb 26 1997 12:51`	17
	.12> the timer isn't started until after the last pthread_create Gerry! This has the possibility of radically skewing the results, I'm afraid. It's critical that the threads not be able to increment the counter until after the timer starts!! I'll code up a modified version of your test and post it here. .12> I checked this again, and yes with the new 1003.1c interface I .12> consistently get degraded performance when I go from two threads to one. I _think_ your above comment explains this: the first (of the two threads) gets to "cheat" and start incrementing the counter before the test starts, so the result looks better than when you have only one thread which doesn't get to cheat. Webb
1491.14	A (hopefully) more reliable test (.1c interface)	WTFN::SCALES	Despair is appropriate and inevitable.	`Wed Feb 26 1997 14:26`	109
	Gerry, try the program below and see if it gives you more consistent results (i.e., with various numbers of threads, but more especially on the various platforms...sorry, I guess you'll have to recast it back to the .4a interface for V3.2....) Thanks, Webb ----------- #include <pthread.h> #include <stdio.h> #include <errno.h> #define TOTAL_WORKERS 50 struct timespec sleeptime = {60, 0}; /* Run test for 60 seconds / pthread_mutex_t mutex; pthread_cond_t condvar; int endflag = FALSE; #define check(_status_, _string_) do { \ int __Status = (_status_); \ if (__Status != 0) fprintf (stderr, \ "%s at line %d failed with %d (%s)", \ _string_, __LINE__, __Status, strerror (__Status)); \ } while (0); void workerthread (void arg) { long count = 0; int quit = FALSE; int status; do { check (pthread_mutex_lock (&mutex), "pthread_mutex_lock"); if (!endflag) count++; else quit = TRUE; check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock"); } while (!quit); return (void )count; } int main (int argc, char argvp, char envpp) { pthread_t workers[TOTAL_WORKERS]; int i, status; void partial_count; long total_count = 0; struct timespec waketime; check (pthread_mutex_init (&mutex, NULL), "pthread_mutex_init"); check (pthread_cond_init (&condvar, NULL), "pthread_cond_init"); / * Lock the mutex now and hold it throughout the * thread-creates to prevent the threads which are * created early from starting to count prematurely. / check (pthread_mutex_lock (&mutex), "pthread_mutex_lock"); for (i = 0; i < TOTAL_WORKERS; i++) check ( pthread_create(&workers[i], NULL, workerthread, (void )i), "pthread_create"); printf( "\nCreated %d threads; starting %d second run.\n\n", TOTAL_WORKERS, sleeptime.tv_sec); /* * Establish the end time for the test. The condition wait will * atomically block the caller and release the mutex, thereby * allowing the threads to start counting. Once the time elapses * the initial thread will reaquire the mutex on wake-up and * stop the threads from counting. */ check (pthread_get_expiration_np (&sleeptime, &waketime), "pthread_get_expiration_np"); while (status = pthread_cond_timedwait (&condvar, &mutex, &waketime) == 0); if (status != ETIMEDOUT) check (status, "pthread_cond_timedwait"); endflag = 1; check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock"); for (i = 0; i < TOTAL_WORKERS; i++) { check (pthread_join (workers[i], &partial_count), "pthread_join"); printf ("Thread #%d: count = %li\n", i, (long)partial_count); total_count += (long)partial_count; } printf("\nTotal count for a %d second run = %li\n", sleeptime.tv_sec, total_count); }
1491.15	Results for modified program - more consistent on V4.0b	MUFFIT::gerry	Gerry Reilly	`Thu Feb 27 1997 16:32`	128
	Webb, Thanks for the new test program, and yes you are most certainly right letting the thread starting hitting the mutexes certainly distorts the results. My mistake. I modified the test program (see below), so that it will compile and run for either the 1003.1c or Draft 4 libraries. Re-running the test then show that through the 1003.1c i/f you get about 10% better performance whilst the number of threads is low. Once the number is high (500) the gain is much higher - approx 40%. This is good news. The relative performance is fine - I expect some penalty from not using the new interface. Unfortunately, the bad new is that the test program still shows a degradation between v3.2c and v4.0b. I'll mail you the details. Thanks for all the help. Gerry --------- #include <pthread.h> #include <stdio.h> #include <errno.h> #define TOTAL_WORKERS 500 struct timespec sleeptime = {60, 0}; /* Run test for 60 seconds / pthread_mutex_t mutex; pthread_cond_t condvar; int endflag = FALSE; #define check(_status_, _string_) do { \ int __Status = (_status_); \ if (__Status != 0) fprintf (stderr, \ "%s at line %d failed with %d (%s)", \ _string_, __LINE__, __Status, strerror (__Status)); \ } while (0); void workerthread (void arg) { long count = 0; int quit = FALSE; int status; do { check (pthread_mutex_lock (&mutex), "pthread_mutex_lock"); if (!endflag) count++; else quit = TRUE; check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock"); } while (!quit); return (void )count; } int main (int argc, char argvp, char envpp) { pthread_t workers[TOTAL_WORKERS]; int i, status; void partial_count; long total_count = 0; struct timespec waketime; #ifdef PTHREAD_USE_D4 check (pthread_mutex_init (&mutex, pthread_mutexattr_default), "pthread_mutex_init"); check (pthread_cond_init (&condvar, pthread_condattr_default), "pthread_cond_init"); #else check (pthread_mutex_init (&mutex, NULL), "pthread_mutex_init"); check (pthread_cond_init (&condvar, NULL), "pthread_cond_init"); #endif / * Lock the mutex now and hold it throughout the * thread-creates to prevent the threads which are * created early from starting to count prematurely. / check (pthread_mutex_lock (&mutex), "pthread_mutex_lock"); for (i = 0; i < TOTAL_WORKERS; i++) #ifdef PTHREAD_USE_D4 check ( pthread_create(&workers[i], pthread_attr_default, workerthread, (void )i), "pthread_create"); #else check ( pthread_create(&workers[i], NULL, workerthread, (void )i), "pthread_create"); #endif printf( "\nCreated %d threads; starting %d second run.\n\n", TOTAL_WORKERS, sleeptime.tv_sec); / * Establish the end time for the test. The condition wait will * atomically block the caller and release the mutex, thereby * allowing the threads to start counting. Once the time elapses * the initial thread will reaquire the mutex on wake-up and * stop the threads from counting. */ check (pthread_get_expiration_np (&sleeptime, &waketime), "pthread_get_expiration_np"); while (status = pthread_cond_timedwait (&condvar, &mutex, &waketime) == 0); if (status != ETIMEDOUT) check (status, "pthread_cond_timedwait"); endflag = 1; check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock"); for (i = 0; i < TOTAL_WORKERS; i++) { check (pthread_join (workers[i], &partial_count), "pthread_join"); printf ("Thread #%d: count = %li\n", i, (long)partial_count); total_count += (long)partial_count; } printf("\nTotal count for a %d second run = %li\n", sleeptime.tv_sec, total_count); }
1491.16	10% is OK...	WTFN::SCALES	Despair is appropriate and inevitable.	`Thu Feb 27 1997 17:50`	48
	.15> I modified the test program (see below), so that it will compile and .15> run for either the 1003.1c or Draft 4 libraries. smile You got pretty close... You need to provide an alternate definition for the check() macro, since the D4 interface returns -1 on error and requires you to look in errno for the error number. And, to be neat, you should to add a call to pthread_detach() after the call to pthread_join() when compiling for D4. But what you've got is probably sufficient to the task. (I don't _think_ there are subtler problems, but I'm sure Dave will point one out... ;-) For those of you listening at home, here are the results Gerry saw comparing the .1c interface to the .4a/D4 one. Unfortunately, Gerry couldn't get two machines in the same class to compare V3.2g to V4.0b, so we'll have to wait for those results. > Digital UNIX V4.0b, AlphaStation 255/300, 300MB > ----------------------------------------------- > d4: cc -o test -threads fair_thread.c > 1003.1c: cc -o test -pthread fair_thread.c > > Threads 1003.1c d4 > (total count in millions) > 1 23.6 21.0 > 2 12.6 11.7 > 5 7.7 7.5 > 50 7.4 6.9 > 500 6.6 4.7 .15> through the 1003.1c i/f you get about 10% better performance That's acceptable. (It's not great, but it's acceptable; I'll be interested to hear what you find on Ptmin.) Gerry made the following observation in his mail to me: > Observing both system with test with 500 threads show markedly different > characteristics. On v3.2g there was very high context switch rate (124K) > and runnable threads reported through vmstat. On V4.0b the context > switch rate was only showing about 270 and few runnable threads. That's good to hear. Your test would tend to generate alot of thread context switches. On V3.2g they are all kernel context switches; on V4.0(b) they are all user-mode context switches, thanks to the new, two-level scheduling. (Now, we need to check on why the user-mode ones seem to be slower....) Webb
1491.17	V3.2c vs V4.0a (1003.1) both on 3000 M500	MUFFIT::gerry	Gerry Reilly	`Fri Feb 28 1997 05:46`	13
	One mistake, late at night..I'll settle for that- anyway who checks error returns anyway -:) As promised, I now have results from physically similar systems. The systems are both 3000 Model 500 with 320MB of memory. Threads V3.2c V4.0a (using 1003.1c to get highest count) ------------------------------------------------- 5 28.4M 6.1M 50 24.9M 5.5M 500 1.1M 4.9M -gerry
1491.18		SMURF::DENHAM	Digital UNIX Kernel	`Fri Feb 28 1997 14:34`	12
	I too am plenty curious about how a completely user-space context switch can be slower than a kernel switch. Sure the kernel code is good, but man, what the hell happened out there? Can we check the firmware/pal code rev on the test machines? It needs for at least 1.45 pal code for ev4, 1.21 for ev5. Doubt this is the cause though. In my first prototype code for 2-level, the thread-to-thread yield times was in the couple of usecs range on a modest ev5 system. Needless the say the overhead has grown from that toy benchmark (and library).... Is our library quantum to quick or something?