[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

10014.0. "30 to 40% performance loss when runing on D_UNIXV4.0B compared to DUNIX V3.2G" by TAEC::URAGO () Mon Jun 02 1997 11:03

Hello,

Running performance tests of TeMIP (Network management application) on D-UNIX 
V4.0B, we have observed a degradation of performance compared to the results 
obtained on D-UNIX V3.2.

For info : TeMIP is a distributed application that highly use threads.
The main difference between the two version DU3.2 / DU4.0 is that the thread
technology as changed : we were using CMA threads on DU3.2 and we use now
POSIX threads on DU4.0.
But we think that this change of technology is not the cause of our problems.

During our investigations we have writen several pieces of code that simulate
parts of our application. these pieces of code do not use thread API but show 
very bad results (20 % to 30% perf lost) on D_UNIX V4.0 when compiled with
-pthread cc option.

The tests have been done on a DEC 3000/700 in dual boot mode, one runing V4.0B
the other one running V3.2G.
 

First Test :
===========

The first example is a two processes (client/server) application that exchange
data over an AF_UNIX socket. Each data writen by the client is read by the
server and an acknoledge is sent back to the client.

programs and make files are given at the end of the note.

We mesure the time spend to send 100000 messages of 600 bytes.

Test 1 :
-------
OS : DUNIX V3.2 G
client and server are compiled with no specific options.

results : An average of 6150 messages / second 

Test 2 :
-------

OS : DUNIX V3.2 G
client and server are linked with thread libraries (-lpthread -lmach -lc_r)

results : An average of 6150 messages / second 
          This is the same value as Test1

Test 3 :
-------

OS DUNIX V4.0B
client and server are compile with no specific options.

results : An average of 5600 messages / second 
          We can observe a loss of 9% compared to the same test un DUNIX V3.2G

Test 4 :
-------

OS DUNIX V4.0B
client and server are compile with -pthread

results: An average of 3800 messages /second 
         We observe a loss of performance of 32% compared to the same
         program compiled without the -pthread option and 38% compared
         to the same test on DUNIX V3.2 


Second Test 
===========

The second program is a simple loop that call the select() system call with 
only the timeout parameter set. This program allows to emulate a 'portable'
 pthread_delay_np().

--------------------------------------
#include <sys/time.h>

main( void )
{
    int status;
    struct timeval tv;

    tv.tv_sec = 0;
    tv.tv_usec = 1;
    while(1)
    {
        select(0,0,0,0,&tv);
    }
}
--------------------------------------

the results show that on DUNIX V4.0

When this program is compiled with no specific option we can run up to 30
occurences of this program simultaneously before having the CPU load reaching
100%.

When compiled with the -pthread, we can only run 5 occurence simultaneously.
adding a 6th occurence makes the CPU load going from 5% to 100% without any
specific reason

We have also noticed that the number of context switch is 4 time higher when the
programe is compiled in -pthread mode.

Now my questions :
================

Does any body else have already noticed that kind of performance degradation ?
Is there any hints, patches, system configuration or whatever that could
improve that ?

Any Idea, Info are welcomed

best regards,
Jean-marie


SERVER and client programs :
=============================

Makefile :
-------------------------------------------
du40: sendto_du40 recvfrom_du40
du40_thread: sendto_du40_t recvfrom_du40_t
du32: sendto_du32 recvfrom_du32
du32_thread: sendto_du32_t recvfrom_du32_t

sendto_du32: sendto.c
        cc -o sendto_du32 sendto.c

recvfrom_du32: recvfrom.c
        cc -o recvfrom_du32 recvfrom.c

sendto_du32_t: sendto.c
        cc -threads -o sendto_du32_t sendto.c -lpthreads -lmach -lc_r

recvfrom_du32_t: recvfrom.c
        cc -threads -o recvfrom_du32_t recvfrom.c -lpthreads -lmach -lc_r

sendto_du40: sendto.c
        cc  -o sendto_du40 sendto.c

recvfrom_du40: recvfrom.c
        cc  -o recvfrom_du40 recvfrom.c

sendto_du40_t: sendto.c
        cc  -pthread -o sendto_du40_t sendto.c

recvfrom_du40_t: recvfrom.c
        cc  -pthread -o recvfrom_du40_t recvfrom.c

-----------------------------------------------------------------------------
recvfrom.c :
===========
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/time.h>

#include <sys/un.h>
#include <sys/socket.h>


#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

main()
{
    int socket_id1,socket_id2;
    struct sockaddr_un address;
    struct sockaddr_un acc_address;
    int addr[2];
    size_t addr_len;
    size_t nbbytes;
    int status;
    char buffer[INBUFF_LEN];
    double ftime1, ftime2;
    struct timespec time1, time2;
    unsigned long nbpacket;
    int msg_len;



    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
        exit(-1);
    }
    status = bind(socket_id1,&address,sizeof(address));
    if ( status == -1)
    {
        printf("bind error %d\n",errno);
        exit(-1);
    }

    status = listen(socket_id1,1);
    if ( status == -1)
    {
        printf("listen error %d\n",errno);
        exit(-1);
    }
    printf(" listen ok\n");

    printf(" accepting conn\n");
    socket_id2 = accept(socket_id1,&acc_address,&addr_len);
    if ( socket_id2 == -1)
    {
        printf("accept error %d\n",errno);
        exit(-1);
    }


    printf(" accept ok\n");
    printf("reading ...\n");

    getclock(TIMEOFDAY,&time1);

    nbpacket=0;
    while (1)
    {
        nbbytes = read(socket_id2,&msg_len,sizeof(msg_len));
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        nbbytes = read(socket_id2,buffer,msg_len);
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        /* send acknowledge */

        nbbytes = write(socket_id2,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
            printf("            write error : %d\n",errno);

        nbpacket++;
    }

    getclock(TIMEOFDAY,&time2);

    ftime1 = time1.tv_sec + time1.tv_nsec * 0.000000001;
    ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001;

   
    ftime1 = ftime2 - ftime1;
    printf(" elapsed time : %f\n",ftime1);
    printf(" number of packet sent : %d\n",nbpacket);
    printf(" packets per second : %f\n", (float)nbpacket/ftime1);

    close(socket_id1);
    close(socket_id2);
    unlink(SOCKET_NAME);

}

-------------------------------------------------------------------
sendto.c :
==========

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>

#include <sys/un.h>
#include <sys/socket.h>


#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

void main()
{
    int socket_id1;
    struct sockaddr_un address;
    int addr_len;
    char buffer[INBUFF_LEN];
    int status;
    size_t nbbytes;
    int msg_len;
    int i;


    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
                exit(-1);
    }

    status = connect(socket_id1,&address,sizeof(address));
    if ( status == -1 )
    {
        printf("connect error %d\n",errno);
                exit(-1);
    }

    msg_len = BUFF_LEN;
    for (i=0 ;i<100000;i++)
    {
        nbbytes = write(socket_id1,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            exit(-1);
        }
        nbbytes = write(socket_id1,message,msg_len);
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            exit(-1);
        }

        /* read acknowledge */
        nbbytes = read(socket_id1,buffer,4);
        if (nbbytes == 0)
        {
            printf("         read error : %d\n",errno);
            exit(-1);
        }

    }

    close(socket_id1);
}

T.R	Title	User	Personal Name	Date	Lines
10014.1	Three problems; one is due to threads for sure.	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon Jun 02 1997 15:21`	46
	I believe that you are reporting three problems: Problem #1: you are seeing a 9% degradation in performance between V3.2G and V4.0B in the responsiveness of socket I/O when used in a non-threaded program. Problem #2: you are seeing an additional 30%+ degradation in performance on V4.0B in the responsiveness of socket I/O when the program is linked with the threads libraries. Problem #3: you are seeing poor scaling characteristics when running multiple instances of a multithreaded program which makes heavy use of select(). Problem #1 is clearly unrelated to threads (perhaps someone who knows about the I/O system will comment). The source of Problem #3 is unclear (someone should make an attempt to diagnose it further). Problem #2, contrary to your assertion, is entirely due to the new thread scheduling model introduced in V4.0. Despite the fact that your code makes no calls to the threading library, simply linking your program with the threads library changes it from a "non-threaded program" into a "multithreaded program which happens to use only one thread", which is completely different. In our efforts to improve the integration between threaded programs and the system, we sacrificed the performance of the degenerate case where a program links in threads but doesn't use them. There are various overhead costs imposed by your election to use threads. The presumption is that the benefits of using threads will more than make up for the costs. However, in your test case, where you are not actually using threads, you get none of the benefits, while you are forced to pay all of the costs. With special support from the Digital Unix kernel, we've moved thread scheduling out into user mode (for the most part). This makes system calls more expensive in terms of CPU usage, while relieving various reschedule-latency problems and system scaling issues. It also makes the case of blocking (and unblocking) the process (i.e., when there is no work to do) marginally more expensive; but, working on the idea that there are a number of threads active, this would be an atypical event. However, your test senario manages to hit both of these costs head on -- your processes are consistently blocking in system calls leaving the process with nothing else to do, so you incur the added cost of the thread scheduling and the added latency of the process block/unblock. Since your real application is multithreaded, you might want to consider constructing a more representative (i.e., multithreaded) benchmark. Webb
10014.2	not best with threaded progrogram	TAEC::URAGO		`Tue Jun 03 1997 09:48`	396
	Hi Webb, from your previous note : 1- "In our efforts to improve the integration between threaded programs and the system, we sacrificed the performance of the degenerate case where a program links in threads but doesn't use them." >>> You did it well .... (joke :-) 2- "However, in your test case, where you are not actually using threads, you get none of the benefits, while you are forced to pay all of the costs." >>> I returned to my desk and have writen the same kind of program but using threads. The idea is still the same : packets of data are exchanged from one process to another and are acknowledged to force synchronization. the number of thread may be statically configured (#define MAX_THREADS). I have done some tests using 3,6,12 and 24 threads on each side. The results obtained are quite similare to the one obtained by my first non-threaded program :-< (programs, makefiles etc.. at the end of the note ) The following results are giving an idea of the number of packet/second exchanged in the different configurations. DUNIX V3.2G : ------------ 3 threads 6 threads 12 threads 24 threads 6100 pkts 6120 pkts 6000 pkts 6010 pkts You can see that the global number of packets exchanged between the two processes is quite constant and near 6000/sec. DUNIX V4.0B : ------------ 3 threads 6 threads 12 threads 24 threads 3400 pkts 3600 pkts 3450 pkts 3200 pkts On DUNIX V4.0 the number of pakets exchanged / second is never higher than 3600 whatever the number of thread in the process. This still represent a loss of performance of 40%. So what are the conditions to get the benefits of the scheduling improvements ? 3- "This makes system calls more expensive in terms of CPU usage, while relieving various reschedule-latency problems and system scaling issues. It also makes the case of blocking (and unblocking) the process (i.e., when there is no work to do) marginally more expensive; but, working on the idea that there are a number of threads active, this would be an atypical event." >>>> If i understant well, the performances have been optimized in the case where a thread is blocked and gives the hand to an other thread of the same process (case of several thread locking/unlocking a mutex) rigth ? >>>> In our application this is usally not the case. the threads are waiting on I/O (sockets) and sometimes use mutex when acessing global data. This means that most of the time when a thread is blocked (on a read syst. call for example) it doesn't give the hand to an other thread of the same process but unblock the receiver thread of the target process. This example IS NOT an atypical event, but just the real life in the telecom world and probably in some others. then, what do we do next ? Jean-marie. ----------------------------------------------------------------------- Makefile: -------- du40: send_thr_du40 recv_thr_du40 du32: send_thr_du32 recv_thr_du32 send_thr_du32: send_thr.c cc -o send_thr_du32 send_thr.c -lpthreads -lmach -lc_r recv_thr_du32: recv_thr.c cc -o recv_thr_du32 recv_thr.c -lpthreads -lmach -lc_r send_thr_du40: send_thr.c cc -pthread -o send_thr_du40 send_thr.c recv_thr_du40: recv_thr.c cc -pthread -o recv_thr_du40 recv_thr.c ----------------------------------------------------------------------- send_thr.c : ------------ #include <pthread.h> #include <stdio.h> #include <errno.h> #include <fcntl.h> #include <sys/types.h> #include <sys/un.h> #include <sys/socket.h> #define MAX_THREADS 3 #define SOCKET_NAME "my_sockets" #define INBUFF_LEN 1024 #define BUFF_LEN 600 char message[BUFF_LEN]; void connect_and_emit() { int socket_id1; struct sockaddr_un address; int addr_len; char buffer[INBUFF_LEN]; int status; size_t nbbytes; int msg_len; int i; printf(" Thread: The thread begin!\n"); address.sun_family = AF_UNIX; strcpy(address.sun_path,SOCKET_NAME); socket_id1 = socket(AF_UNIX,SOCK_STREAM,0); if ( socket_id1 == -1) { printf("socket error %d\n",errno); exit(-1); } status = connect(socket_id1,&address,sizeof(address)); if ( status == -1 ) { printf("connect error %d\n",errno); exit(-1); } msg_len = BUFF_LEN; for (i=0 ;i<100000;i++) { nbbytes = write(socket_id1,&msg_len, sizeof(msg_len)); if ( !nbbytes ) { printf(" write error : %d\n",errno); break; } nbbytes = write(socket_id1,message,msg_len); if ( !nbbytes ) { printf(" write error : %d\n",errno); break; } /* read acknowledge / nbbytes = read(socket_id1,buffer,4); if (nbbytes == 0) { printf(" read error : %d\n",errno); break; } } close(socket_id1); printf(" Thread: The thread end!\n"); } main(int argc, char argv) { struct timespec sleep_time; pthread_t thread_id[MAX_THREADS]; int terror=0; int exitstatus; void result; int i; for (i=0; i< MAX_THREADS; i++) { printf("Main: Create the thread \n"); /* Create the thread / #if _POSIX_C_SOURCE == 199506L terror = pthread_create (&thread_id[i], NULL, (void ()(void)) connect_and_emit, NULL); #else terror = pthread_create (&thread_id[i], pthread_attr_default, (void ()(void)) connect_and_emit, NULL); #endif if (terror != 0) { printf("pthread_create() failed\n"); exit(-1); } } / * wait for termination of threads / for (i=0; i< MAX_THREADS; i++) pthread_join(thread_id[i], &result); exit(0); } ------------------------------------------------------------------------------ recv_thr.c ---------- #include <pthread.h> #include <stdio.h> #include <errno.h> #include <fcntl.h> #include <sys/types.h> #include <sys/time.h> #include <sys/un.h> #include <sys/socket.h> #define MAX_THREADS 24 #define SOCKET_NAME "my_sockets" #define INBUFF_LEN 1024 #define BUFF_LEN 600 char message[BUFF_LEN]; struct param { int indx; int socket; } ; struct param param[MAX_THREADS]; int pkt_received[MAX_THREADS]; receive(struct param param) { size_t nbbytes; char buffer[INBUFF_LEN]; unsigned long nbpacket; int msg_len; printf("reading ...\n"); nbpacket=0; while (1) { nbbytes = read(param->socket,&msg_len,sizeof(msg_len)); if (nbbytes == 0) { printf(" read error : %d\n",errno); break; } nbbytes = read(param->socket,buffer,msg_len); if (nbbytes == 0) { printf(" read error : %d\n",errno); break; } /* send acknowledge / nbbytes = write(param->socket,&msg_len, sizeof(msg_len)); if ( !nbbytes ) printf(" write error : %d\n",errno); nbpacket++; } pkt_received[param->indx] = nbpacket; close(param->socket); } main(int argc, char argv) { int socket_id1,socket_id2; struct sockaddr_un address; struct sockaddr_un acc_address; size_t addr_len; int status; struct timespec sleep_time; pthread_t thread_id[MAX_THREADS]; int terror; int exitstatus; void result; int i; long pckts_per_sec; double ftime1, ftime2; struct timespec time1, time2; address.sun_family = AF_UNIX; strcpy(address.sun_path,SOCKET_NAME); socket_id1 = socket(AF_UNIX,SOCK_STREAM,0); if ( socket_id1 == -1) { printf("socket error %d\n",errno); exit(-1); } status = bind(socket_id1,&address,sizeof(address)); if ( status == -1) { printf("bind error %d\n",errno); exit(-1); } status = listen(socket_id1,SOMAXCONN); if ( status == -1) { printf("listen error %d\n",errno); exit(-1); } printf(" listen ok\n"); for (i=0; i<MAX_THREADS;i++) { printf(" accepting conn\n"); socket_id2 = accept(socket_id1,&acc_address,&addr_len); if ( socket_id2 == -1) { printf("accept error %d\n",errno); exit(-1); } printf(" accept ok\n"); /* get starting time / if (i==0) getclock(TIMEOFDAY,&time1); param[i].indx = i; param[i].socket = socket_id2; printf("Main: Create the thread \n"); / Create the thread / #if _POSIX_C_SOURCE == 199506L terror = pthread_create (&thread_id[i], NULL, (void ()(void)) receive, &param[i]); #else terror = pthread_create (&thread_id[i], pthread_attr_default, (void ()(void)) receive, &param[i]); #endif if (terror != 0) { printf("pthread_create() failed\n"); exit(-1); } } / * wait for termination of threads / for (i=0;i<MAX_THREADS;i++) { pthread_join(thread_id[i], &result); } / compute average / getclock(TIMEOFDAY,&time2); ftime1 = time1.tv_sec + time1.tv_nsec 0.000000001; ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001; ftime1 = ftime2 - ftime1; printf(" elapsed time : %f\n",ftime1); pckts_per_sec = 0; for (i=0;i<MAX_THREADS;i++) { pckts_per_sec += pkt_received[i]; } printf(" number of packet sent : %ld\n",pckts_per_sec); printf(" packets per second : %f\n", (float)pckts_per_sec/ftime1); close(socket_id1); unlink(SOCKET_NAME); exit(0); }
10014.3	Very fast I/O?	WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Jun 03 1997 12:40`	52
	.2> You did it well .... (joke :-) Unfortunately, sometimes, in order to make a step forward, you have to take a step backward; in this case, it was a big step forward, and we apparently haven't yet uncovered or recovered from our various steps backward... :-} .2> I returned to my desk and have writen the same kind of program but using .2> threads. Your program looks reasonable to me. (My command of socket programming is not strong, but it looks like you avoided the obvious pitfalls.) I'm a little surprised and disappointed to find that it doesn't affect your performance. .2> If i understant well, the performances have been optimized in the case .2> where a thread is blocked and gives the hand to an other thread of the .2> same process (case of several thread locking/unlocking a mutex) right ? The thrust of our model is that user mode synchronization, such as blocking on a mutex, should be _very_ fast, while kernel mode synchronization, such as blocking for I/O, should not decrease the level of concurrency in the process and should be as efficient as possible, the expectation being that there should be no increase in latency and that the extra CPU cost would be recouped in concurrent execution and in decrease of scheduling latency for other threads in the application. .2> most of the time when a thread is blocked (on a read syst. call for .2> example) it doesn't give the hand to an other thread of the same process .2> but unblock the receiver thread of the target process. Is the target process typically on the same machine as the sending process? That is, it was _very_ interesting that the I/O throughput was almost unaffected (less than a few percent) by the number of threads you used -- this suggests that the I/O has somehow been optimized to the extent where it is so fast that it is thread scheduling, and not the I/O system, which is the bottleneck. And, any time that the overhead of thread scheduling itself is the bottleneck, then using threads is unlikely to be a win for you (unless performance is not an issue or unless you have some sort of contention problem in your application). Thus, if your expected deployment is always on a single machine, you might want to investigate using shared memory to communicate rather than sockets. Otherwise, if your typical deployment involves multiple machines, you might want to try running your benchmark in a more representative environment. In the meantime, please feel free to enter a QAR the on V4.0B performance. Please include your pair of multithreaded test programs and then numbers that you saw on both platforms. (If you decide to QAR either of the other performance/scaling problems you saw, please enter them as separate QARs.) Webb
10014.4		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Tue Jun 03 1997 14:35`	44
	>.2> I returned to my desk and have writen the same kind of program but using >.2> threads. > >Your program looks reasonable to me. (My command of socket programming is >not strong, but it looks like you avoided the obvious pitfalls.) I'm a >little surprised and disappointed to find that it doesn't affect your >performance. I'm not surprised or disappointed. This is a "degenerate" case, that, unfortunately, has become much more common that we'd expected. Our assumption (and these assumptions are very difficult to test) was that most threaded code would do most synchronization in user mode, within the process -- that is, primarily blocking on condition variables and mutexes. Furthermore, there will usually be threads ready to perform work that haven't been able to get a VP. The upcall protocol for kernel-blocking I/O allows the kernel to tell us that a VP is free to handle one of those ready threads. As Jean-Marie noted in .2, this is NOT common of "telecom" applications. They do most of their blocking in the kernel, primarily in I/O, and suffer greatly from the overhead of upcalls. Furthermore, because they rarely have threads that are READY to do work, we gain no useful concurrency from the upcalls. Absolutely worst-case performance. This sort of application won't ever benefit from 2-level scheduling, because only one level is doing (nearly) all of the scheduling anyway. You should be creating your threads with POSIX system contention scope. Doing so would drop our user-mode scheduler (which isn't helping you) out of the loop. Your mutex and condition variable synchronization will be more expensive, because they will require kernel calls -- but you don't do many anyway, relative to the number of I/O calls, so that shouldn't affect your performance much. System contention scope is implemented for Digital UNIX 4.0D. Oh, one more thing... you're building wrong in your 3.2 lines. You should be using -threads, not "-lpthreads -lmach -lc_r". Although you have the right libraries, they're also provided by -threads. And you're missing the critical -D_REENTRANT compilation option. I was about to try a comparison, but I see that your test programs don't write any performance metric, nor did you say how you arrived at the numbers you've posted. What command sequence do you actually use to run these test programs and to generate the performance numbers? /dave
10014.5		TAEC::URAGO		`Wed Jun 04 1997 05:12`	33
	Webb, dave, Thanks for all your explanations, the impact of the 2-level scheduling in our case in now clearer for me. happy to see that we are a "degenerate" case .... :-(. Anyway, as the performances AND the support of DUNIX V4.0 are one of the goals of our TeMIP Vnext, I will try D-UNIX V4.0D and the system contention scope. Concerning my tests programs, sorry for the lake of explanations. To use it, on D-UNIX V4.0, just start recv_thr_du40 in a session and then start send_thr_du40 in another session. Each sending thread (from send_thr_du40 process) emit 100000 packets and exits. On the receiver side when all packets have been received (end by read error) the total number of received packets is divided by the elapsed time and then gives the number of packets received per second. This is done by recv_thr_du40. I know that we can do something best in terms of statistics but this gives an idea of the overall performances. The number of threads of each process is set by the #define MAX_THREADS (must be the same on each side) running the tests with 3 threads take 1min 30sec on my 3000/700 running it with 6 --> 3 minutes etc ... be patient ! Do not hesitate to ask me more info off-line if you have any trouble using it. Regards, Jean-marie
10014.6	Data and new code...	DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Wed Jun 04 1997 09:04`	440
	OK, I understand the problem I had running it. The two programs must use the same number of threads for successful completion -- and in the source you posted, one used 3 threads and one used 24 threads. So it didn't complete, and didn't print the statistical information. I added setup code to take the number of threads as an argument (-t<n>), and to handle an option (-s) to create system contention scope threads. On "Digital UNIX 4.0D" (actually, it's not: it's a 4.0B system with a special kernel and my latest [debug] private sandbox thread library), with my AlphaStation 600 5/266 workstation, I get: Process Contention Scope (default): 3 threads 6 threads 12 threads 24 threads 10439 pkts 8488 pkts 7957 pkts 7164 pkts System Contention Scope: 3 threads 6 threads 12 threads 24 threads 13431 pkts 12726 pkts 12270 pkts 12575 pkts I don't know what (if anything) it means that I'm getting substantially better numbers than yours even for PCS. Faster CPU? For comparison, an identical (hardware) system running stock Digital UNIX 4.0 showed only 7076 packets/sec with 3 threads -- but that's still twice your numbers. PCS on 4-CPU AlphaServer 2100A 5/300: 3 threads 6 threads 12 threads 24 threads 3293 pkts 4573 pkts 5083 pkts 4982 pkts SCS on 4-CPU AlphaServer 2100A 5/300: 3 threads 6 threads 12 threads 24 threads 4823 pkts 14799 pkts 15621 pkts 16115 pkts My code follows: (Note that it won't compile on 3.2 -- I didn't even pretend to support both interfaces.) As well as adding the options and improving some messages, I also cleaned up the termination protocol to avoid the annoying spurious receive error messages (the sender sends a message length of 0 to terminate). /* * send_thr.c / #include <pthread.h> #include <stdio.h> #include <errno.h> #include <fcntl.h> #include <sys/types.h> #include <sys/un.h> #include <sys/socket.h> #define MAX_THREADS 24 #define ARGS "t:s" #define SOCKET_NAME "my_sockets" #define INBUFF_LEN 1024 #define BUFF_LEN 600 char message[BUFF_LEN]; void connect_and_emit() { int socket_id1; struct sockaddr_un address; int addr_len; char buffer[INBUFF_LEN]; int status; size_t nbbytes; int msg_len; int i; printf(" Thread %d begins!\n", pthread_getselfseq_np()); address.sun_family = AF_UNIX; strcpy(address.sun_path,SOCKET_NAME); socket_id1 = socket(AF_UNIX,SOCK_STREAM,0); if ( socket_id1 == -1) { printf("socket error %d\n",errno); exit(-1); } status = connect(socket_id1,&address,sizeof(address)); if ( status == -1 ) { printf("connect error %d\n",errno); exit(-1); } msg_len = BUFF_LEN; for (i=0 ;i<100000;i++) { nbbytes = write(socket_id1,&msg_len, sizeof(msg_len)); if ( !nbbytes ) { printf(" (len) write error : %d\n",errno); break; } nbbytes = write(socket_id1,message,msg_len); if ( !nbbytes ) { printf(" (data) write error : %d\n",errno); break; } / read acknowledge / nbbytes = read(socket_id1,buffer,4); if (nbbytes == 0) { printf(" (ack) read error : %d\n",errno); break; } } msg_len = 0; nbbytes = write(socket_id1,&msg_len, sizeof(msg_len)); if ( !nbbytes ) { printf(" (done) write error : %d\n",errno); } close(socket_id1); printf(" Thread %d done!\n", pthread_getselfseq_np()); } main(int argc, char argv) { struct timespec sleep_time; pthread_t thread_id[MAX_THREADS]; int terror=0,threadcount=MAX_THREADS; int exitstatus; void result; int i, errflg, c, status; pthread_attr_t attr; status = pthread_attr_init (&attr); if (status != 0) { printf ("Attr init failed\n"); exit (-1); } optarg = NULL; errflg = 0; while (!errflg && ((c = getopt (argc, argv, ARGS)) != -1)) switch (c) { case 't': threadcount = atoi (optarg); printf ("Using %d threads\n", threadcount); break; case 's': printf ("Setting SCS\n"); status = pthread_attr_setscope (&attr, PTHREAD_SCOPE_SYSTEM); if (status != 0) { printf ("Error setting scope\n"); exit (-1); } break; default: errflg++; } if (errflg) { printf ("%s: usage %s\n", argv[0], ARGS); exit (-1); } if (threadcount > MAX_THREADS) { printf ("Too many threads (%d): using %d\n", threadcount, MAX_THREADS); threadcount = MAX_THREADS; } for (i=0; i< threadcount; i++) { printf("Main: Create the thread \n"); /* Create the thread / #if _POSIX_C_SOURCE == 199506L terror = pthread_create (&thread_id[i], &attr, (void ()(void)) connect_and_emit, NULL); #else terror = pthread_create (&thread_id[i], pthread_attr_default, (void ()(void)) connect_and_emit, NULL); #endif if (terror != 0) { printf("pthread_create() failed\n"); exit(-1); } } / * wait for termination of threads / for (i=0; i< threadcount; i++) pthread_join(thread_id[i], &result); exit(0); } / * recv_thr.c / #include <pthread.h> #include <stdio.h> #include <errno.h> #include <fcntl.h> #include <sys/types.h> #include <sys/time.h> #include <sys/un.h> #include <sys/socket.h> #define MAX_THREADS 24 #define SOCKET_NAME "my_sockets" #define INBUFF_LEN 1024 #define ARGS "t:s" #define BUFF_LEN 600 char message[BUFF_LEN]; struct param { int indx; int socket; } ; struct param param[MAX_THREADS]; int pkt_received[MAX_THREADS]; receive(struct param param) { size_t nbbytes; char buffer[INBUFF_LEN]; unsigned long nbpacket; int msg_len; printf(" Thread %d begins!\n", pthread_getselfseq_np()); nbpacket=0; while (1) { nbbytes = read(param->socket,&msg_len,sizeof(msg_len)); if (nbbytes == 0) { printf(" (len) read error : %d\n",errno); break; } if (msg_len == 0) /* Done / break; nbbytes = read(param->socket,buffer,msg_len); if (nbbytes == 0) { printf(" (buf) read error : %d\n",errno); break; } / send acknowledge / nbbytes = write(param->socket,&msg_len, sizeof(msg_len)); if ( !nbbytes ) printf(" (ack) write error : %d\n",errno); nbpacket++; } pkt_received[param->indx] = nbpacket; close(param->socket); printf(" Thread %d done!\n", pthread_getselfseq_np()); } main(int argc, char argv) { int socket_id1,socket_id2; struct sockaddr_un address; struct sockaddr_un acc_address; size_t addr_len; int status,threadcount=MAX_THREADS; struct timespec sleep_time; pthread_t thread_id[MAX_THREADS]; int terror; int exitstatus; void result; int i,errflg,c; long pckts_per_sec; double ftime1, ftime2; struct timespec time1, time2; pthread_attr_t attr; status = pthread_attr_init (&attr); if (status != 0) { printf ("Attr init failed\n"); exit (-1); } optarg = NULL; errflg = 0; while (!errflg && ((c = getopt (argc, argv, ARGS)) != -1)) switch (c) { case 't': threadcount = atoi (optarg); printf ("Using %d threads\n", threadcount); break; case 's': printf ("Setting SCS\n"); status = pthread_attr_setscope (&attr, PTHREAD_SCOPE_SYSTEM); if (status != 0) { printf ("Error setting scope\n"); exit (-1); } break; default: errflg++; } if (errflg) { printf ("%s: usage %s\n", argv[0], ARGS); exit (-1); } if (threadcount > MAX_THREADS) { printf ("Too many threads (%d): using %d\n", threadcount, MAX_THREADS); threadcount = MAX_THREADS; } address.sun_family = AF_UNIX; strcpy(address.sun_path,SOCKET_NAME); socket_id1 = socket(AF_UNIX,SOCK_STREAM,0); if ( socket_id1 == -1) { printf("socket error %d\n",errno); exit(-1); } status = bind(socket_id1,&address,sizeof(address)); if ( status == -1) { printf("bind error %d\n",errno); exit(-1); } status = listen(socket_id1,SOMAXCONN); if ( status == -1) { printf("listen error %d\n",errno); exit(-1); } printf(" listen ok\n"); printf("Main: Create the threads\n"); for (i=0; i<threadcount;i++) { printf(" accepting conn\n"); socket_id2 = accept(socket_id1,&acc_address,&addr_len); if ( socket_id2 == -1) { printf("accept error %d\n",errno); exit(-1); } printf(" accept ok\n"); /* get starting time / if (i==0) getclock(TIMEOFDAY,&time1); param[i].indx = i; param[i].socket = socket_id2; / Create the thread / #if _POSIX_C_SOURCE == 199506L terror = pthread_create (&thread_id[i], &attr, (void ()(void)) receive, &param[i]); #else terror = pthread_create (&thread_id[i], pthread_attr_default, (void ()(void)) receive, &param[i]); #endif if (terror != 0) { printf("pthread_create() failed\n"); exit(-1); } } / * wait for termination of threads / for (i=0;i<threadcount;i++) { pthread_join(thread_id[i], &result); } / compute average / getclock(TIMEOFDAY,&time2); ftime1 = time1.tv_sec + time1.tv_nsec 0.000000001; ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001; ftime1 = ftime2 - ftime1; printf(" elapsed time : %f\n",ftime1); pckts_per_sec = 0; for (i=0;i<threadcount;i++) { pckts_per_sec += pkt_received[i]; } printf(" number of packet sent : %ld\n",pckts_per_sec); printf(" packets per second : %f\n", (float)pckts_per_sec/ftime1); close(socket_id1); unlink(SOCKET_NAME); exit(0); }
10014.7		TAEC::URAGO		`Wed Jun 04 1997 10:15`	18
	"I don't know what (if anything) it means that I'm getting substantially better numbers than yours even for PCS. Faster CPU? For comparison, an identical (hardware) system running stock Digital UNIX 4.0 showed only 7076 packets/sec with 3 threads -- but that's still twice your numbers." >>> identical hardware ... to yours ? If yes, your results in PCS with 3 threads with your 4.0D show 10439 pkts. does that mean that you win 30% with the 4.0D even in PCS compared to 4.0B? For info, I have tested the following hardware with 4.0B: Dec 3000 - M700 : 3 threads : 3400 pkts AlphaStation 255/300 : 3 threads : 3650 pkts AlphaStation 500/400 : 3 threads : 8000 pkts Jean-marie
10014.8		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Wed Jun 04 1997 10:33`	22
	>>> identical hardware ... to yours ? Yes, I mean two AlphaStation 600 5/266 systems (ALCOR). But remember, I'm not talking "4.0B" to "4.0D". One was a "stock", unpatched 4.0 file server -- the other (my workstation) is a hacked-up 4.0B system with a nightly-build 4.0D kernel and a debug thread library from my sandbox. Nevertheless, to the limited extent that a comparison can be considered valid, I gained about 48% (7076 -> 10439) moving from the 4.0 system to the "4.0D" system, with PCS threads. I'm not sure how my AlphaStation 600 5/266 and your AlphaStation 500/400 compare, and I assume that's the basis of your "30%", so I can't say whether that's even as valid as my comparison. Actually, the interesting thing about my results, in case anyone didn't notice, is the SMP numbers. The AlphaSever 2100A (with a faster chip) are substantially worse than the AlphaStation 600 numbers -- ALWAYS, for PCS, and also for SCS until the processors are "saturated" (there's over 300% improvement from 3 threads to 6 threads). The SMP 12 and 24 thread numbers go UP from there, whereas the uniprocessor numbers go down as contention increases. I don't pretend to understand this, and I'm not inclined, at this time, to even try. /dave
10014.9	EV4 -> EV5 doubles performance at same clock	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed Jun 04 1997 15:40`	9
	If you must compare Alphas to oranges... System CPU MHz approx SPECint95 DEC 3000/700 EV45 225? 4 Astn 255/300 EV45 300 5 Astn 600 5/266 EV5 266 8 Asvr2100A 5/300 EV5 300 9 Astn 500/400 EV56 400 12
10014.10		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Thu Jun 05 1997 09:00`	19
	> DEC 3000/700 EV45 225? 4 > Astn 255/300 EV45 300 5 > Astn 600 5/266 EV5 266 8 > Asvr2100A 5/300 EV5 300 9 > Astn 500/400 EV56 400 12 So none of Jean-Marie's numbers are comparable to any of mine, even if the versions were identical. The numbers in .7 are EV4, EV45, and EV56 respectively, while mine are EV5. Thanks, Bill. Bevin Brett & I looked up the AlphaStation 255 series in AltaVista the other day for a different problem, and we couldn't find anything that identified the actual chip. (Annoying.) I like the "<model> <chip>/<fudge-speed>" convention, despite the fact that the speeds are fudged and it doesn't distinguish "4" from "45" or "5" from "56". I wish they'd just clean that up and use it consistently instead of switching back and forth between that and the "<model>/<fudge-speed>" style! /dave
10014.11	SPEC disclosures come in handy	PERFOM::HENNING		`Fri Jun 06 1997 10:14`	5
	The workstation group managed to implement three DIFFERENT naming conventions in a single year (sigh). But if you want to find out what's really inside any specific box, you might want to bookmark http://www.specbench.org/cgi-bin/osgresults?conf=cpu95