[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

10014.0. "30 to 40% performance loss when runing on D_UNIXV4.0B compared to DUNIX V3.2G" by TAEC::URAGO () Mon Jun 02 1997 11:03

Hello,

Running performance tests of TeMIP (Network management application) on D-UNIX 
V4.0B, we have observed a degradation of performance compared to the results 
obtained on D-UNIX V3.2.

For info : TeMIP is a distributed application that highly use threads.
The main difference between the two version DU3.2 / DU4.0 is that the thread
technology as changed : we were using CMA threads on DU3.2 and we use now
POSIX threads on DU4.0.
But we think that this change of technology is not the cause of our problems.

During our investigations we have writen several pieces of code that simulate
parts of our application. these pieces of code do not use thread API but show 
very bad results (20 % to 30% perf lost) on D_UNIX V4.0 when compiled with
-pthread cc option.

The tests have been done on a DEC 3000/700 in dual boot mode, one runing V4.0B
the other one running V3.2G.
 

First Test :
===========

The first example is a two processes (client/server) application that exchange
data over an AF_UNIX socket. Each data writen by the client is read by the
server and an acknoledge is sent back to the client.

programs and make files are given at the end of the note.

We mesure the time spend to send 100000 messages of 600 bytes.

Test 1 :
-------
OS : DUNIX V3.2 G
client and server are compiled with no specific options.

results : An average of 6150 messages / second 

Test 2 :
-------

OS : DUNIX V3.2 G
client and server are linked with thread libraries (-lpthread -lmach -lc_r)

results : An average of 6150 messages / second 
          This is the same value as Test1

Test 3 :
-------

OS DUNIX V4.0B
client and server are compile with no specific options.

results : An average of 5600 messages / second 
          We can observe a loss of 9% compared to the same test un DUNIX V3.2G

Test 4 :
-------

OS DUNIX V4.0B
client and server are compile with -pthread

results: An average of 3800 messages /second 
         We observe a loss of performance of 32% compared to the same
         program compiled without the -pthread option and 38% compared
         to the same test on DUNIX V3.2 


Second Test 
===========

The second program is a simple loop that call the select() system call with 
only the timeout parameter set. This program allows to emulate a 'portable'
 pthread_delay_np().

--------------------------------------
#include <sys/time.h>

main( void )
{
    int status;
    struct timeval tv;

    tv.tv_sec = 0;
    tv.tv_usec = 1;
    while(1)
    {
        select(0,0,0,0,&tv);
    }
}
--------------------------------------

the results show that on DUNIX V4.0

When this program is compiled with no specific option we can run up to 30
occurences of this program simultaneously before having the CPU load reaching
100%.

When compiled with the -pthread, we can only run 5 occurence simultaneously.
adding a 6th occurence makes the CPU load going from 5% to 100% without any
specific reason

We have also noticed that the number of context switch is 4 time higher when the
programe is compiled in -pthread mode.

Now my questions :
================

Does any body else have already noticed that kind of performance degradation ?
Is there any hints, patches, system configuration or whatever that could
improve that ?

Any Idea, Info are welcomed

best regards,
Jean-marie


SERVER and client programs :
=============================

Makefile :
-------------------------------------------
du40: sendto_du40 recvfrom_du40
du40_thread: sendto_du40_t recvfrom_du40_t
du32: sendto_du32 recvfrom_du32
du32_thread: sendto_du32_t recvfrom_du32_t

sendto_du32: sendto.c
        cc -o sendto_du32 sendto.c

recvfrom_du32: recvfrom.c
        cc -o recvfrom_du32 recvfrom.c

sendto_du32_t: sendto.c
        cc -threads -o sendto_du32_t sendto.c -lpthreads -lmach -lc_r

recvfrom_du32_t: recvfrom.c
        cc -threads -o recvfrom_du32_t recvfrom.c -lpthreads -lmach -lc_r

sendto_du40: sendto.c
        cc  -o sendto_du40 sendto.c

recvfrom_du40: recvfrom.c
        cc  -o recvfrom_du40 recvfrom.c

sendto_du40_t: sendto.c
        cc  -pthread -o sendto_du40_t sendto.c

recvfrom_du40_t: recvfrom.c
        cc  -pthread -o recvfrom_du40_t recvfrom.c

-----------------------------------------------------------------------------
recvfrom.c :
===========
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/time.h>

#include <sys/un.h>
#include <sys/socket.h>


#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

main()
{
    int socket_id1,socket_id2;
    struct sockaddr_un address;
    struct sockaddr_un acc_address;
    int addr[2];
    size_t addr_len;
    size_t nbbytes;
    int status;
    char buffer[INBUFF_LEN];
    double ftime1, ftime2;
    struct timespec time1, time2;
    unsigned long nbpacket;
    int msg_len;



    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
        exit(-1);
    }
    status = bind(socket_id1,&address,sizeof(address));
    if ( status == -1)
    {
        printf("bind error %d\n",errno);
        exit(-1);
    }

    status = listen(socket_id1,1);
    if ( status == -1)
    {
        printf("listen error %d\n",errno);
        exit(-1);
    }
    printf(" listen ok\n");

    printf(" accepting conn\n");
    socket_id2 = accept(socket_id1,&acc_address,&addr_len);
    if ( socket_id2 == -1)
    {
        printf("accept error %d\n",errno);
        exit(-1);
    }


    printf(" accept ok\n");
    printf("reading ...\n");

    getclock(TIMEOFDAY,&time1);

    nbpacket=0;
    while (1)
    {
        nbbytes = read(socket_id2,&msg_len,sizeof(msg_len));
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        nbbytes = read(socket_id2,buffer,msg_len);
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        /* send acknowledge */

        nbbytes = write(socket_id2,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
            printf("            write error : %d\n",errno);

        nbpacket++;
    }

    getclock(TIMEOFDAY,&time2);

    ftime1 = time1.tv_sec + time1.tv_nsec * 0.000000001;
    ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001;

   
    ftime1 = ftime2 - ftime1;
    printf(" elapsed time : %f\n",ftime1);
    printf(" number of packet sent : %d\n",nbpacket);
    printf(" packets per second : %f\n", (float)nbpacket/ftime1);

    close(socket_id1);
    close(socket_id2);
    unlink(SOCKET_NAME);

}

-------------------------------------------------------------------
sendto.c :
==========

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>

#include <sys/un.h>
#include <sys/socket.h>


#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

void main()
{
    int socket_id1;
    struct sockaddr_un address;
    int addr_len;
    char buffer[INBUFF_LEN];
    int status;
    size_t nbbytes;
    int msg_len;
    int i;


    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
                exit(-1);
    }

    status = connect(socket_id1,&address,sizeof(address));
    if ( status == -1 )
    {
        printf("connect error %d\n",errno);
                exit(-1);
    }

    msg_len = BUFF_LEN;
    for (i=0 ;i<100000;i++)
    {
        nbbytes = write(socket_id1,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            exit(-1);
        }
        nbbytes = write(socket_id1,message,msg_len);
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            exit(-1);
        }

        /* read acknowledge */
        nbbytes = read(socket_id1,buffer,4);
        if (nbbytes == 0)
        {
            printf("         read error : %d\n",errno);
            exit(-1);
        }

    }

    close(socket_id1);
}


T.RTitleUserPersonal
Name
DateLines
10014.1Three problems; one is due to threads for sure.WTFN::SCALESDespair is appropriate and inevitable.Mon Jun 02 1997 15:2146
I believe that you are reporting three problems:

Problem #1:  you are seeing a 9% degradation in performance between V3.2G and 
		V4.0B in the responsiveness of socket I/O when used in a 
		non-threaded program.
Problem #2:  you are seeing an additional 30%+ degradation in performance on 
		V4.0B in the responsiveness of socket I/O when the program is 
		linked with the threads libraries.
Problem #3:  you are seeing poor scaling characteristics when running multiple
		instances of a multithreaded program which makes heavy use of
		select().

Problem #1 is clearly unrelated to threads (perhaps someone who knows about the
I/O system will comment).  The source of Problem #3 is unclear (someone should
make an attempt to diagnose it further).

Problem #2, contrary to your assertion, is entirely due to the new thread
scheduling model introduced in V4.0.  Despite the fact that your code makes no
calls to the threading library, simply linking your program with the threads
library changes it from a "non-threaded program" into a "multithreaded program
which happens to use only one thread", which is completely different.  

In our efforts to improve the integration between threaded programs and the
system, we sacrificed the performance of the degenerate case where a program
links in threads but doesn't use them.  There are various overhead costs imposed
by your election to use threads.  The presumption is that the benefits of using
threads will more than make up for the costs.  However, in your test case, where
you are not actually using threads, you get none of the benefits, while you are
forced to pay all of the costs.

With special support from the Digital Unix kernel, we've moved thread scheduling
out into user mode (for the most part).  This makes system calls more expensive
in terms of CPU usage, while relieving various reschedule-latency problems and
system scaling issues.  It also makes the case of blocking (and unblocking) the
process (i.e., when there is no work to do) marginally more expensive; but,
working on the idea that there are a number of threads active, this would be an
atypical event.  However, your test senario manages to hit both of these costs
head on -- your processes are consistently blocking in system calls leaving the
process with nothing else to do, so you incur the added cost of the thread
scheduling and the added latency of the process block/unblock.

Since your real application is multithreaded, you might want to consider
constructing a more representative (i.e., multithreaded) benchmark.


				Webb
10014.2not best with threaded progrogramTAEC::URAGOTue Jun 03 1997 09:48396
Hi Webb,

from your previous note :

1- "In our efforts to improve the integration between threaded programs and the
system, we sacrificed the performance of the degenerate case where a program
links in threads but doesn't use them."

>>> You did it well .... (joke :-)

2- "However, in your test case, where you are not actually using threads, you
 get none of the benefits, while you are forced to pay all of the costs."

>>> I returned to my desk and have writen the same kind of program but using
threads. The idea is still the same : packets of data are exchanged from one
process to another and are acknowledged to force synchronization. the number 
of thread may be statically configured (#define MAX_THREADS). 
I have done some tests using 3,6,12 and 24 threads on each side.
The results obtained are quite similare to the one obtained by my first
 non-threaded program :-<

(programs, makefiles etc.. at the end of the note )

The following results are giving an idea of the number of packet/second 
exchanged in the different configurations.

DUNIX V3.2G :
------------
    3 threads       6 threads      12 threads      24 threads

    6100 pkts        6120 pkts      6000 pkts       6010 pkts

You can see that the global number of packets exchanged between the two
 processes is quite constant and near 6000/sec.


DUNIX V4.0B :
------------
    3 threads       6 threads      12 threads      24 threads

    3400 pkts        3600 pkts      3450 pkts       3200 pkts

On DUNIX V4.0 the number of pakets exchanged / second is never higher than 3600
whatever the number of thread in the process. This still represent a loss
of performance of 40%.
So what are the conditions to get the benefits of the scheduling improvements ?

3- "This makes system calls more expensive in terms of CPU
usage, while relieving various reschedule-latency problems and system scaling
issues.  It also makes the case of blocking (and unblocking) the
process (i.e., when there is no work to do) marginally more expensive; but,
working on the idea that there are a number of threads active, this would be an
atypical event."
>>>> If i understant well, the performances have been optimized in the case
where a thread is blocked and gives the hand to an other thread of the same
process (case of several thread locking/unlocking a mutex) rigth ?
>>>> In our application this is usally not the case. the threads are waiting
on I/O (sockets) and sometimes use mutex when acessing global data. This means
 that most of the time when a thread is blocked
(on a read syst. call for example) it doesn't give the hand to an other thread
of the same process but unblock the receiver thread of the target process.
This example IS NOT an atypical event, but just the real life in the 
telecom world and probably in some others.

then, what do we do next ?

Jean-marie.


-----------------------------------------------------------------------
Makefile:
--------
du40: send_thr_du40 recv_thr_du40
du32: send_thr_du32 recv_thr_du32

send_thr_du32: send_thr.c
        cc  -o send_thr_du32 send_thr.c -lpthreads -lmach -lc_r

recv_thr_du32: recv_thr.c
        cc  -o recv_thr_du32 recv_thr.c -lpthreads -lmach -lc_r

send_thr_du40: send_thr.c
        cc  -pthread -o send_thr_du40 send_thr.c

recv_thr_du40: recv_thr.c
        cc  -pthread -o recv_thr_du40 recv_thr.c
-----------------------------------------------------------------------
send_thr.c :
------------
#include <pthread.h>

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>

#include <sys/un.h>
#include <sys/socket.h>

#define MAX_THREADS 3

#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

void connect_and_emit()
{
    int socket_id1;
    struct sockaddr_un address;
    int addr_len;
    char buffer[INBUFF_LEN];
    int status;
    size_t nbbytes;
    int msg_len;
    int i;

    printf("            Thread: The thread begin!\n");

    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
                exit(-1);
    }

    status = connect(socket_id1,&address,sizeof(address));
    if ( status == -1 )
    {
        printf("connect error %d\n",errno);
                exit(-1);
    }

    msg_len = BUFF_LEN;
    for (i=0 ;i<100000;i++)
    {
        nbbytes = write(socket_id1,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            break;
        }
        nbbytes = write(socket_id1,message,msg_len);
        if ( !nbbytes )
        {
            printf("            write error : %d\n",errno);
            break;
        }

        /* read acknowledge */
        nbbytes = read(socket_id1,buffer,4);
        if (nbbytes == 0)
        {
            printf("         read error : %d\n",errno);
            break;
        }

    }

    close(socket_id1);
    printf("            Thread: The thread end!\n");
}


main(int argc, char **argv)
{
    struct timespec sleep_time;
    pthread_t               thread_id[MAX_THREADS];
    int                     terror=0;
    int                     exitstatus;
    void*       result;
    int i;


    for (i=0; i< MAX_THREADS; i++)
    {
       printf("Main: Create the thread \n");


       /* Create the thread
        */
#if _POSIX_C_SOURCE == 199506L
       terror = pthread_create (&thread_id[i], NULL,
                                (void *(*)(void*)) connect_and_emit,
                                NULL);
#else
       terror = pthread_create (&thread_id[i], pthread_attr_default,
                                (void *(*)(void*)) connect_and_emit,
                                NULL);
#endif

       if (terror != 0)
       {
             printf("pthread_create() failed\n");
             exit(-1);
       }

    }
    /*
     * wait for termination of threads
     */
    for (i=0; i< MAX_THREADS; i++)
       pthread_join(thread_id[i], &result);

    exit(0);
}

------------------------------------------------------------------------------
recv_thr.c
----------
#include <pthread.h>

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/time.h>

#include <sys/un.h>
#include <sys/socket.h>

#define MAX_THREADS 24

#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

struct param {
    int indx;
    int socket;
} ;

struct param            param[MAX_THREADS];
int                   pkt_received[MAX_THREADS];

receive(struct param *param)
{
    size_t nbbytes;
    char buffer[INBUFF_LEN];
    unsigned long nbpacket;
    int msg_len;

    printf("reading ...\n");

    nbpacket=0;
    while (1)
    {
        nbbytes = read(param->socket,&msg_len,sizeof(msg_len));
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        nbbytes = read(param->socket,buffer,msg_len);
        if (nbbytes == 0)
        {
               printf("         read error : %d\n",errno);
               break;
        }

        /* send acknowledge */

        nbbytes = write(param->socket,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
            printf("            write error : %d\n",errno);

        nbpacket++;
    }

    pkt_received[param->indx] = nbpacket;
    close(param->socket);

}


main(int argc, char **argv)
{
    int socket_id1,socket_id2;
    struct sockaddr_un address;
    struct sockaddr_un acc_address;
    size_t addr_len;
    int status;
    struct timespec sleep_time;
    pthread_t               thread_id[MAX_THREADS];
    int                     terror;
    int                     exitstatus;
    void*       result;
    int i;
    long pckts_per_sec;
    double ftime1, ftime2;
    struct timespec time1, time2;


    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
        exit(-1);
    }

    status = bind(socket_id1,&address,sizeof(address));
    if ( status == -1)
    {
        printf("bind error %d\n",errno);
        exit(-1);
    }

    status = listen(socket_id1,SOMAXCONN);
    if ( status == -1)
    {
        printf("listen error %d\n",errno);
        exit(-1);
    }
    printf(" listen ok\n");

    for (i=0; i<MAX_THREADS;i++)
    {
        printf(" accepting conn\n");
        socket_id2 = accept(socket_id1,&acc_address,&addr_len);
        if ( socket_id2 == -1)
        {
            printf("accept error %d\n",errno);
            exit(-1);
        }
        printf(" accept ok\n");

        /* get starting time */
        if (i==0)
            getclock(TIMEOFDAY,&time1);

       param[i].indx = i;
       param[i].socket = socket_id2;

       printf("Main: Create the thread \n");


       /* Create the thread
        */
#if _POSIX_C_SOURCE == 199506L
       terror = pthread_create (&thread_id[i], NULL,
                                (void *(*)(void*)) receive,
                                &param[i]);
#else
       terror = pthread_create (&thread_id[i], pthread_attr_default,
                                (void *(*)(void*)) receive,
                                &param[i]);
#endif
       if (terror != 0)
       {
             printf("pthread_create() failed\n");
             exit(-1);
       }
    }


    /*
     * wait for termination of threads
     */
    for (i=0;i<MAX_THREADS;i++)
    {
        pthread_join(thread_id[i], &result);
    }

    /* compute average */

    getclock(TIMEOFDAY,&time2);

    ftime1 = time1.tv_sec + time1.tv_nsec * 0.000000001;
    ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001;
    ftime1 = ftime2 - ftime1;
    printf(" elapsed time : %f\n",ftime1);

    pckts_per_sec = 0;
    for (i=0;i<MAX_THREADS;i++)
    {
        pckts_per_sec += pkt_received[i];
    }
    printf(" number of packet sent : %ld\n",pckts_per_sec);
    printf(" packets per second : %f\n", (float)pckts_per_sec/ftime1);

    close(socket_id1);
    unlink(SOCKET_NAME);
    exit(0);
}


10014.3Very fast I/O?WTFN::SCALESDespair is appropriate and inevitable.Tue Jun 03 1997 12:4052
.2> You did it well .... (joke :-)

Unfortunately, sometimes, in order to make a step forward, you have to take a
step backward; in this case, it was a big step forward, and we apparently
haven't yet uncovered or recovered from our various steps backward...  :-}

.2> I returned to my desk and have writen the same kind of program but using
.2> threads.

Your program looks reasonable to me.  (My command of socket programming is
not strong, but it looks like you avoided the obvious pitfalls.)  I'm a
little surprised and disappointed to find that it doesn't affect your
performance.

.2> If i understant well, the performances have been optimized in the case
.2> where a thread is blocked and gives the hand to an other thread of the
.2> same process (case of several thread locking/unlocking a mutex) right ?

The thrust of our model is that user mode synchronization, such as blocking
on a mutex, should be _very_ fast, while kernel mode synchronization, such as
blocking for I/O, should not decrease the level of concurrency in the process
and should be as efficient as possible, the expectation being that there
should be no increase in latency and that the extra CPU cost would be
recouped in concurrent execution and in decrease of scheduling latency for
other threads in the application.

.2> most of the time when a thread is blocked (on a read syst. call for
.2> example) it doesn't give the hand to an other thread of the same process
.2> but unblock the receiver thread of the target process.

Is the target process typically on the same machine as the sending process?
That is, it was _very_ interesting that the I/O throughput was almost
unaffected (less than a few percent) by the number of threads you used --
this suggests that the I/O has somehow been optimized to the extent where it
is so fast that it is thread scheduling, and not the I/O system, which is the
bottleneck.  And, any time that the overhead of thread scheduling itself is
the bottleneck, then using threads is unlikely to be a win for you (unless
performance is not an issue or unless you have some sort of contention
problem in your application).

Thus, if your expected deployment is always on a single machine, you might
want to investigate using shared memory to communicate rather than sockets.
Otherwise, if your typical deployment involves multiple machines, you might
want to try running your benchmark in a more representative environment.

In the meantime, please feel free to enter a QAR the on V4.0B performance.
Please include your pair of multithreaded test programs and then numbers that
you saw on both platforms.  (If you decide to QAR either of the other
performance/scaling problems you saw, please enter them as separate QARs.)


				Webb
10014.4DCETHD::BUTENHOFDave Butenhof, DECthreadsTue Jun 03 1997 14:3544
>.2> I returned to my desk and have writen the same kind of program but using
>.2> threads.
>
>Your program looks reasonable to me.  (My command of socket programming is
>not strong, but it looks like you avoided the obvious pitfalls.)  I'm a
>little surprised and disappointed to find that it doesn't affect your
>performance.

I'm not surprised or disappointed. This is a "degenerate" case, that,
unfortunately, has become much more common that we'd expected. Our assumption
(and these assumptions are very difficult to test) was that most threaded
code would do most synchronization in user mode, within the process -- that
is, primarily blocking on condition variables and mutexes. Furthermore, there
will usually be threads ready to perform work that haven't been able to get a
VP. The upcall protocol for kernel-blocking I/O allows the kernel to tell us
that a VP is free to handle one of those ready threads.

As Jean-Marie noted in .2, this is NOT common of "telecom" applications. They
do most of their blocking in the kernel, primarily in I/O, and suffer greatly
from the overhead of upcalls. Furthermore, because they rarely have threads
that are READY to do work, we gain no useful concurrency from the upcalls.
Absolutely worst-case performance.

This sort of application won't ever benefit from 2-level scheduling, because
only one level is doing (nearly) all of the scheduling anyway. You should be
creating your threads with POSIX system contention scope. Doing so would drop
our user-mode scheduler (which isn't helping you) out of the loop. Your mutex
and condition variable synchronization will be more expensive, because they
will require kernel calls -- but you don't do many anyway, relative to the
number of I/O calls, so that shouldn't affect your performance much.

System contention scope is implemented for Digital UNIX 4.0D.

Oh, one more thing... you're building wrong in your 3.2 lines. You should be
using -threads, not "-lpthreads -lmach -lc_r". Although you have the right
libraries, they're also provided by -threads. And you're missing the critical
-D_REENTRANT compilation option.

I was about to try a comparison, but I see that your test programs don't
write any performance metric, nor did you say how you arrived at the numbers
you've posted. What command sequence do you actually use to run these test
programs and to generate the performance numbers?

	/dave
10014.5TAEC::URAGOWed Jun 04 1997 05:1233
Webb, dave,

Thanks for all your explanations, the impact of the 2-level scheduling in our
case in now clearer for me. 

happy to see that we are a "degenerate" case .... :-(.


Anyway, as the performances AND the support of DUNIX V4.0 are one of the goals
of our TeMIP Vnext, I will try D-UNIX V4.0D and the system contention scope.

Concerning my tests programs, sorry for the lake of explanations.

To use it, on D-UNIX V4.0, just start recv_thr_du40 in a session and then 
start send_thr_du40 in another session.

Each sending thread (from send_thr_du40 process) emit 100000 packets and exits.
On the receiver side when all packets have been received (end by read error)
the total number of received packets is divided by the elapsed time and then
 gives the number of packets received per second. This is done by recv_thr_du40.
I know that we can do something best in terms of statistics but this gives an
 idea of the overall performances.

The number of threads of each process is set by the #define MAX_THREADS (must be
the same on each side)

running the tests with 3 threads take 1min 30sec on my 3000/700
running it with 6 --> 3 minutes  etc ... be patient !
 
Do not hesitate to ask me more info off-line if you have any trouble using it.

Regards,
Jean-marie
10014.6Data and new code...DCETHD::BUTENHOFDave Butenhof, DECthreadsWed Jun 04 1997 09:04440
OK, I understand the problem I had running it. The two programs must use the
same number of threads for successful completion -- and in the source you
posted, one used 3 threads and one used 24 threads. So it didn't complete,
and didn't print the statistical information.

I added setup code to take the number of threads as an argument (-t<n>), and
to handle an option (-s) to create system contention scope threads.

On "Digital UNIX 4.0D" (actually, it's not: it's a 4.0B system with a special
kernel and my latest [debug] private sandbox thread library), with my
AlphaStation 600 5/266 workstation, I get:

Process Contention Scope (default):

    3 threads       6 threads      12 threads      24 threads

    10439 pkts      8488 pkts      7957 pkts       7164 pkts

System Contention Scope:

    3 threads       6 threads      12 threads      24 threads

    13431 pkts      12726 pkts     12270 pkts      12575 pkts

I don't know what (if anything) it means that I'm getting substantially
better numbers than yours even for PCS. Faster CPU? For comparison, an
identical (hardware) system running stock Digital UNIX 4.0 showed only 7076
packets/sec with 3 threads -- but that's still twice your numbers.

PCS on 4-CPU AlphaServer 2100A 5/300:

    3 threads       6 threads      12 threads      24 threads

    3293 pkts       4573 pkts      5083 pkts       4982 pkts

SCS on 4-CPU AlphaServer 2100A 5/300:

    3 threads       6 threads      12 threads      24 threads

    4823 pkts       14799 pkts     15621 pkts      16115 pkts

My code follows: (Note that it won't compile on 3.2 -- I didn't even pretend
to support both interfaces.) As well as adding the options and improving some
messages, I also cleaned up the termination protocol to avoid the annoying
spurious receive error messages (the sender sends a message length of 0 to
terminate).

/*
 * send_thr.c
 */
#include <pthread.h>

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>

#include <sys/un.h>
#include <sys/socket.h>

#define MAX_THREADS 24
#define ARGS "t:s"
#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024

#define BUFF_LEN 600
char message[BUFF_LEN];

void connect_and_emit()
{
    int socket_id1;
    struct sockaddr_un address;
    int addr_len;
    char buffer[INBUFF_LEN];
    int status;
    size_t nbbytes;
    int msg_len;
    int i;

    printf("            Thread %d begins!\n", pthread_getselfseq_np());

    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
                exit(-1);
    }

    status = connect(socket_id1,&address,sizeof(address));
    if ( status == -1 )
    {
        printf("connect error %d\n",errno);
                exit(-1);
    }

    msg_len = BUFF_LEN;
    for (i=0 ;i<100000;i++)
    {
        nbbytes = write(socket_id1,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
        {
            printf("            (len) write error : %d\n",errno);
            break;
        }
        nbbytes = write(socket_id1,message,msg_len);
        if ( !nbbytes )
        {
            printf("            (data) write error : %d\n",errno);
            break;
        }

        /* read acknowledge */
        nbbytes = read(socket_id1,buffer,4);
        if (nbbytes == 0)
        {
            printf("         (ack) read error : %d\n",errno);
            break;
        }

    }

    msg_len = 0;
    nbbytes = write(socket_id1,&msg_len, sizeof(msg_len));
    if ( !nbbytes )
    {
	printf("            (done) write error : %d\n",errno);
    }
    close(socket_id1);
    printf("            Thread %d done!\n", pthread_getselfseq_np());
}


main(int argc, char **argv)
{
    struct timespec sleep_time;
    pthread_t               thread_id[MAX_THREADS];
    int                     terror=0,threadcount=MAX_THREADS;
    int                     exitstatus;
    void*       result;
    int i, errflg, c, status;
    pthread_attr_t	attr;


    status = pthread_attr_init (&attr);
    if (status != 0) {
	printf ("Attr init failed\n");
	exit (-1);
	}
    optarg = NULL;
    errflg = 0;
    while (!errflg && ((c = getopt (argc, argv, ARGS)) != -1))
	switch (c) {
	    case 't':
		threadcount = atoi (optarg);
		printf ("Using %d threads\n", threadcount);
		break;
	    case 's':
		printf ("Setting SCS\n");
		status = pthread_attr_setscope (&attr, PTHREAD_SCOPE_SYSTEM);
		if (status != 0) {
		    printf ("Error setting scope\n");
		    exit (-1);
		    }
		break;
	    default:
		errflg++;
	    }

    if (errflg) {
	printf ("%s: usage %s\n", argv[0], ARGS);
	exit (-1);
	}
	

    if (threadcount > MAX_THREADS) {
	printf ("Too many threads (%d): using %d\n", threadcount,
MAX_THREADS);
	threadcount = MAX_THREADS;
	}

    for (i=0; i< threadcount; i++)
    {
       printf("Main: Create the thread \n");


       /* Create the thread
        */
#if _POSIX_C_SOURCE == 199506L
       terror = pthread_create (&thread_id[i], &attr,
                                (void *(*)(void*)) connect_and_emit,
                                NULL);
#else
       terror = pthread_create (&thread_id[i], pthread_attr_default,
                                (void *(*)(void*)) connect_and_emit,
                                NULL);
#endif

       if (terror != 0)
       {
             printf("pthread_create() failed\n");
             exit(-1);
       }

    }
    /*
     * wait for termination of threads
     */
    for (i=0; i< threadcount; i++)
       pthread_join(thread_id[i], &result);

    exit(0);
}

/*
 * recv_thr.c
 */
#include <pthread.h>

#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/time.h>

#include <sys/un.h>
#include <sys/socket.h>

#define MAX_THREADS 24

#define SOCKET_NAME "my_sockets"
#define INBUFF_LEN 1024
#define ARGS "t:s"
#define BUFF_LEN 600
char message[BUFF_LEN];

struct param {
    int indx;
    int socket;
} ;

struct param            param[MAX_THREADS];
int                   pkt_received[MAX_THREADS];

receive(struct param *param)
{
    size_t nbbytes;
    char buffer[INBUFF_LEN];
    unsigned long nbpacket;
    int msg_len;

    printf("            Thread %d begins!\n", pthread_getselfseq_np());

    nbpacket=0;
    while (1)
    {
        nbbytes = read(param->socket,&msg_len,sizeof(msg_len));
        if (nbbytes == 0)
        {
               printf("         (len) read error : %d\n",errno);
               break;
        }

	if (msg_len == 0)			/* Done */
	    break;

        nbbytes = read(param->socket,buffer,msg_len);
        if (nbbytes == 0)
        {
               printf("         (buf) read error : %d\n",errno);
               break;
        }

        /* send acknowledge */

        nbbytes = write(param->socket,&msg_len, sizeof(msg_len));
        if ( !nbbytes )
            printf("            (ack) write error : %d\n",errno);

        nbpacket++;
    }

    pkt_received[param->indx] = nbpacket;
    close(param->socket);
    printf("            Thread %d done!\n", pthread_getselfseq_np());
}


main(int argc, char **argv)
{
    int socket_id1,socket_id2;
    struct sockaddr_un address;
    struct sockaddr_un acc_address;
    size_t addr_len;
    int status,threadcount=MAX_THREADS;
    struct timespec sleep_time;
    pthread_t               thread_id[MAX_THREADS];
    int                     terror;
    int                     exitstatus;
    void*       result;
    int i,errflg,c;
    long pckts_per_sec;
    double ftime1, ftime2;
    struct timespec time1, time2;
    pthread_attr_t	attr;


    status = pthread_attr_init (&attr);
    if (status != 0) {
	printf ("Attr init failed\n");
	exit (-1);
	}
    optarg = NULL;
    errflg = 0;
    while (!errflg && ((c = getopt (argc, argv, ARGS)) != -1))
	switch (c) {
	    case 't':
		threadcount = atoi (optarg);
		printf ("Using %d threads\n", threadcount);
		break;
	    case 's':
		printf ("Setting SCS\n");
		status = pthread_attr_setscope (&attr, PTHREAD_SCOPE_SYSTEM);
		if (status != 0) {
		    printf ("Error setting scope\n");
		    exit (-1);
		    }
		break;
	    default:
		errflg++;
	    }

    if (errflg) {
	printf ("%s: usage %s\n", argv[0], ARGS);
	exit (-1);
	}
	

    if (threadcount > MAX_THREADS) {
	printf ("Too many threads (%d): using %d\n", threadcount,
MAX_THREADS);
	threadcount = MAX_THREADS;
	}

    address.sun_family = AF_UNIX;
    strcpy(address.sun_path,SOCKET_NAME);

    socket_id1 = socket(AF_UNIX,SOCK_STREAM,0);
    if ( socket_id1 == -1)
    {
        printf("socket error %d\n",errno);
        exit(-1);
    }

    status = bind(socket_id1,&address,sizeof(address));
    if ( status == -1)
    {
        printf("bind error %d\n",errno);
        exit(-1);
    }

    status = listen(socket_id1,SOMAXCONN);
    if ( status == -1)
    {
        printf("listen error %d\n",errno);
        exit(-1);
    }
    printf(" listen ok\n");

    printf("Main: Create the threads\n");

    for (i=0; i<threadcount;i++)
    {
        printf(" accepting conn\n");
        socket_id2 = accept(socket_id1,&acc_address,&addr_len);
        if ( socket_id2 == -1)
        {
            printf("accept error %d\n",errno);
            exit(-1);
        }
        printf(" accept ok\n");

        /* get starting time */
        if (i==0)
            getclock(TIMEOFDAY,&time1);

       param[i].indx = i;
       param[i].socket = socket_id2;


       /* Create the thread
        */
#if _POSIX_C_SOURCE == 199506L
       terror = pthread_create (&thread_id[i], &attr,
                                (void *(*)(void*)) receive,
                                &param[i]);
#else
       terror = pthread_create (&thread_id[i], pthread_attr_default,
                                (void *(*)(void*)) receive,
                                &param[i]);
#endif
       if (terror != 0)
       {
             printf("pthread_create() failed\n");
             exit(-1);
       }
    }


    /*
     * wait for termination of threads
     */
    for (i=0;i<threadcount;i++)
    {
        pthread_join(thread_id[i], &result);
    }

    /* compute average */

    getclock(TIMEOFDAY,&time2);

    ftime1 = time1.tv_sec + time1.tv_nsec * 0.000000001;
    ftime2 = time2.tv_sec + time2.tv_nsec * 0.000000001;
    ftime1 = ftime2 - ftime1;
    printf(" elapsed time : %f\n",ftime1);

    pckts_per_sec = 0;
    for (i=0;i<threadcount;i++)
    {
        pckts_per_sec += pkt_received[i];
    }
    printf(" number of packet sent : %ld\n",pckts_per_sec);
    printf(" packets per second : %f\n", (float)pckts_per_sec/ftime1);

    close(socket_id1);
    unlink(SOCKET_NAME);
    exit(0);
}
10014.7TAEC::URAGOWed Jun 04 1997 10:1518
"I don't know what (if anything) it means that I'm getting substantially
better numbers than yours even for PCS. Faster CPU? For comparison, an
identical (hardware) system running stock Digital UNIX 4.0 showed only 7076
packets/sec with 3 threads -- but that's still twice your numbers."

>>> identical hardware ... to yours ?
    If yes, your results in PCS with 3 threads with your 4.0D show 10439 pkts.
    does that mean that you win 30% with the 4.0D even in PCS compared to 4.0B?


For info, I have tested the following hardware with 4.0B:

Dec 3000 - M700      : 3 threads : 3400 pkts
AlphaStation 255/300 : 3 threads : 3650 pkts
AlphaStation 500/400 : 3 threads : 8000 pkts

Jean-marie
10014.8DCETHD::BUTENHOFDave Butenhof, DECthreadsWed Jun 04 1997 10:3322
>>> identical hardware ... to yours ?

Yes, I mean two AlphaStation 600 5/266 systems (ALCOR). But remember, I'm not
talking "4.0B" to "4.0D". One was a "stock", unpatched 4.0 file server -- the
other (my workstation) is a hacked-up 4.0B system with a nightly-build 4.0D
kernel and a debug thread library from my sandbox. Nevertheless, to the
limited extent that a comparison can be considered valid, I gained about 48%
(7076 -> 10439) moving from the 4.0 system to the "4.0D" system, with PCS
threads. I'm not sure how my AlphaStation 600 5/266 and your AlphaStation
500/400 compare, and I assume that's the basis of your "30%", so I can't say
whether that's even as valid as my comparison.

Actually, the interesting thing about my results, in case anyone didn't
notice, is the SMP numbers. The AlphaSever 2100A (with a faster chip) are
substantially worse than the AlphaStation 600 numbers -- ALWAYS, for PCS, and
also for SCS until the processors are "saturated" (there's over 300%
improvement from 3 threads to 6 threads). The SMP 12 and 24 thread numbers go
UP from there, whereas the uniprocessor numbers go down as contention
increases. I don't pretend to understand this, and I'm not inclined, at this
time, to even try.

	/dave
10014.9EV4 -> EV5 doubles performance at same clockWIBBIN::NOYCEPulling weeds, pickin&#039; stonesWed Jun 04 1997 15:409
If you must compare Alphas to oranges...

System		CPU	MHz	approx SPECint95

DEC 3000/700	EV45	225?		 4
Astn 255/300	EV45	300		 5
Astn 600 5/266	EV5	266		 8
Asvr2100A 5/300	EV5	300		 9
Astn 500/400	EV56	400		12
10014.10DCETHD::BUTENHOFDave Butenhof, DECthreadsThu Jun 05 1997 09:0019
>	DEC 3000/700	EV45	225?		 4
>	Astn 255/300	EV45	300		 5
>	Astn 600 5/266	EV5	266		 8
>	Asvr2100A 5/300 EV5	300		 9
>	Astn 500/400	EV56	400		12

So none of Jean-Marie's numbers are comparable to any of mine, even if the
versions were identical. The numbers in .7 are EV4, EV45, and EV56
respectively, while mine are EV5.

Thanks, Bill. Bevin Brett & I looked up the AlphaStation 255 series in
AltaVista the other day for a different problem, and we couldn't find
anything that identified the actual chip. (Annoying.) I like the "<model>
<chip>/<fudge-speed>" convention, despite the fact that the speeds are fudged
and it doesn't distinguish "4" from "45" or "5" from "56". I wish they'd just
clean that up and use it consistently instead of switching back and forth
between that and the "<model>/<fudge-speed>" style!

	/dave
10014.11SPEC disclosures come in handyPERFOM::HENNINGFri Jun 06 1997 10:145
    The workstation group managed to implement three DIFFERENT naming
    conventions in a single year (sigh).  But if you want to find out
    what's really inside any specific box, you might want to bookmark
    
        http://www.specbench.org/cgi-bin/osgresults?conf=cpu95