[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:	DECthreads Conference

Moderator:	PTHRED::MARYSTEON

Created:	Mon May 14 1990
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1553
Total number of notes:	9541

1515.0. "debugging on SMP machines/bind_to_cpu(0" by TUXEDO::CHUBB () Wed Apr 02 1997 16:21

    We've got a "cell-exercizer" in DCE that proves that everything is up
    and running OK.   It's a main routine that spawns server and client
    threads to do the main testing, in addition to a "skulker" thread that
    ensures that the CDS namespace is kept up to date.
    
    On a particular 2-cpu machine (each a EV-5) we've got a problem with it
    going into slug mode as if there's some synchronization problem.  If
    any output is sent to the screen, then the problem's not seen, so
    debugging is made difficult.  From running dbx on it at various times
    it looks the same as in the normal "full-speed" runs on another
    machine.
    
    The curious thing is that if I bind_to_cpu() the program to CPU 0, I
    still see the problem, but if I bind_to_cpu() it to CPU 1, the problem
    is cleared up.  I know that SMP machines can bring out threading bugs
    that have always been in the code, but I'm wondering if this particular
    evidence changes anything.  It implies to me that there are some race
    conditions with a program that is primarily running on CPU 0 (like our
    own 'cdsd' daemon).  And that whatever threading problems the program
    has may be more involved than what the program's own threads are doing
    -- that is, it may have to do with exteriour process(es) as well.
    
    Do any threads experts have any thoughts on this?  Sure, I'd like to
    blame the hardware, specifically CPU 0, but it seems more likely there
    are some bad assumptions in the program.  Has anyone else run into this
    behavior?
    
    Thanks,
    Brandon

T.R	Title	User	Personal Name	Date	Lines
1515.1		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Thu Apr 03 1997 06:59`	5
	No useful comment can possibly be made without knowing which VERSION of the O/S you're using. At least, I can infer that you're using UNIX, which is more information than some people manage to convey... /dave
1515.2	Lotsa possibilities...	WTFN::SCALES	Despair is appropriate and inevitable.	`Thu Apr 03 1997 10:45`	10
	I'd say that it's most likely an inter-process communication issue (i.e., a race condition resulting from the order in which your processes are running). However, remember that the EV5 chip takes more liberties with memory writes than did the EV4, so if you're not using mutexes around memory which is shared between threads (or processes??) you'll be subject to read/write-ordering problems. Webb
1515.3	3.2G	TUXEDO::CHUBB		`Thu Apr 03 1997 12:46`	7
	Thanks for the info Webb. The fact that there's a practical software-behavior difference between EV5 and EV4 is news to me, and may explain why we don't hit the problem on every SMP machine. I certainly meant to incude the Unix version: it's 3.2G. -- brandon
1515.4		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Thu Apr 03 1997 13:07`	24
	> The fact that there's a practical > software-behavior difference between EV5 and EV4 is news to me Just wait 'til you see an EV6. EV4 was a good uniprocessor, but a poor multiprocessor. EV5 is pretty slick. EV6 blazes. A good part of the improvements come from mining the deliberate looseness of the memory ordering and latency rules in the Alpha architecture (of which EV4 took no advantage). If you touch any shared memory without a mutex or low-level hardware operations, you're in big trouble! Use a mutex for every access to shared memory (not just for WRITES... also for READS) and you'll be fine. You can skip the mutex for READS when you are only reading ONE value and you don't care if you get the LATEST value. If you care about getting the latest value, or you're reading more than one value, you always need a mutex. (The next level of optimization is to figure out the proper use of MB instructions when you care about sequence but not latency.) The main reason I was so concerned about the version, by the way, is that bind_to_cpu is completely meaningless for a threaded process on 4.0 (up until 4.0D comes out) -- that would have made your apparent differences based on CPU completely coincidental. But, of course, you should always give the version anyway! /dave
1515.5	bind_to_cpu meaningless before 4.0D?	HYDRA::SOUZA	For Internal Use Only	`Sat May 10 1997 13:58`	8
	re: -1 Can you elaborate on bind_to_cpu being meaningless before 4.0D? thanks bob
1515.6	Why do you think you need to request a binding?	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon May 12 1997 13:52`	15
	.5> Can you elaborate on bind_to_cpu being meaningless before 4.0D? For process contention scope threads (i.e., all threads as of V4.0, and the default threads in V4.0D) there is no way to force a particular thread to run on a particular CPU. That is, calling bind_to_cpu() will not have the effect that you expect. (It may or may not return an error, depending on what version you are calling it on.) In the next major functional release, we expect to provide a mechanism for restricting threads' access to the available CPUs. However, as I said, there is no way to do that in V4.0[A\|B\|C], and you have to resort to using system contention scope threads to do it on V4.0D. Webb
1515.7	scheduling description?	HYDRA::SOUZA	For Internal Use Only	`Mon May 12 1997 14:01`	5
	Thanks. Is there a description of how all this scheduling works anywhere?
1515.8	What do you want to know?	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon May 12 1997 14:31`	19
	.7> Is there a description of how all this scheduling works anywhere? Which "this scheduling"? Do you mean "process contention scope" vs. "system contention scope"? Do you mean V3 vs V4? Do you mean "how does two-level scheduling work"? Do you mean how do threads scheduling policies and priorities affect their access to the processor(s)? ... Yours is kind of an open question... Yes, I think there are descriptions of various aspects of "thread scheduling" in the DECthreads documentation and elsewhere in this conference. However, none of it is too detailed, since much of the capabilities are new and most of them have been changing in various small ways in the past three years and will continue to do so at least through the next release of each of the major operating systems. I don't have the luxury of being able to write a comprehensive review here, but if you have a specific question or to, I'd be happy to try to address it. Webb
1515.9	scheduling	HYDRA::SOUZA	For Internal Use Only	`Mon May 12 1997 15:07`	18
	I guess a more specific question would be better. A customer who is familiar with what Sun calls lightweight threads asked me: How are 'user threads' (which I think is what we would call process contention scope) mapped to 'kernel threads' (which I think is what we would call system contention scope), and is is possible to control this mapping? How are threads bound to a cpu, and is is possible to control the binding? They are running Digital Unix 4.0B. Whether these are reasonable questions is not clear. Thanks bob
1515.10		SMURF::DENHAM	Digital UNIX Kernel	`Mon May 12 1997 17:22`	5
	Well, as starters about how this stuff works in general, see not 9772 in tle::digital_unix. Before, V4.0D, there is no way to control any mapping of a thread to anything, kernel thread or CPU.
1515.11	It's too bad that people think they need this stuff...	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon May 12 1997 19:37`	69
	.9> A customer who is familiar with what Sun calls lightweight threads asked me We should have a very good story to tell, relative to Sun...however, as we well know, Sun tends to be much better at spinning things than we are... .9> How are 'user threads' (which I think is what we would call process .9> contention scope) mapped to 'kernel threads' (which I think is what we .9> would call system contention scope), and is is possible to control this .9> mapping? Both Digital Unix (in either V3.2 or again starting in V4.0D) and current versions of Solaris offer the ability to create a thread which is scheduled by the kernel. It sounds like Sun calls these "kernel threads"; we use the POSIX term, "system contention scope thread". Either way, when a processor selects something to run, it selects the kernel-scheduled thread with the highest scheduling precedence and switches to its context, without regard to which process the thread is in. Both Digital Unix (as of V4.0) and current versions of Solaris offer threads which are scheduled by the threads library. It sounds like Sun calls these "user threads"; we use the POSIX term, "process contention scope thread". (Since both process contention scope and system contention scope threads can execute application (i.e., "user") code in non-priviledged (i.e., "user") mode, the terms "user threads" and "kernel threads" can be confusing...) These threads are executed in the context of one or more kernel-scheduled entities, each of which selects a process contention scope thread from those available in the process, based on the PCS threads' scheduling parameters. (Meanwhile, the OS kernel selects which of the "kernel-scheduled" entities to run at any given time, based on a separate set of scheduling parameters.) The big difference between Digital Unix and Solaris is that, when using process contention scope threads, you don't have to guess a priori how many kernel-scheduled entities (we call them "virtual processors" (VPs)) you will need if you're running on Digital Unix. DECthreads creates one VP (as needed) for each processor which is available to the process. If you have fewer VPs than that, then your process cannot take full advantage of the machine; if you have more VPs than that, then they fight among themselves for access to the processors and you lose throughput to the content-switch overhead. However, Sun doesn't currently have the ability to replace VPs when they block (e.g., in a system call, for I/O, or for page faults). This means that sometimes your process will have too few VPs, unless you set the "concurrency level" up, in which case at times you'll have too many. (This is one of the major selling points of having the full two-level scheduling model -- you can have analogous problems if you rely solely on system contention scope threads.) Anyway, to get back toward your question... Digital's implementation of threads is targetted at maximizing throughput. Thus, we have have tried to obviate the need for use of big hammers like binding. Instead, we try to schedule application threads whereever and however is most efficient from a throughput perspective. (Also, binding tends to be very sensitive to the sort of machine you're running on, so it's not very flexible in terms of a single executable running on different machines; whereas, DECthreads is completely adaptive.) Nevertheless, we are sensitive to application-providers' interest in being able to control this stuff. In V4.0D an application is able to bind a given system contention scope thread to a specific physical processor so that it cannot run anywhere else (using bind_to_cpu()). Not that I would recommend it, but by careful use of this stuff (and exhaustive knowlege of your application and system configuration), you can arrange it so that a specific system contention scope thread can get exclusive use of a specific processor (which is a hurrendous waste, but there you go). In the following functional release of Digital Unix, there will be a new interface which allow you to do analogous things with process contention scope threads (although, hopefully, the interface will prove much more flexible and effective than bind_to_cpu()). So, if your customer really wants to shoot his application in the foot, you can now tell him when the bullets will be available... Webb
1515.12	ask the right question, get the right answer...	HYDRA::SOUZA	For Internal Use Only	`Mon May 12 1997 19:52`	4
	Thanks very much, that's very helpful. bob
1515.13		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Mon May 19 1997 08:34`	59
	I found Webb's reply a little confusing. So even though this is a bit late (I was out last week), I'm going to try my own spin -- perhaps more legible given that I know more about Solaris than Webb ;-) Solaris: 1) The kernel provides LWPs, "light weight processes", which the kernel schedules onto processors. Solaris supports a form of realtime scheduling control over these LWPs, though not the POSIX 1003.1b APIs, and it also timeslices non-realtime LWPs. 2) The Solaris thread libraries (libpthread for POSIX and libthread for UI threads) initially create an LWP for each processor available to the process. You can create "bound" threads (THR_BOUND flag in the UI interface, or PTHREAD_SCOPE_SYSTEM in the POSIX interface) which are permanently attached to a new (private) LWP. Or you can create "unbound" threads (which are the default in both interfaces). 3) The thread library does user-mode context switching of unbound threads, when a thread blocks on a mutex, condition variable, or read/write lock, scheduling them among the various LWPs that aren't attached to BOUND threads. Note that there is no support for realtime scheduling OR timeslicing in the user mode scheduler. Furthermore, when a thread blocks in the kernel, e.g., on a read(2) call, the LWP on which it is currently scheduled remains bound to the thread until it returns. 4) When the last unbound LWP in a process is blocked in the kernel, the kernel issues a special signal to allow the library to create additional LWPs. This still, however, reduces the concurrency of the process to 1. Digital UNIX: 1) The kernel provides Mach threads, of which the user thread library utilizes a special subset termed "scheduler threads" to implement what we call "virtual processors". The kernel supports full POSIX realtime scheduling for Mach threads, but (prior to 4.0D) this isn't useful to threaded programs. The kernel also timeslices non-realtime Mach threads. 2) The thread library (libpthread) initially creates a scheduler thread for each processor available to the process. Prior to 4.0D, you can create only process contention scope threads (PCS). In 4.0D and later, you can create either PCS threads or SCS threads (system contention scope). PCS is "unbound" in Solaris terms, while SCS is "bound". (Digital UNIX uses the POSIX terms exclusively, while Solaris still clings to the proprietary UI thread terms.) 3) The thread library does context switching of PCS threads among the scheduler threads it controls. Blocking on process synchronization objects (mutexes, condition variables, etc.) occur completely in user mode. When a PCS thread blocks in the kernel, an "upcall" occurs -- the kernel provides a "replacement virtual processor" on which the library immediately schedules a new PCS thread, if any are ready to run, instead of reducing the process concurrency. The thread library supports the full POSIX scheduling model, and timeslicing, in user mode. 4) The process concurrency can be artificially reduced only when the process (or user) has exceeded the allowed quota of kernel threads, so that no additional replacement VPs can be provided by the kernel, in which case the process simply waits until a blocked kernel call completes. (The limitation is entirely resource-bounded, never arbitrary.)