[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:	DECthreads Conference

Moderator:	PTHRED::MARYSTEON

Created:	Mon May 14 1990
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1553
Total number of notes:	9541

1529.0. "How to identify thread (unix)" by RDGENG::CHAMBERLIN (Danger! Do not Reverse Polarity) Wed Apr 23 1997 09:09

I'm trying to help a partner fix their multi threaded X25 application.

They are having difficulty in identifying and understanding the thread id which
is given by a number of sources...

-  dbx on a core dump (something like 0sfffffc0001e87c20)
-  the two fields given by the thread_self() function.
-  The output from ladebug or dbx on a live run.

Also is it possible to relate these to each other?

They are currently on Unix 3.2D and 3.2G, but hope to move to V4.0x, so would
appreciate any changes this would make to the answer.

THanks,
		Ian Chamberlin
		Software Partner Eng (UK)

T.R	Title	User	Personal Name	Date	Lines
1529.1	No simple correlation...	QUARRY::petert	rigidly defined areas of doubt and uncertainty	`Wed Apr 23 1997 10:05`	33
	Well, the thread id's all mean different things. For dbx, live or core, the thread id is the id of the kernel thread, or the handle that dbx uses to access the various threads. This has changed in V4.0. Not in the way that dbx views it, but the kernel interface has changed so that the handles are now usually small integer numbers, instead of some huge 64bit value. - the two fields given by the thread_self() function. I suspect this would be the decthread id number. The DEC thread routines use a handle to manipulate the threads, which is usually a small integer value updated sequentially to more or less match the number of threads you have open at any one time. It is basically an abstraction on top of the kernel thread id which are what are really being manipulated by the thread routines. The same kernel thread may not always be mapped to the same DEC thread number, though it may not be very apparent under the 3.2 system. At 4.0, DECthreads introduced 2 level scheduling which is very aggressive in the re-use of kernel threads, and no association should be taken for granted between kernel thread id and the DECthread id. It will change quickly. Ladebug tends to report the DECthread id unless you select a native thread mode from the debugger variables. Then the thread id's should look much like dbx's. Confusing? Well, yes. For V4.0 and above, ladebug should be used for thread debugging. Dbx made no real changes for 2 level scheduling, so it only reports the number of active threads, which is generally lower than the number of threads the user thinks he has. Hope that helps a little, PeterT
1529.2	Make up your own thread ID...we already have lots...	WTFN::SCALES	Despair is appropriate and inevitable.	`Wed Apr 23 1997 11:37`	22
	.0> They are having difficulty in identifying and understanding the thread id .0> which is given by a number of sources... I'm having trouble formulating a compact response to your query. Partly because the answer changes from V3 to V4, and partly because there are lots of possibilities for what the customer is seeing on either platform. So, let's try a different tack... Why exactly does your customer want to do this? The "identity" of a thread is based upon what it does, not some bit-pattern identifier. That is, if you want to know which thread is which when looking with a debugger, look at the thread's stack trace, and look for the thread's start routine and its argument. Likewise, if the customer wants to log information and be able to trace it back to the logging thread, have the thread include some application-defined indentifier (i.e., there's no particular need to get the ID from DECthreads). Hopefully, this approach will be simpler than trying to explain and reconcile all of the existing thread identifiers.... Webb
1529.3	Confused? So am I :-)	RDGENG::CHAMBERLIN	Danger! Do not Reverse Polarity	`Thu Apr 24 1997 04:48`	45
	Thanks for the discussion so far. I was trying to keep this short, but it looks like we're in for a longer discussion, so I'll try and explain a bit more what is happening.... Historically, the app was developed for Digital Unix V4.0, (but by some one with Solaris experience:-! ). On finding that the end customere was only running V3.2D/G ( and I think this is partly due to X25 availabality - but that's another issue), they had to do some complicated redesign - instead of being able to use thread_kill() they had to resort to polling and timers. This part of the application performs X25 call management, and sits between a call originating/handling system, and a GSM network which connects to field operatives. Unfortunately, its closing connections before they are done, putting multiple calls on the same X25 connection id, trying to close X25 id's which don't exist, etc. I'm not sure what testing is done off line - probably very little. They are trying to debug and fix the live system, so can't run under debugger because of performance, and all sorts of things would time-out elswhere when threads are stopped. They are trying to debug, by using the X25 CTF trace facilty, information written to a log file by debug statements in the app, debug of the core files and the threads crash log when they occur. They were using the thread_self() fields to identify the thread in their own logs (because it seemed to give an "official id"), but couldn't relate these to the stack trace from the coredumps. Of course itt was difficult to relate the coredump stack traces to what was actually going on because of the V3.2D threads SEGV exception handling, which they didn't appreciate. Many of the traces didn't go back into the application, and there were threads marked (noname). I've pointed them to try re-installing the default signal action, so this may help. Basically, they didn't understand what the core dumps and crash log were (not?) telling them, and couldn't relate this to the information in their own debug log files. Like many, they probably have little experience of building and debugging threaded apps, and need some method of tying together information logged when the application is running, with that given in coredumps. Building is covered somewhat in the developers docs and guide - maybe this (and previous notes) has identified a need for some docs on debugging? Many thanks, Ian.
1529.4	Thread IDs: user and kernel	DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Thu Apr 24 1997 07:19`	55
	> They were using the thread_self() > fields to identify the thread in their own logs (because it seemed to give >an "official id") No, pthread_self() does not give an "id" -- it's a handle. In POSIX terms, it's an opaque identifier. In Digital's POSIX implementation, it's a 64-bit value. In our CMA and DCE thread implementations (the only interfaces available prior to Digital UNIX 4.0), it's a 128-bit value. Any interpretation of the handle is erroneous (both legally, according to the standard, and in practice, according to the implementation). Our <pthread.h> header includes the definition of several non-portable functions. One of these is pthread_getselfseq_np(), which returns the "sequence number" of the current thread. This is the number displayed by the DECthreads debug command -- and by ladebug in "decthreads" mode. It has no relationship to the kernel thread IDs displayed by either debugger -- nor are the kernel thread IDs of much value when debugging threaded programs. You can also use pthread_getsequence_np(pthread_t id) to return the sequence number for any thread handle. In Digital UNIX 4.0D, we're adding interfaces that allow creating threads with real (char*) names, which will be displayed in the debug interfaces and can also be retrieved by user code. The DCE thread interface has pthread_getunique_np(), which, like pthread_getsequence_np in the POSIX interface, returns the debug sequence number for a thread handle. Note that prior to 4.0, you cannot look at DECthreads information in a callback, because the debugger had to make a call inside the target process (which isn't "real" in a core file analysis). Thus, you can only see kernel thread IDs. Unfortunately, there is NO WAY to relate kernel thread IDs to user threads. As Peter T said in .1, prior to 4.0 the kernel thread IDs were shown as large hex numbers -- actually the kernel address of the thread structure. Even the "live" DECthreads scheduler has no way to know what this value is while running -- we only had access to the process-specific Mach "port ID" for the thread. On Digital UNIX 4.0, the proc interfaces changed to use this user port ID as the kernel thread identifier (which is why they're now small integers). While we DO know those values within the scheduler, it's no longer that interesting since ladebug lets you examine the state of user threads directly, even within a core file. Anyway, the result is that a pre-4.0 core file has NO real information about "thread identity", and no such information can be extracted, unless you can guess by looking at the stacks. There is no practical way to examine the user thread state within the core file, and, even if there was, there would be no way to determine which kernel thread belonged to which user thread. (You'd need a matching kernel dump, and you'd have to do a lot of manual tracing around in both the program and kernel address spaces to find all the port numbers for the user threads, and then traverse kernel data structures to translate each port number into a kernel address -- I would never attempt this myself without a debug DECthreads that gave me structure definitions, and I wouldn't recommend it to anyone else. And, as I said, if you don't have a matching kernel dump, you don't have a hope anyway, because the port to address translation information doesn't exist within the process.) /dave
1529.5	Log SP and RA along with other info	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Thu Apr 24 1997 08:27`	13
	Given .4 I'm not sure whether this is helpful, but here are some things I've found helpful in writing log files. If you print the address of some local variable, you'll get an address that is within your current stack frame. This can help match the log entry to a stack, and from there to a particular thread. If inside your log routine you write void * caller = asm("mov r26, r0"); then caller will contain the return address from the log routine -- in other words, it identifies who called the log routine. This syntax is supported even though the return address need not actually be in r26 at the time the asm() gets executed -- the compiler understands what you're trying to do.
1529.6	Why do people insist on using interrupt-driven programming with threads?!!!	WTFN::SCALES	Despair is appropriate and inevitable.	`Thu Apr 24 1997 15:40`	33
	.3> They were using the thread_self() fields to identify the thread in their own .3> logs (because it seemed to give an "official id") When you say "thread_self()", you really mean the undocumented, unsupported MACH function, and not the documented, supported POSIX function, "pthread_self()", right? (This confusion was a big part of why I didn't try a direct response in my previous reply...) .3> instead of being able to use thread_kill() they had to resort to polling and .3> timers. <SARCASM> So they had to replace part of their initial bad design with a hack, fighting kicking-and-screaming to avoid using appropriate multithreading techniques? </SARCASM> .3> They are trying to debug and fix the live system, so can't run under .3> debugger because of performance, and all sorts of things would time-out .3> elswhere when threads are stopped. Well, for starters, I'd recommend that they run under the debugger anyway. All sorts of things will timeout elsewhere when their program crashes anyway, and by running it under the debugger they can catch a SEGV at the point where it happens (and they wouldn't have to worry about the default signal handling or its effects). When they run under the debugger they don't have to use breakpoints or tracepoints, and if they don't then their performance will not be affected by the debugger (except when fatal signals occur). Beyond that, I'll reiterate my previous suggestion that they log an ID of their own creation, one that they can relate to their own threads themselves (i.e., something that they can find by using the debugger to look in the thread's start routine at the base of the thread's stack). Webb
1529.7	clever compiler..	COMEUP::SIMMONDS	loose canon	`Tue Apr 29 1997 01:22`	10
	.5> void * caller = asm("mov r26, r0"); .5> [...] This syntax is supported .5> even though the return address need not actually be in r26 at the time the .5> asm() gets executed -- the compiler understands what you're trying to do. Valuable built-in insight there! Would you Care to Share any other similar examples which you know of in the latest compilers, please Bill? Thanks! John.
1529.8	Now lets see if I've got this right....	RDGENG::CHAMBERLIN	Danger! Do not Reverse Polarity	`Wed Apr 30 1997 07:33`	76
	Thanks, for the helpful suggestions so far. re .6 I have to admit to error and confusion -their project manager told me they were using thread_self(), but they really are using pthread_self(). So much for listening to administrators!!! Now to summarise what I understand about thread identities and debugging.. Please comment and corect as neccessary. I use the terms DEC threads to mean Posix 1003.1c and DCE threads to mean Posix 1003.4a. Also, V3.2 refers to V3.2C through V3.2G (I assume there are no differences?), V4.0 referrs to V4.0 through V4.0C 1. V3.2 only supports DCE threads. Use the -threads flag when building apps. User threads are scheduled with a one to one mapping on kernel threads. 2. V4.0 supports both DCE threads (build with -threads) and Dec threads (build with -pthreads). V4.0 scheduling uses kernel threads more agressively than V3.2 - kernel threads may be shared between user threads - there may not be a one to one mapping. (Same for both Dec threads and DCE threads). 3. pthread_self() returns a handle to the thread (like a windows handle?) For DCE threads on V3.2 and on V4.0, this is a 128 bit value (strictly a pthread_t struct containing .field1 and .field2) For Dec threads on V4.0, this is a 64 bit quantity (strictly a pointer to a larger pthread_t structure) 4. There is a sequence number, which is unige to each thread, and is used by Ladebug to identify threads, both on-line and from coredumps, when in "decthreads mode (the default), and also by the inbuilt threads debugger command. For Dec threads on V4.0, use pthread_getselfseq_np() or pthread_getsequence_np(pthread_t id) (I think the man page is wrong, because it shows ..._np(pthtread_t *id), whereas pthread.h shows ..._np(pthtread_t id) pthread_getunique_np() returns the sequence number for DCE threads on V3.2 or for V4.0. All these are non portable (including pthread_getsequence_np(pthread_t) ?) 5. The thread identifiers shown by dbx are kernel thread identifiers, which have no relation to the sequence numbers. Whilst on V3.2 there is a one to one mapping between user and kernel threads, this mapping is not known to the thread or any debugger. On V4.0, with its more agressive use of Kernel threads, there is not a one to one mapping, and scheduling is not known to the thread or any debugger. So with dbx there is no way of identifying threads except by their stack trace. 6. On V3.2, SEGV is handled by an exception handler which casuse the stack to be unwound, so stack traces in a coredump are meaningless. On V4.0, SEGV is handled corectly, so th ecoredump stack trace represents the threads running at SEGV. 7. On both V3.2 and V4.0, debuggers will trap the SEGV, so enabling the true stack to be viewed. [The stack seems true for te faulting thread, but I couldn't work it out for the others?]. 8. Its also possible to re-install the default signal handler - signal(SIGSEGV, SIG_DFL), to produce a meaningful stack dump on V3.2. 9. As suggested in .6 I tried identifying te caller - had to use asm("mov %r26, %r0); to get it to compile - based on the example in c_asm.h - It worked OK on V4.0, but on V3.2, calls from two different threads had te same caller address?? THanks, for te help so far, Ian.
1529.9		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Wed Apr 30 1997 08:41`	230
	>Thanks, for the helpful suggestions so far. > >re .6 I have to admit to error and confusion -their project manager told me they >were using thread_self(), but they really are using pthread_self(). So much for >listening to administrators!!! > >Now to summarise what I understand about thread identities and debugging.. > >Please comment and corect as neccessary. > >I use the terms DEC threads to mean Posix 1003.1c and DCE threads to mean Posix > 1003.4a. "DECthreads" is the name of the "product". A little confusing, because it's not really a separate product. "POSIX threads" is the POSIX standard 1003.1c pthread interface. "DCE threads" is not a standard, and can no longer usefully or correctly be termed a "POSIX" interface. It was a draft, long ago superceded by other drafts, which have now been superceded by a standard. DECthreads supports a plethora of interfaces: POSIX threads, DCE threads, (which is really two separate interfaces, a "pure" draft 4 and an "exception returning" draft 4), CMA, (on VMS, two variants of CMA, the "open" cma_ and the "VMS calling standard" cma$), and TIS (actually 3 variants of TIS -- one modelled on POSIX, and an "open" and "VMS calling standard" variant of the original TIS). And then there's also the "CMA library services" (CMA$LIB_SHR or libcmalib) which is a trivial atomic queue package built on top of the CMA interface (once intended to grow to become much more, but now a moss-growing, dusty library mouldering on a shelf in a closet somewhere off the main basement). >Also, V3.2 refers to V3.2C through V3.2G (I assume there are no differences?), I don't even recall for sure whether we made any substantial checkins for 3.2C -- but I don't think it matters in this context. > V4.0 referrs to V4.0 through V4.0C That's fine, too. Every patch we make for pre-4.0D has gone into all of those support streams. So, yeah, they're effectively identical. >1. V3.2 only supports DCE threads. > Use the -threads flag when building apps. > User threads are scheduled with a one to one mapping on kernel threads. On 3.2, DECthreads supports the "legacy" interfaces. DCE threads (both varieties), CMA, and "TIS classic". (Actually, I'd better watch out for that one, since everyone still prefers "Coke classic" over "new Coke", if it even still exists -- nevertheless, the terms seem natural.) And, of course, the "library services". >2. V4.0 supports both DCE threads (build with -threads) and Dec threads (build > with -pthreads). > V4.0 scheduling uses kernel threads more agressively than V3.2 - kernel > threads may be shared between user threads - there may not be a one to > one mapping. (Same for both Dec threads and DCE threads). 4.0, DECthreads supports POSIX threads, "new TIS" (an improved and streamlined TIS interface that follows the POSIX style), plus all of the legacy interfaces. We rely on new kernel support to provide 2-level scheduling so that we use kernel threads as "virtual processors". The association between user thread and kernel thread is as dynamic as the association between traditional kernel threads and physical processors. (Actually more so, since we don't yet support "affinity" between user thread and either virtual or physical processor.) >3. pthread_self() returns a handle to the thread (like a windows handle?) > For DCE threads on V3.2 and on V4.0, this is a 128 bit value > (strictly a pthread_t struct containing .field1 and .field2) > For Dec threads on V4.0, this is a 64 bit quantity (strictly a pointer to > a larger pthread_t structure) pthread_self() returns a pthread_t. POSIX states that this is an opaque value that cannot be used for anything except the defined POSIX interfaces. In 3.2, this was also true for the implementation. On 4.0, we have provided an "architected" definition. You're better off ignoring that in most cases and sticking to the defined interfaces, but a pthread_t is a pointer to a TEB (Thread Environment Block). The definition of the TEB is public, in <sys/types.h>, and you can legally write code to reference the public fields of the TEB. (The structure is well commented, and be careful to follow the rules.) In particular, the sequence number is available. (The pthread_getsequence_np and pthread_getselfseq_np "interfaces" are macros that reference the TEB.) Of course the TEB is only part of the real thread structure, but the rest is purely internal information. >4. There is a sequence number, which is unige to each thread, and is > used by Ladebug to identify threads, both on-line and from coredumps, > when in "decthreads mode (the default), and also by the inbuilt threads > debugger command. Yes, it's a field in the TEB. > For Dec threads on V4.0, use pthread_getselfseq_np() or > pthread_getsequence_np(pthread_t id) (I think the man page is wrong, > because it shows ..._np(pthtread_t *id), whereas pthread.h shows > ..._np(pthtread_t id) Uh huh. We'll have to make sure this gets to our writer. The pthread_getsequence_np man page also claims conformance to IEEE Std 1003.1c-1995, which is incorrect. > pthread_getunique_np() returns the sequence number for DCE threads on V3.2 > or for V4.0. > All these are non portable (including pthread_getsequence_np(pthread_t) ?) Yup. POSIX doesn't have the concept of "sequence number". (Too bad.) Of course, "portability" is a relative term. All implementations of DECthreads (OpenVMS, Digital UNIX, Win32, and even ULTRIX) provide the DCE thread extension pthread_getunique_np and the POSIX extension pthread_getsequence_np. Additionally, all implementations (by other vendors) of the DCE thread interface should have pthread_getunique_np. >5. The thread identifiers shown by dbx are kernel thread identifiers, which have > no relation to the sequence numbers. > Whilst on V3.2 there is a one to one mapping between user and kernel > threads, this mapping is not known to the thread or any debugger. That's not entirely true. We do know the mapping between user thread and kernel thread, and the cma_debug "thread -f" command will show the thread's kernel thread. The problem is that the proc filesystem (and therefore dbx, and ladebug in "native" thread mode) uses a DIFFERENT identification for the kernel thread. We have no way to translate between them, and the debuggers have only very limited ways to translate. While we know only the process-specific Mach port id, proc uses the kernel port id. The kernel knows how to translate between them, but it's a tedious process (each Mach task has a queue of port translation records -- you have to traverse the queue until you find the one containing the user or kernel port id you want to translate). In 4.0, proc was changed to use the process port id. But of course, at the same time, the mapping between user and kernel threads became dynamic and, in general, much less interesting and useful. > On V4.0, with its more agressive use of Kernel threads, there is not a > one to one mapping, and scheduling is not known to the thread or any > debugger. I'm not sure what you mean by "scheduling" here. DECthreads always knows on which kernel thread each user thread is currently running. The translation is essential to the ladebug "decthreads" mode, in fact, because the user mode data structures don't contain much of the state relevant to a thread that's currently running or blocked within the kernel. However, dbx doesn't know how to use the libpthreaddebug library that provides all this information, and when you set the ladebug thread mode to "native" you're explicitly telling it not to use the library. > So with dbx there is no way of identifying threads except by their stack > trace. Definitely true in 3.2 (and, unfortunately, in 4.0, because when we initially moved the debug subsystem into libpthreaddebug.so we didn't take the time to capture all of the information). In 4.0D, it's not quite true. Although dbx doesn't support libpthreaddebug, you can "call pthread_debug()" to get at the internal command parser. This dlopens libpthreaddebug inside the process you're debugging -- the environment is very fragile and you can get into trouble, but, most of the time, it more or less works. You can then use "thread -f", which will show the "vp ID" for currently running threads. This is a decimal number that will correspond to one of the hex numbers dbx shows in a "tlist" command. (Neither dbx nor ladebug were changed to use decimal numbers for kernel threads when proc changed to use "low integer" process port ids instead of kernel address (large number) kernel port ids.) >6. On V3.2, SEGV is handled by an exception handler which casuse the stack to be > unwound, so stack traces in a coredump are meaningless. > On V4.0, SEGV is handled corectly, so th ecoredump stack trace represents > the threads running at SEGV. That depends a lot on how you define "correctly". There were limitations in the exception model we used in 3.2, which prevented us from aborting the process on an unhandled exception with the stack intact. While that was unfortunate for some debugging situations, it was not "incorrect". When we moved to 4.0, we changed over to use the libexc "standard" exception mechanism, which allows us to detect an unhandled exception without unwinding the stack. But although it's possible, it's not trivial, and we messed up in 4.0. This has been fixed in a patch, and will be correct in 4.0D. >7. On both V3.2 and V4.0, debuggers will trap the SEGV, so enabling the true > stack to be viewed. [The stack seems true for te faulting thread, but I > couldn't work it out for the others?]. If you don't see a reasonable stack for other threads, something in your process is probably corrupted. >8. Its also possible to re-install the default signal handler - signal(SIGSEGV, > SIG_DFL), to produce a meaningful stack dump on V3.2. That'll work on 4.0, though you don't need it if you have the patch. It's not nearly so easy as that sounds on 3.2, because signal handlers were per-thread, not per-process. You'd need to change the signal handler FOR THE THREAD THAT GETS THE SEGV. (After the thread starts running.) >9. As suggested in .6 I tried identifying te caller - had to use asm("mov %r26, > %r0); to get it to compile - based on the example in c_asm.h - It > worked OK on V4.0, but on V3.2, calls from two different threads had te > same caller address?? Of course you can have the same caller address from different threads. They're all running the same code. You use the ra to get some idea of where you are in your code, and the thread sequence number to tell in which thread you're there.
1529.10	Thanks a LOT for the info	EDSCLU::GARROD	IBM Interconnect Engineering	`Wed Apr 30 1997 13:30`	9
	Re .-1 I don't think I've ever seen a note before with quite so high a density of really useful information. I'm definitely saving this one off for future reference. Many thanks, Dave
1529.11	Thanks indeed - I agree with .10	RDGENG::CHAMBERLIN	Danger! Do not Reverse Polarity	`Thu May 01 1997 09:20`	11
	I agree with .10 Wholehearted thanks to Dave and everyone for their advice, suggestionsZ and patience. One last request - is there a pointer to the V4.0 exception fix Dave mentioned in .9 ? Is this needed for all V4.0 to V4.0C ? many thanks, Ian.
1529.12	Some unwinding may still occur.	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon May 05 1997 14:25`	19
	.9> When we moved to 4.0, we changed over to use the libexc "standard" .9> exception mechanism, which allows us to detect an unhandled exception .9> without unwinding the stack. However, this capability depends on functionality implemented by macros in the application code. So, if your application contains code which was not compiled on V4, some stack unwinding will occur during exception propagation, and the stack will show the frame at which the last raise or reraise occurred (as opposed to where the original exception occurred). .8> The stack seems true for te faulting thread, but I couldn't work it out for .8> the others? It's also possible that these threads were blocked in system calls -- I believe that in this case the stack trace will appear truncated. Webb