T.R | Title | User | Personal Name | Date | Lines |
---|
1529.1 | No simple correlation... | QUARRY::petert | rigidly defined areas of doubt and uncertainty | Wed Apr 23 1997 11:05 | 33 |
| Well, the thread id's all mean different things.
For dbx, live or core, the thread id is the id of the kernel thread, or
the handle that dbx uses to access the various threads. This has changed
in V4.0. Not in the way that dbx views it, but the kernel interface has
changed so that the handles are now usually small integer numbers, instead
of some huge 64bit value.
- the two fields given by the thread_self() function.
I suspect this would be the decthread id number. The DEC thread routines
use a handle to manipulate the threads, which is usually a small integer
value updated sequentially to more or less match the number of
threads you have open at any one time. It is basically an abstraction on
top of the kernel thread id which are what are really being manipulated
by the thread routines. The same kernel thread may not always be mapped
to the same DEC thread number, though it may not be very apparent under
the 3.2 system. At 4.0, DECthreads introduced 2 level scheduling which
is very aggressive in the re-use of kernel threads, and no association should
be taken for granted between kernel thread id and the DECthread id. It
will change quickly.
Ladebug tends to report the DECthread id unless you select a native thread
mode from the debugger variables. Then the thread id's should look much like
dbx's.
Confusing? Well, yes. For V4.0 and above, ladebug should be used for
thread debugging. Dbx made no real changes for 2 level scheduling, so
it only reports the number of active threads, which is generally lower than
the number of threads the user thinks he has.
Hope that helps a little,
PeterT
|
1529.2 | Make up your own thread ID...we already have lots... | WTFN::SCALES | Despair is appropriate and inevitable. | Wed Apr 23 1997 12:37 | 22 |
| .0> They are having difficulty in identifying and understanding the thread id
.0> which is given by a number of sources...
I'm having trouble formulating a compact response to your query. Partly
because the answer changes from V3 to V4, and partly because there are lots
of possibilities for what the customer is seeing on either platform. So,
let's try a different tack...
Why exactly does your customer want to do this? The "identity" of a thread
is based upon what it does, not some bit-pattern identifier. That is, if you
want to know which thread is which when looking with a debugger, look at the
thread's stack trace, and look for the thread's start routine and its
argument. Likewise, if the customer wants to log information and be able to
trace it back to the logging thread, have the thread include some
application-defined indentifier (i.e., there's no particular need to get the
ID from DECthreads).
Hopefully, this approach will be simpler than trying to explain and reconcile
all of the existing thread identifiers....
Webb
|
1529.3 | Confused? So am I :-) | RDGENG::CHAMBERLIN | Danger! Do not Reverse Polarity | Thu Apr 24 1997 05:48 | 45 |
| Thanks for the discussion so far. I was trying to keep this short, but it looks
like we're in for a longer discussion, so I'll try and explain a bit more what
is happening....
Historically, the app was developed for Digital Unix V4.0, (but by some one with
Solaris experience:-! ). On finding that the end customere was only running
V3.2D/G ( and I think this is partly due to X25 availabality - but that's
another issue), they had to do some complicated redesign - instead of being able
to use thread_kill() they had to resort to polling and timers.
This part of the application performs X25 call management, and sits between a
call originating/handling system, and a GSM network which connects to field
operatives. Unfortunately, its closing connections before they are done, putting
multiple calls on the same X25 connection id, trying to close X25 id's which
don't exist, etc. I'm not sure what testing is done off line - probably very
little. They are trying to debug and fix the live system, so can't run under
debugger because of performance, and all sorts of things would time-out elswhere
when threads are stopped.
They are trying to debug, by using the X25 CTF trace facilty, information
written to a log file by debug statements in the app, debug of the core files
and the threads crash log when they occur. They were using the thread_self()
fields to identify the thread in their own logs (because it seemed to give an
"official id"), but couldn't relate these to the stack trace from the coredumps.
Of course itt was difficult to relate the coredump stack traces to what was
actually going on because of the V3.2D threads SEGV exception handling, which
they didn't appreciate. Many of the traces didn't go back into the application,
and there were threads marked (noname). I've pointed them to try re-installing
the default signal action, so this may help.
Basically, they didn't understand what the core dumps and crash log were (not?)
telling them, and couldn't relate this to the information in their own debug log
files. Like many, they probably have little experience of building and
debugging threaded apps, and need some method of tying together information
logged when the application is running, with that given in coredumps. Building
is covered somewhat in the developers docs and guide - maybe this (and previous
notes) has identified a need for some docs on debugging?
Many thanks,
Ian.
|
1529.4 | Thread IDs: user and kernel | DCETHD::BUTENHOF | Dave Butenhof, DECthreads | Thu Apr 24 1997 08:19 | 55 |
| > They were using the thread_self()
> fields to identify the thread in their own logs (because it seemed to give
>an "official id")
No, pthread_self() does not give an "id" -- it's a handle. In POSIX terms,
it's an opaque identifier. In Digital's POSIX implementation, it's a 64-bit
value. In our CMA and DCE thread implementations (the only interfaces
available prior to Digital UNIX 4.0), it's a 128-bit value. Any
interpretation of the handle is erroneous (both legally, according to the
standard, and in practice, according to the implementation).
Our <pthread.h> header includes the definition of several non-portable
functions. One of these is pthread_getselfseq_np(), which returns the
"sequence number" of the current thread. This is the number displayed by the
DECthreads debug command -- and by ladebug in "decthreads" mode. It has no
relationship to the kernel thread IDs displayed by either debugger -- nor are
the kernel thread IDs of much value when debugging threaded programs. You can
also use pthread_getsequence_np(pthread_t id) to return the sequence number
for any thread handle. In Digital UNIX 4.0D, we're adding interfaces that
allow creating threads with real (char*) names, which will be displayed in
the debug interfaces and can also be retrieved by user code.
The DCE thread interface has pthread_getunique_np(), which, like
pthread_getsequence_np in the POSIX interface, returns the debug sequence
number for a thread handle.
Note that prior to 4.0, you cannot look at DECthreads information in a
callback, because the debugger had to make a call inside the target process
(which isn't "real" in a core file analysis). Thus, you can only see kernel
thread IDs. Unfortunately, there is NO WAY to relate kernel thread IDs to
user threads. As Peter T said in .1, prior to 4.0 the kernel thread IDs were
shown as large hex numbers -- actually the kernel address of the thread
structure. Even the "live" DECthreads scheduler has no way to know what this
value is while running -- we only had access to the process-specific Mach
"port ID" for the thread. On Digital UNIX 4.0, the proc interfaces changed to
use this user port ID as the kernel thread identifier (which is why they're
now small integers). While we DO know those values within the scheduler, it's
no longer that interesting since ladebug lets you examine the state of user
threads directly, even within a core file.
Anyway, the result is that a pre-4.0 core file has NO real information about
"thread identity", and no such information can be extracted, unless you can
guess by looking at the stacks. There is no practical way to examine the user
thread state within the core file, and, even if there was, there would be no
way to determine which kernel thread belonged to which user thread. (You'd
need a matching kernel dump, and you'd have to do a lot of manual tracing
around in both the program and kernel address spaces to find all the port
numbers for the user threads, and then traverse kernel data structures to
translate each port number into a kernel address -- I would never attempt
this myself without a debug DECthreads that gave me structure definitions,
and I wouldn't recommend it to anyone else. And, as I said, if you don't have
a matching kernel dump, you don't have a hope anyway, because the port to
address translation information doesn't exist within the process.)
/dave
|
1529.5 | Log SP and RA along with other info | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Thu Apr 24 1997 09:27 | 13 |
| Given .4 I'm not sure whether this is helpful, but here are some things
I've found helpful in writing log files.
If you print the address of some local variable, you'll get an address that
is within your current stack frame. This can help match the log entry to
a stack, and from there to a particular thread.
If inside your log routine you write
void * caller = asm("mov r26, r0");
then caller will contain the return address from the log routine -- in other
words, it identifies who called the log routine. This syntax is supported
even though the return address need not actually be in r26 at the time the
asm() gets executed -- the compiler understands what you're trying to do.
|
1529.6 | Why do people insist on using interrupt-driven programming with threads?!!! | WTFN::SCALES | Despair is appropriate and inevitable. | Thu Apr 24 1997 16:40 | 33 |
| .3> They were using the thread_self() fields to identify the thread in their own
.3> logs (because it seemed to give an "official id")
When you say "thread_self()", you really mean the undocumented, unsupported MACH
function, and not the documented, supported POSIX function, "pthread_self()",
right? (This confusion was a big part of why I didn't try a direct response in
my previous reply...)
.3> instead of being able to use thread_kill() they had to resort to polling and
.3> timers.
<SARCASM> So they had to replace part of their initial bad design with a hack,
fighting kicking-and-screaming to avoid using appropriate multithreading
techniques? </SARCASM>
.3> They are trying to debug and fix the live system, so can't run under
.3> debugger because of performance, and all sorts of things would time-out
.3> elswhere when threads are stopped.
Well, for starters, I'd recommend that they run under the debugger anyway. All
sorts of things will timeout elsewhere when their program crashes anyway, and by
running it under the debugger they can catch a SEGV at the point where it
happens (and they wouldn't have to worry about the default signal handling or
its effects). When they run under the debugger they don't have to use
breakpoints or tracepoints, and if they don't then their performance will not be
affected by the debugger (except when fatal signals occur). Beyond that, I'll
reiterate my previous suggestion that they log an ID of their own creation, one
that they can relate to their own threads themselves (i.e., something that they
can find by using the debugger to look in the thread's start routine at the base
of the thread's stack).
Webb
|
1529.7 | clever compiler.. | COMEUP::SIMMONDS | loose canon | Tue Apr 29 1997 02:22 | 10 |
| .5> void * caller = asm("mov r26, r0");
.5> [...] This syntax is supported
.5> even though the return address need not actually be in r26 at the time the
.5> asm() gets executed -- the compiler understands what you're trying to do.
Valuable built-in insight there! Would you Care to Share any other
similar examples which you know of in the latest compilers, please Bill?
Thanks!
John.
|
1529.8 | Now lets see if I've got this right.... | RDGENG::CHAMBERLIN | Danger! Do not Reverse Polarity | Wed Apr 30 1997 08:33 | 76 |
| Thanks, for the helpful suggestions so far.
re .6 I have to admit to error and confusion -their project manager told me they
were using thread_self(), but they really are using pthread_self(). So much for
listening to administrators!!!
Now to summarise what I understand about thread identities and debugging..
Please comment and corect as neccessary.
I use the terms DEC threads to mean Posix 1003.1c and DCE threads to mean Posix
1003.4a.
Also, V3.2 refers to V3.2C through V3.2G (I assume there are no differences?),
V4.0 referrs to V4.0 through V4.0C
1. V3.2 only supports DCE threads.
Use the -threads flag when building apps.
User threads are scheduled with a one to one mapping on kernel threads.
2. V4.0 supports both DCE threads (build with -threads) and Dec threads (build
with -pthreads).
V4.0 scheduling uses kernel threads more agressively than V3.2 - kernel
threads may be shared between user threads - there may not be a one to
one mapping. (Same for both Dec threads and DCE threads).
3. pthread_self() returns a handle to the thread (like a windows handle?)
For DCE threads on V3.2 and on V4.0, this is a 128 bit value
(strictly a pthread_t struct containing .field1 and .field2)
For Dec threads on V4.0, this is a 64 bit quantity (strictly a pointer to
a larger pthread_t structure)
4. There is a sequence number, which is unige to each thread, and is
used by Ladebug to identify threads, both on-line and from coredumps,
when in "decthreads mode (the default), and also by the inbuilt threads
debugger command.
For Dec threads on V4.0, use pthread_getselfseq_np() or
pthread_getsequence_np(pthread_t id) (I think the man page is wrong,
because it shows ..._np(pthtread_t *id), whereas pthread.h shows
..._np(pthtread_t id)
pthread_getunique_np() returns the sequence number for DCE threads on V3.2
or for V4.0.
All these are non portable (including pthread_getsequence_np(pthread_t) ?)
5. The thread identifiers shown by dbx are kernel thread identifiers, which have
no relation to the sequence numbers.
Whilst on V3.2 there is a one to one mapping between user and kernel
threads, this mapping is not known to the thread or any debugger.
On V4.0, with its more agressive use of Kernel threads, there is not a
one to one mapping, and scheduling is not known to the thread or any
debugger.
So with dbx there is no way of identifying threads except by their stack
trace.
6. On V3.2, SEGV is handled by an exception handler which casuse the stack to be
unwound, so stack traces in a coredump are meaningless.
On V4.0, SEGV is handled corectly, so th ecoredump stack trace represents
the threads running at SEGV.
7. On both V3.2 and V4.0, debuggers will trap the SEGV, so enabling the true
stack to be viewed. [The stack seems true for te faulting thread, but I
couldn't work it out for the others?].
8. Its also possible to re-install the default signal handler - signal(SIGSEGV,
SIG_DFL), to produce a meaningful stack dump on V3.2.
9. As suggested in .6 I tried identifying te caller - had to use asm("mov %r26,
%r0); to get it to compile - based on the example in c_asm.h - It
worked OK on V4.0, but on V3.2, calls from two different threads had te
same caller address??
THanks, for te help so far,
Ian.
|
1529.9 | | DCETHD::BUTENHOF | Dave Butenhof, DECthreads | Wed Apr 30 1997 09:41 | 230 |
| >Thanks, for the helpful suggestions so far.
>
>re .6 I have to admit to error and confusion -their project manager told me
they
>were using thread_self(), but they really are using pthread_self(). So much
for
>listening to administrators!!!
>
>Now to summarise what I understand about thread identities and debugging..
>
>Please comment and corect as neccessary.
>
>I use the terms DEC threads to mean Posix 1003.1c and DCE threads to mean
Posix
> 1003.4a.
"DECthreads" is the name of the "product". A little confusing, because
it's not really a separate product. "POSIX threads" is the POSIX
standard 1003.1c pthread interface. "DCE threads" is not a standard,
and can no longer usefully or correctly be termed a "POSIX"
interface. It was a draft, long ago superceded by other drafts, which
have now been superceded by a standard.
DECthreads supports a plethora of interfaces: POSIX threads, DCE
threads, (which is really two separate interfaces, a "pure" draft 4
and an "exception returning" draft 4), CMA, (on VMS, two variants of
CMA, the "open" cma_ and the "VMS calling standard" cma$), and TIS
(actually 3 variants of TIS -- one modelled on POSIX, and an "open"
and "VMS calling standard" variant of the original TIS). And then
there's also the "CMA library services" (CMA$LIB_SHR or libcmalib)
which is a trivial atomic queue package built on top of the CMA
interface (once intended to grow to become much more, but now a
moss-growing, dusty library mouldering on a shelf in a closet
somewhere off the main basement).
>Also, V3.2 refers to V3.2C through V3.2G (I assume there are no
differences?),
I don't even recall for sure whether we made any substantial checkins
for 3.2C -- but I don't think it matters in this context.
> V4.0 referrs to V4.0 through V4.0C
That's fine, too. Every patch we make for pre-4.0D has gone into all
of those support streams. So, yeah, they're effectively identical.
>1. V3.2 only supports DCE threads.
> Use the -threads flag when building apps.
> User threads are scheduled with a one to one mapping on kernel threads.
On 3.2, DECthreads supports the "legacy" interfaces. DCE threads (both
varieties), CMA, and "TIS classic". (Actually, I'd better watch out
for that one, since everyone still prefers "Coke classic" over "new
Coke", if it even still exists -- nevertheless, the terms seem
natural.) And, of course, the "library services".
>2. V4.0 supports both DCE threads (build with -threads) and Dec threads
(build
> with -pthreads).
> V4.0 scheduling uses kernel threads more agressively than V3.2 - kernel
> threads may be shared between user threads - there may not be a one
to
> one mapping. (Same for both Dec threads and DCE threads).
4.0, DECthreads supports POSIX threads, "new TIS" (an improved and
streamlined TIS interface that follows the POSIX style), plus all of
the legacy interfaces. We rely on new kernel support to provide
2-level scheduling so that we use kernel threads as "virtual
processors". The association between user thread and kernel thread is
as dynamic as the association between traditional kernel threads and
physical processors. (Actually more so, since we don't yet support
"affinity" between user thread and either virtual or physical
processor.)
>3. pthread_self() returns a handle to the thread (like a windows handle?)
> For DCE threads on V3.2 and on V4.0, this is a 128 bit value
> (strictly a pthread_t struct containing .field1 and .field2)
> For Dec threads on V4.0, this is a 64 bit quantity (strictly a pointer
to
> a larger pthread_t structure)
pthread_self() returns a pthread_t. POSIX states that this is an
opaque value that cannot be used for anything except the defined POSIX
interfaces. In 3.2, this was also true for the implementation. On 4.0,
we have provided an "architected" definition. You're better off
ignoring that in most cases and sticking to the defined interfaces,
but a pthread_t is a pointer to a TEB (Thread Environment Block). The
definition of the TEB is public, in <sys/types.h>, and you can legally
write code to reference the public fields of the TEB. (The structure
is well commented, and be careful to follow the rules.) In particular,
the sequence number is available. (The pthread_getsequence_np and
pthread_getselfseq_np "interfaces" are macros that reference the TEB.)
Of course the TEB is only part of the real thread structure, but the
rest is purely internal information.
>4. There is a sequence number, which is unige to each thread, and is
> used by Ladebug to identify threads, both on-line and from coredumps,
> when in "decthreads mode (the default), and also by the inbuilt
threads
> debugger command.
Yes, it's a field in the TEB.
> For Dec threads on V4.0, use pthread_getselfseq_np() or
> pthread_getsequence_np(pthread_t id) (I think the man page is wrong,
> because it shows ..._np(pthtread_t *id), whereas pthread.h shows
> ..._np(pthtread_t id)
Uh huh. We'll have to make sure this gets to our writer. The
pthread_getsequence_np man page also claims conformance to IEEE Std
1003.1c-1995, which is incorrect.
> pthread_getunique_np() returns the sequence number for DCE threads on
V3.2
> or for V4.0.
> All these are non portable (including
pthread_getsequence_np(pthread_t) ?)
Yup. POSIX doesn't have the concept of "sequence number". (Too bad.)
Of course, "portability" is a relative term. All implementations of
DECthreads (OpenVMS, Digital UNIX, Win32, and even ULTRIX) provide the
DCE thread extension pthread_getunique_np and the POSIX extension
pthread_getsequence_np. Additionally, all implementations (by other
vendors) of the DCE thread interface should have pthread_getunique_np.
>5. The thread identifiers shown by dbx are kernel thread identifiers, which
have
> no relation to the sequence numbers.
> Whilst on V3.2 there is a one to one mapping between user and kernel
> threads, this mapping is not known to the thread or any debugger.
That's not entirely true. We do know the mapping between user thread
and kernel thread, and the cma_debug "thread -f" command will show the
thread's kernel thread. The problem is that the proc filesystem (and
therefore dbx, and ladebug in "native" thread mode) uses a DIFFERENT
identification for the kernel thread. We have no way to translate
between them, and the debuggers have only very limited ways to
translate. While we know only the process-specific Mach port id, proc
uses the kernel port id. The kernel knows how to translate between
them, but it's a tedious process (each Mach task has a queue of port
translation records -- you have to traverse the queue until you find
the one containing the user or kernel port id you want to translate).
In 4.0, proc was changed to use the process port id. But of course, at
the same time, the mapping between user and kernel threads became
dynamic and, in general, much less interesting and useful.
> On V4.0, with its more agressive use of Kernel threads, there is not a
> one to one mapping, and scheduling is not known to the thread or any
> debugger.
I'm not sure what you mean by "scheduling" here. DECthreads always
knows on which kernel thread each user thread is currently
running. The translation is essential to the ladebug "decthreads"
mode, in fact, because the user mode data structures don't contain
much of the state relevant to a thread that's currently running or
blocked within the kernel. However, dbx doesn't know how to use the
libpthreaddebug library that provides all this information, and when
you set the ladebug thread mode to "native" you're explicitly telling
it not to use the library.
> So with dbx there is no way of identifying threads except by their
stack
> trace.
Definitely true in 3.2 (and, unfortunately, in 4.0, because when we
initially moved the debug subsystem into libpthreaddebug.so we didn't
take the time to capture all of the information). In 4.0D, it's not
quite true. Although dbx doesn't support libpthreaddebug, you can
"call pthread_debug()" to get at the internal command parser. This
dlopens libpthreaddebug inside the process you're debugging -- the
environment is very fragile and you can get into trouble, but, most of
the time, it more or less works. You can then use "thread -f", which
will show the "vp ID" for currently running threads. This is a decimal
number that will correspond to one of the hex numbers dbx shows in a
"tlist" command. (Neither dbx nor ladebug were changed to use decimal
numbers for kernel threads when proc changed to use "low integer"
process port ids instead of kernel address (large number) kernel port
ids.)
>6. On V3.2, SEGV is handled by an exception handler which casuse the stack
to be
> unwound, so stack traces in a coredump are meaningless.
> On V4.0, SEGV is handled corectly, so th ecoredump stack trace
represents
> the threads running at SEGV.
That depends a lot on how you define "correctly". There were
limitations in the exception model we used in 3.2, which prevented us
from aborting the process on an unhandled exception with the stack
intact. While that was unfortunate for some debugging situations, it
was not "incorrect".
When we moved to 4.0, we changed over to use the libexc "standard"
exception mechanism, which allows us to detect an unhandled exception
without unwinding the stack. But although it's possible, it's not
trivial, and we messed up in 4.0. This has been fixed in a patch, and
will be correct in 4.0D.
>7. On both V3.2 and V4.0, debuggers will trap the SEGV, so enabling the
true
> stack to be viewed. [The stack seems true for te faulting thread, but
I
> couldn't work it out for the others?].
If you don't see a reasonable stack for other threads, something in
your process is probably corrupted.
>8. Its also possible to re-install the default signal handler -
signal(SIGSEGV,
> SIG_DFL), to produce a meaningful stack dump on V3.2.
That'll work on 4.0, though you don't need it if you have the
patch. It's not nearly so easy as that sounds on 3.2, because signal
handlers were per-thread, not per-process. You'd need to change the
signal handler FOR THE THREAD THAT GETS THE SEGV. (After the thread
starts running.)
>9. As suggested in .6 I tried identifying te caller - had to use asm("mov
%r26,
> %r0); to get it to compile - based on the example in c_asm.h - It
> worked OK on V4.0, but on V3.2, calls from two different threads had
te
> same caller address??
Of course you can have the same caller address from different
threads. They're all running the same code. You use the ra to get some
idea of where you are in your code, and the thread sequence number to
tell in which thread you're there.
|
1529.10 | Thanks a LOT for the info | EDSCLU::GARROD | IBM Interconnect Engineering | Wed Apr 30 1997 14:30 | 9 |
| Re .-1
I don't think I've ever seen a note before with quite so high a density
of really useful information. I'm definitely saving this one off for
future reference.
Many thanks,
Dave
|
1529.11 | Thanks indeed - I agree with .10 | RDGENG::CHAMBERLIN | Danger! Do not Reverse Polarity | Thu May 01 1997 10:20 | 11 |
| I agree with .10
Wholehearted thanks to Dave and everyone for their advice, suggestionsZ and
patience.
One last request - is there a pointer to the V4.0 exception fix Dave mentioned
in .9 ? Is this needed for all V4.0 to V4.0C ?
many thanks,
Ian.
|
1529.12 | Some unwinding may still occur. | WTFN::SCALES | Despair is appropriate and inevitable. | Mon May 05 1997 15:25 | 19 |
| .9> When we moved to 4.0, we changed over to use the libexc "standard"
.9> exception mechanism, which allows us to detect an unhandled exception
.9> without unwinding the stack.
However, this capability depends on functionality implemented by macros in the
application code. So, if your application contains code which was not compiled
on V4, some stack unwinding will occur during exception propagation, and the
stack will show the frame at which the last raise or reraise occurred (as
opposed to where the original exception occurred).
.8> The stack seems true for te faulting thread, but I couldn't work it out for
.8> the others?
It's also possible that these threads were blocked in system calls -- I believe
that in this case the stack trace will appear truncated.
Webb
|