T.R | Title | User | Personal Name | Date | Lines |
---|
8792.1 | Well, that's pretty funny. I think.... | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Feb 11 1997 18:09 | 30 |
| .0> We have a cusotmer using pthreads on DU 4.0b and wishes to use the
.0> longjmp command.
Actually it looks like he wishes to ABuse the longjmp() function. I don't know
whether what he's trying to do would be considered "supported" or not, but
longjmp() certainly was not INTENDED to be used in this way.
The customer is using longjmp() to switch the calling thread to a new stack.
This is sheer hackery, and, as far as I can guess, this is a violation of the
Alpha calling standard.
I suspect that the DECthreads bugcheck is arising from the threads library
trying to check for a possible stack overflow and finding the SP is completely
out of bounds for the current thread. (This is a WAG, since you didn't post any
of the bugcheck information, such as the "reason" string.) Also, the test
program never restores the stack pointer to the original value, so when the
main() function returns the registers will be filled with garbage, and that
could cause all manner of "interesting" effects.
Why on earth is the customer trying to switch to a private stack?? What's wrong
with the original stack, or why can't the customer simply create a whole
separate thread instead of just a separate stack?
There are a number of things currently and in the future which care a great deal
that the stack be properly managed. Stuffing addresses of arbitrary chunks of
memory into the stack pointer is not going to work very well, and it will work
even less well in the future.
Webb
|
8792.2 | WOrkaround? Possible solutions? | RHETT::HALETKY | | Wed Feb 12 1997 10:26 | 12 |
| Hello,
Can you suggest an alternative? Perhaps more than one as you already
mentioned creating a new thread.
They claim the code works on 3.2g and /all/ other operating systems.
The company is Sybase so I would presume it does work on other
platforms. As to /why/ they are doing this, I don't know. But if a
valid SP is needed, which supposedly they capture what solutions are
available?
-ed Haletky
|
8792.3 | What are they trying to do?? | WTFN::SCALES | Despair is appropriate and inevitable. | Wed Feb 12 1997 14:01 | 13 |
| .2> As to /why/ they are doing this, I don't know.
Well, without knowing why they are doing this, I can't really suggest any
alternative. (My suggestion of creating a thread was based on a presumption
of what they might be trying to accomplish.)
.2> They claim the code works on 3.2g and /all/ other operating systems.
I guess they've been lucky...
Webb
|
8792.4 | More info from cusotmer. Suggestins? | RHETT::HALETKY | | Fri Feb 14 1997 15:58 | 70 |
| Cusotmers info and reseoning: I don't think it will work this way.
Sybase's Open Server product implements it's own user mode
multithreading model. In the description that follows I will refer to
these user mode threads as a Sybase threads. Digital UNIX's threads
will
be refered to as pthreads.
Description of the Sybase thread implementation:
In the Sybase thread model we have our own thread scheduler that runs
as
part of the user's process. The UNIX kernel is unaware of this.
A Sybase thread is created by allocating a Sybase thread control
structure and allocating a stack for the thread (by default this is
done
using malloc). The thread control structure contains a jmp_buf
structure
and we initialise this with a call to setjmp. Then we patch the new
stack base into the jmp_buf. The entry point for the new Sybase thread
is a function pointer and this also gets patched into the jmp_buf.
This Sybase thread control structure is then added to a run queue in
the
Sybase scheduler.
The Sybase scheduler selects a Sybase thread to run, saves the Sybase
scheduler's own context (using a call to setjmp) and does the context
switch by calling longjmp with the jmp_buf of the selected thread. This
causes the previously allocated stack to be loaded and control jumps to
the entry point patched into the jmp_buf earlier. Note that the Sybase
scheduler has it's own Sybase thread context which is set up during
startup of the Sybase Open Server. At this point the process is now
executing the code of the new thread with the stack residing in the
memory previously obtained through malloc.
The Sybase thread model is non-preemptive so a Sybase thread has to
yield control back to the Sybase scheduler either directly or
implicitly
via an Open Server api call. The api call does this by first saving the
yielding Sybase thread's context in the jmp_buf for the current thread
using setjmp. Then it does a longjmp using the jmp_buf of the Sybase
scheduler, thus resuming the scheduler context.
Finally, when the Open Server is shut down it restores the original
process context from that saved at the start, so main() will have it's
correct stack.
The sample provided to DEC emulates the Sybase scheduler's context
switching.
Sybase's Open Server uses a the Sybase transport control library
(netlib) to make network IO requests. In Sybase System 11 netlib comes
in two versions, the first is basically the same as used in System 10.
The alternate version creates multiple pthreads to service the
requests.
When netlib is initialised it creates a fixed number of pthread worker
threads. But it can also create additional pthread as the number of IO
requests increases. When a Sybase thread makes an IO request it get put
on the request queue of one of the worker threads. However, if there
are
no free request queues, netlib creates a new request queue and a new
pthread to service it. The pthread_create call to create the new
pthread
is called in the context of the a Sybase thread making the netlib IO
request. This causes the problem we are seeing.
Is there any way we can turn off the stack overflow checking in the
pthread library?
|
8792.5 | Customer should use a supported mechanism. | WTFN::SCALES | Despair is appropriate and inevitable. | Mon Feb 17 1997 16:35 | 16 |
| .4> Is there any way we can turn off the stack overflow checking in the
.4> pthread library?
No.
On the other hand, if the customer were to use a supported mechanism for "user
level context switching" we could probably make everything work together [see
makecontext(2) and swapcontext(2), although the man pages are almost worse than
useless :-( ].
Note, I'm not saying that using these functions instead of setjmp()/longjmp()
WILL solve the problem. However, before we _can_ solve the problem, the
customer must be using a supported mechanism.
Webb
|
8792.6 | | DCETHD::BUTENHOF | Dave Butenhof, DECthreads | Tue Feb 18 1997 08:20 | 22 |
| I'll also point out that using the makecontext/swapcontext solution would
(possibly, potentially) work ONLY if they make sure that each "Sybase thread"
runs only on one specific POSIX thread, and never on a different POSIX thread.
The problem with the current setjmp/longjmp hack (and almost certainly with
the current implementation of makecontext/swapcontext) is that they're
designed to assume that the underlying system knows nothing about threads.
But when you try to use them on top of threads, that's no longer true -- the
underlying system DOES know about threads. And pulling the rug out from under
the thread may be hazardous. It's not a matter of stacks or stack limit
checking -- it's much more than that. It's a matter of basic thread identity.
Personally, I believe that such user-mode context switching should simply be
unsupported when using threads. But should the business decision be
otherwise, then a substantial amount of development effort will be required
to make it work. And it can only be made to work if supported mechanisms are
used. Because setjmp and longjmp are "lighter weight" and intended for a lot
of things where changing thread identity would not be desirable, it's very
unlikely that we could support a "Sybase thread" model based on those
primitives.
/dave
|
8792.7 | More suggestions needed | RHETT::HALETKY | | Tue Mar 04 1997 10:30 | 10 |
|
Even so, why would it work on 'every other platform' so sayeth the
custoemr.
It appears that makecontext/setcontext will not work for the customer.
Any other suggestions?
-ed haletky
|
8792.8 | | SMURF::DENHAM | Digital UNIX Kernel | Tue Mar 04 1997 10:55 | 5 |
| The problem report from the customer reached DECthreads
last night. They have an engineer working the stack issues
involved. I think those issues are pretty well understood
at this point, so stay tuned for a test library most likely.
|
8792.9 | The suggestion is the same: use a supported mechanism | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Mar 04 1997 12:07 | 36 |
| .7> Even so, why would it work on 'every other platform' so sayeth the
.7> custoemr.
This sounds alot like a ten-year-old saying, "But, mom, all the other kids
get to stay up late..." :-)
The fact that it works on many other platforms means either that those
platforms haven't tried or needed to do anything sophisticated with call
stacks or procedure invocations or that the customer has simply been lucky.
(Just because it appears to execute correctly doesn't mean that it's correct
code, that it's supportable or robust, or that it will continue to work on
the next version of the operating system or on any other operating system;
the only way to ensure correctness is to write the code properly and not
depend on undocumented features or implementation details.)
The standards don't place any requirements on the contents of a jmp_buf.
(They say only that it must be an array.) The Digital Unix calling standard
points out that on Digital Unix all that really needs to be in the jmp_buf is
a frame pointer (NOT a stack pointer) and a PC -- the things necessary for an
"unwind". If we had used the minimal implementation, then the customer would
have found that their hack couldn't be ported to Digital Unix at all when
they made their initial attempt. :-}
.7> It appears that makecontext/setcontext will not work for the customer.
Appearances can be deceiving. If the customer were willing to switch and use
a supported mechanism in lieu of their longjmp() hack, then I believe it
would be trivial to make swapcontext() work for them. (They need to pass a
status along with the context switch, so pass it in the thread control
structure of the thread being scheduled.) This is not a case of "it won't
work"; this is a case of "we got a cool demo threads package from some
acedemic site on the Internet, and we used it in our product, and now we want
Digital to support it, even though it's a hack..." :-(
Webb
|
8792.10 | Not acceptable, Try again | RHETT::HALETKY | | Wed Mar 05 1997 14:12 | 31 |
|
Setcontext/makecontext will nto work as they would be using it to span
threads.
setcontext/makecontext will not work becasue it would about 3 years to
their development cycle. This is Sybase folks. Does Digital want to
risk not having their products work on our machines because of some
academic reason?
I'll agree its a hack, they agree its a hack. But it does work with
other Posix Threads implementations, but it does not work with ours.
The answers we have given them are /not/ acceptable.
In .8 it says the problem report reached engineering. Excuse me, I did
report the problem via this notes conference and got back /It's a
hack/. Care to explain what is going on? I'm looking for a solution to
give to a company who sells $millions and could sell $millions for
Digital. The answer 'it's a hack and they are lucky' is not acceptable
to me and definitely not to the customer.
I'm out of Ideas. There is /nothing/ in our documentation that says It
won't work. Hence customers will think it does work, whether we
consider it to be a hack or not. In essence, give me a solution that
will work for this customer. Makecontext/swapcontext has been rejected
as useable by them for a number of reasons. Are there any others?
Regards,
Ed Haletky
Digital CSC (Customer Service Center) The ones who bring you bugs and
problems via this Notes Conferences and IPMTs.
|
8792.11 | let's hear those reasons | VAXCPU::michaud | Jeff Michaud - ObjectBroker | Wed Mar 05 1997 14:44 | 8 |
| > Makecontext/swapcontext has been rejected
> as useable by them for a number of reasons.
^^^^^^^
I assume you meant "unuseable".
Care to elaborate on those "number of reasons", or is this just
a case of a stubborn customer?
|
8792.12 | If it's not acceptable, they will be having problems again...and again... | WTFN::SCALES | Despair is appropriate and inevitable. | Wed Mar 05 1997 16:05 | 52 |
| .10> Excuse me, I did report the problem via this notes conference
Ed, first off, this notes conference is not a problem reporting mechanism.
(This is repeated over and over in this and many other notes conferences.)
If you want to make a formal problem report, you must either enter a QAR (for
problems encountered by internal groups) or open an IMPT case (for problems
experienced by specific external customers). Notes conferences are for exchange
of information, only.
.10> There is /nothing/ in our documentation that says It won't work.
Likewise, there is nothing in our _documentation_ which implies that it -would-
work. Furthermore, there is nothing in any of the pertinent standards
specifications which implies that this would work, either.
Just because you try something and it appears to work doesn't make it a good
solution. Just because you have access to the internals of a function, doesn't
mean that the function will never be changed.
Relying on undocumented, unspecified, or internal implementation details will
not make your code reliable or supportable.
.10> Does Digital want to risk not having their products work on our machines
.10> because of some academic reason?
The reason is a practical one, not an academic one. Even if we smooth over
whatever wrinkle Sybase hit this time, there is no guarantee that their code
will continue to work on the -next- version of Digital Unix. The customer is
using an unsupported mechanism, and there is no way that we can ensure that it
will work with each successive version of Digital Unix.
Having the customer use a supported mechanism is in their best interest.
.10> Setcontext/makecontext will nto work as they would be using it to span
.10> threads.
I'm not exactly sure what this means, but if swapcontext() won't work, then
their existing hack is already dead in the water!! Beyond that, it strikes me
as really unlikely that replacing the existing code which calls longjmp() with
code calling swapcontext() could possibly add anything like 3 years to their
development cycle. This is straight FUD.
We understand that this is a big customer. We understand that they mean alot of
business to Digital. However, it will make life better for everyone if they use
facilities which we can ensure will work. If all of our major customers dive
into our internals and make use of little bits, in the future we will be unable
to enhance our software, and we will be quickly left behind in the industry.
So, the result is pretty much the same either way: either we find people
willing to work with us and follow the rules, or we're on our way out.
Webb
|