[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

8792.0. "pthreads and longjmp" by RHETT::HALETKY () Tue Feb 11 1997 12:58

    Hello,
    
    We have a cusotmer using pthreads on DU 4.0b and wishes to use the
    longjmp command. However it appears that this causes a bugchk in
    threads. Is there any caveats when using a longjmp?
    
    Perhaps he is finding the stack pointer incorrectly? Any suggestions?
    
    Here is the code:
    #include <pthread.h>
    #include <stdio.h>
    #include <setjmp.h>
    
    
    jmp_buf x;
    pthread_t thread1, thread2;
    
    void do_thread1(int i)
    {
            printf("thread 1 started\n");
    }
    
    void do_thread2(int i)
    {
            printf("thread 2 started\n");
    }
    
    foo()
    {
            char *s;
    
            printf("inside foo; stack near %x\n", &s);
            pthread_create(&thread2, NULL, (void *) do_thread2, (void *)
    &s);
    }
    
    main()
    {
            int i;
            char *s = (char *) malloc(32760);
            char *stackptr = (s +32760- 40);
    
            if ((long) stackptr & 0xf)
                    stackptr -= (long)stackptr % 16;
    
            for (i=0; i<32760; i++)
                    s[i] = 0;
    
            printf("new stack base = %x, top = %x\n", stackptr, s);
    
            pthread_create(&thread1, NULL, (void *) do_thread1, (void *)
    &i);
    
            if (_setjmp(x) == 0)
            {
                    printf("old stack = %x\n", x[34]);
                    x[34] = (long) stackptr;
            }
    
            for (i=0; i<32760; i++)
                    s[i] = 0;
    
            printf("new stack base = %x, top = %x\n", stackptr, s);
    
            pthread_create(&thread1, NULL, (void *) do_thread1, (void *)
    &i);
    
            if (_setjmp(x) == 0)
            {
                    printf("old stack = %x\n", x[34]);
                    x[34] = (long) stackptr;
                    _longjmp(x,1);
            }
            else
            {
                    foo();
                    pthread_join(thread1, NULL);
                    pthread_join(thread2, NULL);
            }
    }
    
T.RTitleUserPersonal
Name
DateLines
8792.1Well, that's pretty funny. I think....WTFN::SCALESDespair is appropriate and inevitable.Tue Feb 11 1997 18:0930
.0> We have a cusotmer using pthreads on DU 4.0b and wishes to use the
.0> longjmp command. 

Actually it looks like he wishes to ABuse the longjmp() function.  I don't know
whether what he's trying to do would be considered "supported" or not, but
longjmp() certainly was not INTENDED to be used in this way.

The customer is using longjmp() to switch the calling thread to a new stack. 
This is sheer hackery, and, as far as I can guess, this is a violation of the
Alpha calling standard.

I suspect that the DECthreads bugcheck is arising from the threads library
trying to check for a possible stack overflow and finding the SP is completely
out of bounds for the current thread.  (This is a WAG, since you didn't post any
of the bugcheck information, such as the "reason" string.)  Also, the test
program never restores the stack pointer to the original value, so when the
main() function returns the registers will be filled with garbage, and that
could cause all manner of "interesting" effects.

Why on earth is the customer trying to switch to a private stack??  What's wrong
with the original stack, or why can't the customer simply create a whole
separate thread instead of just a separate stack?

There are a number of things currently and in the future which care a great deal
that the stack be properly managed.  Stuffing addresses of arbitrary chunks of
memory into the stack pointer is not going to work very well, and it will work
even less well in the future.


				Webb
8792.2WOrkaround? Possible solutions?RHETT::HALETKYWed Feb 12 1997 10:2612
    Hello,
    
    Can you suggest an alternative? Perhaps more than one as you already
    mentioned creating a new thread.
    
    They claim the code works on 3.2g and /all/ other operating systems.
    The company is Sybase so I would presume it does work on other
    platforms. As to /why/ they are doing this, I don't know. But if a
    valid SP is needed, which supposedly they capture what solutions are
    available?
    
    -ed Haletky
8792.3What are they trying to do??WTFN::SCALESDespair is appropriate and inevitable.Wed Feb 12 1997 14:0113
.2> As to /why/ they are doing this, I don't know.

Well, without knowing why they are doing this, I can't really suggest any
alternative.  (My suggestion of creating a thread was based on a presumption
of what they might be trying to accomplish.)

.2> They claim the code works on 3.2g and /all/ other operating systems.

I guess they've been lucky...



				Webb
8792.4More info from cusotmer. Suggestins?RHETT::HALETKYFri Feb 14 1997 15:5870
    Cusotmers info and reseoning: I don't think it will work this way.
    
    
    Sybase's Open Server product implements it's own user mode
    multithreading model. In the description that follows I will refer to
    these user mode threads as a Sybase threads. Digital UNIX's threads
    will
    be refered to as pthreads.
    
    Description of the Sybase thread implementation:
    In the Sybase thread model we have our own thread scheduler that runs
    as
    part of the user's process. The UNIX kernel is unaware of this.
    A Sybase thread is created by allocating a Sybase thread control
    structure and allocating a stack for the thread (by default this is
    done
    using malloc). The thread control structure contains a jmp_buf
    structure
    and we initialise this with a call to setjmp. Then we patch the new
    stack base into the jmp_buf. The entry point for the new Sybase thread
    is a function pointer and this also gets patched into the jmp_buf.
    This Sybase thread control structure is then added to a run queue in
    the
    Sybase scheduler.
    
    The Sybase scheduler selects a Sybase thread to run, saves the Sybase
    scheduler's own context (using a call to setjmp) and does the context
    switch by calling longjmp with the jmp_buf of the selected thread. This
    causes the previously allocated stack to be loaded and control jumps to
    the entry point patched into the jmp_buf earlier. Note that the Sybase
    scheduler has it's own Sybase thread context which is set up during
    startup of the Sybase Open Server. At this point the process is now
    executing the code of the new thread with the stack residing in the
    memory previously obtained through malloc.
    
    The Sybase thread model is non-preemptive so a Sybase thread has to
    yield control back to the Sybase scheduler either directly or
    implicitly
    via an Open Server api call. The api call does this by first saving the
    yielding Sybase thread's context in the jmp_buf for the current thread
    using setjmp. Then it does a longjmp using the jmp_buf of the Sybase
    scheduler, thus resuming the scheduler context.
    
    Finally, when the Open Server is shut down it restores the original
    process context from that saved at the start, so main() will have it's
    correct stack.
    
    The sample provided to DEC emulates the Sybase scheduler's context
    switching.
    
    Sybase's Open Server uses a the Sybase transport control library
    (netlib) to make network IO requests. In Sybase System 11 netlib comes
    in two versions, the first is basically the same as used in System 10.
    The alternate version creates multiple pthreads to service the
    requests.
    When netlib is initialised it creates a fixed number of pthread worker
    threads. But it can also create additional pthread as the number of IO
    requests increases. When a Sybase thread makes an IO request it get put
    on the request queue of one of the worker threads. However, if there
    are
    no free request queues, netlib creates a new request queue and a new
    pthread to service it. The pthread_create call to create the new
    pthread
    is called in the context of the a Sybase thread making the netlib IO
    request. This causes the problem we are seeing.
    
    Is there any way we can turn off the stack overflow checking in the
    pthread library?
    
    
8792.5Customer should use a supported mechanism.WTFN::SCALESDespair is appropriate and inevitable.Mon Feb 17 1997 16:3516
.4> Is there any way we can turn off the stack overflow checking in the
.4> pthread library?

No.

On the other hand, if the customer were to use a supported mechanism for "user
level context switching" we could probably make everything work together [see
makecontext(2) and swapcontext(2), although the man pages are almost worse than
useless :-( ].  

Note, I'm not saying that using these functions instead of setjmp()/longjmp()
WILL solve the problem.  However, before we _can_ solve the problem, the
customer must be using a supported mechanism.


				Webb
8792.6DCETHD::BUTENHOFDave Butenhof, DECthreadsTue Feb 18 1997 08:2022
I'll also point out that using the makecontext/swapcontext solution would
(possibly, potentially) work ONLY if they make sure that each "Sybase thread"
runs only on one specific POSIX thread, and never on a different POSIX thread.

The problem with the current setjmp/longjmp hack (and almost certainly with
the current implementation of makecontext/swapcontext) is that they're
designed to assume that the underlying system knows nothing about threads.
But when you try to use them on top of threads, that's no longer true -- the
underlying system DOES know about threads. And pulling the rug out from under
the thread may be hazardous. It's not a matter of stacks or stack limit
checking -- it's much more than that. It's a matter of basic thread identity.

Personally, I believe that such user-mode context switching should simply be
unsupported when using threads. But should the business decision be
otherwise, then a substantial amount of development effort will be required
to make it work. And it can only be made to work if supported mechanisms are
used. Because setjmp and longjmp are "lighter weight" and intended for a lot
of things where changing thread identity would not be desirable, it's very
unlikely that we could support a "Sybase thread" model based on those
primitives.

	/dave
8792.7More suggestions neededRHETT::HALETKYTue Mar 04 1997 10:3010
    
    
    Even so, why would it work  on 'every other platform' so sayeth the
    custoemr.
    
    It appears that makecontext/setcontext will not work for the customer.
    
    Any other suggestions?
    
    -ed haletky
8792.8SMURF::DENHAMDigital UNIX KernelTue Mar 04 1997 10:555
    The problem report from the customer reached DECthreads
    last night. They have an engineer working the stack issues
    involved. I think those issues are pretty well understood
    at this point, so stay tuned for a test library most likely.
    
8792.9The suggestion is the same: use a supported mechanismWTFN::SCALESDespair is appropriate and inevitable.Tue Mar 04 1997 12:0736
.7> Even so, why would it work  on 'every other platform' so sayeth the
.7> custoemr.

This sounds alot like a ten-year-old saying, "But, mom, all the other kids
get to stay up late..."  :-)

The fact that it works on many other platforms means either that those
platforms haven't tried or needed to do anything sophisticated with call
stacks or procedure invocations or that the customer has simply been lucky.
(Just because it appears to execute correctly doesn't mean that it's correct
code, that it's supportable or robust, or that it will continue to work on
the next version of the operating system or on any other operating system;
the only way to ensure correctness is to write the code properly and not
depend on undocumented features or implementation details.)

The standards don't place any requirements on the contents of a jmp_buf.
(They say only that it must be an array.)  The Digital Unix calling standard
points out that on Digital Unix all that really needs to be in the jmp_buf is
a frame pointer (NOT a stack pointer) and a PC -- the things necessary for an
"unwind".  If we had used the minimal implementation, then the customer would
have found that their hack couldn't be ported to Digital Unix at all when
they made their initial attempt.  :-}

.7> It appears that makecontext/setcontext will not work for the customer.
    
Appearances can be deceiving.  If the customer were willing to switch and use
a supported mechanism in lieu of their longjmp() hack, then I believe it
would be trivial to make swapcontext() work for them.  (They need to pass a
status along with the context switch, so pass it in the thread control
structure of the thread being scheduled.)  This is not a case of "it won't
work"; this is a case of "we got a cool demo threads package from some
acedemic site on the Internet, and we used it in our product, and now we want
Digital to support it, even though it's a hack..."  :-(


					Webb
8792.10Not acceptable, Try againRHETT::HALETKYWed Mar 05 1997 14:1231
    
    
    Setcontext/makecontext will nto work as they would be using it to span
    threads. 
    
    setcontext/makecontext will not work becasue it would about 3 years to
    their development cycle. This is Sybase folks. Does Digital want to
    risk not having their products work on our machines because of some
    academic reason?
    
    I'll agree its a hack, they agree its a hack. But it does work with
    other Posix Threads implementations, but it does not work with ours.
    The answers we have given them are /not/ acceptable.
    
    In .8 it says the problem report reached engineering. Excuse me, I did
    report the problem via this notes conference and got back /It's a
    hack/. Care to explain what is going on? I'm looking for a solution to
    give to a company who sells $millions and could sell $millions for
    Digital. The answer 'it's a hack and they are lucky' is not acceptable
    to me and definitely not to the customer. 
    
    I'm out of Ideas. There is /nothing/ in our documentation that says It
    won't work. Hence customers will think it does work, whether we
    consider it to be a hack or not. In essence, give me a solution that
    will work for this customer. Makecontext/swapcontext has been rejected
    as useable by them for a number of reasons. Are there any others?
    
    Regards,
    Ed Haletky
    Digital CSC (Customer Service Center) The ones who bring you bugs and
    problems via this Notes Conferences and IPMTs. 
8792.11let's hear those reasonsVAXCPU::michaudJeff Michaud - ObjectBrokerWed Mar 05 1997 14:448
> Makecontext/swapcontext has been rejected
> as useable by them for a number of reasons.
     ^^^^^^^

	I assume you meant "unuseable".

	Care to elaborate on those "number of reasons", or is this just
	a case of a stubborn customer?
8792.12If it's not acceptable, they will be having problems again...and again...WTFN::SCALESDespair is appropriate and inevitable.Wed Mar 05 1997 16:0552
.10> Excuse me, I did report the problem via this notes conference

Ed, first off, this notes conference is not a problem reporting mechanism. 
(This is repeated over and over in this and many other notes conferences.)
If you want to make a formal problem report, you must either enter a QAR (for
problems encountered by internal groups) or open an IMPT case (for problems
experienced by specific external customers).  Notes conferences are for exchange
of information, only.

.10> There is /nothing/ in our documentation that says It won't work.

Likewise, there is nothing in our _documentation_ which implies that it -would-
work.  Furthermore, there is nothing in any of the pertinent standards
specifications which implies that this would work, either.

Just because you try something and it appears to work doesn't make it a good
solution.  Just because you have access to the internals of a function, doesn't
mean that the function will never be changed.

Relying on undocumented, unspecified, or internal implementation details will
not make your code reliable or supportable.

.10> Does Digital want to risk not having their products work on our machines 
.10> because of some academic reason?

The reason is a practical one, not an academic one.  Even if we smooth over
whatever wrinkle Sybase hit this time, there is no guarantee that their code
will continue to work on the -next- version of Digital Unix.  The customer is
using an unsupported mechanism, and there is no way that we can ensure that it
will work with each successive version of Digital Unix.

Having the customer use a supported mechanism is in their best interest.

.10> Setcontext/makecontext will nto work as they would be using it to span
.10> threads. 

I'm not exactly sure what this means, but if swapcontext() won't work, then
their existing hack is already dead in the water!!  Beyond that, it strikes me
as really unlikely that replacing the existing code which calls longjmp() with
code calling swapcontext() could possibly add anything like 3 years to their
development cycle.  This is straight FUD.

We understand that this is a big customer.  We understand that they mean alot of
business to Digital.  However, it will make life better for everyone if they use
facilities which we can ensure will work.  If all of our major customers dive
into our internals and make use of little bits, in the future we will be unable
to enhance our software, and we will be quickly left behind in the industry. 
So, the result is pretty much the same either way:  either we find people
willing to work with us and follow the rules, or we're on our way out.


				Webb