T.R | Title | User | Personal Name | Date | Lines |
---|
8357.1 | Our customers don't, why should we??!!! | WTFN::SCALES | Despair is appropriate and inevitable. | Mon Jan 06 1997 11:00 | 9 |
8357.2 | | LABC::RU | | Mon Jan 06 1997 14:09 | 7 |
8357.3 | | QUARRY::neth | Craig Neth | Mon Jan 06 1997 14:46 | 7 |
8357.4 | Shared library performance is no justification. | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Jan 07 1997 10:04 | 13 |
8357.5 | there is a cost | SMURF::RICKABAUGH | Mike Rickabaugh Quo flamma est? | Tue Jan 07 1997 16:25 | 41 |
8357.6 | | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Jan 07 1997 18:02 | 46 |
8357.7 | | NETRIX::"[email protected]" | [email protected] | Fri Jan 10 1997 13:47 | 12 |
8357.8 | An update | TAEC::SABIANI | | Tue Jan 14 1997 06:40 | 28 |
8357.9 | case study? | RHETT::PARKER | | Mon Jan 27 1997 17:29 | 64 |
|
Hi Folks,
Just to add a real-world example of why one customer is going to
stick with static archive libraries, I'm placing this note here.
Since these folks are a Partner, they might be willing to work
with us on improving the situation. I can check with them or QAR
this for them if someone thinks it would be helpful.
Lee
-----------------------------------------------------------------
We're currently on digital unix 3.2e. We have various applications
which use an in-house c++ class library. To more effeciently use
memory, we placed our class library in a unix share library. For
puposes of this message I'll call the archive version of our class
library "hdb.a" and our shared library version "hdb.so".
When doing a performance analysis of our apps (using Polycenter
and our own internal clocks measuring cpu), we've noticed a
dramatic difference in cpu time in an application which is linked
to "hdb.a" and the same application linked to "hdb.so". The
application linked to "hdb.a" is much faster. They are both built
debug and "hdb.so" is about 25 MB; "hdb.a" is around 85 MB.
The peformance is such that the application runs the same code
over and over (so impact of fixups for the share library should
not be a factor).
Using polycenter, we don't see any differences in other performance
metrics (e.g., paging, swapping, etc) except in cpu.
Do you have any information as to share libraries having this type of
negative performance impact?
------------------------------------------------------------------------
I replied and basically gave him some of the info, not directly, that
Mike had replied with. Here is his response.
Hi Lee,
Tx for the info. I am aware of some of the memory trade-off (i.e.
location of text, use of -taso, etc) and image start up (e.g.,
using so_locations which I guess is "quickstart"). But, it's
the actual amount of cpu being used by a running application
(that's running the same code in a repetitive manner) that's
my main question. We're seeing 2-3 times improvement when use
the archive over the share library.
Also, in the text that you included it is mentioned:
> overhead can grow at a faster rate than the savings (and remember that
> -non_shared shares .text too!).
Is non-shared .text shared too? For example, if 2 apps link to our
big archive (e.g., hdb.a) the archive is mostly in .text. Are these
two applications actually sharing the same copy of hdb.a in physical
memory? I didn't think this could be done but if it is then we don't
require having a share library to reduce memory consumption.
|
8357.10 | | QUARRY::neth | Craig Neth | Tue Jan 28 1997 10:03 | 12 |
| >I can check with them or QAR
>this for them if someone thinks it would be helpful.
Yes, I think it would be helpful if you could QAR this - make sure you include
the reproducers. Something sounds very fishy - shared libraries should not
be 2-3x slower than non-shared.
>Is non-shared .text shared too?
Yes.
|
8357.11 | Shared for a single app | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Tue Jan 28 1997 10:42 | 18 |
| > >Is non-shared .text shared too?
>
> Yes.
Let's be careful about what question is being answered.
If two different applications, A and B, are linked with
the same shared library, then no matter how many users
are running A and B simultaneously, there's only one copy
of the shared library's text in memory, and one copy each
of the text of A and the text of B.
If A and B are linked non-shared (against an archive version
of the same library), then the text of A and its library
routines are shared among all the users of A; the text of
B and its library routines are shared among all the users
of B. If there's one user of A and one user of B, nothing
is shared between them.
|
8357.12 | note collision | SMURF::RICKABAUGH | Mike Rickabaugh Quo flamma est? | Tue Jan 28 1997 13:01 | 139 |
| >Is non-shared .text shared too?
Yes it is...but in the situation you asked about hdb.a, the
answer is no.
The sharing always occurs at the granularity of a ZMAGIC file.
If you create a1.out and a2.out where both of them are linked
-non_shared against hdb.a, there will be no sharing between the
two even though they linked against hdb.a.
What I was referring to about .text being shared with -non_shared was
when you have N processes on the system all running a1.out. The .text
region for a1.out will be shared across all N processes running it--not
with the processes running a2.out.
All executables (except for the kernel) are ZMAGIC files. Every
time a ZMAGIC file is loaded into a process,
the .text region is shared with other processes of that ZMAGIC file.
Now when you pull in hdb.a linking -non_shared, all of its .text gets
combined with the other modules giving a single .text region. It is that
.text that gets shared--not just the hdb.a part.
Shared libraries are also ZMAGIC files. When a process loads in
a shared library, it shares that .text with all of the other processes
that loaded that file. However, shared libraries aren't all of the
program's .text--only that library's portion.
What shared libraries try to do is to allow you to partition
the code into shareable partitions (ZMAGIC files) so that all
different programs that use that library can share that library's
.text. The belief is that a library is a reasonable point to
develop such partitions since most libraries provide interfaces that
are stable from consumer to consumer.
The other benefit to such partitioning is that it allows libraries
to be updated independently from the consumers.
In terms of the .text sharing benefit, you gain with shared libraries
when the application references a lot of common code within the library
that would be referenced many times by all of the other *dissimilar*
processes on the system. The libc.so library is an example of this.
Almost every dissimilar running process needs common code in libc.
The X shared libraries are another example.
(I use the word "dissimilar" because if the running processes are the
same running image, then it's better to use the -non_shared .text
sharing since it will only pull from the library what is needed.)
There is some .text overhead to make sharing work. The overhead
tends to grow linearly with the number of global symbols referenced/
defined in the main image (for example, the .dynsym and .dynstr
sections). This usually means that the overhead grows linearly
with the size of a program. But a program can only use so
much of a library (i.e. all of it)--that means there is a
fixed amount .text of maximum sharing benefit. At some point, the
linear increasing overhead will overtake this sharing benefit. That
is why I made the statement I made a number of replies back about
large programs losing with shared libraries.
However, this problem will show itself on needed paging/system
resources needed on the system (at link time and at application
runtime)--it won't show up as CPU time. That would be caused by
something else.
-mike
schnausser.zk3> emacs sh.txt
schnausser.zk3> !cat
cat sh.txt
>Is non-shared .text shared too?
Yes it is...but in the situation you asked about hdb.a, the
answer is no.
The sharing always occurs at the granularity of a ZMAGIC file.
If you create a1.out and a2.out where both of them are linked
-non_shared against hdb.a, there will be no sharing between the
two even though they linked against hdb.a.
What I was referring to about .text being shared with -non_shared was
when you have N processes on the system all running a1.out. The .text
region for a1.out will be shared across all N processes running it--not
with the processes running a2.out.
All executables (except for the kernel) are ZMAGIC files. Every
time a ZMAGIC file is loaded into a process,
the .text region is shared with other processes of that ZMAGIC file.
Now when you pull in hdb.a linking -non_shared, all of its .text gets
combined with the other modules giving a single .text region. It is that
.text that gets shared--not just the hdb.a part.
Shared libraries are also ZMAGIC files. When a process loads in
a shared library, it shares that .text with all of the other processes
that loaded that file. However, shared libraries aren't all of the
program's .text--only that library's portion.
What shared libraries try to do is to allow you to partition
the code into shareable partitions (ZMAGIC files) so that all
different programs that use that library can share that library's
.text. The belief is that a library is a reasonable point to
develop such partitions since most libraries provide interfaces that
are stable from consumer to consumer.
The other benefit to such partitioning is that it allows libraries
to be updated independently from the consumers.
In terms of the .text sharing benefit, you gain with shared libraries
when the application references a lot of common code within the library
that would be referenced many times by all of the other *dissimilar*
processes on the system. The libc.so library is an example of this.
Almost every dissimilar running process needs common code in libc.
The X shared libraries are another example.
(I use the word "dissimilar" because if the running processes are the
same running image, then it's better to use the -non_shared .text
sharing since it will only pull from the library what is needed.)
There is some .text overhead to make sharing work. The overhead
tends to grow linearly with the number of global symbols referenced/
defined in the main image (for example, the .dynsym and .dynstr
sections). This usually means that the overhead grows linearly
with the size of a program. But a program can only use so
much of a library (i.e. all of it)--that means there is a
fixed amount .text of maximum sharing benefit. At some point, the
linear increasing overhead will overtake this sharing benefit. That
is why I made the statement I made a number of replies back about
large programs losing with shared libraries.
This overhead problem will show itself on needed paging/system
resources needed on the system (at link time and at application
runtime), and in CPU and elapsed time to process any symbol resolution
dynamically (quickstart tries to address this).
-mike
|
8357.13 | Now it's much more clear | RHETT::PARKER | | Tue Jan 28 1997 13:35 | 44 |
|
Thanks for the clarification. I followed most of it! :-)
I explained to the customer that there must be something else
going on since using shared versus non-shared libraries in itself
would not result in only greater CPU usage for shared. I also
asked if he would/could provide us with a reproducer to investigate
the problem further. Here is his reply. If they can somehow work it
out, I'll follow up with a QAR. I hope they can because it looks
like shared libraries are getting blamed for causing a problem that
they may not be responsible for.
Many thanks to all !!
------------------------------------------------------------------------
A couple of follow-ups:
1. Our application is pretty big because it includes not only our
in-house stuff but also includes 2 other 3rd party software (Rogue Waves
c++ class library and Object Design's Inc object-oriented db Objectstore).
The class library is not a big deal but the oodb is because you need the
db server, etc). I'll have to think about this for a bit.
2. The question I had about non-shared .text being shared might not
have been very clear. Let's say we have two totally different applications
and they both link to an huge archive -- say, 500 MB archive. Also assume
that each application touches all of the archive. Now, each application has
the archive scattered in it's own non-shared .text segment. If I run both
at the same time, won't I need 1 GB of memory if I don't want any paging/
swapping?
In other words, I doubt if code in non-shared .text segments for different
programs can be shared -- only when different users run the same program
is the non-shared .text shared.
>> I responded basically giving him the same information that Webb
>> did a few replies back. In this case, as mentioned by Mike, the
>> hdb.a .text region will not be shared because they have two totally
>> different applications linking to it.
|
8357.14 | | QUARRY::neth | Craig Neth | Tue Jan 28 1997 14:19 | 15 |
| >> I responded basically giving him the same information that Webb
>> did a few replies back. In this case, as mentioned by Mike, the
>> hdb.a .text region will not be shared because they have two totally
>> different applications linking to it.
You got it exactly right.
>I also asked if he would/could provide us with a reproducer to investigate
>the problem further. Here is his reply. If they can somehow work it
>out, I'll follow up with a QAR. I hope they can because it looks
>like shared libraries are getting blamed for causing a problem that
>they may not be responsible for.
Well, something fishy is going on, that's for sure. We'd like the reproducer
so we can find and hopefully fix whatever is causing this.
|
8357.15 | Be thankful Digital UNIX has shared libraries! | VAXCPU::michaud | Jeff Michaud - ObjectBroker | Wed Jan 29 1997 01:51 | 10 |
| And for those too young to remember, in the days before shared
libraries (ie. ULTRIX), DECwindows and other products used poor
mans sharing by putting a bunch of programs together into one
giant executable, and the main() would branch to the appropriate
main program based on argv[0], and having multiple hard or softlinks
to the giant executable.
This is because the DECwindows libraries were so huge, that having
real distinct executables whose .text segments couldn't be shared
would of dragged even a DECstation to it's knees.
|
8357.16 | More info | TAEC::SABIANI | | Fri Jan 31 1997 11:54 | 40 |
| One more update about my specific problem with the startup time of an
application linked with (a lot of) shared libraries.
The good news is that this has nothing to do with the quickstart.
In fact the third party libraries I'm using are accessing the
shared libraries the program is linked with.
I got some source code from them if you are interested.
To summarize :
void wuEnumDLLs(char *argv0, void (*pfn)())
{
/* first is "MAIN", so skip it */
char *cp = _rld_first_pathname();
while ((cp = _rld_next_pathname()) != NULL) {
(*pfn)(cp);
}
}
The function that is called (*pfn) processes the .so using dlsym (and dlopen/dlclose).
The dlopen call is very very slow (between 1 and 2s for one library) and
explains why the startup of the applications is that long (I have 36
shared libraries !).
Now they say (at the third party company) that the dlopen is much more efficient
on their system which is quite surprising since I have an AlphaStation 255 and
they have a DEC 3000.
We are both running DUnix 3.2x (I'm using 3.2C but I'm not sure about their exact
version).
Hope you understand most of what I'm saying...
Is there a way to optimize the dlopen call ??? Can someone help me to track
this problem down ?
Stephane.
|
8357.17 | dlopen performance | AOSG::LOWELL | | Mon Feb 03 1997 12:14 | 35 |
|
V4.0 includes major improvements in dlopen() time, however, no attempt
has yet been made to speed up dlclose(). Prior to V4.0 the speed of these
routines will vary as a result of the number of libraries already loaded
by a process. In V4.0 the speed of dlopen() has very little to do with
the number of libraries already loaded. For the test cases I used to
verify my changes to dlopen(), calls that previously took 1-2 seconds
were reduced to noise-level.
I don't know if it's reasonable for the particular set of libraries
you're dealing with, but if it's always dlopen'ing the same 36 libraries
in the same order there are some tricks you could employ to speed it up.
1. Assuming you can rebuild the application, include those 36 libraries
on the link line.
2. If you can't rebuild the application, you can still instruct the
loader to include those 36 libraries as additional dependencies
when you invoke the application using the environment variable
_RLD_LIST
_RLD_LIST = "DEFAULT:library1.so:library2.so:library3.so"
This example will append the three shared library dependencies
listed to any executable you invoke from shell in which the env
var is set.
If you could use either of the two solutions above, this would allow the
loader to do symbol resolution once and be done with it. The program
could still do dlopen's and dlclose's of those added dependencies. The
only difference would be that no new libraries would ever have to be loaded
or unloaded. dlopen would increment reference counts and return a handle.
dlclose would decrement reference counts and return.
|
8357.18 | The end of the headache ?? | TAEC::SABIANI | | Tue Feb 04 1997 05:15 | 15 |
| Thank you for your quick answer.
We have finally found something interesting...
When we set :
_RLD_LIST=/usr/shlib/libc.so:DEFAULT
it makes our program just fly (it takes about 2s to complete instead of
1 minute !!).
Apparently -lc should be loaded before -lc_r !
Any comment is welcome,
Stephane.
|
8357.19 | You don't need correctness, right? ;-) | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Feb 04 1997 10:17 | 18 |
| .18> Apparently -lc should be loaded before -lc_r !
Only if you want unreliable results!!
If your image is linked against libc_r, it means that your code uses or is
prepared to be used by threads. If you load libc first, it will short
circuit (i.e., preempt) the thread-safe functions provided by libc_r,
replacing them with ones which are not thread-safe.
It's a well-known fact that parallel programs run much faster if you remove
all that nasty synchronization. However, they also have a tendency to
produce unreliable results (assuming that they don't just crash outright) if
you remove the synchronization.
So, what do you want, correct, reliable operation or whizzy performance. :-)
Webb
|
8357.20 | Ok, the Champagne bottle is back in the fridge... ;-(( | TAEC::SABIANI | | Tue Feb 04 1997 13:03 | 75 |
| Hi,
Why did I check if there was a new reply to this !@#$%^&* note !!! ;-))
>>You don't need correctness, right?
I have some problems with your definition of 'correctness' (see below).
ok, here is the nasty parallel program :
----------------------------------------------------------------------
#include <stdio.h>
#include <dlfcn.h>
void wuEnumDLLs(char *argv0)
{
/* first is "MAIN", so skip it */
char *cp = _rld_first_pathname();
while ((cp = _rld_next_pathname()) != NULL)
{
printf("Opening : %s \n", cp);
(void) dlopen(cp, RTLD_LAZY);
}
}
int main(int argc, char *argv[])
{
printf("Starting\n");
wuEnumDLLs(argv[0]);
}
----------------------------------------------------------------------
I have linked it with a number of random libs :
----------------------------------------------------------------------
$ odump -Dl wutest4
***LIBRARY LIST SECTION***
Name Time-Stamp CheckSum Flags Version
wutest4:
libpthreads.so Jul 25 03:25:47 1995 0xdc0717c3 0 osf.1
libc_r.so Jul 25 03:22:17 1995 0x78d1b6fe 0 osf.1
libX11.so Jul 25 05:36:43 1995 0x4bd08197 0
libXm.so Jul 25 06:37:59 1995 0x41912f7a 0 motif1.2
libDXm.so Jul 25 07:35:09 1995 0x37e3646a 0 motif1.2
librt.so Jul 25 03:34:14 1995 0xb00f11fd 0 osf.1
libXt.so Jul 25 05:49:16 1995 0x5d677bf7 0
libm.so Jul 25 02:57:33 1995 0x1cac520b 0 osf.1
libsys5.so Jul 25 03:02:00 1995 0x68a59465 0 osf.1
libtli.so Jul 25 03:33:03 1995 0x5d1cde50 0 osf.1
libdna.so Sep 27 21:20:58 1995 0xdc09e256 0
libcxx.so Nov 30 15:02:03 1995 0x1ee4c01f 0
libcurses.so Jul 25 03:32:03 1995 0xfbc5d81d 0 osf.1
libcomplex.so Nov 30 15:02:05 1995 0xb0d4c339 0
libcmalib.so Jul 25 03:32:24 1995 0x0caef850 0 osf.1
libcda.so Jul 25 03:00:08 1995 0xc5ef4fae 0 osf.1
libcdrom.so Jul 25 03:34:22 1995 0x2f360a2d 0 osf.1
libbkr.so Jul 25 07:28:55 1995 0x6a5f910a 0 motif1.2
libaud.so Jul 25 03:02:09 1995 0x47a87493 0 osf.1
libdvr.so Jul 25 07:50:18 1995 0x65f39f83 0 motif1.2
libdvs.so Jul 25 03:01:35 1995 0xebdb7aeb 0 osf.1
libftam.so Aug 22 20:45:47 1995 0x0e07ed60 0
libc.so Jul 25 02:57:21 1995 0xec02d574 0 osf.1
----------------------------------------------------------------------
The results are on my system :
- 2s with the hack (see .18)
- 35s without the hack
35s can seem acceptable but it is not (it is even worse when we link this
program with our own libraries : ~1 minute).
I still don't understand what happens exactly... is this a bug or what ?
Stephane.
PS : I'm still using DUnix V3.2C.
|
8357.21 | | QUARRY::neth | Craig Neth | Tue Feb 04 1997 13:38 | 16 |
| What Webb is saying is that the threads libraries, libc_r and libc *must*
be loaded in a certain order or threads just won't work right. If you're
not using threads, you shouldn't be linking against the threads libraries.
If you are, you need to make sure your link line (or your _RLD_LIST, if
you're going to do that) needs to have the libraries in the correct order,
or you'll get bad results. In particular, libc_r.so MUST be loaded before
libc.so.
Your test program isn't threaded so it's not going to see any differences
in behavior.
FWIW, your test program runs in under 1 second on V4.0B no matter what order
the libraries are loaded in...
|
8357.22 | | TAEC::BALLADELLI | Surfing with the Alien | Wed Feb 05 1997 02:28 | 25 |
| >================================================================================
>Note 8357.21 Shared libraries and startup time 21 of 21
>QUARRY::neth "Craig Neth" 16 lines 4-FEB-1997 13:38
>--------------------------------------------------------------------------------
>
>What Webb is saying is that the threads libraries, libc_r and libc *must*
>be loaded in a certain order or threads just won't work right. If you're
>not using threads, you shouldn't be linking against the threads libraries.
>If you are, you need to make sure your link line (or your _RLD_LIST, if
>you're going to do that) needs to have the libraries in the correct order,
>or you'll get bad results. In particular, libc_r.so MUST be loaded before
>libc.so.
The fact that dlopen takes so long when lib_r is loaded first is still puzzling
though, no matter if we're using threads or not. Is this a bug ?
>FWIW, your test program runs in under 1 second on V4.0B no matter what order
>the libraries are loaded in...
This is probably the good piece of news since our product will ship on DUnix
V4.0 *and* will use threads.
Regards,
Micky
|
8357.23 | | TAEC::SABIANI | | Wed Feb 05 1997 03:08 | 14 |
| >What Webb is saying is that the threads libraries, libc_r and libc
>*must* be loaded in a certain order or threads just won't work right.
Yes, maybe this was not clear enough in my previous reply but we do
not plan to use this hack with _RLD_LIST. BTW, this hack pinpoints
something that looks like a bug (that seems fixed on V4).
Can we expect a patch or a workaround for DUnix V3.2 (the same problem
exists on V3.2G) ? Does it help if we QAR it ?
Stephane.
|
8357.24 | | QUARRY::neth | Craig Neth | Wed Feb 05 1997 09:48 | 6 |
| > Can we expect a patch or a workaround for DUnix V3.2 (the same problem
> exists on V3.2G) ? Does it help if we QAR it ?
If you file a QAR with a reproducer we will look into and see what can
be done for V3.2x, but didn't .22 just say you would be shipping on V4.0?
Fiddling around with V3.2x and making patches isn't free....
|
8357.25 | | TAEC::SABIANI | | Thu Feb 06 1997 04:52 | 28 |
| >but didn't .22 just say you would be shipping on V4.0?
I hope we will ship on V4 too but we won't ship tomorrow and
we will certainly have to live with this bug for a while (i.e. on
V3.2x).
Maybe we should stop beating about the bush. This is a problem we
have when we use a third-party (Bristol) product on Digital Unix. We had
to answer the following question :
1) Is this a bug in the third-party product ?
2) Is this a bug in Digital Unix ?
3) Is there a work-around ?
If I read between the lines, I get :
1) No
2) Yes
3) No (at least not yet but we are still interested)
Bristol is ready to help us but we do not have much information to give
them and the problem is on Digital's side (our side too).
We are continuing our investigations and will try to avoid the DUnix V3.2
patch. We simply thought that it would be simplier for DUnix
Engineering (since the bug was identified and fixed in V4 and since you
have the code) to tell us what solutions we have.
There is enough information in the previous notes to identify the problem.
|
8357.26 | | QUARRY::neth | Craig Neth | Thu Feb 06 1997 09:28 | 40 |
| I'm sorry, I'm not trying to antagonize you. My apologies if my
statements have come off that way.
>We simply thought that it would be simplier for DUnix
> Engineering (since the bug was identified and fixed in V4 and since you
> have the code) to tell us what solutions we have.
As far as I know there is no 'bug fix'. The loader was extensively
modified as part of V4.0 development and one of the benefits of these
modifications was an increase in dlopen() performance. We have not
spent any time evaluating what it would take to move those improvements
back into the V3.2 stream.
In general, patches have a high cost for engineering - there are several
V3.2x streams that would need to be verified, and plenty of paperwork to do.
We are more than willing to do it if there is a documented business impact
(or some sort of obvious disaster), but perhaps you can understand that we
don't undertake that process when the impact is less clear.
If this is a big issue for your product (and it sounds like it might be),
please take the time to open a formal problem report and we'll work with
you to resolve it to the best of our ability.
Craig
|
8357.27 | | SMURF::LOWELL | | Fri Feb 14 1997 10:44 | 12 |
| I'd like to back up Craig's answer, since I'm the one that made
the performance improvements to the V4.0 dlopen.
No. The slowdown you saw in the libc_r V3.2x dlopen was not a bug.
Dynamic symbol resolution for dlopen was simply time-consuming when
there are lots of shared library dependencies. The dlopen in libc_r
is the "multi-threaded" dlopen, which will impose immediate binding
mode. Prior to V4.0, immediate binding can be much slower than
lazy binding. In V4.0 dlopen becomes equally fast for both lazy and
immediate binding.
Randy
|