[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

8357.0. "Shared libraries and startup time" by TAEC::SABIANI () Mon Jan 06 1997 10:34

T.R	Title	User	Personal Name	Date	Lines
8357.1	Our customers don't, why should we??!!!	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon Jan 06 1997 11:00`	9
8357.2		LABC::RU		`Mon Jan 06 1997 14:09`	7
8357.3		QUARRY::neth	Craig Neth	`Mon Jan 06 1997 14:46`	7
8357.4	Shared library performance is no justification.	WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Jan 07 1997 10:04`	13
8357.5	there is a cost	SMURF::RICKABAUGH	Mike Rickabaugh Quo flamma est?	`Tue Jan 07 1997 16:25`	41
8357.6		WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Jan 07 1997 18:02`	46
8357.7		NETRIX::"[email protected]"	[email protected]	`Fri Jan 10 1997 13:47`	12
8357.8	An update	TAEC::SABIANI		`Tue Jan 14 1997 06:40`	28
8357.9	case study?	RHETT::PARKER		`Mon Jan 27 1997 17:29`	64
	Hi Folks, Just to add a real-world example of why one customer is going to stick with static archive libraries, I'm placing this note here. Since these folks are a Partner, they might be willing to work with us on improving the situation. I can check with them or QAR this for them if someone thinks it would be helpful. Lee ----------------------------------------------------------------- We're currently on digital unix 3.2e. We have various applications which use an in-house c++ class library. To more effeciently use memory, we placed our class library in a unix share library. For puposes of this message I'll call the archive version of our class library "hdb.a" and our shared library version "hdb.so". When doing a performance analysis of our apps (using Polycenter and our own internal clocks measuring cpu), we've noticed a dramatic difference in cpu time in an application which is linked to "hdb.a" and the same application linked to "hdb.so". The application linked to "hdb.a" is much faster. They are both built debug and "hdb.so" is about 25 MB; "hdb.a" is around 85 MB. The peformance is such that the application runs the same code over and over (so impact of fixups for the share library should not be a factor). Using polycenter, we don't see any differences in other performance metrics (e.g., paging, swapping, etc) except in cpu. Do you have any information as to share libraries having this type of negative performance impact? ------------------------------------------------------------------------ I replied and basically gave him some of the info, not directly, that Mike had replied with. Here is his response. Hi Lee, Tx for the info. I am aware of some of the memory trade-off (i.e. location of text, use of -taso, etc) and image start up (e.g., using so_locations which I guess is "quickstart"). But, it's the actual amount of cpu being used by a running application (that's running the same code in a repetitive manner) that's my main question. We're seeing 2-3 times improvement when use the archive over the share library. Also, in the text that you included it is mentioned: > overhead can grow at a faster rate than the savings (and remember that > -non_shared shares .text too!). Is non-shared .text shared too? For example, if 2 apps link to our big archive (e.g., hdb.a) the archive is mostly in .text. Are these two applications actually sharing the same copy of hdb.a in physical memory? I didn't think this could be done but if it is then we don't require having a share library to reduce memory consumption.
8357.10		QUARRY::neth	Craig Neth	`Tue Jan 28 1997 10:03`	12
	>I can check with them or QAR >this for them if someone thinks it would be helpful. Yes, I think it would be helpful if you could QAR this - make sure you include the reproducers. Something sounds very fishy - shared libraries should not be 2-3x slower than non-shared. >Is non-shared .text shared too? Yes.
8357.11	Shared for a single app	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Tue Jan 28 1997 10:42`	18
	> >Is non-shared .text shared too? > > Yes. Let's be careful about what question is being answered. If two different applications, A and B, are linked with the same shared library, then no matter how many users are running A and B simultaneously, there's only one copy of the shared library's text in memory, and one copy each of the text of A and the text of B. If A and B are linked non-shared (against an archive version of the same library), then the text of A and its library routines are shared among all the users of A; the text of B and its library routines are shared among all the users of B. If there's one user of A and one user of B, nothing is shared between them.
8357.12	note collision	SMURF::RICKABAUGH	Mike Rickabaugh Quo flamma est?	`Tue Jan 28 1997 13:01`	139
	>Is non-shared .text shared too? Yes it is...but in the situation you asked about hdb.a, the answer is no. The sharing always occurs at the granularity of a ZMAGIC file. If you create a1.out and a2.out where both of them are linked -non_shared against hdb.a, there will be no sharing between the two even though they linked against hdb.a. What I was referring to about .text being shared with -non_shared was when you have N processes on the system all running a1.out. The .text region for a1.out will be shared across all N processes running it--not with the processes running a2.out. All executables (except for the kernel) are ZMAGIC files. Every time a ZMAGIC file is loaded into a process, the .text region is shared with other processes of that ZMAGIC file. Now when you pull in hdb.a linking -non_shared, all of its .text gets combined with the other modules giving a single .text region. It is that .text that gets shared--not just the hdb.a part. Shared libraries are also ZMAGIC files. When a process loads in a shared library, it shares that .text with all of the other processes that loaded that file. However, shared libraries aren't all of the program's .text--only that library's portion. What shared libraries try to do is to allow you to partition the code into shareable partitions (ZMAGIC files) so that all different programs that use that library can share that library's .text. The belief is that a library is a reasonable point to develop such partitions since most libraries provide interfaces that are stable from consumer to consumer. The other benefit to such partitioning is that it allows libraries to be updated independently from the consumers. In terms of the .text sharing benefit, you gain with shared libraries when the application references a lot of common code within the library that would be referenced many times by all of the other dissimilar processes on the system. The libc.so library is an example of this. Almost every dissimilar running process needs common code in libc. The X shared libraries are another example. (I use the word "dissimilar" because if the running processes are the same running image, then it's better to use the -non_shared .text sharing since it will only pull from the library what is needed.) There is some .text overhead to make sharing work. The overhead tends to grow linearly with the number of global symbols referenced/ defined in the main image (for example, the .dynsym and .dynstr sections). This usually means that the overhead grows linearly with the size of a program. But a program can only use so much of a library (i.e. all of it)--that means there is a fixed amount .text of maximum sharing benefit. At some point, the linear increasing overhead will overtake this sharing benefit. That is why I made the statement I made a number of replies back about large programs losing with shared libraries. However, this problem will show itself on needed paging/system resources needed on the system (at link time and at application runtime)--it won't show up as CPU time. That would be caused by something else. -mike schnausser.zk3> emacs sh.txt schnausser.zk3> !cat cat sh.txt >Is non-shared .text shared too? Yes it is...but in the situation you asked about hdb.a, the answer is no. The sharing always occurs at the granularity of a ZMAGIC file. If you create a1.out and a2.out where both of them are linked -non_shared against hdb.a, there will be no sharing between the two even though they linked against hdb.a. What I was referring to about .text being shared with -non_shared was when you have N processes on the system all running a1.out. The .text region for a1.out will be shared across all N processes running it--not with the processes running a2.out. All executables (except for the kernel) are ZMAGIC files. Every time a ZMAGIC file is loaded into a process, the .text region is shared with other processes of that ZMAGIC file. Now when you pull in hdb.a linking -non_shared, all of its .text gets combined with the other modules giving a single .text region. It is that .text that gets shared--not just the hdb.a part. Shared libraries are also ZMAGIC files. When a process loads in a shared library, it shares that .text with all of the other processes that loaded that file. However, shared libraries aren't all of the program's .text--only that library's portion. What shared libraries try to do is to allow you to partition the code into shareable partitions (ZMAGIC files) so that all different programs that use that library can share that library's .text. The belief is that a library is a reasonable point to develop such partitions since most libraries provide interfaces that are stable from consumer to consumer. The other benefit to such partitioning is that it allows libraries to be updated independently from the consumers. In terms of the .text sharing benefit, you gain with shared libraries when the application references a lot of common code within the library that would be referenced many times by all of the other dissimilar processes on the system. The libc.so library is an example of this. Almost every dissimilar running process needs common code in libc. The X shared libraries are another example. (I use the word "dissimilar" because if the running processes are the same running image, then it's better to use the -non_shared .text sharing since it will only pull from the library what is needed.) There is some .text overhead to make sharing work. The overhead tends to grow linearly with the number of global symbols referenced/ defined in the main image (for example, the .dynsym and .dynstr sections). This usually means that the overhead grows linearly with the size of a program. But a program can only use so much of a library (i.e. all of it)--that means there is a fixed amount .text of maximum sharing benefit. At some point, the linear increasing overhead will overtake this sharing benefit. That is why I made the statement I made a number of replies back about large programs losing with shared libraries. This overhead problem will show itself on needed paging/system resources needed on the system (at link time and at application runtime), and in CPU and elapsed time to process any symbol resolution dynamically (quickstart tries to address this). -mike
8357.13	Now it's much more clear	RHETT::PARKER		`Tue Jan 28 1997 13:35`	44
	Thanks for the clarification. I followed most of it! :-) I explained to the customer that there must be something else going on since using shared versus non-shared libraries in itself would not result in only greater CPU usage for shared. I also asked if he would/could provide us with a reproducer to investigate the problem further. Here is his reply. If they can somehow work it out, I'll follow up with a QAR. I hope they can because it looks like shared libraries are getting blamed for causing a problem that they may not be responsible for. Many thanks to all !! ------------------------------------------------------------------------ A couple of follow-ups: 1. Our application is pretty big because it includes not only our in-house stuff but also includes 2 other 3rd party software (Rogue Waves c++ class library and Object Design's Inc object-oriented db Objectstore). The class library is not a big deal but the oodb is because you need the db server, etc). I'll have to think about this for a bit. 2. The question I had about non-shared .text being shared might not have been very clear. Let's say we have two totally different applications and they both link to an huge archive -- say, 500 MB archive. Also assume that each application touches all of the archive. Now, each application has the archive scattered in it's own non-shared .text segment. If I run both at the same time, won't I need 1 GB of memory if I don't want any paging/ swapping? In other words, I doubt if code in non-shared .text segments for different programs can be shared -- only when different users run the same program is the non-shared .text shared. >> I responded basically giving him the same information that Webb >> did a few replies back. In this case, as mentioned by Mike, the >> hdb.a .text region will not be shared because they have two totally >> different applications linking to it.
8357.14		QUARRY::neth	Craig Neth	`Tue Jan 28 1997 14:19`	15
	>> I responded basically giving him the same information that Webb >> did a few replies back. In this case, as mentioned by Mike, the >> hdb.a .text region will not be shared because they have two totally >> different applications linking to it. You got it exactly right. >I also asked if he would/could provide us with a reproducer to investigate >the problem further. Here is his reply. If they can somehow work it >out, I'll follow up with a QAR. I hope they can because it looks >like shared libraries are getting blamed for causing a problem that >they may not be responsible for. Well, something fishy is going on, that's for sure. We'd like the reproducer so we can find and hopefully fix whatever is causing this.
8357.15	Be thankful Digital UNIX has shared libraries!	VAXCPU::michaud	Jeff Michaud - ObjectBroker	`Wed Jan 29 1997 01:51`	10
	And for those too young to remember, in the days before shared libraries (ie. ULTRIX), DECwindows and other products used poor mans sharing by putting a bunch of programs together into one giant executable, and the main() would branch to the appropriate main program based on argv[0], and having multiple hard or softlinks to the giant executable. This is because the DECwindows libraries were so huge, that having real distinct executables whose .text segments couldn't be shared would of dragged even a DECstation to it's knees.
8357.16	More info	TAEC::SABIANI		`Fri Jan 31 1997 11:54`	40
	One more update about my specific problem with the startup time of an application linked with (a lot of) shared libraries. The good news is that this has nothing to do with the quickstart. In fact the third party libraries I'm using are accessing the shared libraries the program is linked with. I got some source code from them if you are interested. To summarize : void wuEnumDLLs(char argv0, void (pfn)()) { /* first is "MAIN", so skip it / char cp = _rld_first_pathname(); while ((cp = _rld_next_pathname()) != NULL) { (pfn)(cp); } } The function that is called (pfn) processes the .so using dlsym (and dlopen/dlclose). The dlopen call is very very slow (between 1 and 2s for one library) and explains why the startup of the applications is that long (I have 36 shared libraries !). Now they say (at the third party company) that the dlopen is much more efficient on their system which is quite surprising since I have an AlphaStation 255 and they have a DEC 3000. We are both running DUnix 3.2x (I'm using 3.2C but I'm not sure about their exact version). Hope you understand most of what I'm saying... Is there a way to optimize the dlopen call ??? Can someone help me to track this problem down ? Stephane.
8357.17	dlopen performance	AOSG::LOWELL		`Mon Feb 03 1997 12:14`	35
	V4.0 includes major improvements in dlopen() time, however, no attempt has yet been made to speed up dlclose(). Prior to V4.0 the speed of these routines will vary as a result of the number of libraries already loaded by a process. In V4.0 the speed of dlopen() has very little to do with the number of libraries already loaded. For the test cases I used to verify my changes to dlopen(), calls that previously took 1-2 seconds were reduced to noise-level. I don't know if it's reasonable for the particular set of libraries you're dealing with, but if it's always dlopen'ing the same 36 libraries in the same order there are some tricks you could employ to speed it up. 1. Assuming you can rebuild the application, include those 36 libraries on the link line. 2. If you can't rebuild the application, you can still instruct the loader to include those 36 libraries as additional dependencies when you invoke the application using the environment variable _RLD_LIST _RLD_LIST = "DEFAULT:library1.so:library2.so:library3.so" This example will append the three shared library dependencies listed to any executable you invoke from shell in which the env var is set. If you could use either of the two solutions above, this would allow the loader to do symbol resolution once and be done with it. The program could still do dlopen's and dlclose's of those added dependencies. The only difference would be that no new libraries would ever have to be loaded or unloaded. dlopen would increment reference counts and return a handle. dlclose would decrement reference counts and return.
8357.18	The end of the headache ??	TAEC::SABIANI		`Tue Feb 04 1997 05:15`	15
	Thank you for your quick answer. We have finally found something interesting... When we set : _RLD_LIST=/usr/shlib/libc.so:DEFAULT it makes our program just fly (it takes about 2s to complete instead of 1 minute !!). Apparently -lc should be loaded before -lc_r ! Any comment is welcome, Stephane.
8357.19	You don't need correctness, right? ;-)	WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Feb 04 1997 10:17`	18
	.18> Apparently -lc should be loaded before -lc_r ! Only if you want unreliable results!! If your image is linked against libc_r, it means that your code uses or is prepared to be used by threads. If you load libc first, it will short circuit (i.e., preempt) the thread-safe functions provided by libc_r, replacing them with ones which are not thread-safe. It's a well-known fact that parallel programs run much faster if you remove all that nasty synchronization. However, they also have a tendency to produce unreliable results (assuming that they don't just crash outright) if you remove the synchronization. So, what do you want, correct, reliable operation or whizzy performance. :-) Webb
8357.20	Ok, the Champagne bottle is back in the fridge... ;-((	TAEC::SABIANI		`Tue Feb 04 1997 13:03`	75
	Hi, Why did I check if there was a new reply to this !@#$%^&* note !!! ;-)) >>You don't need correctness, right? I have some problems with your definition of 'correctness' (see below). ok, here is the nasty parallel program : ---------------------------------------------------------------------- #include <stdio.h> #include <dlfcn.h> void wuEnumDLLs(char argv0) { / first is "MAIN", so skip it / char cp = _rld_first_pathname(); while ((cp = _rld_next_pathname()) != NULL) { printf("Opening : %s \n", cp); (void) dlopen(cp, RTLD_LAZY); } } int main(int argc, char argv[]) { printf("Starting\n"); wuEnumDLLs(argv[0]); } ---------------------------------------------------------------------- I have linked it with a number of random libs : ---------------------------------------------------------------------- $ odump -Dl wutest4 LIBRARY LIST SECTION* Name Time-Stamp CheckSum Flags Version wutest4: libpthreads.so Jul 25 03:25:47 1995 0xdc0717c3 0 osf.1 libc_r.so Jul 25 03:22:17 1995 0x78d1b6fe 0 osf.1 libX11.so Jul 25 05:36:43 1995 0x4bd08197 0 libXm.so Jul 25 06:37:59 1995 0x41912f7a 0 motif1.2 libDXm.so Jul 25 07:35:09 1995 0x37e3646a 0 motif1.2 librt.so Jul 25 03:34:14 1995 0xb00f11fd 0 osf.1 libXt.so Jul 25 05:49:16 1995 0x5d677bf7 0 libm.so Jul 25 02:57:33 1995 0x1cac520b 0 osf.1 libsys5.so Jul 25 03:02:00 1995 0x68a59465 0 osf.1 libtli.so Jul 25 03:33:03 1995 0x5d1cde50 0 osf.1 libdna.so Sep 27 21:20:58 1995 0xdc09e256 0 libcxx.so Nov 30 15:02:03 1995 0x1ee4c01f 0 libcurses.so Jul 25 03:32:03 1995 0xfbc5d81d 0 osf.1 libcomplex.so Nov 30 15:02:05 1995 0xb0d4c339 0 libcmalib.so Jul 25 03:32:24 1995 0x0caef850 0 osf.1 libcda.so Jul 25 03:00:08 1995 0xc5ef4fae 0 osf.1 libcdrom.so Jul 25 03:34:22 1995 0x2f360a2d 0 osf.1 libbkr.so Jul 25 07:28:55 1995 0x6a5f910a 0 motif1.2 libaud.so Jul 25 03:02:09 1995 0x47a87493 0 osf.1 libdvr.so Jul 25 07:50:18 1995 0x65f39f83 0 motif1.2 libdvs.so Jul 25 03:01:35 1995 0xebdb7aeb 0 osf.1 libftam.so Aug 22 20:45:47 1995 0x0e07ed60 0 libc.so Jul 25 02:57:21 1995 0xec02d574 0 osf.1 ---------------------------------------------------------------------- The results are on my system : - 2s with the hack (see .18) - 35s without the hack 35s can seem acceptable but it is not (it is even worse when we link this program with our own libraries : ~1 minute). I still don't understand what happens exactly... is this a bug or what ? Stephane. PS : I'm still using DUnix V3.2C.
8357.21		QUARRY::neth	Craig Neth	`Tue Feb 04 1997 13:38`	16
	What Webb is saying is that the threads libraries, libc_r and libc must be loaded in a certain order or threads just won't work right. If you're not using threads, you shouldn't be linking against the threads libraries. If you are, you need to make sure your link line (or your _RLD_LIST, if you're going to do that) needs to have the libraries in the correct order, or you'll get bad results. In particular, libc_r.so MUST be loaded before libc.so. Your test program isn't threaded so it's not going to see any differences in behavior. FWIW, your test program runs in under 1 second on V4.0B no matter what order the libraries are loaded in...
8357.22		TAEC::BALLADELLI	Surfing with the Alien	`Wed Feb 05 1997 02:28`	25
	>================================================================================ >Note 8357.21 Shared libraries and startup time 21 of 21 >QUARRY::neth "Craig Neth" 16 lines 4-FEB-1997 13:38 >-------------------------------------------------------------------------------- > >What Webb is saying is that the threads libraries, libc_r and libc must >be loaded in a certain order or threads just won't work right. If you're >not using threads, you shouldn't be linking against the threads libraries. >If you are, you need to make sure your link line (or your _RLD_LIST, if >you're going to do that) needs to have the libraries in the correct order, >or you'll get bad results. In particular, libc_r.so MUST be loaded before >libc.so. The fact that dlopen takes so long when lib_r is loaded first is still puzzling though, no matter if we're using threads or not. Is this a bug ? >FWIW, your test program runs in under 1 second on V4.0B no matter what order >the libraries are loaded in... This is probably the good piece of news since our product will ship on DUnix V4.0 and will use threads. Regards, Micky
8357.23		TAEC::SABIANI		`Wed Feb 05 1997 03:08`	14
	>What Webb is saying is that the threads libraries, libc_r and libc >must be loaded in a certain order or threads just won't work right. Yes, maybe this was not clear enough in my previous reply but we do not plan to use this hack with _RLD_LIST. BTW, this hack pinpoints something that looks like a bug (that seems fixed on V4). Can we expect a patch or a workaround for DUnix V3.2 (the same problem exists on V3.2G) ? Does it help if we QAR it ? Stephane.
8357.24		QUARRY::neth	Craig Neth	`Wed Feb 05 1997 09:48`	6
	> Can we expect a patch or a workaround for DUnix V3.2 (the same problem > exists on V3.2G) ? Does it help if we QAR it ? If you file a QAR with a reproducer we will look into and see what can be done for V3.2x, but didn't .22 just say you would be shipping on V4.0? Fiddling around with V3.2x and making patches isn't free....
8357.25		TAEC::SABIANI		`Thu Feb 06 1997 04:52`	28
	>but didn't .22 just say you would be shipping on V4.0? I hope we will ship on V4 too but we won't ship tomorrow and we will certainly have to live with this bug for a while (i.e. on V3.2x). Maybe we should stop beating about the bush. This is a problem we have when we use a third-party (Bristol) product on Digital Unix. We had to answer the following question : 1) Is this a bug in the third-party product ? 2) Is this a bug in Digital Unix ? 3) Is there a work-around ? If I read between the lines, I get : 1) No 2) Yes 3) No (at least not yet but we are still interested) Bristol is ready to help us but we do not have much information to give them and the problem is on Digital's side (our side too). We are continuing our investigations and will try to avoid the DUnix V3.2 patch. We simply thought that it would be simplier for DUnix Engineering (since the bug was identified and fixed in V4 and since you have the code) to tell us what solutions we have. There is enough information in the previous notes to identify the problem.
8357.26		QUARRY::neth	Craig Neth	`Thu Feb 06 1997 09:28`	40
	I'm sorry, I'm not trying to antagonize you. My apologies if my statements have come off that way. >We simply thought that it would be simplier for DUnix > Engineering (since the bug was identified and fixed in V4 and since you > have the code) to tell us what solutions we have. As far as I know there is no 'bug fix'. The loader was extensively modified as part of V4.0 development and one of the benefits of these modifications was an increase in dlopen() performance. We have not spent any time evaluating what it would take to move those improvements back into the V3.2 stream. In general, patches have a high cost for engineering - there are several V3.2x streams that would need to be verified, and plenty of paperwork to do. We are more than willing to do it if there is a documented business impact (or some sort of obvious disaster), but perhaps you can understand that we don't undertake that process when the impact is less clear. If this is a big issue for your product (and it sounds like it might be), please take the time to open a formal problem report and we'll work with you to resolve it to the best of our ability. Craig
8357.27		SMURF::LOWELL		`Fri Feb 14 1997 10:44`	12
	I'd like to back up Craig's answer, since I'm the one that made the performance improvements to the V4.0 dlopen. No. The slowdown you saw in the libc_r V3.2x dlopen was not a bug. Dynamic symbol resolution for dlopen was simply time-consuming when there are lots of shared library dependencies. The dlopen in libc_r is the "multi-threaded" dlopen, which will impose immediate binding mode. Prior to V4.0, immediate binding can be much slower than lazy binding. In V4.0 dlopen becomes equally fast for both lazy and immediate binding. Randy