[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1175.0. "Help needed with formula for "MEAN"" by EXIT26::ZIKA () Thu Jan 04 1990 10:31

    I must determine the "mean" of 324 file sizes and data access times.
    
    The Average file size is 3084 blocks and the average access times are
    80 seconds using method one and 71 seconds using method 2.
    
    The Median file size is 1210 blocks with access times of 38 method one 
    and 29 seconds using method 2.
    
    The sum of the squares of the block sizes for all 324 is 1.31939E10
    The sum of the squares of method 1 access times for all 324 is 18,090,568
    The sum of the squares of method 2 access times for all 324 is 16,988,112
    
    Not having a probabilty reference book available I'll venture a guess
    that the formula to determine a "Mean" is
    
    | [(A**2) + (B**2) + (C**2) ...] |
    | ------------------------------ | ** .5
    |             n                  |
    
    or written otherwise square_root_of((sum_of_squares_of_all_numbers)/n)
    
    Could someone please either produce the "means" or verify the above
    formula?  -- Thanks in advance --- Chris

T.R	Title	User	Personal Name	Date	Lines
1175.1		BEING::POSTPISCHIL	Always mount a scratch monkey.	`Thu Jan 04 1990 10:34`	7
	Re .0: The mean of a set of numbers is the average of the numbers -- the total of the numbers divided by the number of numbers. -- edp
1175.2	more means	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Thu Jan 04 1990 12:15`	6
	.1 gives the definition of the arithmetic mean, which is almost always meant when the word mean is unqualified. Note that this is the same as the average, in the usual use of both terms. There is also a geometric mean, the nth root of the product of n numbers. There is also a harmonic mean, whose definition I forgot.
1175.3	Don't be average	VMSDEV::HALLYB	The Smart Money was on Goliath	`Thu Jan 04 1990 13:11`	19
	The harmonic mean is the sample size divided by the sum of the reciprocals of the data points. But that's not important here. Note that the mean size is 3084 and the median is 1210. This is typical of a skewed distribution, with lots of small files and a few very large files. If you were to plot a histogram you would get something like: . .. .... ......... .............. ...................... Where the Y-axis is the size and the X-axis is one point per file, sorted by increasing file size. The average is probably the wrong number to use; I would prefer the median. Or, perhaps the 90th percentile if you're looking at upper bounds. John
1175.4	didn't think it could get this complicated?	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri Jan 05 1990 15:04`	39
	re: <<< Note 1175.0 by EXIT26::ZIKA >>> > I must determine the "mean" of 324 file sizes and data access times. Based on the comments in .3, you probably need to think more about why you must determine this. Assuming this is a typical business problem, you are evaluating two file access methods, and want to know which is "better". Depending on circumstances, this could be a very difficult question. > The Average file size is 3084 blocks and the average access times are > 80 seconds using method one and 71 seconds using method 2. [ just a nit: it can't really be taking you 80 seconds to access a file, can it? Or is this over a wide area net? ] > > The Median file size is 1210 blocks with access times of 38 method one > and 29 seconds using method 2. This makes method two look better by both measures. Just for fun, you could throw in the third common measure, the mode (defined as the most common value). Determine the relative performance of the two methods for the most common file size. Then if you have not been statted out, you could divide all files into quartiles (the smallest one-fourth, the next smallest, the next and the largest one fourth). If method two is better for all quartiles, then you can be pretty sure that it is the better method in general. Trouble comes if one method is better for the most common files, but another is better for uncommon files. It gets really nasty if the uncommon files are the ones you care most about. Then you have to back off and ask what does "better" mean in the global sense. For example, you may want to maximize the throughput of the system. Or you may want to minimize the upper bound of the response time. Your concept of goodness then determines which kind of average or other statistic you should use.