[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::hackers_v1

Title:	-={ H A C K E R S }=-
Notice:	Write locked - see NOTED::HACKERS
Moderator:	DIEHRD::MORRIS

Created:	Thu Feb 20 1986
Last Modified:	Mon Aug 03 1992
Last Successful Update:	Fri Jun 06 1997
Number of topics:	680
Total number of notes:	5456

512.0. "Design ideas for data collection application?" by DYO780::DYSERT (Barry Dysert) Fri Jul 17 1987 10:51

    Here is a problem for you applications programmers.  A customer
    is writing an application where he wants to collect data in an indexed
    file.  (It needs to be indexed because of the way he's going to
    access it later.)  The data must be "sorted" in order of date/time
    stamps, but the records can arrive out of time order (i.e. a 12:00
    record can arrive before an 11:00 record).  But, later viewing (both
    dynamic viewing as well as static reports) must be able to access
    the data in an ISAM fashion (i.e. start at a particular time) by going
    through times in chronological order.
        
    The biggest gotcha to the whole thing is that they only want to keep a
    certain number of records in the file.  So, after a certain amount of
    time, they want to delete the old records as they continue to add new
    ones.  All this is to continue as the application continues to do
    data collection without interruption.  The quantity of data isn't
    all that big, and it's not an extremely time-critical situation.
    (For instance, they may collect 200 bytes per minute and want
    to retain the most recent 2-hours' worth of activity.)
    
    The problem is how best to do this.  If you use the date/time as
    the primary key, then the file will grow indefinitely as successive
    records are written to it.  (Even if you delete the old ones.)
    You could do periodic CONVERTs but this is sure a lot of overhead
    running CONVERT every few hours.
    
    I have a couple of other ideas, but I don't want to mention them
    yet because I want to hear some ideas unbiased from my approach.
    Please throw out whatever ideas you may have.  Thank you.

T.R	Title	User	Personal Name	Date	Lines
512.1	No problem that I can see	SWAMI::LAMIA	Free radicals of the world, UNIONIZE!	`Fri Jul 17 1987 13:42`	19
	> access it later.) The data must be "sorted" in order of date/time > stamps, but the records can arrive out of time order (i.e. a 12:00 > record can arrive before an 11:00 record). But, later viewing (both Hmm, this isn't very clear, but I assume that you understand what you mean and you know how to collect and distinguish the dates. > data collection without interruption. The quantity of data isn't > all that big, and it's not an extremely time-critical situation. > (For instance, they may collect 200 bytes per minute and want > to retain the most recent 2-hours' worth of activity.) Let's see... 200 bytes * 60 min/hr * 12 working hr/day * 7 day/wk * 2 wk = 2,016,000 bytes = 3938 blocks every 2 weeks. I don't think this is big enough to worry about using CONVERT to reclaim deleted space any more than once every couple of weeks, or even once a month! Just make sure you tune the RMS file carefully for good insertion performance of records in roughly sorted key order.
512.2	Go ahead: Use indexed files with delete AND Convert/reclaim	CASEE::VANDENHEUVEL	Formerly known as BISTRO::HEIN	`Fri Jul 17 1987 13:46`	1

512.3	how about this way...	DYO780::DYSERT	Barry Dysert	`Fri Jul 17 1987 14:09`	10
	I see that I shouldn't have even provided what I think is the obvious solution because no other ideas have yet been presented. Let me try this one: how about using the date/time stamp as an alternate key, using anything else as the primary and doing REWRITEs, modifying the alternate key. This would prevent the file from growing, no? What I don't know is if this would cause performance problems (bucket splits or something) and eventually require a CONVERT anyway. Are there any other ideas, or at least some discussion on this second method versus the one presented in .0? Thank you!
512.4	Alternate key does not `feel' right.	CASEE::VANDENHEUVEL	Formerly known as BISTRO::HEIN	`Sat Jul 18 1987 05:22`	35
	Using an alternate key will cause at lot more IO's: For every record updated, not only the primary bucket will be updated but also the old AND new sidr bucket will be read and updated. When retrieving records using an alternate key you almost garantuee an IO per record unless the alternate key order largely follows the primary key order (which might be the case here) or when you cache the whole data level from the file in global buffers. Go indexed. Once you have a good solution you need no other. Right? Nevertheless, Given the relative small and limited amount of data there are probably several alternatives. One idea that might prove intersting is to make use of the fact that the records will probably be largely coming in in key order. That opens the opportunity to handle out of sequence records through an exception procedure such as a forward/barckward pointer. Thus you could us a relative file (or even a fixed length record sequential file) as follows: Record 0 -> record number of record with lowest timestamp in file & record number of record with highest timestamp in file Record i is logically followed by record i+1 UNLESS diverted through presence of key value in pointer field. You might be able to use sequential puts to relative file to have RMS handle the free slot handling... until EOF. At EOF you must wrap around to a low key value. Using an sequential file RMS can not tell you whether a record exeists or is deleted and you might consider a record bitmap to find free space. Hein.
512.5	Piece of cake!	ALBANY::KOZAKIEWICZ	You can call me Al...	`Sun Jul 19 1987 11:56`	31
	An application I wrote a number of years ago sounds like what you are trying to do. It collected data from a process control system on a time domain basis and stored the data in several files. The data was used for process optimization and we wanted to purge "old" data on an automatic basis. This resulted in two classes of files - high resolution data which was to be retained for 7 days and lower resolution data which was kept for 6 months. The solution was to size and populate the files with null records to their eventual capacity up front. A hashing algorithm was applied to the date and time in such a manner as to "wrap around" upon itself after 7 days or 6 months. The result of this was used as the primary key. The null records inserted into the file had all the possible combinations of this key represented. For example, on the 7 day file, we collected data every 15 minutes. The hashed key became the day-of-week and the hour and minute of day. 3:30 PM Wednesday would yield 41530, for example. The rest of the record consisted of the "real" date, time, and process data. The application which stored the data would fetch the record with the apporopriate primary key, modify all the other fields, and rewrite the record. Using DTR or whatever to analyze the data in the file was straightforward because the date and time (the primary way of accessing the data from a users standpoint) were represented in a normal fashion in alternate keys. The original version of this system was done with RMS-11 prologue 1 files, so I didn't have the luxury of on-line garbage collection. By populating the file in advance, and never changing the primary key, I was able to realize the goal of a stable file which didn't require occasional cleanup. I have used this same technique elsewhere, always based on the date/time. I can speak from experience when I point out that any period that doesn't roughly corrospond to some interval on a calendar (week, month, year) is a real bitch to implement because of the hashing algorithm (try to do 10 days, for instance!).
512.6	Beware of pre-loading `empty' records with compression.	CASEE::VANDENHEUVEL	Formerly known as BISTRO::HEIN	`Mon Jul 20 1987 05:29`	11
	Re .5 Beware of data compression when trying to preload records into an indexed file, intending to update them later with no change to the structure: The `Empty' records that are always used are in fact long strings of a single character (space or zero probably) Such records will compressed to repeat counts only and subseqent updates are garantueed to increase the size of the record in the buckets thus potentially causing splits!
512.7		ALBANY::KOZAKIEWICZ	You can call me Al...	`Mon Jul 20 1987 09:38`	7
	re: -1 Yes bucket splits will occur until all the records have "real" data in them. Actually, since the original application was written under RSX, this wasn't a problem (no compression). When transferred to VMS, data compression was disabled.
512.8	thanks to all	DYO780::DYSERT	Barry Dysert	`Mon Jul 20 1987 11:05`	5
	I really like your suggestion, Al (.5). Although I haven't yet coded a test program I presume that you won't incur any eventual bucket splits or continual file growth. I'll discuss the various ideas presented by everyone and let the customer decide what he thinks is best. Thanks for everyone's input!
512.9	TRIED GLOBAL SECTIONS ?	TROPPO::RICKARD	Doug Rickard - waterfall minder.	`Sun Aug 02 1987 22:49`	15
	I had a similar problem one time but after several tries I finally gave up on ISAM files. Instead I mapped a global section file which was big enough to hold the window of data and used it as a circular buffer. Because of the simultaneous access capabilities, other processes could be accessing the same data at the same time as the data acquisition program was putting it in. Every entry was time stamped, and I wrote my own code to work through the window and put sliding averages, etc. into external ISAM files. Worked a treat and I can highly recommend the partiocular approach. Otherwise, the hashed approach mentioned earlied is a real neat way to go. Doug Rickard.