|
Date: Sat, 29 Aug 1998 21:47:16 +0200
From: John Walker
To: rdnelson
CC: Greg Nelson
Subject: Re: Proposed Revisions to Egg and Basket Software
Here are replies to questions and issues raised in Greg
and Roger's replies to the original egg changes design
specifications.
Replies to Greg's comments:
>> We were originally thinking of sorting the data into one file per egg
>> per month. The original thinking was that this would keep the files
>> down to a modest size. Your proposal should do this as well. My
>> concerns are:
>>
>> 1) Since each of the eggs might (in principle) communicate with a
>> secondary basket if the primary basket does not respond, it is
>> unclear where the data might end up, or whether multiple copies
>> might be generated.
>>
>> 2) If the desire was to analyze different eggs independently (for
>> example, all Orion vs. all PEAR eggs, or eggs by continent) it
>> would be necessary to search all the files for the appropriate
>> data.
The advantage I saw in one all-eggs file per day is that
the day is the most probable unit of analysis and reporting,
so a program which wishes to examine the data for a given day
can do so efficiently if all the data for a day appear in
a file bearing its name. Otherwise, the analysis program would
have to scan multiple files, one per egg, and extract the
data for the selected day.
Basically, I've been proceeding on the assumption that it
doesn't really matter how the basket stores its data (as
long as nothing is lost and it's unambiguous) because for
all the analysis I'm anticipating doing will not work on the
raw data but rather the merged, time-aligned extracts produced
by the eggtable.c program or some equivalent. Eggtable is
able to digest data from one or more files in any order
or organisation whatsoever, discard duplicate samples (as may
be collected when multiple baskets interrogate eggs), mark
missing data, and emit complete data for all eggs, second by
second, for a specified time interval. I would envision
these composite summaries, produced daily (already--they're in
the noosphere:/home/httpd/html/data/eggsummary directory
as basketdata-yyyy-mm-dd.csv.gz files) as being the primary
grist for analysis programs. No information is lost in transforming
the binary data collected by the basket into this tabulated,
time-aligned CSV form--it's just more tractable since you don't
have to worry about when the basket happened to collect data
from this or that egg. You might think of CSV as a terribly
inefficient ASCII file format, which it is, but gzip compresses
these files to a size comparable to the raw binary basket data
files. Analysis programs can "pipe them on board" with "zcat",
so the full CSV file need never exist on disc.
>> If the data duplication could be solved easily, this would have the
>> advantage that mirroring the data would be as simple as transferring
>> the complete files. I was initially expecting each of the baskets to
>> retrieve the data from other eggs through the regular data transfer
>> protocol.
As long as you use eggtable as a front-end, all you need to
do is feed it the basket data files from all available
baskets and it will automatically throw out duplicates
received by more than one basket and make a composite of
data collected by all baskets. (The existing version of
eggtable will produce many warnings about duplicate data
if you do this; I can add an option to suppress these
warnings if and when we routinely start to do this.)
>> Second, I would lean toward adding extra '%' phrases rather than
>> having the interpretation depend on which program was running. We
>> could use %e for egg name, %E for egg number, and %k for basket name.
>> Do we want to have those parse numeric terms as well, e.g. "%06E" for
>> your sample egg file name?
Yes, this is cleaner. I decided to re-use %E because
strftime() already defines most of the other obvious letters,
for example:
%e day of month (1-31; single digits are preceded by a blank)
How about defining distinct phrases for each item we might
want to insert into the name, but introducing them with a
different character, for example:
$e Egg name
$08E Egg number, right justified, zero filled in 8 characters
$h Hostid of basket machine
$k Basket name
etc? No sane Unix programmer would want a dollar sign in
a file name, anyway.
>> > .eggrc file. If eggsample receives a SIGHUP signal, it
>> > re-reads the .eggprotocolrc file and, if the protocol
>> > has changed, begins to use the new protocol starting
>> > with the next packet. Eggsample logs its process ID
>> > in the file eggsample.pid.
>>
>> This sounds good... I imagine a slightly different mechanism would be
>> necessary under Windoze, but I don't feel like worrying about what
>> that might be at the present time.
Under Windows, the re-reading of .eggprotocolrc would work
exactly the same, but instead of sending a signal eggnet
would queue a message to eggsample, which it would process
in the slack time after collecting the next sample. Sending
a message from one application to another is a straightforward
(albeit characteristically Microsoft-ugly) matter.
>> Eggnet will still need to initiate the communications with the
>> baskets, as is happening now.
Absolutely. I've added a mention of this to the design
spec. Eggnet will send 0x0303 AWAKE_PACKETs precisely as
eggsh does now.
>> > If the next request from the basket consists of the next
>> > packet, in time, after the last requested, eggnet simply
>> > reads the next packet from the open file.
>>
>> Will this work if the last file read resulted in feof() true, and new
>> data has been appended since then? This is a very likely scenario.
>> I'm just not sure of the filesystem semantics here...
Yes, this will work fine. The Unix utility "tail -f" counts
on these semantics obtaining. The other implicit assumption
is that unbuffered file writes are atomic, and except for
huge (multi-megabyte) I/O on "enhanced" file systems (such
as RAID, SGI XFS, etc.), this is also the case. Actually,
non-atomic I/O will cause no problems for this code as
long as it treats an incomplete packet read as equivalent to
end of file and resets the read pointer to the start of
the packet; that is my intent in writing the code.
>> > I propose eggsample adopt this synchronisation technique. As
>> > in the synchromesh program, a status display will be available
>> > in debug builds to monitor the accuracy of synchronisation.
>>
>> That sounds like the right mechanism overall. But... rather than just
>> putting the accuracy info in the debug builds, should we make it
>> available in some way (yet another file?) so that the regular screen
>> display can show the information?
An excellent idea. I'll make it log synchronisation statistics
(for example, last, one- and ten-minute mean and extreme) in a
file and make the user interface in eggnet report this.
>> > When a new packet is initialised, the 10 second time
>> > intervals within the minute are filled in and all data
>> > fields in each record are set to 255, representing missing
>> > data...
>>
>> Can we use 254 as the marker instead, since we're using 255 as a flag
>> to be able to scan for packet checksums more easily? We don't expect
>> to ever go *above* 200 bits/trial, although there has been some
>> consideration of going down. (This was the reason we felt it was safe
>> to use an 8 bit trialsz field... not recognizing the possible
>> alignment problems).
No problem--254 it is. I'm inclined to continue to use 255
as the missing data marker in the CSV summary files, and simply
fix eggtable.c to translate a sample of 254 in the binary file
into a 255 missing data marker in the CSV.
Replies to Roger's comments:
>> If the a computer running eggnet goes down, will the information it is
>> keeping in the in-memory table of times and requests be preserved?
No, but the sole function of that table is to eliminate
the number of reads of the egg data file needed to find
the first packet requested by the basket. What this means,
then, is that if eggnet or the computer running it
crashes and is restarted, the first request from a
basket will cause a search for the requested packet
in the file, at which point the in-memory table will be
reconstructed. Subsequent requests will be serviced
without any extraneous reads. By comparison, *every*
basket to egg request currently reads through the egg data
file to find the packet. With the proposed change, this
will happen only once (per basket, if the egg is talking
to more than one), when eggnet is restarted, and since the
egg data files will be limited to one day's data, fewer
records will need to be scanned compared to the current
archiving of a month's data all in one file.
>> This makes me think of the possibility of a central monitoring facility
>> for outlying eggs -- an expanded egg.status report, so the user at a
>> basket could see individual eggs' behavior.
This is what I've referred to in the past as a
"basket dashboard". The easiest way to implement this is
as a CGI program which examines the egg.status files and
recent basket data files and produces a report indicating
the status and recent performance of eggs. Once we
get more than a handful of eggs on-line, this will be
essential for detecting problems as they emerge.
For example, the following URL is the master status
dashboard for www.fourmilab.ch (if you try it, please be patient,
it tests a number of things which take a while to respond).
http://www.fourmilab.ch/cgi-bin/Fourmilab-Status
This is a Perl script which creates the HTML report on the
fly as it tests the various resources on the site. A similar
Perl program, adapted in part from the egg daily status report
generator, should do the job. As I implement the changes to
eggsh, I'll think about what additional information the
basket might provide to make such a report informative.
* * *
After the next go-round or two, I'll circulate a second draft
of the design spec. As soon as we "sign off" on the design, I'll
start implementing the changes to the egg software.
|