Next: 9 Acceptance Suite Up: Part II: Subsystem Designs Previous: 7 Information Management

8 Environmental Monitoring and Logging

8.1 Overview

Our goal for the DEIMOS instrument is to learn from, and improve on, our experiences with error logging, alarm delivery, and condition handling for previous Lick-built instruments. Specifically, we wish to provide a logging and monitoring facility that will:

detect and respond to dangerous conditions, preventing damage to the instrument or physical injury to staff, and leave a reliable and accessible record of these conditions and responses
detect and respond to conditions threatening observing efficiency or success, and alert observers and OAs to these conditions reliably (and promptly), leaving an accessible record of these alerts
respond where possible in pre-emptive ways (preventing failure rather than merely warning of imminent failure)
provide a reliable and accessible log of every KTL command sent to the instrument by the user(s)
(at least during the first year of operation) routinely monitor and log a wide spectrum of instrument and telescope operating parameters, to facilitate post facto investigation of instrument faults and failures

Our efforts fall into three major categories:

Robustness (responses that prevent any interruption of observing, preserve data, avoid error)
Event logging (logging particular discrete events as they occur)
Parametric logging (logging operating parameters at regular intervals for later analysis)

8.1.1 Robustness

We have established a DEIMOS software design principle: all process which log or preserve information, from the image capture facility to the parametric monitoring software, should have a ``fallback chain" of strategies and locations for storing data. In other words, information should not be lost if, for example, database server performance degrades beyond acceptable limits, database server crashes, or disk space fills up.

Every process which logs directly to the database server must switch to a fallback mode if the database server does not respond within a suitable timeout period. The fallback mode (writing to a disk file) itself must have further fallback options (different disk partitions to use if the default partition fills).

Furthermore, no logging or monitoring process may, even in the event of complete failure, interrupt or impact the observing process. Even if no logging is possible, observing should continue as long as the instrument and infrastructure will support it. Observing should only be impacted in the case of instrument or personnel safety hazards.

The fallback rule should apply to image capture and storage as well; the write-to-disk operation should continue smoothly even if the default OUTDIR partition fills. The chain of alternate disk write areas, in all cases (logging and image capture) must be readily visible to the user. The value of OUTDIR must change when a new fallback disk is selected, so that the user can readily see where current images are being stored. Furthermore, a new keyword (DISKFREE) should be supported which displays in megabytes the amount of total disk space remaining in the entire fallback chain for image storage; and two new keywords DIREIMGS and SPECIMGS should display the number of direct or spectral (respectively) DEIMOS images which will fit in DISKFREE megabytes.

These principles and plans are intended to ensure that science data have the highest priority and will not be lost (or even mislaid) in the course of the night.

8.1.2 Event Logging and Response

We must consider two levels of ``event". One is an KTL event (a write to a KTL keyword), the other is the detection of a condition (involving one or more keywords) which we have defined as noteworthy. All KTL writes should be logged, along with their source if it is identifiable. This is fairly simple to do (see below). Conditions, are somewhat more complicated, as they can exist at levels below KTL's consciousness. The hardware control crates can perceive and respond to certain conditions which are not visible at the KTL layer. Therefore, condition detection and response has to take place independently at both levels: in the crates, and at the control computer.

The crate-level condition detection should be restricted to safety hazards, stage collisions, hardware failures, and so on: conditions which may cause damage to the instrument or to staff, or which will disable observing altogether. The crate should log these conditions and its responses in a local (not NFS) log file. This log file should be accessible in some way from the control computer, but the live copy must reside on local disk. The crate condition handling code will have to be hand-crafted. However, since it must detect a limited number of severe conditions related to hardware, limits, interlocks, etc., it is unlikely that the condition handling code will be volatile.

The ``Dashboard" tool already contains a fairly sophisticated KTL keyword condition processor. This facility can evaluate Boolean and arithmetic expressions in which KTL keyword names are terms; the KTL keywords need not be from the same KTL service, so complex conditions can be evaluated involving keywords from both DCS/ACS and the instrument. Any codable action can be taken as a result, e.g. popping up alert windows, altering the visual appearance of the GUI, sending mail, or executing other arbitary Unix commands.

The ``Dashboard" configuration delivered for use as a DEIMOS GUI will doubtless include many ``alarms and alerts" as part of the GUI. However, we feel the need for a watchdog process not tied to the user GUI, one which runs as a daemon (similar in spirit to the watch_hirot) process used now to monitor the HIRES rotator. This process could be effectively a subset of the Dashboard code (just the condition handling); once configured from an authoritative database table of conditions, severities, and responses, it would run continuously, monitoring as many services as necessary (and keeping track of available disk space, updating DISKFREE and the *IMGS keywords).

Another design principle that has emerged from our discussions is that alarm and alert delivery must be multimodal. We have found that it's difficult to get the observer's attention, in practice, when an alarm condition exists. Audio alarms via the /dev/audio port are ineffectual if the speaker is turned off or relocated out of earshot. Console window alarms are almost always ineffectual because the console window is usually hidden or iconified. Our conclusion is that important alarms should be delivered by multiple media (audio, popup alarm boxes, and mail, for example) and that drastic changes in the appearance of the screen are necessary to get the user's attention. We will design our DEIMOS alarms and alerts with these conclusions in mind.

8.1.3 Parametric Logging

Why bother to log parametric information, when a lot of it ends up in image headers anyway?

Particularly in the first year of operation (but arguably, throughout the instrument lifetime) a great deal of utility may be derived from a fairly comprehensive parametric (environmental) log. Such a log would include all raw telemetry values from the instrument (there may be a need for converted values, but raw values are the minimum requirement), and selected values from the dome and telescope (and possibly ambient condition monitoring hardware like thermometers, wind gauges, hygrometers, etc.)

This seems like a lot of data. One may well ask what is wrong with the existing procedure: when engineers want a log of certain parameters, they write a script, put it in their crontab file, and let it run for a few hours or nights; then they look through the log file it generated. The existing procedure has several disadvantages offsetting its one advantage (parsimony of disk space).

When an event takes place (instrument failure or mysterious instrument behaviour) it is often difficult to reproduce. If a comprehensive parametric log were always available, one could investigate and analyze an event the following day, rather than ``set the traps" and wait hopefully for it to happen again.

In specifying a narrow filter of parameters to log, one makes implicit assumptions about the cause of the problem which may not be true. Often, the real question is not ``Does X correlate with this event," but ``Does anything correlate with this event." An accessible and comprehensive parametric log gives the engineer a larger model of the instrument to analyze when searching for the cause of a problem, and the opportunity to make discoveries and connections outside the bounds of his/her initial assumptions.

When individual engineers make individual tailored log files, those files and the data they contain are semi-private; they are not easily cross- correlated, and their location and nomenclature may be somewhat obscure to anyone other than the owner. A centralized online parametric log is available to everyone at all times, from astronomers to OAs to engineers.

Invidual tailored log files created and deleted at whim do not form a continuous historic record of instrument behaviour and performance. It is unlikely that an astronomer will be able to make arbitrary inquiries about instrument performance with regard to a particular image taken at a particular date and time. We feel that the astronomer may find it very useful to make such queries, and that these queries should not pose a burden on the observatory staff by requiring them manually to integrate disjoint private logfiles into a dataset for analysis.

Obviously it is not possible to capture every last operating parameter for the instrument. We can only capture what our telemetry gathers and presents in the form of keywords. However, we can try to ensure that keywords exist for as broad a set of parameters as is practical to implement.

Why should such a log be written directly to the database?

As a text file, it would be fairly unreadable. If written as keyword=value on separate lines, it would be unreadable due to the sheer number of lines and the laboriousness of parsing it into a horizontal format for plotting, etc. If written as records, the association betwen column and value will be very hard to maintain; and awk and grep, while powerful and handy tools, become somewhat unwieldy when used as a programming language for dealing with multi-hundred-column, multi-hundred-Krecord datasets. A relational database is designed to perform retrieval and statistical analysis on exactly this type of data set. To save the log to disk, in a relatively unusable form, then implement a scheduled ingestion of the data into the RDBMS, seems like unnecessary overhead.

Various tools, both commercial and free, are available for browsing, analyzing, and visualizing large RDBMS datasets. Once the information is in this form, there is considerable flexibility of choice in tools and methods for access.

How manageable would the data volume be?

The record length is more of an issue than the record count. The record count, assuming a 5-minute sample interval 24 hours per day, is about 105K records per year. Three years' worth of online record is by no means an excessive record count for rapid retrieval. The field count (number of columns in the table) could be high (between 100 and 200?). If raw values are being preserved rather than multiple value conversions and string representations, we could assume about 4 bytes per field, so a record length of between 400 and 800 bytes. Averaging that to a Kbyte, we could estimate a total dataset size of 55MB per year. This is not considered, by today's standards, an exceptionally large dataset. How much of this information would reside online, and how soon it would be flushed to secondary storage, are open questions.

Is this ambitious?

Actually, it's rather trivial to implement. The more challenging aspect is making intelligent use of the accumulated data. Capturing and storing the data are very simple problems, representing a very tiny fraction of the overall DEIMOS software effort. To make good use of the data, DEIMOS engineers will need to acquire or build tools; our feeling is that many good tools exist already for data visualization and analysis, so that retrieval is the primary issue. We already have in hand several flavours of ``friendly" interfaces to database information retrieval, and by enhancing these, or acquiring similar products from other institutions, users other than database experts should be able to perform exploratory and analytical queries with relative ease.

8.2 Component Modules

8.2.1 EngDataVisu

Function

Tool for retrieving and visualizing database contents.

Language

Tcl/Tk?

Type

User Tool

Author

Clarke

Revision

Source CVS Ref

Essay

This tool (not yet written) would present an easy and attractive interface to engineers and astronomers wanting to make exploratory queries into any database information. The express purpose is to grant analytical access to the instrument's operational history (parametric logs), but statistical analysis of image headers or other datasets could also be performed. The precise nature of the user interface is TBD. There are strong reasons for preferring a web-based approach, but some limitations of Web browser function might encourage a standalone, dedicated GUI instead. Whatever the nature of the GUI, it must offer basic plotting capability, and must handle the specific kinds of time-based statistics needed for understanding log data: statistics by month, week, day, fortnight, hour of the day, etc. Again, there are reasons for preferring to graft onto an existing data visualization suite, and other reasons for preferring a standalone application with built-in graphics.

Some version of this tool, however constructed, should be available at engineering test time; a fairly complete version should ship with, or before, the instrument.

8.2.2 KTLwatch

Function

Watches selected KTL keywords and takes action if conditions are met.

Language

Tcl

Type

Daemon

Author

Clarke

Revision

Source CVS Ref

Essay

A stripped down version of Dashboard (not yet written), running as a daemon; like Dashboard, it can configure itself from the database directly or from a cached local file of information originally retrieved from the database. It relies upon the keyword (Memes) tables and an additional table of Alarms (boolean pseudo-tcl expressions, severities, handlers etc.)

It can take any arbitrary action on fulfilment of a condition, or a default action depending on the severity of the condition.

8.2.3 ElectroLog

Function

Online observer logbook

Language

TclTk

Type

Tool

Author

Deich

Revision

Source CVS Ref

Essay

ElectroLog (currently in early beta) is an X11 GUI observer logbook which accept both arbitrary interactive input from the user, and data input on socket connections from processes. It can easily be configured into any KTL system involving ``Dashboard" in such a way that any KTL event can write a formatted record of KTL keyword values to the GUI logbook. A simple example might be to write, at end of exposure, a user-selected list of KTL keywords as an observing log record of the exposure. The user can then annotate the logbook at will.

ElectroLog has various handy features including log print, log save, mail log to recipient, and so forth. It can retroactively ingest keywords from an entire directory full of FITS images, making a ex post facto observing log.

ElectroLog will be tested on Mount Hamilton with Lick instruments long before DEIMOS commissioning, and if it has won popularity by pre-ship review, will probably be included in the DEIMOS software suite as a ``value added" feature.

8.3 Subsystem Procedures

8.3.1 Data Capture

Data capture for logs should be fully automatic. The only exception is the online observer logbook (ElectroLog), in which the observer can enter arbitrary personal comments; and these comments should probably not be ingested into any public record.

8.3.2 Data Retrieval

As we mentioned above, retrieval and visualization of data are the real challenges here. The data retrieval and review process is a combination of manual (user specifies the ``slice" of data to be retrieved) and automatic (SQL is automatically generated from a GUI interface, or filters are automatically swept across the data on a regular interval). The automatic generation of complex SQL statements involving datetime functions is the only feature which does not already exist in one deployed application or another; a UI for datetime queries, and reliable code generation for such queries, will be the features distinguishing EngDataVisu from (e.g.) Wisql.

8.4 Deliverable Documents

Complete schema documentation should be provided with the delivered instrument (see Chapter 7). A complete user guide and cookbook should be provided with any data analysis or visualization tools that ship with DEIMOS. As mentioned in Chapter 7, similar guides and cookbooks are required for all the other database utilities.

Next: 9 Acceptance Suite Up: Part II: Subsystem Designs Previous: 7 Information Management

DEIMOS Software Team <deimos@ucolick.org>
1997-06-13T00:18:19