***** File FITSCOMP.TXT                                                       
                                                                              
               An Extension of FITS for Data Compression                      
                                                                              
                        Archibald Warnock III                                 
                           Robert S. Hill                                     
                                 and                                          
                          Barbara B. Pfarr                                    
                                                                              
                           ST Systems Corp.                                   
                   NASA/Goddard Space Flight Center                           
                          Greenbelt, MD 20771                                 
                                                                              
                                                                              
ABSTRACT                                                                      
                                                                              
An extension of the FITS format to allow for data compression is              
presented. This extension will allow a variety of algorithms to be            
utilized to reduce the storage requirements for images and ASCII data.        
Simple rules permit the complete reconstruction of the FITS header            
records for the original uncompressed file; yet the header for the            
compressed file is accessible to existing FITS file readers. An               
algorithm for the previous pixel compression scheme is presented as an        
example.                                                                      
                                                                              
                                                                              
                                                                              
                                                                              
INTRODUCTION                                                                  
                                                                              
The Large-Scale Phenomena Network of the International Halley Watch           
(IHW) has the task of digitizing wide-field photographic plates of Comet      
P/Halley. The physical size of the plates and the desire for high             
spatial resolution combine to result in extremely large digital images,       
often as large as 4096 lines by 4096 samples of 16-bit numbers (i.e.,         
individual images may be as large as 32 megabytes).                           
                                                                              
Typical production images are roughly 18 megabytes in size, and the LSPN      
expects to generate 1000 of them. The Near Nucleus Studies Network has a      
similar problem with CCD images, albeit on a smaller scale. The IHW           
Digital Archive could be expected to consist of perhaps 35 to 40 CD-ROM       
disks, most of them containing only images. By contrast, an archive           
containing only compressed images would have on the order of 20 disks,        
with a corresponding decrease in cost. It is essential that there be          
some portable data compression scheme to reduce the physical storage          
requirements of the imagery, and thereby reduce the mastering and             
production costs for a complete archive on CD-ROM. The expense and            
effort associated with distributing the complete archive on any               
alternative medium, such as magnetic tape, would be prohibitive.              
                                                                              
                                                                              
                                                                              
                                                                              
POSSIBLE COMPRESSION SCHEMES                                                  
                                                                              
The IHW has restricted its consideration of compression schemes to those      
which preserve the full dynamic range of the data. Of these, we consider      
only the so-called "instantaneous" methods; that is, those methods in         
which the next output number is computed from the current input number        
and a small set of status variables.                                          
                                                                              
The advantage of instantaneous decompression schemes over others is that      
no scratch space is required to hold the uncompressed image, i.e., the        
decompression takes place "on the fly". A compressed image on FITS tape,      
with appropriate keywords, can be read in virtually the same way that         
uncompressed images are currently read, with the decompression being          
performed as the records are read and unpacked. This will prove to be         
most convenient for users of the IHW CD-ROM archive, given the large          
image sizes involved.                                                         
                                                                              
The IHW believes that the previous-pixel (or first-differences)               
algorithm is the one giving the best compression ratio for the least          
computational effort and will adopt it for use on the CD-ROM archive          
disks.                                                                        
                                                                              
                                                                              
                                                                              
                                                                              
THE FITS HEADER FOR COMPRESSED DATA                                           
                                                                              
A compressed file may be considered as an extension to basic FITS in the      
following sense: the compressed data may be read simply as a byte             
stream. It is fully documented by an extension header written according       
to the FITS perscription. Thus, one retains the capability of examining       
the header of the compressed data with a FITS reader. The end result          
(after decompression) can be a standard FITS file, in the sense that the      
data is restored to its original uncompressed state, complete with a          
valid header.                                                                 
                                                                              
Two types of size information are required. The basic FITS keywords           
(BITPIX, NAXIS, NAXIS1 - NAXISn) must continue to define the attributes       
of the file in the form in which the user receives it. New FITS keywords      
then define the file as it is to be reconstructed.                            
                                                                              
We propose that the keyword XTENSION take the value 'COMPRESS' if the         
data has been compressed, and that a series of keywords COMPRES1,             
COMPRES2, ..., COMPRESn give the names of the compression schemes used        
on the original data, in the order in which the compression was               
performed; i.e., the scheme given by COMPRES1 was the first compression       
done, that given by COMPRES2 was the second one done, and so forth. The       
number of axes in the original uncompressed data is given by the keyword      
LAXIS (for logical axis), and the corresponding dimensions are given by       
the keywords LAXIS1 through LAXISn. The keyword LBITPIX will define the       
precision of the uncompressed data, corresponding to BITPIX after             
decompression.                                                                
                                                                              
The sample FITS header which follows illustrates the proposed compressed      
file extension.                                                               
                                                                              
First record:                                                                 
                                                                              
SIMPLE  =                    T / VALID FITS FORMAT                            
BITPIX  =                    8 / BYTE DATA                                    
NAXIS   =                    0 / NO DATA RECORDS YET                          
EXTEND  =                    T / THERE MAY BE EXTENSION RECORDS               
END                                                                           
                                                                              
                                                                              
Extension record:                                                             
                                                                              
XTENSION= 'COMPRESS'           / THE DATA HAS BEEN COMPRESSED                 
NAXIS   =                    1 / THE DATA IS JUST A BYTE STREAM               
NAXIS1  =                12345 / OF THIS MANY BYTES                           
PCOUNT  =                    0 / NO PARAMETERS PRECEEDING THE DATA            
GCOUNT  =                    1 / ONLY ONE GROUP                               
LAXIS   =                    2 / NUMBER OF AXES, UNCOMPRESSED IMAGE           
LAXIS1  =                  512 / LOGICAL NAXIS1                               
LAXIS2  =                  512 / LOGICAL NAXIS2                               
LBITPIX =                   16 / LOGICAL BITS PER PIXEL                       
COMPRES1= 'PREVPIXEL'          / FIRST COMPRESSION SCHEME USED                
COMPRES2= 'HUFFMAN  '          / SECOND COMPRESSION SCHEME USED               
END                                                                           
                                                                              
                                                                              
Any keywords which must accompany the uncompressed data may be put into       
the extension header, and should be copied to the output header when the      
file is decompressed.                                                         
                                                                              
This scheme identifies the data as having been compressed (XTENSION=          
'COMPRESS') and gives the compression scheme(s) used (COMPRES1=               
'PREVPIXEL', COMPRES2= 'HUFFMAN') in the header, so any FITS reader can       
decide whether or not it knows how to handle the data. It defines the         
real length of the data file (NAXIS1=12345) so the data can be correctly      
skipped, if desired. It also preserves the dimensions of the original         
data file as "logical" attributes (LAXIS, LAXISn), so that a basic FITS       
file may be created once the byte stream is uncompressed.                     
                                                                              
                                                                              
                                                                              
                                                                              
DISCUSSION                                                                    
                                                                              
One of the reasons for suggesting keywords whose values state the             
compression schemes used (rather than something like COMPRESS=T) is           
precisely because no single scheme is best under all circumstances. Our       
approach allows flexibility in selecting any appropriate (or even             
inappropriate) compression method and specifying it unambiguously, so         
long as the algorithm has been published and agreed upon.                     
                                                                              
New compression schemes will be added to the recognized list in the same      
way that FITS extensions have been proposed and adopted in the past           
(Harten, et al. 1988). The IHW intends to inaugurate this procedure with      
the previous-pixel algorithm and will publish source code and                 
pseudo-code for decompression both on the archive CD-ROM and in print.        
                                                                              
No one has utilized compression in astronomical production work yet,          
although it certainly must happen eventually. IHW is already performing       
the necessary prototyping for previous-pixel compression. For IHW,            
compressed images are an absolute necessity (or perhaps a necessary           
evil).                                                                        
                                                                              
It is also possible to interpret compression like blocking; i.e., as          
irrelevant to the logical structure of the data (Wells, private               
communication). Although this is philosophically true, we remain              
convinced of the need to identify and support the most useful practical       
options, as is already being done for coordinate systems. Agreement on a      
flexible set of keywords will allow new algorithms to be evaluated and        
added later, in full accordance with any current agreement. While it is       
possible to write code which analyzes a given input image and selects         
the best compression scheme (similar to the archiving programs in the         
personal computer world) and thus use a syntax like "COMPRESS=T", it is       
impossible to anticipate all schemes which may be desirable in the            
future.                                                                       
                                                                              
One alternative to the current proposal would be to use a standalone          
program to compress both header and data. This would render the header        
records unreadable to existing FITS software until after decompression.       
It would be wasteful of both time (for decompression) and disk space          
(for scratch storage) and, we believe, contrary to the basic FITS             
philosophy of allowing examination of the header before deciding what to      
do with the data.                                                             
                                                                              
However, nothing in this proposal precludes the use of a separate             
decompressor. It is certainly possible to skip over the header to reach       
the data and decompress the image data separately, then concatenate a         
revised header and uncompressed image, although the result would not be       
a valid FITS format. Note also that nothing in this proposal restricts        
compression to images. The scheme specified here would work equally well      
with ASCII data. At present, it's use on tables is precluded by the           
inability to nest FITS extensions under the standard.                         
                                                                              
Not only is our approach independent of the storage medium, but also, it      
avoids reliance on coded filename extensions and the like. Nevertheless,      
data analysis packages can easily recognize compressed data and either        
process or skip, according to their capabilities.                             
                                                                              
Data compression has proven its value in many areas - PC-based                
telecommunications and file transfer, transmission of spacecraft data,        
etc. It is only a matter of time until astronomers appreciate how             
attractive it can be for images in general (consider the prospect of          
2048 by 2048 CCD chips with three bytes per pixel). Now is a good time        
to decide how to incorporate compression into the general data                
interchange standards so that we can avoid revisions in the future.           
                                                                              
                                                                              
                                                                              
                                                                              
BIBLIOGRAPHY                                                                  
                                                                              
Harten, R. H., Grosbol, R., Greisen, E. W., and Wells, D. C., 1988,           
                           Astron. Astrophys. Suppl. Ser., 73, 365.           
                                                                              
                                                                              
                                                                              
                                                                              
APPENDIX                                                                      
                                                                              
The algorithm for the previous-pixel compression scheme to be used by         
the International Halley Watch is given here in pseudocode. It is based       
on the observation that, for many images of 16-bit data, the                  
pixel-to-pixel differences may often be coded within the dynamic range        
available in 8 bits, yielding a substantial savings in file size from a       
small computational investment.                                               
                                                                              
All differences that lie in the range [-127,127] can be coded in a            
single byte. Each difference has the bias value 127 added to it in order      
to avoid problems on machines which require unsigned byte data. This          
yields the range [0,254] in the actual data stream.                           
                                                                              
The value 255 is reserved as a flag to indicate that the difference           
between the two current pixels is too large to be stored in 8 bits. In        
this case, the two bytes that follow the flag byte are to be interpreted      
as a new 16-bit pixel value, which then provides a new zero-point for         
the differences.                                                              
                                                                              
No allowances are made for ends of lines, that is, the successive             
differences are allowed to cross from the right-hand edge of the image        
to the next line at the left hand edge. For images with smooth                
backgrounds, this will often result in another 8-bit difference, and so       
save a few more bytes. Note that the number of samples in a line is           
given by the keyword LAXIS2, so that the compressed image need not flag       
the start of a new line.                                                      
                                                                              
                                                                              
                                                                              
                                                                              
                         PREVIOUS PIXEL ALGORITHM                             
                  WITH DIFFERENCE FLAG AS SEPARATE BYTE                       
                                                                              
                                                                              
COMPRESSION:                                                                  
                                                                              
Load 255 into output record                                                   
READ first record                                                             
Set first value to be PREVPIXEL                                               
Load first value into output record                                           
DO UNTIL no more data on input                                                
    IF input buffer is empty THEN read next record                            
    Compute difference between CURRENTPIXEL and PREVPIXEL                     
    IF this difference is within -127:127 THEN                                
         Add bias of 127 to the difference                                    
         Convert the biased difference to a single byte                       
         Load value of difference byte into output record                     
         IF output record full, WRITE out record                              
    ELSE IF the difference is outside -127:127 THEN                           
         Load flag byte (255) into output record                              
         IF output record full, WRITE out record                              
         Load high byte of CURRENTPIXEL into output record                    
         IF output record full, WRITE out record                              
         Load low byte of CURRENTPIXEL into output record                     
         IF output record full, WRITE out record                              
    ENDIF                                                                     
    Set PREVPIXEL equal to CURRENTPIXEL                                       
ENDDO                                                                         
IF partially filled output buffer remains THEN                                
    Blank to the end of the buffer                                            
    WRITE out final (partial) record                                          
ENDIF                                                                         
                                                                              
                                                                              
                                                                              
DECOMPRESSION:                                                                
                                                                              
READ first record                                                             
Verify that first byte is 255                                                 
Set first value to be PREVPIXEL                                               
Load first value into output record                                           
DO UNTIL no more data on input                                                
    IF input buffer is empty THEN read next record                            
    Get CURRENTBYTE from input buffer                                         
    IF CURRENTBYTE is not equal to 255 (is a difference) THEN                 
         NEWPIXEL = PREVPIXEL + CURRENTBYTE - 127                             
         Load NEWPIXEL into output record                                     
         IF output record full, WRITE out record                              
    ELSE IF CURRENTBYTE = 255 THEN                                            
         IF input buffer has < 2 bytes left THEN read next record             
         Set NEWPIXEL to the next 2 bytes in the input buffer                 
         Load NEWPIXEL into output record                                     
         IF output record full, WRITE out record                              
    ENDIF                                                                     
    Set PREVPIXEL equal to NEWPIXEL                                           
ENDDO                                                                         
IF partially filled output buffer remains THEN                                
    Blank to the end of the buffer                                            
    WRITE out final (partial) record                                          
ENDIF