PDS4 Python Tools - User Manual

1.1 Introduction

PDS4 Python Tools are several modules integrated into a Python library to provide access to PDS4 data.

1.2 Installation

Installation instructions are currently provided at:

https://sbnwiki.asteroiddata.org/Python_PDS4_Tools.html#Installation

1.3 Quick Tutorial

This chapter provides a quick introduction into using PDS4 Python Tools. The goal is to demonstrate basic features without getting into too much detail. If you are a first time or occasional user, this is where you should start.

After installing PDS4 Python Tools, start Python and load the PDS4 tools library. Note that the module name is all lower case.

>>> import pds4_tools

Opening a PDS4 file

Once pds4_tools is imported, we can open an existing data file from its describing label,

>>> structures = pds4_tools.read('/path/to/label.xml')       # Local
>>> structures = pds4_tools.read('http://url.com/label.xml') # Remote

For remote URLs, the files will be downloaded to a temporary on-disk cache and deleted upon Python interpreter exit.

The pds4_tools.read() function has several optional arguments that control warnings, scaling, and lazy-loading. It returns a PDS4 Tools object called a StructureList which is a list-like object, consisting of Structure objects. A Structure typically consists of a data array or table, and the label portion that describes that it.

>>> structures[0]                   # First Structure
>>> structures[0:2]                 # First two Structures
>>> structures['Integration']       # Structure with LID or name of 'Integration'

The StructureList has a useful method, StructureList.info(), which summarizes the data structure content of the opened PDS4 file:

0 : Array_3D_Spectrum 'Primary' (3 axes, 21 x 10 x 36)
1 : Table_Binary 'Integration' (9 fields x 1000 records)
2 : Table_Binary 'Engineering' (38 fields x 1000  records)
3 : Table_Binary 'Binning' (9 fields x 1000 records)
4 : Table_Binary 'PixelGeometry' (12 fields x 10000 records)
5 : Table_Binary 'SpacecraftGeometry' (36 fields x 21 records)
6 : Table_Binary 'Observation' (22 fields x 1 records)

Note that LIDs and names are case-sensitive.

Working with large files

The pds4_tools.read() function, with default arguments, will immediately read all data structures into memory all at once. This may not be desired for labels describing many large data structures at once, and for this reason the function supports a lazy_load=True argument.

>>> structures = pds4_tools.read('/path/to/label.xml', lazy_load=True)

When enabled, this argument ensures that data will only be transparently read-in upon first attempt to access that data. This has minimal impact on smaller files as well.

Additionally, an interface is available to read only portions of PDS4 arrays into memory at once. For more information, see Working with Image and Array Data.

Working with PDS4 Labels

As mentioned earlier, each element of a StructureList is a Structure object with .label, .meta_data and .data attributes that can be used to access the label and data portions of the Structure. The StructureList itself contains a .label attribute, which can be used to access the entire label.

If you are not familiar with XML, a brief example would be,

<record_length unit="byte">65</record_length>

Where the record_length is called an element, the unit is called an attribute, its value is called an attribute value, and in this case the element’s text is 65.

The .label attribute of a StructureList or a Structure is a Label instance, another PDS4 Tools object. It provides access to the XML label content, although some knowledge of XPATH expressions is generally required for search and usage. You may however use the label.to_dict() and label.to_string() methods to obtain more familiar access. Below we provide some examples of using .label and .meta_data.

To search a Label instance, you may use,

>>> structures.label.find('.//record_length').text
65
>>> structures.label.find('.//record_length').attrib
{'unit': 'byte'}
>>> structures.label.find('.//start_date_time').text
'2015-06-01T00:36:23.03Z'

This uses XPATH to find the first occurrence of the start_date_time and record_length elements, no matter how deep in the XML tree they are.

If there are multiple occurrences of an element, you may use,

>>> lids = structures.label.findall('.//local_identifier')
>>> lids[0].text
'Primary'
>>> lids[1].text
'Integration'

To search for elements outside of the core PDS namespace, one may use,

>>> reference_time = structures.label.find('.//geom:geometry_reference_time_utc')
>>> reference_time.text
'2019-05-24T10:30:06.724Z'

For more details, we encourage you to see the Supported XPATH syntax section of the Python manual for ElementTree, which underlines the implementation of the PDS4 Tools’ Label object.

For an individual Structure, we can use .meta_data attribute to access the associated label information. This attribute may be a number of Meta_Class derived instances, all of which inherit from the OrderedDict Python data structure. Below we show some sample meta data for an array described by the label,

>>> array_structure = structures['Primary']

>>> array_structure.type
'Array_3D_Spectrum'

>>> array_structure.meta_data.keys()
['local_identifier', 'offset', 'axes', 'axis_index_order', 'description', 'Element_Array', 'Axis_Array']

>>> array_structure.meta_data['local_identifier']
'Primary'

>>> array_structure.meta_data['Axis_Array']['axis_name']
'Time'

The organization and naming of .meta_data attributes directly follow those in the label, with a few exceptions that are discussed in the notes for each relevant meta data class.

Working with Data

Image and Array Data

If a Structure’s data is an array, the data attribute of the ArrayStructure object will be an object that is for all intents and purposes identical to a NumPy ndarray object, except possessing an additional meta_data attribute. Refer to the NumPy documentation for the complete details on manipulating these numerical arrays.

>>> structures[0].id
'Primary'
>>> data = structures[0].data

Here data contains the data of the first Structure, which corresponds to the Structure with a local identifier of Primary. Alternatively, you can access a Structure by its local identifier or its name,

>>> data = structures['Primary'].data

For very large arrays it may be convenient to read-in only portions of the array into memory at a time. This may be done with the ArraySection interface if lazy_load is set during the initial read-in call,

>>> data = structures['Large_Array'].section[0:50000, 25000:50000]

For data with Special Constants, such as flag values indicating missing data, you may access a version of the structure where numeric flag values are masked.

>>> data = structures['Primary'].as_masked().data
>>> data = structures['Large_Array'].as_masked().section[0:50000, 25000:50000]

For data access through masked arrays, mathematical functions such as minimum and maximum and many other operations will return correct results instead of counting flag values.

In all cases, the returned data has many useful attributes and methods for a user to get information about the array; e.g.,

>>> data.shape
(21, 10, 36)
>>> data.dtype.name
'float32'

Since image data is a NumPy array, we can slice it, view it, and perform mathematical operations on it. To see the pixel value at i1=5, i2=2, i3=10:

>>> print(data[4, 1, 9])

Note that Python is 0-indexed. Additionally, all PDS4 data is required to be last index fastest, and the read-in array dimensions will follow the sequence_number as provided in the labels.

The next example shows that NumPy array data can be manipulated in a single command, specifically a multiplication and division of all values,

>>> data = (data * 10) / 5

To access label meta data for an ArrayStructure, we may use its .meta_data attribute. See the Working with PDS4 Labels section for examples, as well as the Meta_ArrayStructure class.

Table Data

If working with a table, the data inside the TableStructure can be accessed in multiple ways. Similar to array data, an individual field’s data will be an object that is for all intents and purposes identical to a NumPy ndarray object, except possessing an additional meta_data attribute. The underlying data object containing all fields is similar to a recarray. Refer to the NumPy documentation for the complete details on manipulating these numerical arrays.

Common ways to access data for individual columns (or fields, in PDS4 parlance) are,

# Access the 'Wavelength' field in the 'Integration' Table
>>> structures['Integration']['Wavelength']
>>> structures['Integration'].field('Wavelength')

# Access the first field
>>> structures['Integration'].field(0)

# Access multiple fields at the same time
>>> structures['Integration'][['Timestamp', 'Wavelength']]

As can be seen in these examples, a field can be obtained by either index or by name.

In many cases it is preferable to access fields by their name, as the field name is entirely independent of its physical order in the table. As with Structure names, field names are case-sensitive.

To access the data record-wise,

# Access the entire first record (all fields) in the 'Integration' Table
>>> structures['Integration'][0]

# Access the first 10 records (all fields)
>>> structures['Integration'][0:10]

The underlying data object, which is essentially a NumPy record array, may be accessed directly via,

>>> structures['Integration'].data

The NumPy array returned by the above calls contain the data for the requested selection. We can slice it, view it, and perform mathematical operations as desired.

>>> field = structures['Integration']['Wavelength']

>>> field[0:10]   # The first 10 rows for field 'Wavelength'
>>> field.mean()  # Take the mean of the field
>>> field * 5     # Multiply each value in the field by 5

For data with Special Constants, such as flag values indicating missing data, you may access a version of the structure where numeric flag values are masked.

# Access a view of the table where flag values are masked
>>> masked_table = structures['Integration'].as_masked()

# Data access and operations are unchanged, e.g.:
>>> masked_table['Wavelength']
>>> masked_table.field('Wavelength')
>>> masked_table.field(0)

For data access through masked tables, mathematical functions such as minimum and maximum and many other operations will return correct results instead of counting flag values. This is also often advantageous when plotting data, where common software will exclude masked values. The underlying label must correctly describe Special Constants for them to be masked.

The object returned when accessing individual fields is for all intents and purposes identical to a NumPy ndarray object. However, it also provides a .meta_data than can give the field’s meta data as recorded in the label,

>>> field.meta_data['unit']
'deg'

>>> field.meta_data.keys()
['name', 'location', 'data_type', 'length', 'unit', 'description']

To access label meta data for the entire TableStructure, we may use its .meta_data attribute. See the Working with PDS4 Labels section for examples, as well as the Meta_TableStructure class.

Visualization

PDS4 Tools ship with a GUI that enables basic visualization of PDS4 data. To use this,

>>> import pds4_tools

You may then call the GUI via,

>>> # Call an empty Viewer, allowing you to browse disk for file
>>> pds4_tools.view()
>>>
>>> # Specify path to label describing the data product to visualize
>>> pds4_tools.view('/path/to/label.xml')
>>>
>>> # Specify structures that have already been read-in
>>> structures = pds4_tools.read('/path/to/label.xml')
>>> pds4_tools.view(from_existing_structures=structures)

Note that the basic GUI works via Tkinter, which generally ships with installations of Python. To enable Image View and Plot View, you must also have recent versions of Matplotlib installed.

1.4 API

The full API reference is available here.