Datasets¤
A Dataset is analogous to a file in the file system used by an operating system and it is contained within a Group (analogous to a directory).
A Dataset operates as a numpy ndarray with Metadata and it can be accessed in read-only mode or in read-write mode.
Since a Dataset is a numpy ndarray, the attributes of ndarray are also valid for a Dataset. For example, suppose my_dataset is a Dataset
>>> my_dataset
<Dataset '/my_dataset' shape=(5,) dtype='|V16' (2 metadata)>
>>> print(my_dataset)
array([(0.23, 1.27), (1.86, 2.74), (3.44, 2.91), (5.91, 1.83),
(8.73, 0.74)], dtype=[('x', '<f8'), ('y', '<f8')])
You can get the shape using
>>> my_dataset.shape
(5,)
or convert the data in the Dataset to a Python list using tolist
>>> my_dataset.tolist()
[(0.23, 1.27), (1.86, 2.74), (3.44, 2.91), (5.91, 1.83), (8.73, 0.74)]
To access the Metadata of a Dataset, you access the metadata attribute
>>> my_dataset.metadata
<Metadata '/my_dataset' {'temperature': 20.13, 'humidity': 45.31}>
You can access values of the Metadata as keys
>>> my_dataset.metadata["humidity"]
45.31
or as attributes
>>> my_dataset.metadata.temperature
20.13
Depending on the dtype that was used to create the ndarray for the Dataset, the field names can also be accessed as class attributes. For example, you can access the fields in my_dataset as keys
>>> my_dataset["x"]
array([0.23, 1.86, 3.44, 5.91, 8.73])
or as attributes
>>> my_dataset.x
array([0.23, 1.86, 3.44, 5.91, 8.73])
See Accessing Keys as Class Attributes for more information.
You can also chain multiple attribute calls together. For example, to get the maximum x value in my_dataset you can use
>>> print(my_dataset.x.max())
8.73
Automatic Group Creation¤
If you want to create a new Dataset and its parent Groups do not exist yet, the parent Groups are automatically created for you
>>> voltages = root.create_dataset("a/b/c/voltages", data=[3.2, 3.4, 3.3])
>>> root.a
<Group '/a' (2 groups, 1 dataset, 0 metadata)>
>>> root.a.b
<Group '/a/b' (1 group, 1 dataset, 0 metadata)>
>>> root.a.b.c
<Group '/a/b/c' (0 groups, 1 dataset, 0 metadata)>
>>> voltages
<Dataset '/a/b/c/voltages' shape=(3,) dtype='<f8' (0 metadata)>
Slicing and Indexing¤
Slicing and indexing a Dataset is a valid operation, but returns a numpy ndarray which does not contain Metadata.
Consider my_dataset from above. You can slice it
>>> my_dataset[::2]
array([(0.23, 1.27), (3.44, 2.91), (8.73, 0.74)],
dtype=[('x', '<f8'), ('y', '<f8')])
or index it
>>> print(my_dataset[2])
(3.44, 2.91)
Since a numpy ndarray is returned, you are responsible for keeping track of the Metadata in slicing and indexing operations. For example, you can create a new Dataset from the subset by calling the create_dataset method
>>> my_subset = root.create_dataset("my_subset", data=my_dataset[::2], **my_dataset.metadata)
>>> my_subset
<Dataset '/my_subset' shape=(3,) dtype='|V16' (2 metadata)>
>>> my_subset.data
array([(0.23, 1.27), (3.44, 2.91), (8.73, 0.74)],
dtype=[('x', '<f8'), ('y', '<f8')])
>>> my_subset.metadata
<Metadata '/my_subset' {'temperature': 20.13, 'humidity': 45.31}>
Arithmetic Operations¤
Arithmetic operations are valid with a Dataset. The returned object is a Dataset with all Metadata copied and the name attribute updated to represent the operation that was performed.
For example, consider a temperatures Dataset
>>> temperatures
<Dataset '/temperatures' shape=(3,) dtype='<f8' (1 metadata)>
>>> temperatures.data
array([19.8, 21.1, 20.5])
>>> temperatures.metadata.unit
'°C'
and you wanted to add 1 to each temperature value, you can do the following
>>> plus_1 = temperatures + 1
>>> plus_1
<Dataset 'add(/temperatures)' shape=(3,) dtype='<f8' (1 metadata)>
>>> plus_1.data
array([20.8, 22.1, 21.5])
>>> plus_1.metadata.unit
'°C'
If the arithmetic operation involves multiple Datasets then the Metadata from the Datasets are merged into the resultant Dataset. Thus, if the Metadata for the individual Datasets have the same keys then only the key-value pair in the right-most Dataset in the operation will exist after the merger.
For example, suppose you have two Datasets that contain the following information
>>> dset1.data
array([1., 2., 3.])
>>> dset1.metadata
<Metadata '/dset1' {'temperature': 20.3}>
>>> dset2.data
array([4., 5., 6.])
>>> dset2.metadata
<Metadata '/dset2' {'temperature': 21.7}>
You can add the Datasets, but the temperature value in dset2 will be merged into the Metadata of dset3 (since dset2 is to the right of dset1 in the addition operation)
>>> dset3 = dset1 + dset2
>>> dset3
<Dataset 'add(/dset1,/dset2)' shape=(3,) dtype='<f8' (1 metadata)>
>>> dset3.metadata
<Metadata 'add(/dset1,/dset2)' {'temperature': 21.7}>
If you want to preserve both temperature values, or change the resultant name, you can do so by explicitly creating a new Dataset
>>> dset3 = root.create_dataset("dset3", data=dset3, t1=dset1.metadata.temperature, t2=dset2.metadata.temperature)
>>> dset3
<Dataset '/dset3' shape=(3,) dtype='<f8' (2 metadata)>
>>> dset3.data
array([5., 7., 9.])
>>> dset3.metadata
<Metadata '/dset3' {'t1': 20.3, 't2': 21.7}>
Logging Records¤
The DatasetLogging class is a custom Dataset that is also a Handler which automatically appends logging records to the Dataset. See create_dataset_logging for more details.
The following illustrates how to automatically append logging records to a Dataset
>>> import logging
>>> from msl.io import JSONWriter
>>> logger = logging.getLogger("my_logger")
>>> root = JSONWriter()
>>> log_dset = root.create_dataset_logging("log")
>>> logger.info("hi")
>>> logger.error("cannot do that!")
>>> log_dset.data
array([(..., 'INFO', 'my_logger', 'hi'),
(..., 'ERROR', 'my_logger', 'cannot do that!')],
dtype=[('asctime', 'O'), ('levelname', 'O'), ('name', 'O'), ('message', 'O')])
Get all ERROR logging records
>>> errors = log_dset[log_dset["levelname"] == "ERROR"]
>>> print(errors)
[(..., 'ERROR', 'my_logger', 'cannot do that!')]
Stop the DatasetLogging instance from receiving logging records
>>> log_dset.remove_handler()
Note
When a file is read, it will load an object that was once a DatasetLogging as a Dataset (i.e., it will not be associated with new logging records that are emitted). If you want to convert the Dataset to be a DatasetLogging item again, so that logging records are once again appended to it when emitted, then you must call the require_dataset_logging method with the name argument equal to the value of the name attribute of the Dataset.