Building Georeferenced Datasets from HDF5 Files with h5py, xarray, and Rasterio

EmailTwitterLinkedInFacebookWhatsAppShare

Introduction

HDF5 (.h5) files are widely used in scientific computing, particularly for storing large-scale datasets from remote sensing missions, climate models, and other geospatial applications. These files are self-describing, meaning they store both the data and metadata (e.g., spatial extent, resolution, units, etc.) in a hierarchical structure. However, working with HDF5 files can be challenging due to their complexity and lack of standardized georeferencing information.

This tutorial provides an in-depth guide on how to read, process, and convert HDF5 files into a usable raster format using Python libraries such as h5py, xarray, rasterio, and rioxarray. By the end of this tutorial, you will be able to extract geospatial information, construct affine transformations, and create georeferenced datasets suitable for GIS analysis and visualization.


Prerequisites

Before starting, ensure you have the following Python libraries installed:

pip install rioxarray xarray rasterio h5py numpy matplotlib

h5 Datasets

You can down h5 datasets from https://search.earthdata.nasa.gov/search?q=ecostress&ac=true

Code Repo

Please find the full code here https://github.com/geospatialearning/hdf5_to_georeferenced_raster.

Key Libraries

  • h5py: For reading and exploring HDF5 files.
  • xarray: For handling multi-dimensional labeled arrays.
  • rasterio: For geospatial raster operations.
  • rioxarray: An extension of xarray for geospatial data.
  • numpy: For numerical operations.
  • matplotlib: For visualizing data.

Opening HDF5 Files in Python

There are multiple ways to open HDF5 files in Python. Below, we explore two approaches: using rioxarray and h5py.

Using rioxarray

The rioxarray library is designed to handle geospatial raster data. It extends xarray with geospatial capabilities, including CRS (Coordinate Reference System) and affine transformation support.

import rioxarray as riox

ds = riox.open_rasterio("h5/lst_ECOSTRESS_L2_LSTE_28628_015_20230724T203731_0601_02.h5")

However, when opening certain HDF5 files, you may encounter a warning like:

NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.

This indicates that the file lacks georeferencing information, which must be manually extracted and applied.


Using h5py

To address the lack of georeferencing, we use h5py to inspect the file’s structure and extract metadata manually.


Exploring the HDF5 Structure

HDF5 files are hierarchical, similar to a file system. They contain groups and datasets. To understand the contents of the file, list its keys:

print("Top-level groups:", list(h5_ds.keys()))

Listing Metadata and Data Sets

We often find three main groups in remote sensing HDF5 files:

  1. Metadata Groups: Contain high-level metadata about the dataset.
  2. Data Groups: Contain the actual raster data.
  3. Standard Metadata: Contains standard fields like bounding coordinates, resolution, etc.

Extracting Geospatial Information

Geospatial information is critical for interpreting the data correctly. We extract the bounding box and dimensions of the raster.

Bounding Coordinates

Retrieve the geographic extent (west, east, south, north):

Raster Dimensions

Retrieve the number of rows and columns:

Pixel Resolution

Calculate the resolution (cell size) in degrees or meters:


Converting HDF5 Data to xarray.DataArray

To work with geospatial raster data, we convert the extracted arrays into xarray.DataArray objects and attach metadata.

Step-by-Step Conversion

  1. Iterate Over Datasets: Loop through all datasets in the SDS group.
  2. Convert to NumPy Array: Extract each dataset as a NumPy array.
  3. Create Coordinates: Define x and y coordinates based on the bounding box and resolution.
  4. Attach Metadata: Add attributes such as _FillValue, scale_factor, and units.

Creating an xarray.Dataset

Combine all the DataArray objects into a single xarray.Dataset for easier management:

ds = xr.Dataset(da_arr)
print(ds)

Saving Data as NetCDF

Save the dataset as a NetCDF file for further analysis or sharing:

ds.to_netcdf('alst_emis.nc')

Visualizing the Data

You can visualize the data using plot function of xarray object.


Conclusion

This tutorial demonstrated how to:

  1. Open and explore HDF5 files using h5py.
  2. Extract geospatial metadata and construct affine transformations.
  3. Convert HDF5 datasets into xarray.DataArray objects with georeferencing.
  4. Combine multiple bands into an xarray.Dataset.
  5. Save the dataset as a NetCDF file for further analysis.

This workflow is invaluable for processing remote sensing data and preparing it for GIS applications. If you encounter specific challenges or need further clarification, feel free to ask!


This expanded tutorial should provide a comprehensive understanding of working with HDF5 files in Python.

I hope this tutorial will create a good foundation for you. If you want tutorials on another GIS topic or you have any queries, please send an mail at contact@spatial-dev.guru.

Leave a ReplyCancel reply

Discover more from Spatial Dev Guru

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from Spatial Dev Guru

Subscribe now to keep reading and get access to the full archive.

Continue reading