
Introduction
HDF5 (.h5) files are widely used in scientific computing, particularly for storing large-scale datasets from remote sensing missions, climate models, and other geospatial applications. These files are self-describing, meaning they store both the data and metadata (e.g., spatial extent, resolution, units, etc.) in a hierarchical structure. However, working with HDF5 files can be challenging due to their complexity and lack of standardized georeferencing information.
This tutorial provides an in-depth guide on how to read, process, and convert HDF5 files into a usable raster format using Python libraries such as h5py, xarray, rasterio, and rioxarray. By the end of this tutorial, you will be able to extract geospatial information, construct affine transformations, and create georeferenced datasets suitable for GIS analysis and visualization.
Prerequisites
Before starting, ensure you have the following Python libraries installed:
pip install rioxarray xarray rasterio h5py numpy matplotlib
h5 Datasets
You can down h5 datasets from https://search.earthdata.nasa.gov/search?q=ecostress&ac=true
Code Repo
Please find the full code here https://github.com/geospatialearning/hdf5_to_georeferenced_raster.
Key Libraries
h5py: For reading and exploring HDF5 files.xarray: For handling multi-dimensional labeled arrays.rasterio: For geospatial raster operations.rioxarray: An extension ofxarrayfor geospatial data.numpy: For numerical operations.matplotlib: For visualizing data.
Opening HDF5 Files in Python
There are multiple ways to open HDF5 files in Python. Below, we explore two approaches: using rioxarray and h5py.
Using rioxarray
The rioxarray library is designed to handle geospatial raster data. It extends xarray with geospatial capabilities, including CRS (Coordinate Reference System) and affine transformation support.
import rioxarray as riox
ds = riox.open_rasterio("h5/lst_ECOSTRESS_L2_LSTE_28628_015_20230724T203731_0601_02.h5")
However, when opening certain HDF5 files, you may encounter a warning like:
NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
This indicates that the file lacks georeferencing information, which must be manually extracted and applied.
Using h5py
To address the lack of georeferencing, we use h5py to inspect the file’s structure and extract metadata manually.
|
1 2 3 4 5 |
import h5py # Open the HDF5 file file_path = "h5/lst_ECOSTRESS_L2_LSTE_28628_015_20230724T203731_0601_02.h5" h5_ds = h5py.File(file_path, 'r') |
Exploring the HDF5 Structure
HDF5 files are hierarchical, similar to a file system. They contain groups and datasets. To understand the contents of the file, list its keys:
print("Top-level groups:", list(h5_ds.keys()))
Listing Metadata and Data Sets
We often find three main groups in remote sensing HDF5 files:
- Metadata Groups: Contain high-level metadata about the dataset.
- Data Groups: Contain the actual raster data.
- Standard Metadata: Contains standard fields like bounding coordinates, resolution, etc.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# List L2 LSTE Metadata print("\nL2 LSTE Metadata:") for key in h5_ds['L2 LSTE Metadata'].keys(): print(key) # List SDS (Scientific Data Sets) print("\nSDS (Scientific Data Sets):") for key in h5_ds['SDS'].keys(): print(key, ":", h5_ds['SDS'][key]) # List Standard Metadata print("\nStandard Metadata:") for key in h5_ds['StandardMetadata'].keys(): print(key, ":", h5_ds['StandardMetadata'][key]) |
Extracting Geospatial Information
Geospatial information is critical for interpreting the data correctly. We extract the bounding box and dimensions of the raster.
Bounding Coordinates
Retrieve the geographic extent (west, east, south, north):
|
1 2 3 4 5 6 |
east = float(h5_ds['StandardMetadata']['EastBoundingCoordinate'][0]) north = float(h5_ds['StandardMetadata']['NorthBoundingCoordinate'][0]) south = float(h5_ds['StandardMetadata']['SouthBoundingCoordinate'][0]) west = float(h5_ds['StandardMetadata']['WestBoundingCoordinate'][0]) extent = (west, east, south, north) print("Extent:", extent) |
Raster Dimensions
Retrieve the number of rows and columns:
|
1 2 3 |
rows = int(h5_ds['StandardMetadata']['ImageLines'][0]) cols = int(h5_ds['StandardMetadata']['ImagePixels'][0]) print("Rows:", rows, "Columns:", cols) |
Pixel Resolution
Calculate the resolution (cell size) in degrees or meters:
|
1 2 3 4 5 6 |
from rasterio.transform import from_origin x_res = (east - west) / cols # Resolution in x-direction y_res = (north - south) / rows # Resolution in y-direction transform = from_origin(west, north, x_res, y_res) # Affine transform print("Transform:", transform) |
Converting HDF5 Data to xarray.DataArray
To work with geospatial raster data, we convert the extracted arrays into xarray.DataArray objects and attach metadata.
Step-by-Step Conversion
- Iterate Over Datasets: Loop through all datasets in the
SDSgroup. - Convert to NumPy Array: Extract each dataset as a NumPy array.
- Create Coordinates: Define
xandycoordinates based on the bounding box and resolution. - Attach Metadata: Add attributes such as
_FillValue,scale_factor, andunits.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import xarray as xr import numpy as np da_arr = {} # Dictionary to store DataArrays for dataset_name in h5_ds['SDS']: # Extract the dataset as a NumPy array np_array = np.array(h5_ds['SDS'][dataset_name])[::-1] # Flip vertically if needed # Create the DataArray da = xr.DataArray( np_array, dims=("y", "x"), coords={ "y": np.linspace(north, south, np_array.shape[0]), # Latitude values "x": np.linspace(west, east, np_array.shape[1]), # Longitude values }, name=dataset_name ) # Attach metadata metadata = dict(h5_ds['SDS'][dataset_name].attrs) for attr in ['_FillValue', 'add_offset', 'coordsys', 'format', 'long_name', 'scale_factor', 'units', 'valid_range']: if attr in metadata: value = metadata[attr] da.attrs[attr] = value[0] if isinstance(value, np.ndarray) else value.decode("utf-8") # Write CRS and transform da.rio.write_crs('EPSG:4326', inplace=True) # Assume WGS84 da.rio.write_transform(transform, inplace=True) # Store the DataArray in the dictionary da_arr[dataset_name] = da |
Creating an xarray.Dataset
Combine all the DataArray objects into a single xarray.Dataset for easier management:
ds = xr.Dataset(da_arr)
print(ds)
Saving Data as NetCDF
Save the dataset as a NetCDF file for further analysis or sharing:
ds.to_netcdf('alst_emis.nc')
Visualizing the Data
You can visualize the data using plot function of xarray object.
|
1 |
ds['LST'].plot() |
Conclusion
This tutorial demonstrated how to:
- Open and explore HDF5 files using
h5py. - Extract geospatial metadata and construct affine transformations.
- Convert HDF5 datasets into
xarray.DataArrayobjects with georeferencing. - Combine multiple bands into an
xarray.Dataset. - Save the dataset as a NetCDF file for further analysis.
This workflow is invaluable for processing remote sensing data and preparing it for GIS applications. If you encounter specific challenges or need further clarification, feel free to ask!
This expanded tutorial should provide a comprehensive understanding of working with HDF5 files in Python.
I hope this tutorial will create a good foundation for you. If you want tutorials on another GIS topic or you have any queries, please send an mail at contact@spatial-dev.guru.
