Parallel Processing and Saving Raster Chunks Using Xarray and Dask

Parallel Processing and Saving Raster Chunks Using Xarray and Dask

In this tutorial, we’ll walk through how to process and save raster chunks in parallel using Xarray and Dask. This technique is particularly useful when working with large raster datasets where chunking and parallel processing can significantly improve efficiency.

Prerequisites

Before we begin, ensure you have the following Python libraries installed:

xarray
dask
rioxarray (for saving rasters)

You can install them using pip:

pip install xarray dask rioxarray

1 2	pip install xarray dask rioxarray

Step 1: Import Libraries

import xarray as xr
import dask

import xarray as xr

import dask

Xarray is great for working with labeled multi-dimensional arrays, and Dask provides parallel computing capabilities to process large datasets efficiently.

Step 2: Read the Raster Dataset

# Read raster data
ds = xr.open_dataset('cdnh43e_v3r1/study_area.tif')

# Read raster data

ds = xr.open_dataset('cdnh43e_v3r1/study_area.tif')

Here, we load the raster dataset using open_dataset. Ensure the file format is supported by Xarray and that the dataset is properly structured.

Step 3: Define Chunk Sizes

# Add chunk information
ds_chunk = ds.chunk({'x': 100, 'y': 100})

# Add chunk information

ds_chunk = ds.chunk({'x': 100, 'y': 100})

Chunking divides the dataset into smaller, manageable pieces. In this case, we specify a chunk size of 100×100 pixels.

Step 4: Save Chunks as TIFF Files

Define a function to save each chunk as an individual TIFF file:

# Function to save each chunk as a TIFF file
def save_chunk(chunk, block_id):
    _id = next(block_id)  # Generate a unique block ID
    if _id != 0:
        print(f"ID: {_id}, Size: {chunk.sizes}")
        # Save the chunk as a raster file
        chunk.band_data.rio.to_raster(f"files/{_id}.tiff")
    return chunk

# Function to save each chunk as a TIFF file

def save_chunk(chunk, block_id):

_id = next(block_id) # Generate a unique block ID

if _id != 0:

print(f"ID: {_id}, Size: {chunk.sizes}")

# Save the chunk as a raster file

chunk.band_data.rio.to_raster(f"files/{_id}.tiff")

return chunk

The save_chunk function:

Takes a chunk and a unique block ID generator as inputs.
Saves the chunk as a raster file using rio.to_raster.
Optionally prints metadata like chunk size.

Step 5: Create a Block ID Iterator

# Iterator for creating block IDs
block_id = iter(range(1000000000000))  # Generates unique IDs for chunks

# Iterator for creating block IDs

block_id = iter(range(1000000000000)) # Generates unique IDs for chunks

This iterator assigns a unique ID to each chunk. You can adjust the range as needed.

Step 6: Process and Save Chunks in Parallel

Use Dask’s map_blocks to apply the save_chunk function to each chunk:

# Parallel processing with Dask
ds_chunk.map_blocks(save_chunk, template=ds_chunk).compute()

# Parallel processing with Dask

ds_chunk.map_blocks(save_chunk, template=ds_chunk).compute()

Here’s how it works:

map_blocks applies the save_chunk function to each chunk.
template=ds_chunk ensures that the output matches the input dataset structure.
.compute() triggers the Dask computation, processing all chunks in parallel.

Step 7: Run the Script

Run the script, and it will process the raster in chunks, saving each as an individual TIFF file in the files directory.

Full Source Code:

# import libraries
import xarray as xr
import dask

# Read data
ds = xr.open_dataset('cdnh43e_v3r1/study_area.tif')

# Add chunk info
ds_chunk = ds.chunk({'x': 100, 'y': 100})

# Save chunk as tiff 
def save_chunk(chunk, block_id):
    _id = next(block_id)
    if _id != 0:
        print("ID: ", _id, "Size: ", chunk.sizes, "Centroid: ", centroid)
        chunk.band_data.rio.to_raster(f"files/{_id}.tiff")
    return chunk

# iterator for creating block id of chunks
block_id = iter(range(1000000000000))

# dask multiprocessing 
ds_chunk.map_blocks(save_chunk, args=[block_id], template=ds_chunk).compute()

# import libraries

import xarray as xr

import dask

# Read data

ds = xr.open_dataset('cdnh43e_v3r1/study_area.tif')

# Add chunk info

ds_chunk = ds.chunk({'x': 100, 'y': 100})

# Save chunk as tiff

def save_chunk(chunk, block_id):

_id = next(block_id)

if _id != 0:

print("ID: ", _id, "Size: ", chunk.sizes, "Centroid: ", centroid)

chunk.band_data.rio.to_raster(f"files/{_id}.tiff")

return chunk

# iterator for creating block id of chunks

block_id = iter(range(1000000000000))

# dask multiprocessing

ds_chunk.map_blocks(save_chunk, args=[block_id], template=ds_chunk).compute()

Key Advantages

Efficient Processing: Dask parallelizes the task, leveraging multi-core CPUs.
Chunk Management: By specifying chunk sizes, memory usage is optimized.
Scalability: The method works seamlessly for large raster datasets.

Output

After execution, you’ll have TIFF files saved in the files directory, each representing a 100×100 pixel chunk of the original raster.

Conclusion

Using Xarray and Dask for parallel processing of raster data is a powerful technique for handling large datasets. By dividing the raster into manageable chunks and processing them in parallel, you can save significant time and computational resources.

Feel free to adapt this workflow to your specific needs, such as modifying chunk sizes or incorporating additional processing steps!

I hope this tutorial will create a good foundation for you. If you want tutorials on another GIS topic or you have any queries, please send an mail at contact@spatial-dev.guru.