In this tutorial, we’ll walk through a geospatial analysis of population shifts using raster data. We’ll be comparing census data from two years, 2006 and 2021, to identify significant changes in population.
Why Rasterize the 1KM Grid Census Data?
When working with geospatial data, especially for analysis over large areas or with many polygons, operations can become computationally expensive and time-consuming. The census data we have is in the form of 1KM grid polygons. While this format is detailed and precise, performing operations like overlays, intersections, and calculations on such polygon data can be slow.
By converting this polygon data into raster format, we can:
- Speed up computations: Raster operations, especially on regular grids, can be much faster than equivalent vector operations.
- Simplify data: Rasters represent data as a matrix of cells, which can be more straightforward to work with for certain analyses.
- Facilitate integration: If we want to combine our census data with other raster datasets (like satellite imagery or elevation data), having everything in raster format can make the process smoother.
However, it’s important to note that raster analysis is not a one-size-fits-all solution. While it offers advantages in certain scenarios, there are cases where vector analysis might be more appropriate, especially when dealing with intricate spatial details or when precise boundaries are essential.
To get full source code with example dataset for this tutorial, click here.
Prerequisites:
- Basic knowledge of Python and geospatial analysis.
- Libraries:
geopandas
,pandas
,geocube
,xarray
, andnumpy
.
Steps:
1. Reading the data
Before diving into the analysis, ensure you’ve have downloaded the data from https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat. For this tutorial, we have already downloaded the data and kept it in github repo.
The data we have is 1KM grid census polygons for 2021 and 2006 of Slovakia region
1 2 3 4 5 |
import geopandas as gpd import pandas as pd proj_crs = 'EPSG:3035' euro_census_within_slovakia_2021 = gpd.read_file('slovakia_census/slovakia_census_2021/slovakia_census_2021.shp') euro_census_within_slovakia_2006 = gpd.read_file('slovakia_census/slovakia_census_2006/slovakia_census_2006.shp') |
2. Rasterizing Vector Data:
We’ll rasterize the vector data using the make_geocube
library. This will convert our vector data into a grid format, making it easier to compare.
1 2 3 |
from geocube.api.core import make_geocube census_raster_2006 = make_geocube( vector_data=euro_census_within_slovakia_2006, measurements=["pops"], resolution=(1000, -1000) ) census_raster_2021 = make_geocube( vector_data=euro_census_within_slovakia_2021, measurements=["pops"], resolution=(1000, -1000) ) |
3. Aligning Rasters:
Since the rasters from 2006 and 2021 might have different dimensions, we need to align them.
1 2 |
import xarray as xr census_raster_2006, census_raster_2021 = xr.align(census_raster_2006, census_raster_2021, join='outer') |
4. Calculating Relative Change:
Subtract the 2006 raster from the 2021 raster to get the relative change in population.
1 2 |
import numpy as np result = np.abs(census_raster_2021 - census_raster_2006) |
5. Exporting data to raster:
The rasters for 2021, 2006, and the relative difference are saved as TIFF files for future use or visualization.
1 2 3 4 |
census_raster_2021.rio.to_raster('census_rasters/census_raster_2021.tiff') census_raster_2006.rio.to_raster('census_rasters/census_raster_2006.tiff') result.rio.to_raster('census_rasters/relative_census_difference.tiff') result = result.fillna(0) |
6. Identifying Top Changes:
Extract the top 5 highest relative population changes. We want to identify the areas with the most significant population changes. This step extracts the top 5 highest relative population changes.
1 2 3 |
top_changes = sorted(result.pops.to_numpy().flatten(), reverse=True)[:5] result = result.where(result.pops >= top_changes[-1]) result = result.fillna(0) |
7. Extracting Centroid Coordinates:
For the top changes, we’ll convert the raster pixels to centroid coordinates. For each of the top changes, we convert the raster pixels to centroid coordinates. This gives us the central point of each grid cell that experienced significant change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
pops_change = result.pops.to_numpy() x,y = np.meshgrid(result.x, result.y) relative_pops = [] pops_2006 = [] pops_2021 = [] centroids = [] for i in range(pops_change.shape[0]): for j in range(pops_change.shape[1]): if (pops_change[i][j] > 0): relative_pops.append(pops_change[i][j]) centroids.append(Point(x[i][j], y[i][j])) pops_2006.append(census_raster_2006.pops.to_numpy()[i][j]) pops_2021.append(census_raster_2021.pops.to_numpy()[i][j]) |
8. Creating a DataFrame:
Organize the extracted data into a DataFrame.
1 2 3 4 5 6 7 8 9 |
# Create a DataFrame with centroids and pixel values of relative difference top_5_cells = pd.DataFrame({ 'geometry': centroids, 'relative_pops_difference': relative_pops, 'pops_2006': pops_2006, 'pops_2021': pops_2021 }) top_5_cells = top_5_cells.sort_values(by='relative_pops_difference', ascending=False) |
9. Reprojecting Data:
Convert the data to the EPSG:4326 projection. Spatial data can exist in various coordinate reference systems (CRS). To ensure compatibility with other datasets or tools, we convert our data to the commonly used EPSG:4326 projection.
1 2 3 |
top_5_cells = top_5_cells.set_geometry(col='geometry') top_5_cells = top_5_cells.set_crs(proj_crs) top_5_cells = top_5_cells.to_crs('EPSG:4326') |
10. Saving Results:
Save the results to a CSV file.
1 |
top_5_cells.to_csv('top_5_cells.csv', index=False) |
11. Additional Analysis:
Calculate the average and median population per 1 square km grid for 2021.
1 2 |
avg_population = euro_census_within_slovakia_2021['pops'].mean() median_population = euro_census_within_slovakia_2021['pops'].median() |
In summary, this code provides a comprehensive analysis of population changes in Slovakia between 2006 and 2021. By rasterizing the data, aligning the rasters, and calculating relative changes, we can pinpoint the areas with the most significant population shifts. The results are then organized, reprojected, and saved for further study or visualization.