
Introduction
Understanding patterns in accident data is crucial for urban planning, traffic management, and public safety. Clustering is a powerful technique that helps in identifying accident-prone areas by grouping locations with similar characteristics. In this tutorial, we will use Fuzzy C-Means (FCM) clustering to analyze London accident data over 36 months, identifying high-density clusters and potential correlations over time.
Unlike traditional clustering methods like K-Means, where each point belongs to a single cluster, Fuzzy C-Means assigns probabilities, allowing a more flexible representation of accident hotspots. This can be particularly useful when accident locations are close to multiple clusters.
Step 1: Download UK Accidents Data
We downloaded UK accident data from data.police.uk, converted it into geometrical data, and filtered it to retain only accidents within the London administrative boundary. Below is a concise Python script using geopandas to achieve this, ensuring the data is in CRS 4326 and saved as a Parquet file.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import geopandas as gpd from shapely.geometry import Point # Load accident data (assuming a CSV with 'Latitude' and 'Longitude' columns) df = gpd.read_file("uk_accidents.csv") df["geometry"] = gpd.points_from_xy(df["Longitude"], df["Latitude"]) gdf_accidents = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326") # Load London boundary (assuming GeoJSON format) london_boundary = gpd.read_file("london_boundary.geojson").to_crs("EPSG:4326") # Spatial filter: keep only accidents within London london_accidents = gdf_accidents[gdf_accidents.intersects(london_boundary.unary_union)] # Save as Parquet london_accidents.to_parquet("london_accidents.parquet") |
Step 2: Import Required Libraries
We first need to import the required Python libraries for data handling, geospatial processing, and clustering.
|
1 2 3 4 5 6 7 8 |
import pandas as pd import geopandas as gpd import numpy as np import skfuzzy as fuzz from shapely.geometry import Point from sklearn.preprocessing import MinMaxScaler import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap |
Step 3: Load and Explore the Dataset
We assume the dataset is stored in a Parquet file for efficient handling of large geospatial datasets.
|
1 2 3 4 5 6 7 8 9 10 11 |
# Load the dataset gdf = gpd.read_parquet("shp/london_accidents.parquet") # Display first few records print(gdf.head()) # Get unique months unique_months = sorted(gdf['Month'].unique()) # Check for missing values print(gdf.isnull().sum()) |
If any missing values are found in latitude or longitude columns, we drop those rows to ensure the accuracy of clustering.
|
1 |
gdf = gdf.dropna(subset=['Latitude', 'Longitude']) |
Step 4: Define Clustering Parameters
To visualize different clusters effectively, we define a color palette for the clusters.
|
1 2 3 4 5 |
cluster_colors = [ "#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF", "#00FFFF", "#800000", "#008000", "#000080", "#808000" ] cmap = ListedColormap(cluster_colors) |
Step 5: Perform Fuzzy C-Means Clustering for Each Month
We iterate through each month, filter the dataset accordingly, and apply Fuzzy C-Means Clustering.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
for i in range(len(unique_months)): month = unique_months[i] # Filter based on month data = gdf[gdf['Month'] == month] print("Number of recs: ", len(data)) # Extract latitude and longitude coordinates = data[['Latitude', 'Longitude']].values # Normalize the coordinates (optional but recommended for clustering) scaler = MinMaxScaler() normalized_coordinates = scaler.fit_transform(coordinates) # Define the number of clusters (you can adjust this based on your analysis) num_clusters = 10 # Perform Fuzzy C-Means clustering cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans( normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None ) # Assign each point to the cluster with the highest membership value cluster_membership = np.argmax(u, axis=0) # Add cluster labels to the original DataFrame data['Cluster'] = cluster_membership # Count the number of crimes in each cluster cluster_counts = data['Cluster'].value_counts() # Calculate crime density per cluster cluster_density = cluster_counts / cluster_counts.sum() # Display high-risk clusters high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index print("High-risk clusters:", high_risk_clusters) # # Plot clusters # fig, ax = plt.subplots(figsize=(10, 10)) # data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10) # plt.title(f"London Crime Clusters - {month}") # plt.show() # Plot clusters fig, ax = plt.subplots(figsize=(10, 10)) # Plot all clusters data.plot( column='Cluster', ax=ax, cmap=cmap, markersize=10, legend=False # Disable default legend ) # Highlight high-density clusters for cluster in high_risk_clusters: high_density_data = data[data['Cluster'] == cluster] high_density_data.plot( ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}" ) # Add a custom legend for high-density clusters handles = [ plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10) for cluster in high_risk_clusters ] labels = [f"Cluster {cluster}" for cluster in high_risk_clusters] ax.legend(handles, labels, title="High-Density Clusters", loc="upper right") # Set plot title plt.title(f"London Crime Clusters - {month}") plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight') # plt.show() # Close the plot to free memory plt.close(fig) |
Step 6: Understanding the Results
- The script processes data for 36 months, generating monthly cluster maps.
- High-risk clusters are those with above-average density of accidents.
- By analyzing these clusters across months, we can identify patterns:
These insights can inform policymakers and city planners in improving road safety measures.
Complete Python Code
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
import pandas as pd import os, glob import geopandas as gdf import numpy as np import skfuzzy as fuzz import geopandas as gpd from shapely.geometry import Point from sklearn.preprocessing import MinMaxScaler import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap # Load the dataset gdf = gpd.read_parquet("shp/london_accidents.parquet") # Display the first few print(gdf.head()) # Get unique months unique_months = sorted(gdf['Month'].unique()) # Check for missing values print(gdf.isnull().sum()) # Drop rows with missing latitude/longitude if any gdf = gdf.dropna(subset=['Latitude', 'Longitude']) # Display summary statistics gdf.describe() # Define 10 distinct colors for clusters cluster_colors = [ "#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF", "#00FFFF", "#800000", "#008000", "#000080", "#808000" ] cmap = ListedColormap(cluster_colors) for i in range(len(unique_months)): month = unique_months[i] # Filter based on month data = gdf[gdf['Month'] == month] print("Number of recs: ", len(data)) # Extract latitude and longitude coordinates = data[['Latitude', 'Longitude']].values # Normalize the coordinates (optional but recommended for clustering) scaler = MinMaxScaler() normalized_coordinates = scaler.fit_transform(coordinates) # Define the number of clusters (you can adjust this based on your analysis) num_clusters = 10 # Perform Fuzzy C-Means clustering cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans( normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None ) # Assign each point to the cluster with the highest membership value cluster_membership = np.argmax(u, axis=0) # Add cluster labels to the original DataFrame data['Cluster'] = cluster_membership # Count the number of crimes in each cluster cluster_counts = data['Cluster'].value_counts() # Calculate crime density per cluster cluster_density = cluster_counts / cluster_counts.sum() # Display high-risk clusters high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index print("High-risk clusters:", high_risk_clusters) # # Plot clusters # fig, ax = plt.subplots(figsize=(10, 10)) # data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10) # plt.title(f"London Crime Clusters - {month}") # plt.show() # Plot clusters fig, ax = plt.subplots(figsize=(10, 10)) # Plot all clusters data.plot( column='Cluster', ax=ax, cmap=cmap, markersize=10, legend=False # Disable default legend ) # Highlight high-density clusters for cluster in high_risk_clusters: high_density_data = data[data['Cluster'] == cluster] high_density_data.plot( ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}" ) # Add a custom legend for high-density clusters handles = [ plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10) for cluster in high_risk_clusters ] labels = [f"Cluster {cluster}" for cluster in high_risk_clusters] ax.legend(handles, labels, title="High-Density Clusters", loc="upper right") # Set plot title plt.title(f"London Crime Clusters - {month}") plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight') # plt.show() # Close the plot to free memory plt.close(fig) |
Conclusion
In this tutorial, we:
- Loaded and preprocessed accident data.
- Used Fuzzy C-Means to cluster accident locations.
- Identified high-risk clusters and visualized them.
- Generated monthly accident cluster maps for 36 months.
- Encouraged further analysis to detect trends and correlations over time.
This approach provides a data-driven way to enhance road safety policies by identifying and addressing accident-prone areas in London.
I hope this tutorial will create a good foundation for you. If you want tutorials on another GIS topic or you have any queries, please send an mail at contact@spatial-dev.guru.










