Clustering London Accidents Data Using Fuzzy C-Means

Introduction

Understanding patterns in accident data is crucial for urban planning, traffic management, and public safety. Clustering is a powerful technique that helps in identifying accident-prone areas by grouping locations with similar characteristics. In this tutorial, we will use Fuzzy C-Means (FCM) clustering to analyze London accident data over 36 months, identifying high-density clusters and potential correlations over time.

Unlike traditional clustering methods like K-Means, where each point belongs to a single cluster, Fuzzy C-Means assigns probabilities, allowing a more flexible representation of accident hotspots. This can be particularly useful when accident locations are close to multiple clusters.

1 / 10

Step 1: Download UK Accidents Data

We downloaded UK accident data from data.police.uk, converted it into geometrical data, and filtered it to retain only accidents within the London administrative boundary. Below is a concise Python script using geopandas to achieve this, ensuring the data is in CRS 4326 and saved as a Parquet file.

import geopandas as gpd
from shapely.geometry import Point

# Load accident data (assuming a CSV with 'Latitude' and 'Longitude' columns)
df = gpd.read_file("uk_accidents.csv")
df["geometry"] = gpd.points_from_xy(df["Longitude"], df["Latitude"])
gdf_accidents = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")

# Load London boundary (assuming GeoJSON format)
london_boundary = gpd.read_file("london_boundary.geojson").to_crs("EPSG:4326")

# Spatial filter: keep only accidents within London
london_accidents = gdf_accidents[gdf_accidents.intersects(london_boundary.unary_union)]

# Save as Parquet
london_accidents.to_parquet("london_accidents.parquet")

import geopandas as gpd

from shapely.geometry import Point

# Load accident data (assuming a CSV with 'Latitude' and 'Longitude' columns)

df = gpd.read_file("uk_accidents.csv")

df["geometry"] = gpd.points_from_xy(df["Longitude"], df["Latitude"])

gdf_accidents = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")

# Load London boundary (assuming GeoJSON format)

london_boundary = gpd.read_file("london_boundary.geojson").to_crs("EPSG:4326")

# Spatial filter: keep only accidents within London

london_accidents = gdf_accidents[gdf_accidents.intersects(london_boundary.unary_union)]

# Save as Parquet

london_accidents.to_parquet("london_accidents.parquet")

Step 2: Import Required Libraries

We first need to import the required Python libraries for data handling, geospatial processing, and clustering.

import pandas as pd
import geopandas as gpd
import numpy as np
import skfuzzy as fuzz
from shapely.geometry import Point
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

import pandas as pd

import geopandas as gpd

import numpy as np

import skfuzzy as fuzz

from shapely.geometry import Point

from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

Step 3: Load and Explore the Dataset

We assume the dataset is stored in a Parquet file for efficient handling of large geospatial datasets.

# Load the dataset
gdf = gpd.read_parquet("shp/london_accidents.parquet")

# Display first few records
print(gdf.head())

# Get unique months
unique_months = sorted(gdf['Month'].unique())

# Check for missing values
print(gdf.isnull().sum())

# Load the dataset

gdf = gpd.read_parquet("shp/london_accidents.parquet")

# Display first few records

print(gdf.head())

# Get unique months

unique_months = sorted(gdf['Month'].unique())

# Check for missing values

print(gdf.isnull().sum())

If any missing values are found in latitude or longitude columns, we drop those rows to ensure the accuracy of clustering.

gdf = gdf.dropna(subset=['Latitude', 'Longitude'])

1	gdf = gdf.dropna(subset=['Latitude', 'Longitude'])

Step 4: Define Clustering Parameters

To visualize different clusters effectively, we define a color palette for the clusters.

cluster_colors = [
    "#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF",
    "#00FFFF", "#800000", "#008000", "#000080", "#808000"
]
cmap = ListedColormap(cluster_colors)

cluster_colors = [

"#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF",

"#00FFFF", "#800000", "#008000", "#000080", "#808000"

]

cmap = ListedColormap(cluster_colors)

Step 5: Perform Fuzzy C-Means Clustering for Each Month

We iterate through each month, filter the dataset accordingly, and apply Fuzzy C-Means Clustering.

for i in range(len(unique_months)):

    month = unique_months[i]

    # Filter based on month
    data = gdf[gdf['Month'] == month]

    print("Number of recs: ", len(data))

    # Extract latitude and longitude
    coordinates = data[['Latitude', 'Longitude']].values

    # Normalize the coordinates (optional but recommended for clustering)
    scaler = MinMaxScaler()
    normalized_coordinates = scaler.fit_transform(coordinates)

    # Define the number of clusters (you can adjust this based on your analysis)
    num_clusters = 10

    # Perform Fuzzy C-Means clustering
    cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None
    )

    # Assign each point to the cluster with the highest membership value
    cluster_membership = np.argmax(u, axis=0)

    # Add cluster labels to the original DataFrame
    data['Cluster'] = cluster_membership

    # Count the number of crimes in each cluster
    cluster_counts = data['Cluster'].value_counts()

    # Calculate crime density per cluster
    cluster_density = cluster_counts / cluster_counts.sum()

    # Display high-risk clusters
    high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index
    print("High-risk clusters:", high_risk_clusters)

    # # Plot clusters
    # fig, ax = plt.subplots(figsize=(10, 10))
    # data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10)
    # plt.title(f"London Crime Clusters - {month}")
    # plt.show()

    # Plot clusters
    fig, ax = plt.subplots(figsize=(10, 10))

    # Plot all clusters
    data.plot(
        column='Cluster', ax=ax, cmap=cmap, markersize=10,
        legend=False  # Disable default legend
    )

    # Highlight high-density clusters
    for cluster in high_risk_clusters:
        high_density_data = data[data['Cluster'] == cluster]
        high_density_data.plot(
            ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}"
        )

    # Add a custom legend for high-density clusters
    handles = [
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10)
        for cluster in high_risk_clusters
    ]
    labels = [f"Cluster {cluster}" for cluster in high_risk_clusters]
    ax.legend(handles, labels, title="High-Density Clusters", loc="upper right")

    # Set plot title
    plt.title(f"London Crime Clusters - {month}")
    plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight')
    # plt.show()

    # Close the plot to free memory
    plt.close(fig)

for i in range(len(unique_months)):

month = unique_months[i]

# Filter based on month

data = gdf[gdf['Month'] == month]

print("Number of recs: ", len(data))

# Extract latitude and longitude

coordinates = data[['Latitude', 'Longitude']].values

# Normalize the coordinates (optional but recommended for clustering)

scaler = MinMaxScaler()

normalized_coordinates = scaler.fit_transform(coordinates)

# Define the number of clusters (you can adjust this based on your analysis)

num_clusters = 10

# Perform Fuzzy C-Means clustering

cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(

normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None

)

# Assign each point to the cluster with the highest membership value

cluster_membership = np.argmax(u, axis=0)

# Add cluster labels to the original DataFrame

data['Cluster'] = cluster_membership

# Count the number of crimes in each cluster

cluster_counts = data['Cluster'].value_counts()

# Calculate crime density per cluster

cluster_density = cluster_counts / cluster_counts.sum()

# Display high-risk clusters

high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index

print("High-risk clusters:", high_risk_clusters)

# # Plot clusters

# fig, ax = plt.subplots(figsize=(10, 10))

# data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10)

# plt.title(f"London Crime Clusters - {month}")

# plt.show()

# Plot clusters

fig, ax = plt.subplots(figsize=(10, 10))

# Plot all clusters

data.plot(

column='Cluster', ax=ax, cmap=cmap, markersize=10,

legend=False # Disable default legend

)

# Highlight high-density clusters

for cluster in high_risk_clusters:

high_density_data = data[data['Cluster'] == cluster]

high_density_data.plot(

ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}"

)

# Add a custom legend for high-density clusters

handles = [

plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10)

for cluster in high_risk_clusters

]

labels = [f"Cluster {cluster}" for cluster in high_risk_clusters]

ax.legend(handles, labels, title="High-Density Clusters", loc="upper right")

# Set plot title

plt.title(f"London Crime Clusters - {month}")

plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight')

# plt.show()

# Close the plot to free memory

plt.close(fig)

Step 6: Understanding the Results

The script processes data for 36 months, generating monthly cluster maps.
High-risk clusters are those with above-average density of accidents.
By analyzing these clusters across months, we can identify patterns:

These insights can inform policymakers and city planners in improving road safety measures.

Complete Python Code

import pandas as pd
import os, glob
import geopandas as gdf
import numpy as np
import skfuzzy as fuzz

import geopandas as gpd
from shapely.geometry import Point
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Load the dataset
gdf = gpd.read_parquet("shp/london_accidents.parquet")

# Display the first few 
print(gdf.head())

# Get unique months
unique_months = sorted(gdf['Month'].unique())

# Check for missing values
print(gdf.isnull().sum())


# Drop rows with missing latitude/longitude if any
gdf = gdf.dropna(subset=['Latitude', 'Longitude'])

# Display summary statistics
gdf.describe()

# Define 10 distinct colors for clusters
cluster_colors = [
    "#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF",
    "#00FFFF", "#800000", "#008000", "#000080", "#808000"
]
cmap = ListedColormap(cluster_colors)

for i in range(len(unique_months)):

    month = unique_months[i]

    # Filter based on month
    data = gdf[gdf['Month'] == month]

    print("Number of recs: ", len(data))

    # Extract latitude and longitude
    coordinates = data[['Latitude', 'Longitude']].values

    # Normalize the coordinates (optional but recommended for clustering)
    scaler = MinMaxScaler()
    normalized_coordinates = scaler.fit_transform(coordinates)

    # Define the number of clusters (you can adjust this based on your analysis)
    num_clusters = 10

    # Perform Fuzzy C-Means clustering
    cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None
    )

    # Assign each point to the cluster with the highest membership value
    cluster_membership = np.argmax(u, axis=0)

    # Add cluster labels to the original DataFrame
    data['Cluster'] = cluster_membership

    # Count the number of crimes in each cluster
    cluster_counts = data['Cluster'].value_counts()

    # Calculate crime density per cluster
    cluster_density = cluster_counts / cluster_counts.sum()

    # Display high-risk clusters
    high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index
    print("High-risk clusters:", high_risk_clusters)

    # # Plot clusters
    # fig, ax = plt.subplots(figsize=(10, 10))
    # data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10)
    # plt.title(f"London Crime Clusters - {month}")
    # plt.show()

    # Plot clusters
    fig, ax = plt.subplots(figsize=(10, 10))

    # Plot all clusters
    data.plot(
        column='Cluster', ax=ax, cmap=cmap, markersize=10,
        legend=False  # Disable default legend
    )

    # Highlight high-density clusters
    for cluster in high_risk_clusters:
        high_density_data = data[data['Cluster'] == cluster]
        high_density_data.plot(
            ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}"
        )

    # Add a custom legend for high-density clusters
    handles = [
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10)
        for cluster in high_risk_clusters
    ]
    labels = [f"Cluster {cluster}" for cluster in high_risk_clusters]
    ax.legend(handles, labels, title="High-Density Clusters", loc="upper right")

    # Set plot title
    plt.title(f"London Crime Clusters - {month}")
    plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight')
    # plt.show()

    # Close the plot to free memory
    plt.close(fig)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

import pandas as pd

import os, glob

import geopandas as gdf

import numpy as np

import skfuzzy as fuzz

import geopandas as gpd

from shapely.geometry import Point

from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

# Load the dataset

gdf = gpd.read_parquet("shp/london_accidents.parquet")

# Display the first few

print(gdf.head())

# Get unique months

unique_months = sorted(gdf['Month'].unique())

# Check for missing values

print(gdf.isnull().sum())

# Drop rows with missing latitude/longitude if any

gdf = gdf.dropna(subset=['Latitude', 'Longitude'])

# Display summary statistics

gdf.describe()

# Define 10 distinct colors for clusters

cluster_colors = [

"#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF",

"#00FFFF", "#800000", "#008000", "#000080", "#808000"

]

cmap = ListedColormap(cluster_colors)

for i in range(len(unique_months)):

month = unique_months[i]

# Filter based on month

data = gdf[gdf['Month'] == month]

print("Number of recs: ", len(data))

# Extract latitude and longitude

coordinates = data[['Latitude', 'Longitude']].values

# Normalize the coordinates (optional but recommended for clustering)

scaler = MinMaxScaler()

normalized_coordinates = scaler.fit_transform(coordinates)

# Define the number of clusters (you can adjust this based on your analysis)

num_clusters = 10

# Perform Fuzzy C-Means clustering

cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(

normalized_coordinates.T, num_clusters, 2, error=0.005, maxiter=1000, init=None

)

# Assign each point to the cluster with the highest membership value

cluster_membership = np.argmax(u, axis=0)

# Add cluster labels to the original DataFrame

data['Cluster'] = cluster_membership

# Count the number of crimes in each cluster

cluster_counts = data['Cluster'].value_counts()

# Calculate crime density per cluster

cluster_density = cluster_counts / cluster_counts.sum()

# Display high-risk clusters

high_risk_clusters = cluster_density[cluster_density > cluster_density.mean()].index

print("High-risk clusters:", high_risk_clusters)

# # Plot clusters

# fig, ax = plt.subplots(figsize=(10, 10))

# data.plot(column='Cluster', ax=ax, legend=True, cmap='viridis', markersize=10)

# plt.title(f"London Crime Clusters - {month}")

# plt.show()

# Plot clusters

fig, ax = plt.subplots(figsize=(10, 10))

# Plot all clusters

data.plot(

column='Cluster', ax=ax, cmap=cmap, markersize=10,

legend=False # Disable default legend

)

# Highlight high-density clusters

for cluster in high_risk_clusters:

high_density_data = data[data['Cluster'] == cluster]

high_density_data.plot(

ax=ax, color=cluster_colors[cluster], markersize=20, label=f"Cluster {cluster}"

)

# Add a custom legend for high-density clusters

handles = [

plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cluster_colors[cluster], markersize=10)

for cluster in high_risk_clusters

]

labels = [f"Cluster {cluster}" for cluster in high_risk_clusters]

ax.legend(handles, labels, title="High-Density Clusters", loc="upper right")

# Set plot title

plt.title(f"London Crime Clusters - {month}")

plt.savefig(f"plots/{i+1}_crime_clusters_{month}.png", dpi=300, bbox_inches='tight')

# plt.show()

# Close the plot to free memory

plt.close(fig)

Conclusion

In this tutorial, we:

Loaded and preprocessed accident data.
Used Fuzzy C-Means to cluster accident locations.
Identified high-risk clusters and visualized them.
Generated monthly accident cluster maps for 36 months.
Encouraged further analysis to detect trends and correlations over time.

This approach provides a data-driven way to enhance road safety policies by identifying and addressing accident-prone areas in London.

I hope this tutorial will create a good foundation for you. If you want tutorials on another GIS topic or you have any queries, please send an mail at contact@spatial-dev.guru.

Introduction

Step 1: Download UK Accidents Data

Step 2: Import Required Libraries

Step 3: Load and Explore the Dataset

Step 4: Define Clustering Parameters

Step 5: Perform Fuzzy C-Means Clustering for Each Month

Step 6: Understanding the Results

Complete Python Code

Conclusion

Related Posts

Leave a ReplyCancel reply

Discover more from Spatial Dev Guru

Discover more from Spatial Dev Guru