Step 4: Nearest-Neighbor Spatial Join for Unmatched Parcels
Author
Russell Blessing
Overview
This notebook performs a nearest-neighbor spatial join for parcels that did not receive a string match in Step 3. For each county, the unmatched OneMap parcel geometries are joined to the nearest CoreLogic point. Ties (a parcel equidistant from multiple CL points) are resolved by Levenshtein address similarity. The final output combines the string-matched and distance-matched records into a single parcel–CoreLogic table.
In [1]:
Show / hide code
library(sf)
Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE
Show / hide code
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Show / hide code
library(readr)library(stringr)library(purrr)library(stringdist) # for levenshtein distance calculationlibrary(gt)library(tidyr)
Attaching package: 'tidyr'
The following object is masked from 'package:stringdist':
extract
In [2]:
Show / hide code
out_dir <-"/proj/mhinolab/users/rbless/data/Obstacles_Output"# CoreLogic (all character)cl <-read_csv(file.path(out_dir, "cl.csv"),col_types =cols(.default =col_character()))# Previously string-matched parcel indicesstringmatched <-read_csv(file.path(out_dir, "parcel_cl_stringmatch.csv"),col_types =cols(.default =col_character())) |>mutate(parcel_index =as.character(parcel_index))# Read spatial parcels (keep geometry for the join)nc1_attr <-st_read(file.path(out_dir, "parcels_study_area.gpkg"),query ="SELECT parcel_index, cntyfips FROM parcels_study_area",quiet =TRUE) |>st_drop_geometry() |>mutate(parcel_index =as.character(parcel_index),cntyfips =paste0("37", cntyfips) )unmatched_ids <- nc1_attr |>filter(!parcel_index %in% stringmatched$parcel_index) |>pull(parcel_index)# Read only unmatched rows with geometry using a SQL filterunmatched_nc1 <-st_read(file.path(out_dir, "parcels_study_area.gpkg"),query =paste0("SELECT * FROM parcels_study_area WHERE parcel_index IN (",paste(unmatched_ids, collapse =","),")" ),quiet =TRUE) |>mutate(parcel_index =as.character(parcel_index),cntyfips =paste0("37", cntyfips) )rm(nc1_attr); gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 51288866 2739.2 93051511 4969.5 68666733 3667.2
Vcells 584757134 4461.4 921214206 7028.4 688777479 5255.0
Show / hide code
county_list <-unique(unmatched_nc1$cntyfips)# Build sf object for CL records in counties that have unmatched parcelscl_unmatched <- cl |>filter(`FIPS CODE`%in% county_list) |>rename(fips_code =`FIPS CODE`) |>filter(!is.na(`PARCEL LEVEL LATITUDE`) &!is.na(`PARCEL LEVEL LONGITUDE`)) |>st_as_sf(coords =c("PARCEL LEVEL LONGITUDE", "PARCEL LEVEL LATITUDE"),crs =4326 )
In [3]:
Show / hide code
# Reproject to NC State Plane (meters) for distance calculationsunmatched_nc1 <-st_transform(unmatched_nc1, crs =32119)cl_unmatched <-st_transform(cl_unmatched, crs =st_crs(unmatched_nc1))# Assign geometry IDs (factorize unique geometries)unmatched_nc1 <- unmatched_nc1 |>mutate(geometry_id =as.integer(factor(as.character(st_geometry(unmatched_nc1)))))cl_unmatched <- cl_unmatched |>mutate(geometry_id1 =as.integer(factor(as.character(st_geometry(cl_unmatched)))))
tibble::tibble(Source =c("String-matched (Step 3)", "Distance-matched (no ties)","Distance-matched (ties resolved)", "Total unique parcels in merge"),Records =c(n_string, n_nosort, n_sorted, n_total)) |>gt() |>tab_header(title ="Parcel–CoreLogic Merge Summary") |>fmt_integer(columns = Records) |>cols_label(Source ="", Records ="N")
Parcel–CoreLogic Merge Summary
N
String-matched (Step 3)
3,248,412
Distance-matched (no ties)
1,024,893
Distance-matched (ties resolved)
0
Total unique parcels in merge
4,273,305
1,024,893 parcels entered the spatial join (no string match found in Step 3).
1,024,893 were unambiguously distance-matched.
0 had tied distances and were resolved by Levenshtein address similarity.
The combined file nc1_cl_merge.csv contains 4,273,305 unique parcel records.
One-to-Many CL Match Diagnostic
In a distance-based spatial join, it is expected that a single CoreLogic (CL) point may be the nearest neighbor to multiple OneMap parcels — for example, when a large tract has been subdivided, when many parcels lack site addresses, or when a condominium building appears as one polygon in OneMap but has many individual unit records in CL.
The following tables describe the scope of this issue in the distance-matched output.
In [14]:
Show / hide code
# Count how many OneMap parcels share each CL geometry_id1cl_match_counts <- cl_dist_nosort |>count(geometry_id1, name ="n_parcels")one_to_many_summary <- cl_match_counts |>summarise(`Total CL records matched`=n(),`Matched to exactly 1 parcel`=sum(n_parcels ==1),`Matched to 2–5 parcels`=sum(n_parcels >1& n_parcels <=5),`Matched to 6–20 parcels`=sum(n_parcels >5& n_parcels <=20),`Matched to >20 parcels`=sum(n_parcels >20),`Max parcels sharing one CL record`=max(n_parcels),`Parcels affected (n_parcels > 1)`=sum(n_parcels[n_parcels >1]) ) |>pivot_longer(everything(), names_to ="Metric", values_to ="Value")one_to_many_summary |>gt() |>tab_header(title ="Distance-Matched: One-to-Many CL Record Summary") |>fmt_integer(columns = Value) |>cols_label(Metric ="", Value ="N")
Distance-Matched: One-to-Many CL Record Summary
N
Total CL records matched
787,198
Matched to exactly 1 parcel
723,852
Matched to 2–5 parcels
54,585
Matched to 6–20 parcels
6,627
Matched to >20 parcels
2,134
Max parcels sharing one CL record
968
Parcels affected (n_parcels > 1)
301,041
In [15]:
Show / hide code
# Break down multi-matched parcels by CL property typecl_dist_nosort |>inner_join( cl_match_counts |>filter(n_parcels >1),by ="geometry_id1" ) |>distinct(geometry_id1, n_parcels, `PROPERTY.INDICATOR.CODE`,`LAND.USE.CODE`, `COUNTY.USE.DESCRIPTION`) |>count(`PROPERTY.INDICATOR.CODE`, `LAND.USE.CODE`, `COUNTY.USE.DESCRIPTION`,wt = n_parcels, name ="n_parcels_affected") |>arrange(desc(n_parcels_affected)) |>slice_head(n =20) |>gt() |>tab_header(title ="Top 20 CoreLogic Property Types with Multi-Parcel Matches",subtitle ="Weighted by number of OneMap parcels affected" ) |>fmt_integer(columns = n_parcels_affected) |>cols_label(`PROPERTY.INDICATOR.CODE`="Property Indicator",`LAND.USE.CODE`="Land Use Code",`COUNTY.USE.DESCRIPTION`="County Use Description",n_parcels_affected ="Parcels Affected" )
Top 20 CoreLogic Property Types with Multi-Parcel Matches
Weighted by number of OneMap parcels affected
Property Indicator
Land Use Code
County Use Description
Parcels Affected
80
400
VACANT
30,290
10
163
SINGLE FAMILY RESIDENTIAL
20,933
80
400
VACANT LAND
17,542
10
163
RESIDENTIAL-1 FAMILY
14,832
10
163
RESIDENTIAL SINGLE FAMILY
9,194
10
163
SINGLE FAMILY RES
7,681
NA
NA
NA
7,474
10
160
RURAL HOMESITE
6,765
10
100
RESIDENTIAL
6,541
10
163
SINGLE FAMILY RESIDENTIAL RURA
5,976
10
163
SINGLE FAMILY RES. - RURAL AC
5,367
10
163
RESIDENTIAL
5,355
10
163
SFD DWELLING
4,852
10
163
SINGLE FAMILY
4,545
20
200
COMMERCIAL
4,184
10
163
SIN FAM RES RURAL ACREAGE
4,154
10
163
DWELLING
3,840
10
100
VACANT LAND
3,648
11
112
CONDOMINIUM
3,452
70
500
AGRI I PV
3,002
Key findings:
The majority of CL records (723,852, 92.0%) match exactly one OneMap parcel — these are clean one-to-one matches.
301,041 parcels share a CL record with at least one other parcel.
Multi-match cases fall into three broad categories:
Condominiums (PROPERTY.INDICATOR.CODE = 11, LAND.USE.CODE = 112): One building polygon in OneMap is the nearest neighbor for many individual CL unit records. This is expected behavior.
Single-family residential (PROPERTY.INDICATOR.CODE = 10, LAND.USE.CODE = 163): A CL residential point is the nearest neighbor to multiple surrounding OneMap parcels (e.g., adjacent vacant lots, subdivided tracts with no site addresses). This is the primary risk for downstream application duplication.
Vacant land (PROPERTY.INDICATOR.CODE = 80 or 10, LAND.USE.CODE = 400 or 100): Similar to the residential case — many parcels cluster around a single CL vacant land point.
A n_cl_matches flag column should be added in downstream joins to allow filtering to clean one-to-one matches when needed.