Back to Article
Step 4: Nearest-Neighbor Spatial Join for Unmatched Parcels
Download Source

Step 4: Nearest-Neighbor Spatial Join for Unmatched Parcels

Author

Russell Blessing

Overview

This notebook performs a nearest-neighbor spatial join for parcels that did not receive a string match in Step 3. For each county, the unmatched OneMap parcel geometries are joined to the nearest CoreLogic point. Ties (a parcel equidistant from multiple CL points) are resolved by Levenshtein address similarity. The final output combines the string-matched and distance-matched records into a single parcel–CoreLogic table.

In [1]:
Show / hide code
library(sf)
Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE
Show / hide code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Show / hide code
library(readr)
library(stringr)
library(purrr)
library(stringdist)  # for levenshtein distance calculation
library(gt)
library(tidyr)

Attaching package: 'tidyr'
The following object is masked from 'package:stringdist':

    extract
In [2]:
Show / hide code
out_dir <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

# CoreLogic (all character)
cl <- read_csv(
  file.path(out_dir, "cl.csv"),
  col_types = cols(.default = col_character())
)

# Previously string-matched parcel indices
stringmatched <- read_csv(
  file.path(out_dir, "parcel_cl_stringmatch.csv"),
  col_types = cols(.default = col_character())
) |>
  mutate(parcel_index = as.character(parcel_index))

# Read spatial parcels (keep geometry for the join)
nc1_attr <- st_read(
  file.path(out_dir, "parcels_study_area.gpkg"),
  query = "SELECT parcel_index, cntyfips FROM parcels_study_area",
  quiet = TRUE
) |>
  st_drop_geometry() |>
  mutate(
    parcel_index = as.character(parcel_index),
    cntyfips     = paste0("37", cntyfips)
  )

unmatched_ids <- nc1_attr |>
  filter(!parcel_index %in% stringmatched$parcel_index) |>
  pull(parcel_index)

# Read only unmatched rows with geometry using a SQL filter
unmatched_nc1 <- st_read(
  file.path(out_dir, "parcels_study_area.gpkg"),
  query = paste0(
    "SELECT * FROM parcels_study_area WHERE parcel_index IN (",
    paste(unmatched_ids, collapse = ","),
    ")"
  ),
  quiet = TRUE
) |>
  mutate(
    parcel_index = as.character(parcel_index),
    cntyfips     = paste0("37", cntyfips)
  )

rm(nc1_attr); gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  51288866 2739.2   93051511 4969.5  68666733 3667.2
Vcells 584757134 4461.4  921214206 7028.4 688777479 5255.0
Show / hide code
county_list <- unique(unmatched_nc1$cntyfips)

# Build sf object for CL records in counties that have unmatched parcels
cl_unmatched <- cl |>
  filter(`FIPS CODE` %in% county_list) |>
  rename(fips_code = `FIPS CODE`) |> 
  filter(!is.na(`PARCEL LEVEL LATITUDE`) & !is.na(`PARCEL LEVEL LONGITUDE`)) |>
  st_as_sf(
    coords = c("PARCEL LEVEL LONGITUDE", "PARCEL LEVEL LATITUDE"),
    crs = 4326
  )
In [3]:
Show / hide code
# Reproject to NC State Plane (meters) for distance calculations
unmatched_nc1  <- st_transform(unmatched_nc1,  crs = 32119)
cl_unmatched   <- st_transform(cl_unmatched,   crs = st_crs(unmatched_nc1))

# Assign geometry IDs (factorize unique geometries)
unmatched_nc1 <- unmatched_nc1 |>
  mutate(geometry_id = as.integer(factor(as.character(st_geometry(unmatched_nc1)))))

cl_unmatched <- cl_unmatched |>
  mutate(geometry_id1 = as.integer(factor(as.character(st_geometry(cl_unmatched)))))
In [4]:
Show / hide code
# Save intermediate geopackages
st_write(unmatched_nc1, file.path(out_dir, "nc1geoms_distmatch.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

st_write(cl_unmatched, file.path(out_dir, "cl_wID.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
In [5]:
Show / hide code
out_dir <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

unmatched_nc1 <- st_read(file.path(out_dir, "nc1geoms_distmatch.gpkg"), quiet = TRUE) |>
  mutate(parcel_index = as.character(parcel_index))

cl_unmatched <- st_read(file.path(out_dir, "cl_wID.gpkg"), quiet = TRUE)

county_list <- unique(unmatched_nc1$cntyfips)
In [6]:
Show / hide code
# Drop to unique CL geometries before joining (avoids redundant distance calcs)
cl_unique <- cl_unmatched |>
  distinct(geometry_id1, .keep_all = TRUE)

# County-by-county nearest-neighbor join
nn_results <- map(county_list, function(fips) {
  cl_fips      <- cl_unique      |> filter(`fips_code` == fips)
  parcels_fips <- unmatched_nc1  |> filter(cntyfips    == fips)

  if (nrow(cl_fips) == 0 || nrow(parcels_fips) == 0) return(NULL)

  # Nearest-neighbour join (assigns each parcel its nearest CL row)
  nn <- st_join(parcels_fips, cl_fips, join = st_nearest_feature) |>
    filter(!is.na(geometry_id1))

  if (nrow(nn) == 0) return(NULL)

  # Get CL point geometries for matched rows
  idx           <- match(nn$geometry_id1, cl_fips$geometry_id1)
  matched_geoms <- st_geometry(cl_fips)[idx]

  nn <- nn |>
    mutate(distance = as.numeric(
      st_distance(st_geometry(nn), matched_geoms, by_element = TRUE)
    ))

  # For each parcel, keep only the minimum-distance match(es)
  nn |>
    st_drop_geometry() |>
    group_by(parcel_index) |>
    filter(distance == min(distance)) |>
    ungroup()
}, .progress = TRUE) |>
  list_rbind()
 ■                                  1% |  ETA:  5m
 ■■                                 3% |  ETA:  7m
 ■■                                 4% |  ETA:  5m
 ■■■                                8% |  ETA:  3m
 ■■■■                              12% |  ETA:  4m
 ■■■■■                             13% |  ETA:  5m
 ■■■■■■                            17% |  ETA:  6m
 ■■■■■■■                           19% |  ETA:  5m
 ■■■■■■■■                          22% |  ETA:  5m
 ■■■■■■■■                          23% |  ETA:  5m
 ■■■■■■■■                          24% |  ETA: 10m
 ■■■■■■■■■                         27% |  ETA: 10m
 ■■■■■■■■■                         28% |  ETA: 13m
 ■■■■■■■■■■                        29% |  ETA: 12m
 ■■■■■■■■■■                        31% |  ETA: 13m
 ■■■■■■■■■■■                       32% |  ETA: 13m
 ■■■■■■■■■■■                       35% |  ETA: 12m
 ■■■■■■■■■■■■                      36% |  ETA: 12m
 ■■■■■■■■■■■■■                     38% |  ETA: 10m
 ■■■■■■■■■■■■■                     40% |  ETA: 10m
 ■■■■■■■■■■■■■                     41% |  ETA: 10m
 ■■■■■■■■■■■■■■                    42% |  ETA:  9m
 ■■■■■■■■■■■■■■                    44% |  ETA: 11m
 ■■■■■■■■■■■■■■■                   46% |  ETA: 10m
 ■■■■■■■■■■■■■■■                   47% |  ETA: 10m
 ■■■■■■■■■■■■■■■■                  49% |  ETA:  9m
 ■■■■■■■■■■■■■■■■                  50% |  ETA:  9m
 ■■■■■■■■■■■■■■■■■                 53% |  ETA:  8m
 ■■■■■■■■■■■■■■■■■                 54% |  ETA:  8m
 ■■■■■■■■■■■■■■■■■■                56% |  ETA:  8m
 ■■■■■■■■■■■■■■■■■■■               59% |  ETA:  8m
 ■■■■■■■■■■■■■■■■■■■               60% |  ETA:  7m
 ■■■■■■■■■■■■■■■■■■■               62% |  ETA:  7m
 ■■■■■■■■■■■■■■■■■■■■              64% |  ETA:  6m
 ■■■■■■■■■■■■■■■■■■■■■             65% |  ETA:  6m
 ■■■■■■■■■■■■■■■■■■■■■             67% |  ETA:  6m
 ■■■■■■■■■■■■■■■■■■■■■■            69% |  ETA:  5m
 ■■■■■■■■■■■■■■■■■■■■■■            71% |  ETA:  5m
 ■■■■■■■■■■■■■■■■■■■■■■■           72% |  ETA:  5m
 ■■■■■■■■■■■■■■■■■■■■■■■           73% |  ETA:  5m
 ■■■■■■■■■■■■■■■■■■■■■■■           74% |  ETA:  4m
 ■■■■■■■■■■■■■■■■■■■■■■■■          77% |  ETA:  4m
 ■■■■■■■■■■■■■■■■■■■■■■■■          78% |  ETA:  4m
 ■■■■■■■■■■■■■■■■■■■■■■■■■         79% |  ETA:  3m
 ■■■■■■■■■■■■■■■■■■■■■■■■■         81% |  ETA:  3m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■        83% |  ETA:  3m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■        85% |  ETA:  3m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■       87% |  ETA:  2m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      88% |  ETA:  2m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      90% |  ETA:  2m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      91% |  ETA:  1m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     92% |  ETA:  1m
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     95% |  ETA: 50s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■    96% |  ETA: 37s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■    97% |  ETA: 25s
In [7]:
Show / hide code
# Parcels that matched multiple CL geometries at equal minimum distance
tie_counts      <- count(nn_results, parcel_index)
tie_parcelids   <- tie_counts$parcel_index[tie_counts$n > 1]

duplicated_rows <- nn_results |> filter(parcel_index %in%  tie_parcelids)
cl_dist_nosort  <- nn_results |> filter(!parcel_index %in% tie_parcelids)

write_csv(duplicated_rows, file.path(out_dir, "parcel_joins_to_sort.csv"))
write_csv(cl_dist_nosort,  file.path(out_dir, "parcel_cl_dist_joins_nosort.csv"))
In [8]:
Show / hide code

# CL attribute table (no geometry) for address lookup
cl_attr <- cl_unmatched |> st_drop_geometry()

# One row per tied parcel (the nc1 side)
nc1_pcls <- duplicated_rows |> distinct(parcel_index, .keep_all = TRUE)

set.seed(987)

find_best_cl_match <- function(pc_row) {
  pc       <- nc1_pcls[pc_row, ]
  cl_geoms <- duplicated_rows |>
    filter(parcel_index == pc$parcel_index) |>
    pull(geometry_id1)

  cl_pcls <- cl_attr |> filter(geometry_id1 %in% cl_geoms)

  # Choose address field: situs preferred, mailing as fallback
  if (!is.na(pc$siteadd) && pc$siteadd != "") {
    candidate_addresses <- paste(cl_pcls$`SITUS.HOUSE.NUMBER`,
                                 cl_pcls$`SITUS.STREET.NAME`)
    query_address       <- pc$siteadd
  } else {
    candidate_addresses <- paste(cl_pcls$`MAILING.HOUSE.NUMBER`,
                                 cl_pcls$`MAILING.STREET.NAME`)
    query_address       <- pc$mailadd
  }

  scores    <- stringdist(as.character(query_address), candidate_addresses,
                          method = "lv")
  best_idx  <- which(scores == min(scores))

  # Break remaining ties randomly
  if (length(best_idx) > 1) best_idx <- sample(best_idx, 1)

  cl_pcls[best_idx, ]
}

matched_cls  <- map(seq_len(nrow(nc1_pcls)), find_best_cl_match)
cl_dist_sorted <- bind_cols(
  nc1_pcls |> select(-any_of(names(cl_attr))),   # nc1 columns only
  list_rbind(matched_cls)                          # winning CL row
)

write_csv(cl_dist_sorted, file.path(out_dir, "cl_dist_sorted.csv"))
In [9]:
Show / hide code
cl_string <- read_csv(
  file.path(out_dir, "parcel_cl_stringmatch.csv"),
  col_types = cols(.default = col_character())
)

cl_dist_nosort <- read_csv(
  file.path(out_dir, "parcel_cl_dist_joins_nosort.csv"),
  col_types = cols(.default = col_character())
)

cl_dist_sorted <- read_csv(
  file.path(out_dir, "cl_dist_sorted.csv"),
  col_types = cols(.default = col_character())
)
In [10]:
Show / hide code

# Step 3 writes CL attributes with spaced names (`PROPERTY INDICATOR CODE`),
# while Step 4 writes them with dotted names because the CL attributes
# round-tripped through GeoPackage (GDAL replaces spaces with dots). Without
# normalizing, bind_rows() would create paired columns ("X Y" and "X.Y"), each
# half-populated. Force everything to the dotted form before stacking.
to_dotted <- function(df) {
  rename_with(df, ~ gsub(" ", ".", .x, fixed = TRUE))
}

# Stack string-matched, distance-matched (no sort needed), distance-matched (sorted)
pcl_comb <- bind_rows(
    cl_string      |> to_dotted() |> mutate(across(everything(), as.character)),
    cl_dist_nosort |> to_dotted() |> mutate(across(everything(), as.character)),
    cl_dist_sorted |> to_dotted() |> mutate(across(everything(), as.character))
  ) |>
  mutate(geometry_id = ifelse(is.na(geometry_id), "", as.character(geometry_id))) |>
  arrange(parcel_index, geometry_id) |>                # blank geometry_id sorts first
  distinct(parcel_index, .keep_all = TRUE)             # prefer string match

# Sanity check: no spaced/dotted column duplication slipped through
stopifnot(
  length(intersect(
    gsub(" ", ".", names(pcl_comb)),
    names(pcl_comb)
  )) == ncol(pcl_comb)
)

write_csv(pcl_comb, file.path(out_dir, "nc1_cl_merge.csv"))
In [11]:
Show / hide code
pcl_comb <- read_csv(
  file.path(out_dir, "nc1_cl_merge.csv"),
  col_types = cols(.default = col_character())
)

Summary

In [12]:
Show / hide code
n_string   <- nrow(cl_string)
n_nosort   <- nrow(cl_dist_nosort)
n_sorted   <- nrow(cl_dist_sorted)
n_total    <- nrow(pcl_comb)
n_unmatched_input <- nrow(unmatched_nc1)
In [13]:
Show / hide code
tibble::tibble(
  Source  = c("String-matched (Step 3)", "Distance-matched (no ties)",
              "Distance-matched (ties resolved)", "Total unique parcels in merge"),
  Records = c(n_string, n_nosort, n_sorted, n_total)
) |>
  gt() |>
  tab_header(title = "Parcel–CoreLogic Merge Summary") |>
  fmt_integer(columns = Records) |>
  cols_label(Source = "", Records = "N")
Parcel–CoreLogic Merge Summary
N
String-matched (Step 3) 3,248,412
Distance-matched (no ties) 1,024,893
Distance-matched (ties resolved) 0
Total unique parcels in merge 4,273,305
  • 1,024,893 parcels entered the spatial join (no string match found in Step 3).
  • 1,024,893 were unambiguously distance-matched.
  • 0 had tied distances and were resolved by Levenshtein address similarity.
  • The combined file nc1_cl_merge.csv contains 4,273,305 unique parcel records.

One-to-Many CL Match Diagnostic

In a distance-based spatial join, it is expected that a single CoreLogic (CL) point may be the nearest neighbor to multiple OneMap parcels — for example, when a large tract has been subdivided, when many parcels lack site addresses, or when a condominium building appears as one polygon in OneMap but has many individual unit records in CL.

The following tables describe the scope of this issue in the distance-matched output.

In [14]:
Show / hide code
# Count how many OneMap parcels share each CL geometry_id1
cl_match_counts <- cl_dist_nosort |>
  count(geometry_id1, name = "n_parcels")

one_to_many_summary <- cl_match_counts |>
  summarise(
    `Total CL records matched`          = n(),
    `Matched to exactly 1 parcel`       = sum(n_parcels == 1),
    `Matched to 2–5 parcels`            = sum(n_parcels > 1 & n_parcels <= 5),
    `Matched to 6–20 parcels`           = sum(n_parcels > 5 & n_parcels <= 20),
    `Matched to >20 parcels`            = sum(n_parcels > 20),
    `Max parcels sharing one CL record` = max(n_parcels),
    `Parcels affected (n_parcels > 1)`  = sum(n_parcels[n_parcels > 1])
  ) |>
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

one_to_many_summary |>
  gt() |>
  tab_header(title = "Distance-Matched: One-to-Many CL Record Summary") |>
  fmt_integer(columns = Value) |>
  cols_label(Metric = "", Value = "N")
Distance-Matched: One-to-Many CL Record Summary
N
Total CL records matched 787,198
Matched to exactly 1 parcel 723,852
Matched to 2–5 parcels 54,585
Matched to 6–20 parcels 6,627
Matched to >20 parcels 2,134
Max parcels sharing one CL record 968
Parcels affected (n_parcels > 1) 301,041
In [15]:
Show / hide code
# Break down multi-matched parcels by CL property type
cl_dist_nosort |>
  inner_join(
    cl_match_counts |> filter(n_parcels > 1),
    by = "geometry_id1"
  ) |>
  distinct(geometry_id1, n_parcels, `PROPERTY.INDICATOR.CODE`,
           `LAND.USE.CODE`, `COUNTY.USE.DESCRIPTION`) |>
  count(`PROPERTY.INDICATOR.CODE`, `LAND.USE.CODE`, `COUNTY.USE.DESCRIPTION`,
        wt = n_parcels, name = "n_parcels_affected") |>
  arrange(desc(n_parcels_affected)) |>
  slice_head(n = 20) |>
  gt() |>
  tab_header(
    title    = "Top 20 CoreLogic Property Types with Multi-Parcel Matches",
    subtitle = "Weighted by number of OneMap parcels affected"
  ) |>
  fmt_integer(columns = n_parcels_affected) |>
  cols_label(
    `PROPERTY.INDICATOR.CODE` = "Property Indicator",
    `LAND.USE.CODE`            = "Land Use Code",
    `COUNTY.USE.DESCRIPTION`   = "County Use Description",
    n_parcels_affected         = "Parcels Affected"
  )
Top 20 CoreLogic Property Types with Multi-Parcel Matches
Weighted by number of OneMap parcels affected
Property Indicator Land Use Code County Use Description Parcels Affected
80 400 VACANT 30,290
10 163 SINGLE FAMILY RESIDENTIAL 20,933
80 400 VACANT LAND 17,542
10 163 RESIDENTIAL-1 FAMILY 14,832
10 163 RESIDENTIAL SINGLE FAMILY 9,194
10 163 SINGLE FAMILY RES 7,681
NA NA NA 7,474
10 160 RURAL HOMESITE 6,765
10 100 RESIDENTIAL 6,541
10 163 SINGLE FAMILY RESIDENTIAL RURA 5,976
10 163 SINGLE FAMILY RES. - RURAL AC 5,367
10 163 RESIDENTIAL 5,355
10 163 SFD DWELLING 4,852
10 163 SINGLE FAMILY 4,545
20 200 COMMERCIAL 4,184
10 163 SIN FAM RES RURAL ACREAGE 4,154
10 163 DWELLING 3,840
10 100 VACANT LAND 3,648
11 112 CONDOMINIUM 3,452
70 500 AGRI I PV 3,002

Key findings:

  • The majority of CL records (723,852, 92.0%) match exactly one OneMap parcel — these are clean one-to-one matches.
  • 301,041 parcels share a CL record with at least one other parcel.
  • Multi-match cases fall into three broad categories:
    • Condominiums (PROPERTY.INDICATOR.CODE = 11, LAND.USE.CODE = 112): One building polygon in OneMap is the nearest neighbor for many individual CL unit records. This is expected behavior.
    • Single-family residential (PROPERTY.INDICATOR.CODE = 10, LAND.USE.CODE = 163): A CL residential point is the nearest neighbor to multiple surrounding OneMap parcels (e.g., adjacent vacant lots, subdivided tracts with no site addresses). This is the primary risk for downstream application duplication.
    • Vacant land (PROPERTY.INDICATOR.CODE = 80 or 10, LAND.USE.CODE = 400 or 100): Similar to the residential case — many parcels cluster around a single CL vacant land point.
  • A n_cl_matches flag column should be added in downstream joins to allow filtering to clean one-to-one matches when needed.