Back to Article
Step 5.1 (Alt): Direct Intersection — Mits & Apps to Parcels
Download Source

Step 5.1 (Alt): Direct Intersection — Mits & Apps to Parcels

Author

Russell Blessing

Overview

A transparent, no-heuristic spatial join of mitigation and application records to parcels. Every input record is kept (no deduplication). A single point representing, say, six mitigations on a condo building stays as six rows in the output. Shared-footprint parcels fan out naturally: one input record that intersects N parcel polygons produces N output rows.

This file is complementary to 05_spatial_match_app_mit.qmd, which runs the tiered priority cascade and produces the canonical matched output. Nothing here changes that output.

What this file does NOT do:

  • No tier cascade (high → med → low → fallback NN).
  • No distance-based matching or buffering in the join itself.
  • No dedup of mit/app records.
  • No winner selection when a point hits multiple shared-footprint parcels.

Inputs:

  • parcels_pri.gpkg — parcels with luc_tier column (from out_dir)
  • mit_newgeo.gpkg — raw mitigations (CRC_May_2025/Raw Data Inputs/raw_app_ncb/)
  • applications_final.gpkg — raw applications (same directory)

Outputs (written to out_dir):

  • alt_mits_direct.gpkg, alt_apps_direct.gpkg — full fan-out spatial join
  • alt_mits_direct_review.csv, alt_apps_direct_review.csv — same, no geometry

In [1]:
Show / hide code
library(sf)
Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE
Show / hide code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Show / hide code
library(readr)
library(tidyr)
library(purrr)
library(knitr)
library(DT)
library(leaflet)
library(glue)
library(stringdist)

Attaching package: 'stringdist'
The following object is masked from 'package:tidyr':

    extract
Show / hide code
library(stringr)
library(nngeo)
library(RANN)
In [2]:
Show / hide code
out_dir <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

parcels_path <- file.path(out_dir, "parcels_pri.gpkg")
mits_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"
apps_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"
flood_boundary_path <- "/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"

HIGH_TIER_MIN <- 3L

mit_id_col <- "unique_id"
app_id_col <- "unique_id"

1 Load inputs

In [3]:
Show / hide code
parcels <- st_read(parcels_path, quiet = TRUE) |>
  select(
    parcel_index, geometry_id, cntyfips, parno, altparno,
    siteadd, scity, szip, mailadd, ownname, parusedesc,
    `LAND.USE.CODE`, luc_tier,
    `APN.SEQUENCE.NUMBER`,
    `SITUS.HOUSE.NUMBER`, `SITUS.STREET.NAME`,
    `OWNER.1.FULL.NAME`, `YEAR.BUILT`,
    `ASSESSED.TOTAL.VALUE`,
    `IMPROVEMENT.VALUE.CALCULATED`   
  ) |>
  st_transform(32119)

# Flood study area boundary
flood_boundary <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(st_crs(parcels))

# Read raw mits/apps, clip to flood study area via st_within, then assign
# unique_id within the clipped set. Matches the canonical Step 5 SA filter.
# Do NOT dedup — every HMA / IHP record is preserved, coordinate duplicates included.
mits <- st_read(mits_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())

apps <- st_read(apps_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())

2 Direct intersection join

One row per (input record × intersecting parcel). Records that fall outside every parcel produce one row with NA parcel columns (left = TRUE).

In [4]:
Show / hide code
mits_direct <- st_join(mits, parcels, join = st_intersects, left = TRUE)
apps_direct <- st_join(apps, parcels, join = st_intersects, left = TRUE)

3 Tag match status

In [5]:
Show / hide code
tag_status <- function(df, high_tier_min = HIGH_TIER_MIN) {
  df |>
    mutate(
      luc_tier_num = suppressWarnings(as.integer(luc_tier)),
      match_status = case_when(
        is.na(parcel_index)           ~ "no_parcel_hit",
        luc_tier_num >= high_tier_min ~ "matched_high",
        TRUE                          ~ "matched_low"
      )
    )
}

mits_direct <- tag_status(mits_direct)
apps_direct <- tag_status(apps_direct)

4 Proximity to nearest high-priority parcel

For every input point, compute distance to the nearest high-priority parcel. This is for review only — it does not change any match result.

In [6]:
Show / hide code
high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

nearest_high <- function(pts, id_col, high) {
  if (nrow(pts) == 0 || nrow(high) == 0) {
    return(tibble(
      !!id_col := character(),
      nearest_high_parcel_index = integer(),
      dist_to_nearest_high_m    = numeric()
    ))
  }
  nn_idx <- st_nearest_feature(pts, high)
  tibble(
    !!id_col := pts[[id_col]],
    nearest_high_parcel_index = high$parcel_index[nn_idx],
    dist_to_nearest_high_m    = as.numeric(
      st_distance(pts, high[nn_idx, ], by_element = TRUE)
    )
  )
}

mits_nearhigh <- nearest_high(mits, mit_id_col, high_parcels)
apps_nearhigh <- nearest_high(apps, app_id_col, high_parcels)

mits_direct <- mits_direct |> left_join(mits_nearhigh, by = mit_id_col)
apps_direct <- apps_direct |> left_join(apps_nearhigh, by = app_id_col)

5 Write outputs

In [7]:
Show / hide code
st_write(mits_direct, file.path(out_dir, "alt_mits_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_direct, file.path(out_dir, "alt_apps_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_direct_review.csv"))
apps_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_direct_review.csv"))

6 Resolve: one-row-per-record output

The fan-out outputs above preserve every (input record × parcel) combination for inspection. This section produces a deduplicated, one-row-per-input-record version suitable for downstream analysis and supervisor-facing counts.

Assignment method buckets:

assignment_method Meaning
direct_single_parcel Input record intersected exactly one parcel — no judgment needed
direct_multi_resolved Multi-parcel hit (condo, recorder dup) — winner picked by four-level tiebreaker
nearest_high_addr No parcel hit — rescued by address-similarity match within distance cap
nearest_high_dist No parcel hit, no good address match — assigned to nearest high-priority parcel within cap
unassigned No parcel hit, no candidate within cap (out-of-scope records, e.g. western mountain counties)

Methods note — multi-unit buildings: For apartment complexes and condominium buildings where all candidate parcels share a complex-level site address (e.g. 95 units sharing one street address but distinct APNs), address-similarity is non-discriminating across candidates and the winner is determined by the APN sequence / improvement value tiebreakers. The selected parcel is reproducible across runs but should be interpreted as “a representative unit within the matched building” rather than “the specific unit the record corresponds to.” Unit-level disambiguation would require parsing unit numbers from the input record’s address and matching against per-unit parcel records, which CoreLogic’s siteadd field does not consistently provide.

In [8]:
Show / hide code
library(stringdist)

norm_addr <- function(x) {
  coalesce(x, "") |>
    str_to_lower() |>
    str_replace_all("[^a-z0-9 ]", " ") |>
    str_squish()
}

addr_sim <- function(a, b) {
  na <- norm_addr(a); nb <- norm_addr(b)
  ifelse(na == "" | nb == "", 0, stringsim(na, nb, method = "jw"))
}

# --- Tunables ---------------------------------------------------------------
SIM_THRESHOLD <- 0.70   # min Jaro-Winkler to call an address match a "real" hit
DIST_CAP_M    <- 200    # max metres allowed for a distance-only fallback
K_CANDIDATES  <- 10     # nearest high-priority parcels to evaluate per no-hit

# Columns to preserve from input side even if they share a name with a parcel
# column. Populate after running check-column-overlaps if genuine input-side
# fields are being silently dropped by finalize().
MIT_KEEP_FROM_INPUT <- character()
APP_KEEP_FROM_INPUT <- character()

high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

# Centroid version for RANN KNN. st_coordinates() on polygon geometry returns
# edge vertices, producing misaligned coordinate arrays; centroids give one
# point per row, matching the row-index contract for nn$nn.idx.
# high_parcels and high_centroids share the same row order.
high_centroids <- high_parcels |> st_centroid()
Warning: st_centroid assumes attributes are constant over geometries
In [9]:
Show / hide code
# Sanity check: any column names shared between input-side data (mits/apps)
# and parcels will be silently dropped by finalize() unless added to
# MIT_KEEP_FROM_INPUT / APP_KEEP_FROM_INPUT in resolve-setup.
parcel_attr_cols <- names(st_drop_geometry(parcels))

cat("Parcel columns also present in mits_direct (will be dropped from base):\n")
Parcel columns also present in mits_direct (will be dropped from base):
Show / hide code
print(intersect(names(st_drop_geometry(mits_direct)), parcel_attr_cols))
 [1] "parcel_index"                 "geometry_id"                 
 [3] "cntyfips"                     "parno"                       
 [5] "altparno"                     "siteadd"                     
 [7] "scity"                        "szip"                        
 [9] "mailadd"                      "ownname"                     
[11] "parusedesc"                   "LAND.USE.CODE"               
[13] "luc_tier"                     "APN.SEQUENCE.NUMBER"         
[15] "SITUS.HOUSE.NUMBER"           "SITUS.STREET.NAME"           
[17] "OWNER.1.FULL.NAME"            "YEAR.BUILT"                  
[19] "ASSESSED.TOTAL.VALUE"         "IMPROVEMENT.VALUE.CALCULATED"
Show / hide code
cat("\nParcel columns also present in apps_direct (will be dropped from base):\n")

Parcel columns also present in apps_direct (will be dropped from base):
Show / hide code
print(intersect(names(st_drop_geometry(apps_direct)), parcel_attr_cols))
 [1] "parcel_index"                 "geometry_id"                 
 [3] "cntyfips"                     "parno"                       
 [5] "altparno"                     "siteadd"                     
 [7] "scity"                        "szip"                        
 [9] "mailadd"                      "ownname"                     
[11] "parusedesc"                   "LAND.USE.CODE"               
[13] "luc_tier"                     "APN.SEQUENCE.NUMBER"         
[15] "SITUS.HOUSE.NUMBER"           "SITUS.STREET.NAME"           
[17] "OWNER.1.FULL.NAME"            "YEAR.BUILT"                  
[19] "ASSESSED.TOTAL.VALUE"         "IMPROVEMENT.VALUE.CALCULATED"
In [10]:
Show / hide code
# Stage 1: For each input record that intersected one or more parcels,
# pick a single winner. Tiebreaking mirrors canonical Step 5:
#   1. Jaro-Winkler address similarity (decisive when addresses differ)
#   2. APN.SEQUENCE.NUMBER ascending (primary recorder entry)
#   3. luc_tier descending (most specific land use)
#   4. IMPROVEMENT.VALUE.CALCULATED descending (parcel with structure)
pick_winner <- function(direct_df, addr_col, id_col) {
  direct_df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    mutate(
      addr_sim  = addr_sim(.data[[addr_col]], siteadd),
      apn_seq   = suppressWarnings(as.integer(`APN.SEQUENCE.NUMBER`)),
      improvval = suppressWarnings(as.numeric(`IMPROVEMENT.VALUE.CALCULATED`))
    ) |>
    group_by(.data[[id_col]]) |>
    mutate(n_parcels = n()) |>
    arrange(
      desc(addr_sim),
      apn_seq,
      desc(as.integer(luc_tier)),
      desc(improvval)
    ) |>
    slice(1) |>
    ungroup() |>
    select(all_of(id_col),
           winner_parcel_index = parcel_index,
           winner_addr_sim     = addr_sim,
           winner_apn_seq      = apn_seq,
           winner_n_parcels    = n_parcels)
}

mit_winners <- pick_winner(mits_direct, "raw_addr", mit_id_col)
app_winners <- pick_winner(apps_direct, "address",  app_id_col)
In [11]:
Show / hide code
# Stage 2: For each no-parcel-hit record, find K nearest high-priority parcels
# and pick by best address similarity if available, else nearest.
reassign_nohit <- function(direct_df, addr_col, id_col,
                           high       = high_parcels,
                           high_pts   = high_centroids,
                           k          = K_CANDIDATES,
                           sim_thresh = SIM_THRESHOLD,
                           dist_cap   = DIST_CAP_M) {
  nohit <- direct_df |>
    filter(is.na(parcel_index)) |>
    distinct(.data[[id_col]], .keep_all = TRUE)

  if (nrow(nohit) == 0) return(tibble())

  nohit_coords <- st_coordinates(nohit)[, c("X", "Y")]
  high_coords  <- st_coordinates(high_pts)[, c("X", "Y")]   # <-- high_pts, not high

  nn <- RANN::nn2(high_coords, nohit_coords, k = k)

  cand <- tibble(
    !!id_col    := rep(nohit[[id_col]], each = k),
    input_addr  = rep(nohit[[addr_col]], each = k),
    cand_pos    = rep(seq_len(k), nrow(nohit)),
    cand_row    = as.vector(t(nn$nn.idx)),
    cand_dist_m = as.vector(t(nn$nn.dists))
  ) |>
    mutate(
      cand_parcel_index = high$parcel_index[cand_row],
      cand_siteadd      = high$siteadd[cand_row],
      cand_addr_sim     = addr_sim(input_addr, cand_siteadd)
    )

  # Pick winner per input record:
  #   1. best address match if sim >= threshold AND within distance cap
  #   2. else nearest within distance cap
  #   3. else unassigned
  cand |>
    group_by(.data[[id_col]]) |>
    summarise(
      best_addr_sim    = max(cand_addr_sim),
      best_addr_dist   = cand_dist_m[which.max(cand_addr_sim)],
      best_addr_parcel = cand_parcel_index[which.max(cand_addr_sim)],
      nearest_dist     = min(cand_dist_m),
      nearest_parcel   = cand_parcel_index[which.min(cand_dist_m)],
      .groups = "drop"
    ) |>
    mutate(
      assigned_parcel_index = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ best_addr_parcel,
        nearest_dist  <= dist_cap                                ~ nearest_parcel,
        TRUE                                                      ~ NA_integer_
      ),
      assignment_method = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ "nearest_high_addr",
        nearest_dist  <= dist_cap                                ~ "nearest_high_dist",
        TRUE                                                      ~ "unassigned"
      ),
      assignment_dist_m = if_else(assignment_method == "nearest_high_addr",
                                  best_addr_dist, nearest_dist),
      assignment_addr_sim = if_else(assignment_method == "nearest_high_addr",
                                    best_addr_sim, NA_real_)
    )
}

mit_reassign <- reassign_nohit(mits_direct, "raw_addr", mit_id_col)
app_reassign <- reassign_nohit(apps_direct, "address",  app_id_col)
In [12]:
Show / hide code
finalize <- function(direct_df, winners, reassign, id_col,
                     keep_from_input = character()) {
  # All parcel-attribute columns that were folded in by the earlier st_join.
  # Drop them cleanly — the final left_join below re-attaches them via
  # assigned_parcel_index. Any input-side columns the caller wants to
  # preserve despite sharing a name with a parcel column are excluded.
  parcel_cols <- setdiff(names(st_drop_geometry(parcels)), keep_from_input)

  base <- direct_df |>
    distinct(.data[[id_col]], .keep_all = TRUE) |>
    select(-any_of(parcel_cols))

  base |>
    left_join(winners, by = id_col) |>
    left_join(
      reassign |> rename(reassign_method = assignment_method),
      by = id_col
    ) |>
    mutate(
      # left_join (not inner) is load-bearing: records absent from winners
      # (no parcel hit) get winner_parcel_index = NA, then coalesce falls
      # through to the Stage 2 assigned_parcel_index.
      assigned_parcel_index = coalesce(winner_parcel_index, assigned_parcel_index),
      assignment_method = case_when(
        !is.na(winner_parcel_index) & winner_n_parcels == 1 ~ "direct_single_parcel",
        !is.na(winner_parcel_index) & winner_n_parcels >  1 ~ "direct_multi_resolved",
        !is.na(reassign_method)                              ~ reassign_method,
        TRUE                                                  ~ "unassigned"
      )
    ) |>
    select(-reassign_method) |>
    left_join(parcels |> st_drop_geometry(),
              by = c("assigned_parcel_index" = "parcel_index"))
}

mits_final <- finalize(mits_direct, mit_winners, mit_reassign, mit_id_col,
                       keep_from_input = MIT_KEEP_FROM_INPUT)
apps_final <- finalize(apps_direct, app_winners, app_reassign, app_id_col,
                       keep_from_input = APP_KEEP_FROM_INPUT)

# Diagnostics
cat("\n=== Mitigations: assignment method distribution ===\n")

=== Mitigations: assignment method distribution ===
Show / hide code
mits_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()
      assignment_method    n
1  direct_single_parcel 7113
2     nearest_high_addr  935
3     nearest_high_dist  689
4            unassigned   40
5 direct_multi_resolved   24
Show / hide code
cat("\n=== Applications: assignment method distribution ===\n")

=== Applications: assignment method distribution ===
Show / hide code
apps_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()
      assignment_method     n
1  direct_single_parcel 20792
2     nearest_high_dist  1214
3     nearest_high_addr  1015
4            unassigned   150
5 direct_multi_resolved    61
In [13]:
Show / hide code
st_write(mits_final, file.path(out_dir, "alt_mits_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_final, file.path(out_dir, "alt_apps_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_resolved_review.csv"))
apps_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_resolved_review.csv"))

Summary

Match status counts

Row counts reflect the fan-out: a single input record landing on N parcel polygons contributes N rows here.

In [14]:
Show / hide code
bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  count(side, match_status) |>
  pivot_wider(names_from = match_status, values_from = n, values_fill = 0L) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Row counts by side and match status (rows = input record × parcel hit)"
  )
Row counts by side and match status (rows = input record × parcel hit)
side matched_high matched_low no_parcel_hit
applications 19,532 1,949 2,379
mitigations 6,617 699 1,664

Shared-footprint fan-out

How often does a single input record intersect multiple parcel polygons? Condo complexes and stacked parcels show up as n_parcels >= 2.

In [15]:
Show / hide code
multihit <- function(df, id_col, label) {
  df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    count(.data[[id_col]], name = "n_parcels") |>
    count(n_parcels, name = "n_records") |>
    mutate(side = label)
}

bind_rows(
  multihit(mits_direct, mit_id_col, "mitigations"),
  multihit(apps_direct, app_id_col, "applications")
) |>
  pivot_wider(names_from = side, values_from = n_records, values_fill = 0L) |>
  arrange(n_parcels) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Distribution of parcel-hit counts per input record (n_parcels > 1 = shared-footprint)"
  )
Distribution of parcel-hit counts per input record (n_parcels > 1 = shared-footprint)
n_parcels mitigations applications
1 7,113 20,792
2 20 40
3 1 11
4 0 2
6 1 0
11 0 1
37 1 3
95 0 1
117 1 3

Distance to nearest high-priority parcel (low-priority / off-parcel only)

If most low-hit records are within ~30 m of a high-priority parcel the low classification is likely a geocode artefact, not a meaningful difference in parcel type.

In [16]:
Show / hide code
bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  group_by(side, match_status) |>
  summarise(
    n      = n(),
    p10    = round(quantile(dist_to_nearest_high_m, 0.10, na.rm = TRUE), 1),
    median = round(median(dist_to_nearest_high_m,          na.rm = TRUE), 1),
    p90    = round(quantile(dist_to_nearest_high_m, 0.90, na.rm = TRUE), 1),
    max    = round(max(dist_to_nearest_high_m,             na.rm = TRUE), 1),
    .groups = "drop"
  ) |>
  knitr::kable(
    caption = "Distance (m) from low-priority / no-hit records to nearest high-priority parcel"
  )
Distance (m) from low-priority / no-hit records to nearest high-priority parcel
side match_status n p10 median p90 max
applications matched_low 1949 6.9 24.4 92.8 2006.4
applications no_parcel_hit 2379 4.2 8.7 56.6 5175.0
mitigations matched_low 699 5.6 25.8 85.7 750.9
mitigations no_parcel_hit 1664 3.5 8.3 41.8 453.9

Review browsers

Records that landed on a low-priority LUC or off-parcel entirely, sorted ascending by distance to the nearest high-priority parcel. Rows near the top are the most suspicious (close to a high-priority parcel but not on one).

Mitigations

In [17]:
Show / hide code

mits_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Mitigations on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Applications

In [18]:
Show / hide code

apps_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Applications on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Output schema

Fan-out outputs (alt_mits_direct.gpkg / alt_apps_direct.gpkg)

One row per (input record × intersecting parcel). Use for shared-footprint inspection and canonical Step 5 comparison.

Each row of alt_mits_direct.gpkg / alt_apps_direct.gpkg:

Column Source Notes
(all original mit/app columns) input unchanged
parcel_indexIMPROVEMENT.VALUE.CALCULATED parcels NA when no_parcel_hit
luc_tier_num derived as.integer(luc_tier)
match_status derived matched_high / matched_low / no_parcel_hit
nearest_high_parcel_index derived parcel_index of closest high-priority parcel
dist_to_nearest_high_m derived metres; review only, not used for matching

Resolved outputs (alt_mits_resolved.gpkg / alt_apps_resolved.gpkg)

One row per input record. Use for downstream analysis and supervisor-facing counts.

Column Source Notes
(all original mit/app columns) input unchanged
assigned_parcel_index derived winner from Stage 1 or Stage 2; NA if unassigned
assignment_method derived see bucket table in Section 6
winner_parcel_index Stage 1 NA for no-hit records
winner_addr_sim Stage 1 Jaro-Winkler score of winning address match
winner_n_parcels Stage 1 how many parcels the input point intersected
assignment_dist_m Stage 2 metres to assigned parcel; NA for Stage 1 records
assignment_addr_sim Stage 2 Jaro-Winkler score; NA for distance-only assignments
parcel_indexIMPROVEMENT.VALUE.CALCULATED parcels re-joined via assigned_parcel_index; NA if unassigned

Results

The Step 5 alt direct-match pipeline produces a one-to-one assignment of every mit/app record (within the flood study area) to a parcel, where possible. The table below summarizes hit rate by assignment method, and the leaflet visualizes spatial distribution of every record by outcome.

Hit-rate summary

In [19]:
Show / hide code
library(knitr)
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
Show / hide code
build_summary <- function(df, side_label) {
  total <- nrow(df)
  am <- df |> st_drop_geometry() |> count(assignment_method)
  get_n <- function(method) {
    v <- am |> filter(assignment_method == method) |> pull(n)
    if (length(v) == 0) 0L else v
  }

  direct_single <- get_n("direct_single_parcel")
  direct_multi  <- get_n("direct_multi_resolved")
  stage2_addr   <- get_n("nearest_high_addr")
  stage2_dist   <- get_n("nearest_high_dist")
  unassigned    <- get_n("unassigned")
  assigned      <- total - unassigned

  tibble(
    Metric = c(
      "Total records (within HUC-6 study area)",
      "  Direct hit — single parcel",
      "  Direct hit — multi-parcel resolved (Stage 1 tiebreaker)",
      "  Stage 2 rescue — address match",
      "  Stage 2 rescue — nearest centroid (\u2264 200 m)",
      "  Unassigned — no high-priority parcel within 200 m",
      "**Total assigned to a parcel**",
      "**Hit rate**"
    ),
    !!side_label := c(
      comma(total),
      comma(direct_single),
      comma(direct_multi),
      comma(stage2_addr),
      comma(stage2_dist),
      comma(unassigned),
      comma(assigned),
      sprintf("%.2f %%", 100 * assigned / total)
    )
  )
}

mit_summary <- build_summary(mits_final, "Mitigations")
app_summary <- build_summary(apps_final, "Applications")

results_table <- mit_summary |>
  left_join(app_summary, by = "Metric")

knitr::kable(
  results_table,
  align   = c("l", "r", "r"),
  caption = "Hit rate of mit/app-to-parcel join. Indented rows are subcategories of \"Total records\"; totals at bottom are bolded."
)
Hit rate of mit/app-to-parcel join. Indented rows are subcategories of “Total records”; totals at bottom are bolded.
Metric Mitigations Applications
Total records (within HUC-6 study area) 8,801 23,232
Direct hit — single parcel 7,113 20,792
Direct hit — multi-parcel resolved (Stage 1 tiebreaker) 24 61
Stage 2 rescue — address match 935 1,015
Stage 2 rescue — nearest centroid (≤ 200 m) 689 1,214
Unassigned — no high-priority parcel within 200 m 40 150
Total assigned to a parcel 8,761 23,082
Hit rate 99.55 % 99.35 %

The two-stage pipeline assigns nearly every record to a parcel. Direct intersection (Stage 1) handles the bulk of the workload; Stage 2’s nearest-centroid rescue captures records whose geocoded location fell on or near a non-residential parcel (curb, ROW, adjacent commercial) while the intended residential parcel was within 200 m. The small unassigned residual represents genuine data-coverage limitations discussed in the methodology section.

Spatial distribution

The map below plots every mit and app record by assignment outcome. Successful records are clustered at lower zoom levels for performance; unassigned records are shown individually in red so they remain visible at any zoom.

In [20]:
Show / hide code
library(leaflet)

# Combine mits + apps into a single point dataset with outcome label
make_map_df <- function(df, side, addr_col) {
  coords <- df |> st_transform(4326) |> st_coordinates()
  df |>
    st_drop_geometry() |>
    transmute(
      unique_id,
      side,
      label_addr = .data[[addr_col]],
      assignment_method,
      lon = coords[, "X"],
      lat = coords[, "Y"]
    )
}

points_for_map <- bind_rows(
  make_map_df(mits_final, "Mitigation",  "Match_addr"),
  make_map_df(apps_final, "Application", "address")
) |>
  mutate(
    outcome = case_when(
      assignment_method == "direct_single_parcel"  ~ "Direct (single hit)",
      assignment_method == "direct_multi_resolved" ~ "Direct (multi-parcel resolved)",
      assignment_method == "nearest_high_addr"     ~ "Rescued (address match)",
      assignment_method == "nearest_high_dist"     ~ "Rescued (nearest centroid)",
      assignment_method == "unassigned"            ~ "Unassigned"
    )
  )

outcome_pal <- colorFactor(
  palette = c("Direct (single hit)"             = "#1B7837",
              "Direct (multi-parcel resolved)"  = "#5AAE61",
              "Rescued (address match)"         = "#2166AC",
              "Rescued (nearest centroid)"      = "#67A9CF",
              "Unassigned"                      = "#B2182B"),
  domain  = points_for_map$outcome
)

flood_boundary_wgs <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(4326)

leaflet() |>
  addProviderTiles(providers$CartoDB.Positron, group = "Light") |>
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") |>

  # Flood study area
  addPolygons(
    data        = flood_boundary_wgs,
    fillColor   = "lightblue",
    fillOpacity = 0.15,
    color       = "navy",
    weight      = 1.5,
    opacity     = 0.6,
    group       = "Flood study area"
  ) |>

  # Assigned records — clustered for performance
  addCircleMarkers(
    data           = points_for_map |> filter(outcome != "Unassigned"),
    lng            = ~lon, lat = ~lat,
    radius         = 3,
    color          = ~outcome_pal(outcome),
    fillOpacity    = 0.7,
    stroke         = FALSE,
    label          = ~glue::glue("{side} #{unique_id} — {outcome}"),
    popup          = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>outcome:</b> {outcome}<br>",
      "<b>addr:</b> {label_addr}"
    ),
    clusterOptions = markerClusterOptions(maxClusterRadius = 40),
    group          = "Assigned records"
  ) |>

  # Unassigned records — unclustered and prominent
  addCircleMarkers(
    data         = points_for_map |> filter(outcome == "Unassigned"),
    lng          = ~lon, lat = ~lat,
    radius       = 6,
    color        = "#7C0024",
    fillColor    = "#B2182B",
    fillOpacity  = 0.9,
    stroke       = TRUE,
    weight       = 2,
    label        = ~glue::glue("{side} #{unique_id} — UNASSIGNED"),
    popup        = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>UNASSIGNED</b><br>",
      "<b>addr:</b> {label_addr}"
    ),
    group        = "Unassigned records"
  ) |>

  addLayersControl(
    baseGroups    = c("Light", "Satellite"),
    overlayGroups = c("Flood study area", "Assigned records", "Unassigned records"),
    options       = layersControlOptions(collapsed = FALSE)
  ) |>
  addLegend(
    "bottomright",
    pal     = outcome_pal,
    values  = points_for_map$outcome,
    title   = "Assignment outcome",
    opacity = 0.85
  )

Reading the map: Direct-hit records (green) and Stage 2 rescues (blue) make up the vast majority. Unassigned records (red) are concentrated in:

  • Rural eastern NC (Edgecombe County and similar) where rural-route box-number geocodes fell back to road centerlines rather than specific parcels
  • Highway corridors (US-264 near Belhaven, NC-33, US-117) where numbered highway addresses geocoded to ROW parcels
  • Subdivisions where the parcel layer lacks lot-level granularity (e.g., River Bend Drive in Burgaw)
  • Mobile home parks where the parcel data treats the entire mobile home park as one big land parcel rather than multiple unit-level parceles. (Greenbriar MHP in Wilson County)
  • Inland metro areas (Winston-Salem, Concord, Greenville) where the geocoded point landed in industrial or highway-adjacent zones with no high-priority residential parcel within 200 m

These categories represent genuine data limitations (geocoding precision, parcel-layer granularity, and large-parcel centroid distance) rather than matching-pipeline failures.