Step 5.1 (Alt): Direct Intersection — Mits & Apps to Parcels

Overview

A transparent, no-heuristic spatial join of mitigation and application records to parcels. Every input record is kept (no deduplication). A single point representing, say, six mitigations on a condo building stays as six rows in the output. Shared-footprint parcels fan out naturally: one input record that intersects N parcel polygons produces N output rows.

This file is complementary to 05_spatial_match_app_mit.qmd, which runs the tiered priority cascade and produces the canonical matched output. Nothing here changes that output.

What this file does NOT do:

No tier cascade (high → med → low → fallback NN).
No distance-based matching or buffering in the join itself.
No dedup of mit/app records.
No winner selection when a point hits multiple shared-footprint parcels.

Inputs:

parcels_pri.gpkg — parcels with luc_tier column (from out_dir)
mit_newgeo.gpkg — raw mitigations (CRC_May_2025/Raw Data Inputs/raw_app_ncb/)
applications_final.gpkg — raw applications (same directory)

Outputs (written to out_dir):

alt_mits_direct.gpkg, alt_apps_direct.gpkg — full fan-out spatial join
alt_mits_direct_review.csv, alt_apps_direct_review.csv — same, no geometry

In [1]:

Show / hide code

library(sf)

Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE

Show / hide code

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Show / hide code

library(readr)
library(tidyr)
library(purrr)
library(knitr)
library(DT)
library(leaflet)
library(glue)
library(stringdist)


Attaching package: 'stringdist'

The following object is masked from 'package:tidyr':

    extract

Show / hide code

library(stringr)
library(nngeo)
library(RANN)

In [2]:

Show / hide code

out_dir <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

parcels_path <- file.path(out_dir, "parcels_pri.gpkg")
mits_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"
apps_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"
flood_boundary_path <- "/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"

HIGH_TIER_MIN <- 3L

mit_id_col <- "unique_id"
app_id_col <- "unique_id"

1 Load inputs

In [3]:

Show / hide code

parcels <- st_read(parcels_path, quiet = TRUE) |>
  select(
    parcel_index, geometry_id, cntyfips, parno, altparno,
    siteadd, scity, szip, mailadd, ownname, parusedesc,
    `LAND.USE.CODE`, luc_tier,
    `APN.SEQUENCE.NUMBER`,
    `SITUS.HOUSE.NUMBER`, `SITUS.STREET.NAME`,
    `OWNER.1.FULL.NAME`, `YEAR.BUILT`,
    `ASSESSED.TOTAL.VALUE`,
    `IMPROVEMENT.VALUE.CALCULATED`   
  ) |>
  st_transform(32119)

# Flood study area boundary
flood_boundary <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(st_crs(parcels))

# Read raw mits/apps, clip to flood study area via st_within, then assign
# unique_id within the clipped set. Matches the canonical Step 5 SA filter.
# Do NOT dedup — every HMA / IHP record is preserved, coordinate duplicates included.
mits <- st_read(mits_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())

apps <- st_read(apps_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())

2 Direct intersection join

One row per (input record × intersecting parcel). Records that fall outside every parcel produce one row with NA parcel columns (left = TRUE).

In [4]:

Show / hide code

mits_direct <- st_join(mits, parcels, join = st_intersects, left = TRUE)
apps_direct <- st_join(apps, parcels, join = st_intersects, left = TRUE)

3 Tag match status

In [5]:

Show / hide code

tag_status <- function(df, high_tier_min = HIGH_TIER_MIN) {
  df |>
    mutate(
      luc_tier_num = suppressWarnings(as.integer(luc_tier)),
      match_status = case_when(
        is.na(parcel_index)           ~ "no_parcel_hit",
        luc_tier_num >= high_tier_min ~ "matched_high",
        TRUE                          ~ "matched_low"
      )
    )
}

mits_direct <- tag_status(mits_direct)
apps_direct <- tag_status(apps_direct)

4 Proximity to nearest high-priority parcel

For every input point, compute distance to the nearest high-priority parcel. This is for review only — it does not change any match result.

In [6]:

Show / hide code

high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

nearest_high <- function(pts, id_col, high) {
  if (nrow(pts) == 0 || nrow(high) == 0) {
    return(tibble(
      !!id_col := character(),
      nearest_high_parcel_index = integer(),
      dist_to_nearest_high_m    = numeric()
    ))
  }
  nn_idx <- st_nearest_feature(pts, high)
  tibble(
    !!id_col := pts[[id_col]],
    nearest_high_parcel_index = high$parcel_index[nn_idx],
    dist_to_nearest_high_m    = as.numeric(
      st_distance(pts, high[nn_idx, ], by_element = TRUE)
    )
  )
}

mits_nearhigh <- nearest_high(mits, mit_id_col, high_parcels)
apps_nearhigh <- nearest_high(apps, app_id_col, high_parcels)

mits_direct <- mits_direct |> left_join(mits_nearhigh, by = mit_id_col)
apps_direct <- apps_direct |> left_join(apps_nearhigh, by = app_id_col)

5 Write outputs

In [7]:

Show / hide code

st_write(mits_direct, file.path(out_dir, "alt_mits_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_direct, file.path(out_dir, "alt_apps_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_direct_review.csv"))
apps_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_direct_review.csv"))

6 Resolve: one-row-per-record output

The fan-out outputs above preserve every (input record × parcel) combination for inspection. This section produces a deduplicated, one-row-per-input-record version suitable for downstream analysis and supervisor-facing counts.

Assignment method buckets:

`assignment_method`	Meaning
`direct_single_parcel`	Input record intersected exactly one parcel — no judgment needed
`direct_multi_resolved`	Multi-parcel hit (condo, recorder dup) — winner picked by four-level tiebreaker
`nearest_high_addr`	No parcel hit — rescued by address-similarity match within distance cap
`nearest_high_dist`	No parcel hit, no good address match — assigned to nearest high-priority parcel within cap
`unassigned`	No parcel hit, no candidate within cap (out-of-scope records, e.g. western mountain counties)

Methods note — multi-unit buildings: For apartment complexes and condominium buildings where all candidate parcels share a complex-level site address (e.g. 95 units sharing one street address but distinct APNs), address-similarity is non-discriminating across candidates and the winner is determined by the APN sequence / improvement value tiebreakers. The selected parcel is reproducible across runs but should be interpreted as “a representative unit within the matched building” rather than “the specific unit the record corresponds to.” Unit-level disambiguation would require parsing unit numbers from the input record’s address and matching against per-unit parcel records, which CoreLogic’s siteadd field does not consistently provide.

In [8]:

Show / hide code

library(stringdist)

norm_addr <- function(x) {
  coalesce(x, "") |>
    str_to_lower() |>
    str_replace_all("[^a-z0-9 ]", " ") |>
    str_squish()
}

addr_sim <- function(a, b) {
  na <- norm_addr(a); nb <- norm_addr(b)
  ifelse(na == "" | nb == "", 0, stringsim(na, nb, method = "jw"))
}

# --- Tunables ---------------------------------------------------------------
SIM_THRESHOLD <- 0.70   # min Jaro-Winkler to call an address match a "real" hit
DIST_CAP_M    <- 200    # max metres allowed for a distance-only fallback
K_CANDIDATES  <- 10     # nearest high-priority parcels to evaluate per no-hit

# Columns to preserve from input side even if they share a name with a parcel
# column. Populate after running check-column-overlaps if genuine input-side
# fields are being silently dropped by finalize().
MIT_KEEP_FROM_INPUT <- character()
APP_KEEP_FROM_INPUT <- character()

high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

# Centroid version for RANN KNN. st_coordinates() on polygon geometry returns
# edge vertices, producing misaligned coordinate arrays; centroids give one
# point per row, matching the row-index contract for nn$nn.idx.
# high_parcels and high_centroids share the same row order.
high_centroids <- high_parcels |> st_centroid()

Warning: st_centroid assumes attributes are constant over geometries

In [9]:

Show / hide code

# Sanity check: any column names shared between input-side data (mits/apps)
# and parcels will be silently dropped by finalize() unless added to
# MIT_KEEP_FROM_INPUT / APP_KEEP_FROM_INPUT in resolve-setup.
parcel_attr_cols <- names(st_drop_geometry(parcels))

cat("Parcel columns also present in mits_direct (will be dropped from base):\n")

Parcel columns also present in mits_direct (will be dropped from base):

Show / hide code

print(intersect(names(st_drop_geometry(mits_direct)), parcel_attr_cols))

 [1] "parcel_index"                 "geometry_id"                 
 [3] "cntyfips"                     "parno"                       
 [5] "altparno"                     "siteadd"                     
 [7] "scity"                        "szip"                        
 [9] "mailadd"                      "ownname"                     
[11] "parusedesc"                   "LAND.USE.CODE"               
[13] "luc_tier"                     "APN.SEQUENCE.NUMBER"         
[15] "SITUS.HOUSE.NUMBER"           "SITUS.STREET.NAME"           
[17] "OWNER.1.FULL.NAME"            "YEAR.BUILT"                  
[19] "ASSESSED.TOTAL.VALUE"         "IMPROVEMENT.VALUE.CALCULATED"

Show / hide code

cat("\nParcel columns also present in apps_direct (will be dropped from base):\n")


Parcel columns also present in apps_direct (will be dropped from base):

Show / hide code

print(intersect(names(st_drop_geometry(apps_direct)), parcel_attr_cols))

 [1] "parcel_index"                 "geometry_id"                 
 [3] "cntyfips"                     "parno"                       
 [5] "altparno"                     "siteadd"                     
 [7] "scity"                        "szip"                        
 [9] "mailadd"                      "ownname"                     
[11] "parusedesc"                   "LAND.USE.CODE"               
[13] "luc_tier"                     "APN.SEQUENCE.NUMBER"         
[15] "SITUS.HOUSE.NUMBER"           "SITUS.STREET.NAME"           
[17] "OWNER.1.FULL.NAME"            "YEAR.BUILT"                  
[19] "ASSESSED.TOTAL.VALUE"         "IMPROVEMENT.VALUE.CALCULATED"

In [10]:

Show / hide code

# Stage 1: For each input record that intersected one or more parcels,
# pick a single winner. Tiebreaking mirrors canonical Step 5:
#   1. Jaro-Winkler address similarity (decisive when addresses differ)
#   2. APN.SEQUENCE.NUMBER ascending (primary recorder entry)
#   3. luc_tier descending (most specific land use)
#   4. IMPROVEMENT.VALUE.CALCULATED descending (parcel with structure)
pick_winner <- function(direct_df, addr_col, id_col) {
  direct_df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    mutate(
      addr_sim  = addr_sim(.data[[addr_col]], siteadd),
      apn_seq   = suppressWarnings(as.integer(`APN.SEQUENCE.NUMBER`)),
      improvval = suppressWarnings(as.numeric(`IMPROVEMENT.VALUE.CALCULATED`))
    ) |>
    group_by(.data[[id_col]]) |>
    mutate(n_parcels = n()) |>
    arrange(
      desc(addr_sim),
      apn_seq,
      desc(as.integer(luc_tier)),
      desc(improvval)
    ) |>
    slice(1) |>
    ungroup() |>
    select(all_of(id_col),
           winner_parcel_index = parcel_index,
           winner_addr_sim     = addr_sim,
           winner_apn_seq      = apn_seq,
           winner_n_parcels    = n_parcels)
}

mit_winners <- pick_winner(mits_direct, "raw_addr", mit_id_col)
app_winners <- pick_winner(apps_direct, "address",  app_id_col)

In [11]:

Show / hide code

# Stage 2: For each no-parcel-hit record, find K nearest high-priority parcels
# and pick by best address similarity if available, else nearest.
reassign_nohit <- function(direct_df, addr_col, id_col,
                           high       = high_parcels,
                           high_pts   = high_centroids,
                           k          = K_CANDIDATES,
                           sim_thresh = SIM_THRESHOLD,
                           dist_cap   = DIST_CAP_M) {
  nohit <- direct_df |>
    filter(is.na(parcel_index)) |>
    distinct(.data[[id_col]], .keep_all = TRUE)

  if (nrow(nohit) == 0) return(tibble())

  nohit_coords <- st_coordinates(nohit)[, c("X", "Y")]
  high_coords  <- st_coordinates(high_pts)[, c("X", "Y")]   # <-- high_pts, not high

  nn <- RANN::nn2(high_coords, nohit_coords, k = k)

  cand <- tibble(
    !!id_col    := rep(nohit[[id_col]], each = k),
    input_addr  = rep(nohit[[addr_col]], each = k),
    cand_pos    = rep(seq_len(k), nrow(nohit)),
    cand_row    = as.vector(t(nn$nn.idx)),
    cand_dist_m = as.vector(t(nn$nn.dists))
  ) |>
    mutate(
      cand_parcel_index = high$parcel_index[cand_row],
      cand_siteadd      = high$siteadd[cand_row],
      cand_addr_sim     = addr_sim(input_addr, cand_siteadd)
    )

  # Pick winner per input record:
  #   1. best address match if sim >= threshold AND within distance cap
  #   2. else nearest within distance cap
  #   3. else unassigned
  cand |>
    group_by(.data[[id_col]]) |>
    summarise(
      best_addr_sim    = max(cand_addr_sim),
      best_addr_dist   = cand_dist_m[which.max(cand_addr_sim)],
      best_addr_parcel = cand_parcel_index[which.max(cand_addr_sim)],
      nearest_dist     = min(cand_dist_m),
      nearest_parcel   = cand_parcel_index[which.min(cand_dist_m)],
      .groups = "drop"
    ) |>
    mutate(
      assigned_parcel_index = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ best_addr_parcel,
        nearest_dist  <= dist_cap                                ~ nearest_parcel,
        TRUE                                                      ~ NA_integer_
      ),
      assignment_method = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ "nearest_high_addr",
        nearest_dist  <= dist_cap                                ~ "nearest_high_dist",
        TRUE                                                      ~ "unassigned"
      ),
      assignment_dist_m = if_else(assignment_method == "nearest_high_addr",
                                  best_addr_dist, nearest_dist),
      assignment_addr_sim = if_else(assignment_method == "nearest_high_addr",
                                    best_addr_sim, NA_real_)
    )
}

mit_reassign <- reassign_nohit(mits_direct, "raw_addr", mit_id_col)
app_reassign <- reassign_nohit(apps_direct, "address",  app_id_col)

In [12]:

Show / hide code

finalize <- function(direct_df, winners, reassign, id_col,
                     keep_from_input = character()) {
  # All parcel-attribute columns that were folded in by the earlier st_join.
  # Drop them cleanly — the final left_join below re-attaches them via
  # assigned_parcel_index. Any input-side columns the caller wants to
  # preserve despite sharing a name with a parcel column are excluded.
  parcel_cols <- setdiff(names(st_drop_geometry(parcels)), keep_from_input)

  base <- direct_df |>
    distinct(.data[[id_col]], .keep_all = TRUE) |>
    select(-any_of(parcel_cols))

  base |>
    left_join(winners, by = id_col) |>
    left_join(
      reassign |> rename(reassign_method = assignment_method),
      by = id_col
    ) |>
    mutate(
      # left_join (not inner) is load-bearing: records absent from winners
      # (no parcel hit) get winner_parcel_index = NA, then coalesce falls
      # through to the Stage 2 assigned_parcel_index.
      assigned_parcel_index = coalesce(winner_parcel_index, assigned_parcel_index),
      assignment_method = case_when(
        !is.na(winner_parcel_index) & winner_n_parcels == 1 ~ "direct_single_parcel",
        !is.na(winner_parcel_index) & winner_n_parcels >  1 ~ "direct_multi_resolved",
        !is.na(reassign_method)                              ~ reassign_method,
        TRUE                                                  ~ "unassigned"
      )
    ) |>
    select(-reassign_method) |>
    left_join(parcels |> st_drop_geometry(),
              by = c("assigned_parcel_index" = "parcel_index"))
}

mits_final <- finalize(mits_direct, mit_winners, mit_reassign, mit_id_col,
                       keep_from_input = MIT_KEEP_FROM_INPUT)
apps_final <- finalize(apps_direct, app_winners, app_reassign, app_id_col,
                       keep_from_input = APP_KEEP_FROM_INPUT)

# Diagnostics
cat("\n=== Mitigations: assignment method distribution ===\n")


=== Mitigations: assignment method distribution ===

Show / hide code

mits_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()

      assignment_method    n
1  direct_single_parcel 7113
2     nearest_high_addr  935
3     nearest_high_dist  689
4            unassigned   40
5 direct_multi_resolved   24

Show / hide code

cat("\n=== Applications: assignment method distribution ===\n")


=== Applications: assignment method distribution ===

Show / hide code

apps_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()

      assignment_method     n
1  direct_single_parcel 20792
2     nearest_high_dist  1214
3     nearest_high_addr  1015
4            unassigned   150
5 direct_multi_resolved    61

In [13]:

Show / hide code

st_write(mits_final, file.path(out_dir, "alt_mits_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_final, file.path(out_dir, "alt_apps_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_resolved_review.csv"))
apps_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_resolved_review.csv"))

Summary

Match status counts

Row counts reflect the fan-out: a single input record landing on N parcel polygons contributes N rows here.

In [14]:

Show / hide code

bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  count(side, match_status) |>
  pivot_wider(names_from = match_status, values_from = n, values_fill = 0L) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Row counts by side and match status (rows = input record × parcel hit)"
  )

Row counts by side and match status (rows = input record × parcel hit)
side	matched_high	matched_low	no_parcel_hit
applications	19,532	1,949	2,379
mitigations	6,617	699	1,664

Shared-footprint fan-out

How often does a single input record intersect multiple parcel polygons? Condo complexes and stacked parcels show up as n_parcels >= 2.

In [15]:

Show / hide code

multihit <- function(df, id_col, label) {
  df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    count(.data[[id_col]], name = "n_parcels") |>
    count(n_parcels, name = "n_records") |>
    mutate(side = label)
}

bind_rows(
  multihit(mits_direct, mit_id_col, "mitigations"),
  multihit(apps_direct, app_id_col, "applications")
) |>
  pivot_wider(names_from = side, values_from = n_records, values_fill = 0L) |>
  arrange(n_parcels) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Distribution of parcel-hit counts per input record (n_parcels > 1 = shared-footprint)"
  )

Distribution of parcel-hit counts per input record (n_parcels > 1 = shared-footprint)
n_parcels	mitigations	applications
1	7,113	20,792
2	20	40
3	1	11
4	0	2
6	1	0
11	0	1
37	1	3
95	0	1
117	1	3

Distance to nearest high-priority parcel (low-priority / off-parcel only)

If most low-hit records are within ~30 m of a high-priority parcel the low classification is likely a geocode artefact, not a meaningful difference in parcel type.

In [16]:

Show / hide code

bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  group_by(side, match_status) |>
  summarise(
    n      = n(),
    p10    = round(quantile(dist_to_nearest_high_m, 0.10, na.rm = TRUE), 1),
    median = round(median(dist_to_nearest_high_m,          na.rm = TRUE), 1),
    p90    = round(quantile(dist_to_nearest_high_m, 0.90, na.rm = TRUE), 1),
    max    = round(max(dist_to_nearest_high_m,             na.rm = TRUE), 1),
    .groups = "drop"
  ) |>
  knitr::kable(
    caption = "Distance (m) from low-priority / no-hit records to nearest high-priority parcel"
  )

Distance (m) from low-priority / no-hit records to nearest high-priority parcel
side	match_status	n	p10	median	p90	max
applications	matched_low	1949	6.9	24.4	92.8	2006.4
applications	no_parcel_hit	2379	4.2	8.7	56.6	5175.0
mitigations	matched_low	699	5.6	25.8	85.7	750.9
mitigations	no_parcel_hit	1664	3.5	8.3	41.8	453.9

Review browsers

Records that landed on a low-priority LUC or off-parcel entirely, sorted ascending by distance to the nearest high-priority parcel. Rows near the top are the most suspicious (close to a high-priority parcel but not on one).

Mitigations

In [17]:

Show / hide code


mits_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Mitigations on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )

Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Applications

In [18]:

Show / hide code


apps_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Applications on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )

Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Output schema

Fan-out outputs (`alt_mits_direct.gpkg` / `alt_apps_direct.gpkg`)

One row per (input record × intersecting parcel). Use for shared-footprint inspection and canonical Step 5 comparison.

Each row of alt_mits_direct.gpkg / alt_apps_direct.gpkg:

Column	Source	Notes
(all original mit/app columns)	input	unchanged
`parcel_index` … `IMPROVEMENT.VALUE.CALCULATED`	parcels	NA when `no_parcel_hit`
`luc_tier_num`	derived	`as.integer(luc_tier)`
`match_status`	derived	`matched_high` / `matched_low` / `no_parcel_hit`
`nearest_high_parcel_index`	derived	`parcel_index` of closest high-priority parcel
`dist_to_nearest_high_m`	derived	metres; review only, not used for matching

Resolved outputs (`alt_mits_resolved.gpkg` / `alt_apps_resolved.gpkg`)

One row per input record. Use for downstream analysis and supervisor-facing counts.

Column	Source	Notes
(all original mit/app columns)	input	unchanged
`assigned_parcel_index`	derived	winner from Stage 1 or Stage 2; NA if unassigned
`assignment_method`	derived	see bucket table in Section 6
`winner_parcel_index`	Stage 1	NA for no-hit records
`winner_addr_sim`	Stage 1	Jaro-Winkler score of winning address match
`winner_n_parcels`	Stage 1	how many parcels the input point intersected
`assignment_dist_m`	Stage 2	metres to assigned parcel; NA for Stage 1 records
`assignment_addr_sim`	Stage 2	Jaro-Winkler score; NA for distance-only assignments
`parcel_index` … `IMPROVEMENT.VALUE.CALCULATED`	parcels	re-joined via `assigned_parcel_index`; NA if unassigned

Results

The Step 5 alt direct-match pipeline produces a one-to-one assignment of every mit/app record (within the flood study area) to a parcel, where possible. The table below summarizes hit rate by assignment method, and the leaflet visualizes spatial distribution of every record by outcome.

Hit-rate summary

In [19]:

Show / hide code

library(knitr)
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Show / hide code

build_summary <- function(df, side_label) {
  total <- nrow(df)
  am <- df |> st_drop_geometry() |> count(assignment_method)
  get_n <- function(method) {
    v <- am |> filter(assignment_method == method) |> pull(n)
    if (length(v) == 0) 0L else v
  }

  direct_single <- get_n("direct_single_parcel")
  direct_multi  <- get_n("direct_multi_resolved")
  stage2_addr   <- get_n("nearest_high_addr")
  stage2_dist   <- get_n("nearest_high_dist")
  unassigned    <- get_n("unassigned")
  assigned      <- total - unassigned

  tibble(
    Metric = c(
      "Total records (within HUC-6 study area)",
      "  Direct hit — single parcel",
      "  Direct hit — multi-parcel resolved (Stage 1 tiebreaker)",
      "  Stage 2 rescue — address match",
      "  Stage 2 rescue — nearest centroid (\u2264 200 m)",
      "  Unassigned — no high-priority parcel within 200 m",
      "**Total assigned to a parcel**",
      "**Hit rate**"
    ),
    !!side_label := c(
      comma(total),
      comma(direct_single),
      comma(direct_multi),
      comma(stage2_addr),
      comma(stage2_dist),
      comma(unassigned),
      comma(assigned),
      sprintf("%.2f %%", 100 * assigned / total)
    )
  )
}

mit_summary <- build_summary(mits_final, "Mitigations")
app_summary <- build_summary(apps_final, "Applications")

results_table <- mit_summary |>
  left_join(app_summary, by = "Metric")

knitr::kable(
  results_table,
  align   = c("l", "r", "r"),
  caption = "Hit rate of mit/app-to-parcel join. Indented rows are subcategories of \"Total records\"; totals at bottom are bolded."
)

Hit rate of mit/app-to-parcel join. Indented rows are subcategories of “Total records”; totals at bottom are bolded.
Metric	Mitigations	Applications
Total records (within HUC-6 study area)	8,801	23,232
Direct hit — single parcel	7,113	20,792
Direct hit — multi-parcel resolved (Stage 1 tiebreaker)	24	61
Stage 2 rescue — address match	935	1,015
Stage 2 rescue — nearest centroid (≤ 200 m)	689	1,214
Unassigned — no high-priority parcel within 200 m	40	150
Total assigned to a parcel	8,761	23,082
Hit rate	99.55 %	99.35 %

The two-stage pipeline assigns nearly every record to a parcel. Direct intersection (Stage 1) handles the bulk of the workload; Stage 2’s nearest-centroid rescue captures records whose geocoded location fell on or near a non-residential parcel (curb, ROW, adjacent commercial) while the intended residential parcel was within 200 m. The small unassigned residual represents genuine data-coverage limitations discussed in the methodology section.

Spatial distribution

The map below plots every mit and app record by assignment outcome. Successful records are clustered at lower zoom levels for performance; unassigned records are shown individually in red so they remain visible at any zoom.

In [20]:

Show / hide code

library(leaflet)

# Combine mits + apps into a single point dataset with outcome label
make_map_df <- function(df, side, addr_col) {
  coords <- df |> st_transform(4326) |> st_coordinates()
  df |>
    st_drop_geometry() |>
    transmute(
      unique_id,
      side,
      label_addr = .data[[addr_col]],
      assignment_method,
      lon = coords[, "X"],
      lat = coords[, "Y"]
    )
}

points_for_map <- bind_rows(
  make_map_df(mits_final, "Mitigation",  "Match_addr"),
  make_map_df(apps_final, "Application", "address")
) |>
  mutate(
    outcome = case_when(
      assignment_method == "direct_single_parcel"  ~ "Direct (single hit)",
      assignment_method == "direct_multi_resolved" ~ "Direct (multi-parcel resolved)",
      assignment_method == "nearest_high_addr"     ~ "Rescued (address match)",
      assignment_method == "nearest_high_dist"     ~ "Rescued (nearest centroid)",
      assignment_method == "unassigned"            ~ "Unassigned"
    )
  )

outcome_pal <- colorFactor(
  palette = c("Direct (single hit)"             = "#1B7837",
              "Direct (multi-parcel resolved)"  = "#5AAE61",
              "Rescued (address match)"         = "#2166AC",
              "Rescued (nearest centroid)"      = "#67A9CF",
              "Unassigned"                      = "#B2182B"),
  domain  = points_for_map$outcome
)

flood_boundary_wgs <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(4326)

leaflet() |>
  addProviderTiles(providers$CartoDB.Positron, group = "Light") |>
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") |>

  # Flood study area
  addPolygons(
    data        = flood_boundary_wgs,
    fillColor   = "lightblue",
    fillOpacity = 0.15,
    color       = "navy",
    weight      = 1.5,
    opacity     = 0.6,
    group       = "Flood study area"
  ) |>

  # Assigned records — clustered for performance
  addCircleMarkers(
    data           = points_for_map |> filter(outcome != "Unassigned"),
    lng            = ~lon, lat = ~lat,
    radius         = 3,
    color          = ~outcome_pal(outcome),
    fillOpacity    = 0.7,
    stroke         = FALSE,
    label          = ~glue::glue("{side} #{unique_id} — {outcome}"),
    popup          = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>outcome:</b> {outcome}<br>",
      "<b>addr:</b> {label_addr}"
    ),
    clusterOptions = markerClusterOptions(maxClusterRadius = 40),
    group          = "Assigned records"
  ) |>

  # Unassigned records — unclustered and prominent
  addCircleMarkers(
    data         = points_for_map |> filter(outcome == "Unassigned"),
    lng          = ~lon, lat = ~lat,
    radius       = 6,
    color        = "#7C0024",
    fillColor    = "#B2182B",
    fillOpacity  = 0.9,
    stroke       = TRUE,
    weight       = 2,
    label        = ~glue::glue("{side} #{unique_id} — UNASSIGNED"),
    popup        = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>UNASSIGNED</b><br>",
      "<b>addr:</b> {label_addr}"
    ),
    group        = "Unassigned records"
  ) |>

  addLayersControl(
    baseGroups    = c("Light", "Satellite"),
    overlayGroups = c("Flood study area", "Assigned records", "Unassigned records"),
    options       = layersControlOptions(collapsed = FALSE)
  ) |>
  addLegend(
    "bottomright",
    pal     = outcome_pal,
    values  = points_for_map$outcome,
    title   = "Assignment outcome",
    opacity = 0.85
  )

Reading the map: Direct-hit records (green) and Stage 2 rescues (blue) make up the vast majority. Unassigned records (red) are concentrated in:

Rural eastern NC (Edgecombe County and similar) where rural-route box-number geocodes fell back to road centerlines rather than specific parcels
Highway corridors (US-264 near Belhaven, NC-33, US-117) where numbered highway addresses geocoded to ROW parcels
Subdivisions where the parcel layer lacks lot-level granularity (e.g., River Bend Drive in Burgaw)
Mobile home parks where the parcel data treats the entire mobile home park as one big land parcel rather than multiple unit-level parceles. (Greenbriar MHP in Wilson County)
Inland metro areas (Winston-Salem, Concord, Greenville) where the geocoded point landed in industrial or highway-adjacent zones with no high-priority residential parcel within 200 m

These categories represent genuine data limitations (geocoding precision, parcel-layer granularity, and large-parcel centroid distance) rather than matching-pipeline failures.

Step 5.1 (Alt): Direct Intersection — Mits & Apps to Parcels

Overview

1 Load inputs

2 Direct intersection join

3 Tag match status

4 Proximity to nearest high-priority parcel

5 Write outputs

6 Resolve: one-row-per-record output

Summary

Match status counts

Shared-footprint fan-out

Distance to nearest high-priority parcel (low-priority / off-parcel only)

Review browsers

Mitigations

Applications

Output schema

Fan-out outputs (alt_mits_direct.gpkg / alt_apps_direct.gpkg)

Resolved outputs (alt_mits_resolved.gpkg / alt_apps_resolved.gpkg)

Results

Hit-rate summary

Spatial distribution

Fan-out outputs (`alt_mits_direct.gpkg` / `alt_apps_direct.gpkg`)

Resolved outputs (`alt_mits_resolved.gpkg` / `alt_apps_resolved.gpkg`)