---
title: "Step 5.1 (Alt): Direct Intersection — Mits & Apps to Parcels"
author: "Russell Blessing"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: true
    code-summary: "Show / hide code"
execute:
  echo: true
  warning: false
  message: false
---

## Overview

A transparent, no-heuristic spatial join of mitigation and application records
to parcels. Every input record is kept (no deduplication). A single point
representing, say, six mitigations on a condo building stays as six rows in
the output. Shared-footprint parcels fan out naturally: one input record that
intersects *N* parcel polygons produces *N* output rows.

This file is **complementary** to `05_spatial_match_app_mit.qmd`, which runs
the tiered priority cascade and produces the canonical matched output. Nothing
here changes that output.

**What this file does NOT do:**

- No tier cascade (high → med → low → fallback NN).
- No distance-based matching or buffering in the join itself.
- No dedup of mit/app records.
- No winner selection when a point hits multiple shared-footprint parcels.

**Inputs:**

- `parcels_pri.gpkg` — parcels with `luc_tier` column (from `out_dir`)
- `mit_newgeo.gpkg` — raw mitigations (`CRC_May_2025/Raw Data Inputs/raw_app_ncb/`)
- `applications_final.gpkg` — raw applications (same directory)

**Outputs (written to `out_dir`):**

- `alt_mits_direct.gpkg`, `alt_apps_direct.gpkg` — full fan-out spatial join
- `alt_mits_direct_review.csv`, `alt_apps_direct_review.csv` — same, no geometry

---

```{r libraries}
library(sf)
library(dplyr)
library(readr)
library(tidyr)
library(purrr)
library(knitr)
library(DT)
library(leaflet)
library(glue)
library(stringdist)
library(stringr)
library(nngeo)
library(RANN)
```

```{r setup}
out_dir <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

parcels_path <- file.path(out_dir, "parcels_pri.gpkg")
mits_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"
apps_path    <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"
flood_boundary_path <- "/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"

HIGH_TIER_MIN <- 3L

mit_id_col <- "unique_id"
app_id_col <- "unique_id"
```

---

## 1  Load inputs

```{r load-inputs}
parcels <- st_read(parcels_path, quiet = TRUE) |>
  select(
    parcel_index, geometry_id, cntyfips, parno, altparno,
    siteadd, scity, szip, mailadd, ownname, parusedesc,
    `LAND.USE.CODE`, luc_tier,
    `APN.SEQUENCE.NUMBER`,
    `SITUS.HOUSE.NUMBER`, `SITUS.STREET.NAME`,
    `OWNER.1.FULL.NAME`, `YEAR.BUILT`,
    `ASSESSED.TOTAL.VALUE`,
    `IMPROVEMENT.VALUE.CALCULATED`   
  ) |>
  st_transform(32119)

# Flood study area boundary
flood_boundary <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(st_crs(parcels))

# Read raw mits/apps, clip to flood study area via st_within, then assign
# unique_id within the clipped set. Matches the canonical Step 5 SA filter.
# Do NOT dedup — every HMA / IHP record is preserved, coordinate duplicates included.
mits <- st_read(mits_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())

apps <- st_read(apps_path, quiet = TRUE) |>
  st_transform(st_crs(parcels)) |>
  st_filter(flood_boundary, .predicate = st_within) |>
  mutate(unique_id = row_number())
```

---

## 2  Direct intersection join

One row per (input record × intersecting parcel). Records that fall outside
every parcel produce one row with NA parcel columns (`left = TRUE`).

```{r direct-join}
mits_direct <- st_join(mits, parcels, join = st_intersects, left = TRUE)
apps_direct <- st_join(apps, parcels, join = st_intersects, left = TRUE)
```

---

## 3  Tag match status

```{r tag-status}
tag_status <- function(df, high_tier_min = HIGH_TIER_MIN) {
  df |>
    mutate(
      luc_tier_num = suppressWarnings(as.integer(luc_tier)),
      match_status = case_when(
        is.na(parcel_index)           ~ "no_parcel_hit",
        luc_tier_num >= high_tier_min ~ "matched_high",
        TRUE                          ~ "matched_low"
      )
    )
}

mits_direct <- tag_status(mits_direct)
apps_direct <- tag_status(apps_direct)
```

---

## 4  Proximity to nearest high-priority parcel

For every input point, compute distance to the nearest high-priority parcel.
This is for **review only** — it does not change any match result.

```{r proximity-to-high}
high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

nearest_high <- function(pts, id_col, high) {
  if (nrow(pts) == 0 || nrow(high) == 0) {
    return(tibble(
      !!id_col := character(),
      nearest_high_parcel_index = integer(),
      dist_to_nearest_high_m    = numeric()
    ))
  }
  nn_idx <- st_nearest_feature(pts, high)
  tibble(
    !!id_col := pts[[id_col]],
    nearest_high_parcel_index = high$parcel_index[nn_idx],
    dist_to_nearest_high_m    = as.numeric(
      st_distance(pts, high[nn_idx, ], by_element = TRUE)
    )
  )
}

mits_nearhigh <- nearest_high(mits, mit_id_col, high_parcels)
apps_nearhigh <- nearest_high(apps, app_id_col, high_parcels)

mits_direct <- mits_direct |> left_join(mits_nearhigh, by = mit_id_col)
apps_direct <- apps_direct |> left_join(apps_nearhigh, by = app_id_col)
```

---

## 5  Write outputs

```{r write-outputs}
st_write(mits_direct, file.path(out_dir, "alt_mits_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_direct, file.path(out_dir, "alt_apps_direct.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_direct_review.csv"))
apps_direct |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_direct_review.csv"))
```

---

## 6  Resolve: one-row-per-record output

The fan-out outputs above preserve every (input record × parcel) combination
for inspection. This section produces a deduplicated, one-row-per-input-record
version suitable for downstream analysis and supervisor-facing counts.

**Assignment method buckets:**

| `assignment_method` | Meaning |
|---|---|
| `direct_single_parcel` | Input record intersected exactly one parcel — no judgment needed |
| `direct_multi_resolved` | Multi-parcel hit (condo, recorder dup) — winner picked by four-level tiebreaker |
| `nearest_high_addr` | No parcel hit — rescued by address-similarity match within distance cap |
| `nearest_high_dist` | No parcel hit, no good address match — assigned to nearest high-priority parcel within cap |
| `unassigned` | No parcel hit, no candidate within cap (out-of-scope records, e.g. western mountain counties) |

> **Methods note — multi-unit buildings:** For apartment complexes and
> condominium buildings where all candidate parcels share a complex-level site
> address (e.g. 95 units sharing one street address but distinct APNs),
> address-similarity is non-discriminating across candidates and the winner is
> determined by the APN sequence / improvement value tiebreakers. The selected
> parcel is reproducible across runs but should be interpreted as "a
> representative unit within the matched building" rather than "the specific
> unit the record corresponds to." Unit-level disambiguation would require
> parsing unit numbers from the input record's address and matching against
> per-unit parcel records, which CoreLogic's `siteadd` field does not
> consistently provide.

```{r resolve-setup}
library(stringdist)

norm_addr <- function(x) {
  coalesce(x, "") |>
    str_to_lower() |>
    str_replace_all("[^a-z0-9 ]", " ") |>
    str_squish()
}

addr_sim <- function(a, b) {
  na <- norm_addr(a); nb <- norm_addr(b)
  ifelse(na == "" | nb == "", 0, stringsim(na, nb, method = "jw"))
}

# --- Tunables ---------------------------------------------------------------
SIM_THRESHOLD <- 0.70   # min Jaro-Winkler to call an address match a "real" hit
DIST_CAP_M    <- 200    # max metres allowed for a distance-only fallback
K_CANDIDATES  <- 10     # nearest high-priority parcels to evaluate per no-hit

# Columns to preserve from input side even if they share a name with a parcel
# column. Populate after running check-column-overlaps if genuine input-side
# fields are being silently dropped by finalize().
MIT_KEEP_FROM_INPUT <- character()
APP_KEEP_FROM_INPUT <- character()

high_parcels <- parcels |> filter(as.integer(luc_tier) >= HIGH_TIER_MIN)

# Centroid version for RANN KNN. st_coordinates() on polygon geometry returns
# edge vertices, producing misaligned coordinate arrays; centroids give one
# point per row, matching the row-index contract for nn$nn.idx.
# high_parcels and high_centroids share the same row order.
high_centroids <- high_parcels |> st_centroid()
```

```{r check-column-overlaps}
# Sanity check: any column names shared between input-side data (mits/apps)
# and parcels will be silently dropped by finalize() unless added to
# MIT_KEEP_FROM_INPUT / APP_KEEP_FROM_INPUT in resolve-setup.
parcel_attr_cols <- names(st_drop_geometry(parcels))

cat("Parcel columns also present in mits_direct (will be dropped from base):\n")
print(intersect(names(st_drop_geometry(mits_direct)), parcel_attr_cols))

cat("\nParcel columns also present in apps_direct (will be dropped from base):\n")
print(intersect(names(st_drop_geometry(apps_direct)), parcel_attr_cols))
```

```{r stage1-multiparcel-pick}
# Stage 1: For each input record that intersected one or more parcels,
# pick a single winner. Tiebreaking mirrors canonical Step 5:
#   1. Jaro-Winkler address similarity (decisive when addresses differ)
#   2. APN.SEQUENCE.NUMBER ascending (primary recorder entry)
#   3. luc_tier descending (most specific land use)
#   4. IMPROVEMENT.VALUE.CALCULATED descending (parcel with structure)
pick_winner <- function(direct_df, addr_col, id_col) {
  direct_df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    mutate(
      addr_sim  = addr_sim(.data[[addr_col]], siteadd),
      apn_seq   = suppressWarnings(as.integer(`APN.SEQUENCE.NUMBER`)),
      improvval = suppressWarnings(as.numeric(`IMPROVEMENT.VALUE.CALCULATED`))
    ) |>
    group_by(.data[[id_col]]) |>
    mutate(n_parcels = n()) |>
    arrange(
      desc(addr_sim),
      apn_seq,
      desc(as.integer(luc_tier)),
      desc(improvval)
    ) |>
    slice(1) |>
    ungroup() |>
    select(all_of(id_col),
           winner_parcel_index = parcel_index,
           winner_addr_sim     = addr_sim,
           winner_apn_seq      = apn_seq,
           winner_n_parcels    = n_parcels)
}

mit_winners <- pick_winner(mits_direct, "raw_addr", mit_id_col)
app_winners <- pick_winner(apps_direct, "address",  app_id_col)
```

```{r stage2-nohit-reassign}
# Stage 2: For each no-parcel-hit record, find K nearest high-priority parcels
# and pick by best address similarity if available, else nearest.
reassign_nohit <- function(direct_df, addr_col, id_col,
                           high       = high_parcels,
                           high_pts   = high_centroids,
                           k          = K_CANDIDATES,
                           sim_thresh = SIM_THRESHOLD,
                           dist_cap   = DIST_CAP_M) {
  nohit <- direct_df |>
    filter(is.na(parcel_index)) |>
    distinct(.data[[id_col]], .keep_all = TRUE)

  if (nrow(nohit) == 0) return(tibble())

  nohit_coords <- st_coordinates(nohit)[, c("X", "Y")]
  high_coords  <- st_coordinates(high_pts)[, c("X", "Y")]   # <-- high_pts, not high

  nn <- RANN::nn2(high_coords, nohit_coords, k = k)

  cand <- tibble(
    !!id_col    := rep(nohit[[id_col]], each = k),
    input_addr  = rep(nohit[[addr_col]], each = k),
    cand_pos    = rep(seq_len(k), nrow(nohit)),
    cand_row    = as.vector(t(nn$nn.idx)),
    cand_dist_m = as.vector(t(nn$nn.dists))
  ) |>
    mutate(
      cand_parcel_index = high$parcel_index[cand_row],
      cand_siteadd      = high$siteadd[cand_row],
      cand_addr_sim     = addr_sim(input_addr, cand_siteadd)
    )

  # Pick winner per input record:
  #   1. best address match if sim >= threshold AND within distance cap
  #   2. else nearest within distance cap
  #   3. else unassigned
  cand |>
    group_by(.data[[id_col]]) |>
    summarise(
      best_addr_sim    = max(cand_addr_sim),
      best_addr_dist   = cand_dist_m[which.max(cand_addr_sim)],
      best_addr_parcel = cand_parcel_index[which.max(cand_addr_sim)],
      nearest_dist     = min(cand_dist_m),
      nearest_parcel   = cand_parcel_index[which.min(cand_dist_m)],
      .groups = "drop"
    ) |>
    mutate(
      assigned_parcel_index = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ best_addr_parcel,
        nearest_dist  <= dist_cap                                ~ nearest_parcel,
        TRUE                                                      ~ NA_integer_
      ),
      assignment_method = case_when(
        best_addr_sim >= sim_thresh & best_addr_dist <= dist_cap ~ "nearest_high_addr",
        nearest_dist  <= dist_cap                                ~ "nearest_high_dist",
        TRUE                                                      ~ "unassigned"
      ),
      assignment_dist_m = if_else(assignment_method == "nearest_high_addr",
                                  best_addr_dist, nearest_dist),
      assignment_addr_sim = if_else(assignment_method == "nearest_high_addr",
                                    best_addr_sim, NA_real_)
    )
}

mit_reassign <- reassign_nohit(mits_direct, "raw_addr", mit_id_col)
app_reassign <- reassign_nohit(apps_direct, "address",  app_id_col)
```

```{r final-resolved}
finalize <- function(direct_df, winners, reassign, id_col,
                     keep_from_input = character()) {
  # All parcel-attribute columns that were folded in by the earlier st_join.
  # Drop them cleanly — the final left_join below re-attaches them via
  # assigned_parcel_index. Any input-side columns the caller wants to
  # preserve despite sharing a name with a parcel column are excluded.
  parcel_cols <- setdiff(names(st_drop_geometry(parcels)), keep_from_input)

  base <- direct_df |>
    distinct(.data[[id_col]], .keep_all = TRUE) |>
    select(-any_of(parcel_cols))

  base |>
    left_join(winners, by = id_col) |>
    left_join(
      reassign |> rename(reassign_method = assignment_method),
      by = id_col
    ) |>
    mutate(
      # left_join (not inner) is load-bearing: records absent from winners
      # (no parcel hit) get winner_parcel_index = NA, then coalesce falls
      # through to the Stage 2 assigned_parcel_index.
      assigned_parcel_index = coalesce(winner_parcel_index, assigned_parcel_index),
      assignment_method = case_when(
        !is.na(winner_parcel_index) & winner_n_parcels == 1 ~ "direct_single_parcel",
        !is.na(winner_parcel_index) & winner_n_parcels >  1 ~ "direct_multi_resolved",
        !is.na(reassign_method)                              ~ reassign_method,
        TRUE                                                  ~ "unassigned"
      )
    ) |>
    select(-reassign_method) |>
    left_join(parcels |> st_drop_geometry(),
              by = c("assigned_parcel_index" = "parcel_index"))
}

mits_final <- finalize(mits_direct, mit_winners, mit_reassign, mit_id_col,
                       keep_from_input = MIT_KEEP_FROM_INPUT)
apps_final <- finalize(apps_direct, app_winners, app_reassign, app_id_col,
                       keep_from_input = APP_KEEP_FROM_INPUT)

# Diagnostics
cat("\n=== Mitigations: assignment method distribution ===\n")
mits_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()

cat("\n=== Applications: assignment method distribution ===\n")
apps_final |> st_drop_geometry() |> count(assignment_method, sort = TRUE) |> print()
```

```{r write-resolved}
st_write(mits_final, file.path(out_dir, "alt_mits_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(apps_final, file.path(out_dir, "alt_apps_resolved.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

mits_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_mits_resolved_review.csv"))
apps_final |> st_drop_geometry() |>
  write_csv(file.path(out_dir, "alt_apps_resolved_review.csv"))
```

---

## Summary

### Match status counts

Row counts reflect the fan-out: a single input record landing on *N* parcel
polygons contributes *N* rows here.

```{r summary-counts}
bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  count(side, match_status) |>
  pivot_wider(names_from = match_status, values_from = n, values_fill = 0L) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Row counts by side and match status (rows = input record × parcel hit)"
  )
```

### Shared-footprint fan-out

How often does a single input record intersect multiple parcel polygons?
Condo complexes and stacked parcels show up as `n_parcels >= 2`.

```{r multihit-diagnostic}
multihit <- function(df, id_col, label) {
  df |>
    st_drop_geometry() |>
    filter(!is.na(parcel_index)) |>
    count(.data[[id_col]], name = "n_parcels") |>
    count(n_parcels, name = "n_records") |>
    mutate(side = label)
}

bind_rows(
  multihit(mits_direct, mit_id_col, "mitigations"),
  multihit(apps_direct, app_id_col, "applications")
) |>
  pivot_wider(names_from = side, values_from = n_records, values_fill = 0L) |>
  arrange(n_parcels) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    caption = "Distribution of parcel-hit counts per input record (n_parcels > 1 = shared-footprint)"
  )
```

### Distance to nearest high-priority parcel (low-priority / off-parcel only)

If most low-hit records are within ~30 m of a high-priority parcel the low
classification is likely a geocode artefact, not a meaningful difference in
parcel type.

```{r proximity-distribution}
bind_rows(
  mits_direct |> st_drop_geometry() |> mutate(side = "mitigations"),
  apps_direct |> st_drop_geometry() |> mutate(side = "applications")
) |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  group_by(side, match_status) |>
  summarise(
    n      = n(),
    p10    = round(quantile(dist_to_nearest_high_m, 0.10, na.rm = TRUE), 1),
    median = round(median(dist_to_nearest_high_m,          na.rm = TRUE), 1),
    p90    = round(quantile(dist_to_nearest_high_m, 0.90, na.rm = TRUE), 1),
    max    = round(max(dist_to_nearest_high_m,             na.rm = TRUE), 1),
    .groups = "drop"
  ) |>
  knitr::kable(
    caption = "Distance (m) from low-priority / no-hit records to nearest high-priority parcel"
  )
```

---

## Review browsers

Records that landed on a low-priority LUC or off-parcel entirely, sorted
ascending by distance to the nearest high-priority parcel. Rows near the top
are the most suspicious (close to a high-priority parcel but not on one).

### Mitigations

```{r review-browser-mits}
#| column: page

mits_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Mitigations on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )
```

### Applications

```{r review-browser-apps}
#| column: page

apps_direct |>
  st_drop_geometry() |>
  filter(match_status %in% c("matched_low", "no_parcel_hit")) |>
  arrange(dist_to_nearest_high_m) |>
  DT::datatable(
    filter   = "top",
    rownames = FALSE,
    options  = list(pageLength = 25, scrollX = TRUE),
    caption  = "Applications on low-priority LUCs or off-parcel, sorted by distance to nearest high-priority parcel."
  )
```

---

## Output schema

### Fan-out outputs (`alt_mits_direct.gpkg` / `alt_apps_direct.gpkg`)

One row per (input record × intersecting parcel). Use for shared-footprint
inspection and canonical Step 5 comparison.

Each row of `alt_mits_direct.gpkg` / `alt_apps_direct.gpkg`:

| Column | Source | Notes |
|---|---|---|
| *(all original mit/app columns)* | input | unchanged |
| `parcel_index` … `IMPROVEMENT.VALUE.CALCULATED` | parcels | NA when `no_parcel_hit` |
| `luc_tier_num` | derived | `as.integer(luc_tier)` |
| `match_status` | derived | `matched_high` / `matched_low` / `no_parcel_hit` |
| `nearest_high_parcel_index` | derived | `parcel_index` of closest high-priority parcel |
| `dist_to_nearest_high_m` | derived | metres; review only, not used for matching |

### Resolved outputs (`alt_mits_resolved.gpkg` / `alt_apps_resolved.gpkg`)

One row per input record. Use for downstream analysis and supervisor-facing
counts.

| Column | Source | Notes |
|---|---|---|
| *(all original mit/app columns)* | input | unchanged |
| `assigned_parcel_index` | derived | winner from Stage 1 or Stage 2; NA if unassigned |
| `assignment_method` | derived | see bucket table in Section 6 |
| `winner_parcel_index` | Stage 1 | NA for no-hit records |
| `winner_addr_sim` | Stage 1 | Jaro-Winkler score of winning address match |
| `winner_n_parcels` | Stage 1 | how many parcels the input point intersected |
| `assignment_dist_m` | Stage 2 | metres to assigned parcel; NA for Stage 1 records |
| `assignment_addr_sim` | Stage 2 | Jaro-Winkler score; NA for distance-only assignments |
| `parcel_index` … `IMPROVEMENT.VALUE.CALCULATED` | parcels | re-joined via `assigned_parcel_index`; NA if unassigned |

---

## Results

The Step 5 alt direct-match pipeline produces a one-to-one assignment of every
mit/app record (within the flood study area) to a parcel, where possible. The
table below summarizes hit rate by assignment method, and the leaflet visualizes
spatial distribution of every record by outcome.

### Hit-rate summary

```{r results-summary-table}
library(knitr)
library(scales)

build_summary <- function(df, side_label) {
  total <- nrow(df)
  am <- df |> st_drop_geometry() |> count(assignment_method)
  get_n <- function(method) {
    v <- am |> filter(assignment_method == method) |> pull(n)
    if (length(v) == 0) 0L else v
  }

  direct_single <- get_n("direct_single_parcel")
  direct_multi  <- get_n("direct_multi_resolved")
  stage2_addr   <- get_n("nearest_high_addr")
  stage2_dist   <- get_n("nearest_high_dist")
  unassigned    <- get_n("unassigned")
  assigned      <- total - unassigned

  tibble(
    Metric = c(
      "Total records (within HUC-6 study area)",
      "  Direct hit — single parcel",
      "  Direct hit — multi-parcel resolved (Stage 1 tiebreaker)",
      "  Stage 2 rescue — address match",
      "  Stage 2 rescue — nearest centroid (\u2264 200 m)",
      "  Unassigned — no high-priority parcel within 200 m",
      "**Total assigned to a parcel**",
      "**Hit rate**"
    ),
    !!side_label := c(
      comma(total),
      comma(direct_single),
      comma(direct_multi),
      comma(stage2_addr),
      comma(stage2_dist),
      comma(unassigned),
      comma(assigned),
      sprintf("%.2f %%", 100 * assigned / total)
    )
  )
}

mit_summary <- build_summary(mits_final, "Mitigations")
app_summary <- build_summary(apps_final, "Applications")

results_table <- mit_summary |>
  left_join(app_summary, by = "Metric")

knitr::kable(
  results_table,
  align   = c("l", "r", "r"),
  caption = "Hit rate of mit/app-to-parcel join. Indented rows are subcategories of \"Total records\"; totals at bottom are bolded."
)
```

The two-stage pipeline assigns nearly every record to a parcel. Direct intersection
(Stage 1) handles the bulk of the workload; Stage 2's nearest-centroid rescue
captures records whose geocoded location fell on or near a non-residential parcel
(curb, ROW, adjacent commercial) while the intended residential parcel was within
200 m. The small unassigned residual represents genuine data-coverage limitations
discussed in the methodology section.

### Spatial distribution

The map below plots every mit and app record by assignment outcome. Successful
records are clustered at lower zoom levels for performance; unassigned records
are shown individually in red so they remain visible at any zoom.

```{r results-leaflet}
library(leaflet)

# Combine mits + apps into a single point dataset with outcome label
make_map_df <- function(df, side, addr_col) {
  coords <- df |> st_transform(4326) |> st_coordinates()
  df |>
    st_drop_geometry() |>
    transmute(
      unique_id,
      side,
      label_addr = .data[[addr_col]],
      assignment_method,
      lon = coords[, "X"],
      lat = coords[, "Y"]
    )
}

points_for_map <- bind_rows(
  make_map_df(mits_final, "Mitigation",  "Match_addr"),
  make_map_df(apps_final, "Application", "address")
) |>
  mutate(
    outcome = case_when(
      assignment_method == "direct_single_parcel"  ~ "Direct (single hit)",
      assignment_method == "direct_multi_resolved" ~ "Direct (multi-parcel resolved)",
      assignment_method == "nearest_high_addr"     ~ "Rescued (address match)",
      assignment_method == "nearest_high_dist"     ~ "Rescued (nearest centroid)",
      assignment_method == "unassigned"            ~ "Unassigned"
    )
  )

outcome_pal <- colorFactor(
  palette = c("Direct (single hit)"             = "#1B7837",
              "Direct (multi-parcel resolved)"  = "#5AAE61",
              "Rescued (address match)"         = "#2166AC",
              "Rescued (nearest centroid)"      = "#67A9CF",
              "Unassigned"                      = "#B2182B"),
  domain  = points_for_map$outcome
)

flood_boundary_wgs <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(4326)

leaflet() |>
  addProviderTiles(providers$CartoDB.Positron, group = "Light") |>
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") |>

  # Flood study area
  addPolygons(
    data        = flood_boundary_wgs,
    fillColor   = "lightblue",
    fillOpacity = 0.15,
    color       = "navy",
    weight      = 1.5,
    opacity     = 0.6,
    group       = "Flood study area"
  ) |>

  # Assigned records — clustered for performance
  addCircleMarkers(
    data           = points_for_map |> filter(outcome != "Unassigned"),
    lng            = ~lon, lat = ~lat,
    radius         = 3,
    color          = ~outcome_pal(outcome),
    fillOpacity    = 0.7,
    stroke         = FALSE,
    label          = ~glue::glue("{side} #{unique_id} — {outcome}"),
    popup          = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>outcome:</b> {outcome}<br>",
      "<b>addr:</b> {label_addr}"
    ),
    clusterOptions = markerClusterOptions(maxClusterRadius = 40),
    group          = "Assigned records"
  ) |>

  # Unassigned records — unclustered and prominent
  addCircleMarkers(
    data         = points_for_map |> filter(outcome == "Unassigned"),
    lng          = ~lon, lat = ~lat,
    radius       = 6,
    color        = "#7C0024",
    fillColor    = "#B2182B",
    fillOpacity  = 0.9,
    stroke       = TRUE,
    weight       = 2,
    label        = ~glue::glue("{side} #{unique_id} — UNASSIGNED"),
    popup        = ~glue::glue(
      "<b>{side} #{unique_id}</b><br>",
      "<b>UNASSIGNED</b><br>",
      "<b>addr:</b> {label_addr}"
    ),
    group        = "Unassigned records"
  ) |>

  addLayersControl(
    baseGroups    = c("Light", "Satellite"),
    overlayGroups = c("Flood study area", "Assigned records", "Unassigned records"),
    options       = layersControlOptions(collapsed = FALSE)
  ) |>
  addLegend(
    "bottomright",
    pal     = outcome_pal,
    values  = points_for_map$outcome,
    title   = "Assignment outcome",
    opacity = 0.85
  )
```

**Reading the map:** Direct-hit records (green) and Stage 2 rescues (blue) make
up the vast majority. Unassigned records (red) are concentrated in:

- Rural eastern NC (Edgecombe County and similar) where rural-route box-number
  geocodes fell back to road centerlines rather than specific parcels
- Highway corridors (US-264 near Belhaven, NC-33, US-117) where numbered
  highway addresses geocoded to ROW parcels
- Subdivisions where the parcel layer lacks lot-level granularity (e.g., River
  Bend Drive in Burgaw)
- Mobile home parks where the parcel data treats the entire mobile home park as one big land parcel rather than multiple unit-level   parceles. (Greenbriar MHP in Wilson County)
- Inland metro areas (Winston-Salem, Concord, Greenville) where the geocoded
  point landed in industrial or highway-adjacent zones with no high-priority
  residential parcel within 200 m

These categories represent genuine data limitations (geocoding precision,
parcel-layer granularity, and large-parcel centroid distance) rather than
matching-pipeline failures.