---
title: "Step 5: Mitigations to Parcels"
author: "Russell Blessing"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: true
    code-summary: "Show / hide code"
execute:
  echo: true
  warning: false
  message: false
---

## Overview

This notebook spatially matches mitigation and application records to parcels
using a tiered priority scheme. Parcels are scored and ranked as high, medium,
or low priority for matching based on land use code and address quality.
For each priority tier, mitigation/application geometries are first matched via
direct intersection with parcel polygons, then by nearest-neighbor distance
(≤ 50 m) for those that do not intersect. Unmatched records after all three
tiers fall back to a global nearest-neighbor match across all parcels.

**Input files (Russell's):**

- `nc1_cl_merge.csv` — parcel–CoreLogic merged table (Step 4 output)
- `parcels_study_area.gpkg` — OneMap parcel geometries with `geometry_id`
- `mit_newgeo.gpkg` — mitigations (re-geocoded)
- `mit_elevation_qaqc_filled.csv` — elevation QAQC scores for mitigations
- `applications_final.gpkg` — FEMA applications
- `NC_huc6_data_union_sf.shp` — flood study area boundary

**External files (written to `out_dir`):**

- `parcels_matcheable.gpkg` — filtered parcels before priority scoring
- `parcels_pri.gpkg` — parcels with priority scores
- `mit_highpri.csv`, `mit_medpri.csv`, `mit_lowpri.csv`, `mit_dist.csv`
- `app_highpri.csv`, `app_medpri.csv`, `app_lowpri.csv`, `app_dist.csv`
- `app_sa_full_noqa.gpkg`, `mit_sa_full_noqa.gpkg`
- `apps_pcls.gpkg`, `mits_pcls.gpkg`

```{r libraries}
library(sf)
library(dplyr)
library(readr)
library(stringr)
library(stringdist)
library(purrr)
library(tidyr)
```

```{r setup}
out_dir  <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

# External data paths
flood_boundary_path <- "/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"
mit_path            <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"
mit_elev_path       <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_elevation_qaqc_filled.csv"
app_path            <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"
geoms_path          <- file.path(out_dir, "parcels_study_area.gpkg")
```

---

## 1  Load & prepare parcels

```{r load-parcels}
#| eval: true

parcels_raw <- read_csv(
  file.path(out_dir, "nc1_cl_merge.csv"),
  col_types = cols(.default = col_character())
) |>
  mutate(parcel_index = as.integer(parcel_index))

nc1_geoms <- st_read(geoms_path, quiet = TRUE) |>
  mutate(
    parcel_index = as.integer(parcel_index),
    geometry_id  = as.integer(factor(as.character(geom)))
  ) |>
  select(parcel_index, geometry_id)

parcels <- parcels_raw |>
  select(-any_of(c("geometry_id1", "index_right", "geometry_id", "geometry"))) |>
  left_join(st_drop_geometry(nc1_geoms), by = "parcel_index") |>
  left_join(nc1_geoms |> select(parcel_index, geometry = geom), by = "parcel_index") |>
  st_as_sf(crs = st_crs(nc1_geoms))
```

---

## 2  Filter invalid parcels

```{r filter-parcels}
#| eval: true

# Remove water parcels (flagged via parcel number keywords)
water_keywords <- "bay|creek|sound|water|ocean|river|canal|pond|inlet"

parcels <- parcels |>
  filter(!str_detect(parno, regex(water_keywords, ignore_case = TRUE)))

# Flag and remove invalid parcels
parcels <- parcels |>
  mutate(
    invalidparcels = case_when(
      # No identifying info at all
      is.na(`LAND.USE.CODE`) & parno == "" & is.na(siteadd) & mailadd == "" ~ 1L,
      # Carteret: no land use code, blank parno → likely right-of-way
      is.na(`LAND.USE.CODE`) & parno == "" & cntyfips == "031" ~ 1L,
      # Brunswick: no parno or altparno → right-of-way
      is.na(`LAND.USE.CODE`) & parno == "" & altparno == "" & cntyfips == "019" ~ 1L,
      # Explicit ROW tags (excluding "ROWAN COUNTY")
      (str_detect(parno,    regex("row", ignore_case = TRUE)) |
       str_detect(altparno, regex("row", ignore_case = TRUE))) &
        parno != "ROWAN COUNTY" ~ 1L,
      TRUE ~ 0L
    )
  ) |>
  filter(invalidparcels == 0L)

# Save pre-priority filtered parcels for diagnostic mapping
st_write(parcels |> select(-invalidparcels),
         file.path(out_dir, "parcels_matcheable.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

parcels <- parcels |> select(-invalidparcels)
```

---

## 3  Parcel priority scoring

Land use codes that define each priority bucket:

```{r luc-definitions}
luc_vacant      <- c("465","460","455","454","453","452","450","400")
luc_gov         <- c("604","603","601","600")     # +600 (E-Exempt) — parallels 601/603/604
luc_residential <- c("165","163","160","151","138","137","136","135","134","133",
                      "132","131","117","116","115","113","112","111","109","106",
                      "103","102","100")
luc_fr_1_4      <- c("165","163","160","151","138","137","136","135",
                      "134","133",                # +133, 134 (triplex, quadruplex — definitional fix)
                      "132","115",
                      "113","112","111",          # +111–113 (mobile/manufactured/modular — FEMA priority)
                      "102","100")
luc_buyout      <- c("602")                       # Hindsight — completed historical buyouts (LUC 602)
luc_top         <- c(luc_vacant, luc_gov, luc_residential, luc_buyout)
luc_bottom      <- c("283",                       # +283 (commercial supermarket)
                      "605","606","607","609","610","614","615","618","620","630",
                      "800","818","839","850","860","872","875","877","880","883",
                      "885","886","114")
```

**Scoring change from original Python script**

The original script collapsed vacant, government-owned, and all residential
parcels into a single `qual_luc = 3` bucket, meaning a 1–4 family home and
a municipal lot were treated identically. It also weighted site address and
mailing address equally (1 pt each), and the composite `qual_sum` only
ranged 0–5, making the high/med/low thresholds somewhat arbitrary.

The revised scheme, with hindsight-specific elevation for completed buyouts:

| LUC group | `luc_tier` | `match_priority` | Rationale |
|---|---|---|---|
| Completed buyouts (`luc_buyout`, LUC 602) | 3 | high (always) | Hindsight: confirmed historical buyout — parcel converted to exempt status *because of* the program we're studying |
| 1–4 family residential (`luc_fr_1_4`) | 4 | high if site address present | Primary buyout/mitigation candidate; includes mobile/manufactured/modular housing per FEMA HMA priorities |
| Vacant land (`luc_vacant`) | 3 | med | High candidate, no structure risk |
| Gov-owned (`luc_gov`) | 3 | med | Eligible but less common |
| Other residential (`luc_residential \ luc_fr_1_4`) | 2 | med if qual_sum ≥ 5, else low | Multi-family / condo / apartment — rarely targeted |
| Unknown LUC | 1 | low | Insufficient info |
| Commercial/industrial (`luc_bottom`) | 0 | low | Unlikely candidates |

Site address is weighted **2 pts** (more useful for geocoding/matching than
mailing address), mailing address **1 pt**, giving a `qual_sum` range of
0–7 and more granular priority tiers.

**`luc_fr_1_4` membership reflects FEMA HMA program priorities** rather
than CoreLogic's literal numeric subgroupings:

- LUC 111 ("Mobile home"), 112 ("Manufactured home"), 113 ("Modular home")
  are included to reflect FEMA's documented emphasis on mobile/manufactured/
  modular housing in flood-prone areas.
- LUC 133 ("Triplex") and 134 ("Quadruplex") added for definitional
  consistency with the "1–4 family" label — both are 2–4 unit residential
  and were missing from the original list.

**Code list corrections from the original Python:**

- The `gov` list had a missing comma bug (`['604' '603', '601']` → the
  string `'604603'`), so no parcel ever matched it. Fixed here.
- LUC 600 ("E-Exempt") added to `luc_gov` — parallels the existing
  601/603/604 exempt-government codes.
- LUC 283 (CoreLogic commercial) added to `luc_bottom`.
- LUC 602 ("Exempt Governmental (Flood Buyout)") added as a separate
  `luc_buyout` flag. Treated as tier 3 in `luc_tier` (current state =
  exempt government) and elevated to `match_priority = "high"` for
  the hindsight analysis — a mit/app record landing on a 602 parcel
  is a confirmed historical buyout, the strongest possible match signal.

```{r score-parcels}
#| eval: true

parcels <- parcels |>
  mutate(
    `LAND.USE.CODE` = as.character(`LAND.USE.CODE`),

    # Land use flags (kept for downstream use)
    vacant              = if_else(`LAND.USE.CODE` %in% luc_vacant,      1L, 0L),
    gov_lu              = if_else(`LAND.USE.CODE` %in% luc_gov,         1L, 0L),
    res_lu              = if_else(`LAND.USE.CODE` %in% luc_residential, 1L, 0L),
    fr_1_4              = if_else(`LAND.USE.CODE` %in% luc_fr_1_4,      1L, 0L),
    is_completed_buyout = `LAND.USE.CODE` %in% luc_buyout,

    # Address quality: site address weighted 2x, mail address 1x
    qual_siteadd = if_else(!is.na(siteadd) & siteadd != "", 2L, 0L),
    qual_mailadd = if_else(!is.na(mailadd) & mailadd != "", 1L, 0L),

    # Tiered LUC score (0–4) — reflects current CoreLogic classification
    luc_tier = case_when(
      `LAND.USE.CODE` %in% luc_fr_1_4                                        ~ 4L,
      `LAND.USE.CODE` %in% luc_vacant                                         ~ 3L,
      `LAND.USE.CODE` %in% luc_gov                                            ~ 3L,
      `LAND.USE.CODE` %in% luc_buyout                                         ~ 3L,
      `LAND.USE.CODE` %in% luc_residential & !`LAND.USE.CODE` %in% luc_fr_1_4 ~ 2L,
      `LAND.USE.CODE` %in% luc_bottom                                         ~ 0L,
      TRUE                                                                     ~ 1L
    ),

    # Composite score (0–7) and priority tier
    qual_sum = luc_tier + qual_siteadd + qual_mailadd,
    match_priority = case_when(
      is_completed_buyout                ~ "high",   # Hindsight: confirmed historical buyout
      luc_tier == 4 & qual_siteadd == 2  ~ "high",   # 1–4 family + site address
      luc_tier >= 3 | qual_sum >= 5      ~ "med",
      TRUE                               ~ "low"
    )
  )

# Save prioritised parcel layer
st_write(parcels, file.path(out_dir, "parcels_pri.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
```

```{r priority-split}
#| eval: true

# Split into priority tiers
highpri <- parcels |> filter(match_priority == "high")
medpri  <- parcels |> filter(match_priority == "med")
lowpri  <- parcels |> filter(match_priority == "low")
```

---

## 4  Load & prepare mitigations and applications

```{r load-flood-boundary}
flood_boundary <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(crs = 32119)
```

```{r load-mit}
mit_gdf <- st_read(mit_path, quiet = TRUE) |>
  st_transform(crs = st_crs(flood_boundary))

# Merge elevation QAQC scores
mit_elev <- read_csv(mit_elev_path, col_types = cols(.default = col_character())) |>
  rename_with(str_trim)

mit_gdf <- mit_gdf |>
  left_join(mit_elev |> 
  select(record_ID, elevation_score), by = "record_ID")

# Clip to study area
mit_gdf <- mit_gdf |>
  mutate(Inside_SA = as.integer(lengths(st_within(st_geometry(mit_gdf), flood_boundary)) > 0))

mit_sa <- mit_gdf |> filter(Inside_SA == 1)
```

```{r load-app}
app <- st_read(app_path, quiet = TRUE) |>
  st_transform(crs = st_crs(flood_boundary))

app <- app |>
  mutate(
    app_id    = row_number(),
    Inside_SA = as.integer(lengths(st_within(st_geometry(app), flood_boundary)) > 0)
  )

app_sa <- app |> filter(Inside_SA == 1)
```

```{r unique-geoms-setup}
mit_geom_col <- attr(mit_sa, "sf_column")
app_geom_col <- attr(app_sa, "sf_column")

mit_sa <- mit_sa |>
  mutate(g_indicator = as.integer(factor(as.character(st_geometry(mit_sa)))))

app_sa <- app_sa |>
  mutate(
    g_indicator = as.integer(factor(as.character(st_geometry(app_sa)))),
    app_id      = row_number()
  )

unique_mit <- mit_sa |> distinct(g_indicator, .keep_all = TRUE) |>
  select(g_indicator, all_of(mit_geom_col)) |>
  st_transform(crs = st_crs(flood_boundary))

unique_app <- app_sa |> distinct(g_indicator, .keep_all = TRUE) |>
  select(g_indicator, all_of(app_geom_col)) |>
  st_transform(crs = st_crs(flood_boundary))
```

---

## 5  Build unique-geometry sets

Matching is done on unique geometries then merged back, avoiding redundant
spatial operations where multiple records share a coordinate.

```{r load-parcels-priority}
read_priority <- function(priority, path) {
  st_read(
    path,
    query = paste0("SELECT parcel_index, match_priority, geometry_id, geom FROM \"parcels_pri\" WHERE match_priority = '", priority, "'"),
    quiet = TRUE
  ) |>
    st_transform(crs = 32119)
}

parcels_path <- file.path(out_dir, "parcels_pri.gpkg")

highpri <- read_priority("high", parcels_path)
medpri  <- read_priority("med",  parcels_path)
lowpri  <- read_priority("low",  parcels_path)
gc()
```


```{r unique-geoms-parcels}
# Reproject parcel priority sets to EPSG:32119
to_unique_geoms <- function(df) {
  geom_col <- attr(df, "sf_column")
  df |>
    st_transform(crs = 32119) |>
    distinct(geometry_id, .keep_all = TRUE) |>
    select(geometry_id, all_of(geom_col)) |>
    rename(geometry = all_of(geom_col))
}

ug_high <- to_unique_geoms(highpri)
ug_med  <- to_unique_geoms(medpri)
ug_low  <- to_unique_geoms(lowpri)
```


---

## 6  Tiered spatial matching helper

A single function runs the intersect → distance (≤ 50 m) workflow for one
priority tier, for both mitigations and applications.

```{r match-function}
#' Match unique geometries to a parcel tier via intersect + nearest-neighbour
#'
#' @param candidates sf with columns g_indicator + geometry (mit or app unique geoms)
#' @param parcel_tier sf with columns geometry_id + geometry (unique parcel geoms)
#' @param max_dist   maximum allowable distance in metres for distance matches
#'
#' @return tibble with columns g_indicator, geometry_id, and optionally distance
match_to_tier <- function(candidates, parcel_tier, max_dist = 50) {

  # --- intersect pass ---
  intersect_raw <- st_join(candidates, parcel_tier, join = st_intersects, left = TRUE)

  intersect_matched <- intersect_raw |>
    filter(!is.na(geometry_id)) |>
    st_drop_geometry() |>
    mutate(intersect = 1L) |>
    distinct(g_indicator, .keep_all = TRUE)

  unmatched <- intersect_raw |>
    filter(is.na(geometry_id)) |>
    select(g_indicator)

  # --- distance pass on the remainder ---
  if (nrow(unmatched) == 0 || nrow(parcel_tier) == 0) {
    dist_matched <- tibble(g_indicator = integer(), geometry_id = integer(),
                           distance = numeric())
  } else {
    dist_joined <- st_join(unmatched, parcel_tier, join = st_nearest_feature)

    parcel_geoms <- parcel_tier$geometry[match(dist_joined$geometry_id, parcel_tier$geometry_id)]

    dist_matched <- dist_joined |>
      mutate(distance = as.numeric(
        st_distance(st_geometry(dist_joined), parcel_geoms, by_element = TRUE)
      )) |>
      st_drop_geometry() |>
      distinct(g_indicator, .keep_all = TRUE) |>
      filter(distance <= max_dist)
  }

  bind_rows(
    intersect_matched |> select(g_indicator, geometry_id),
    dist_matched      |> select(g_indicator, geometry_id, any_of("distance"))
  )
}
```

---

## 7  Run tiered matching

```{r match-high}
mit_highpri <- match_to_tier(unique_mit, ug_high)
app_highpri <- match_to_tier(unique_app, ug_high)

# Remove matched records before next tier
mits_rem <- unique_mit |> filter(!g_indicator %in% mit_highpri$g_indicator)
apps_rem <- unique_app |> filter(!g_indicator %in% app_highpri$g_indicator)
```

```{r match-med}
mit_medpri <- match_to_tier(mits_rem, ug_med)
app_medpri <- match_to_tier(apps_rem, ug_med)

mits_rem <- mits_rem |> filter(!g_indicator %in% mit_medpri$g_indicator)
apps_rem <- apps_rem |> filter(!g_indicator %in% app_medpri$g_indicator)
```

```{r match-low}
mit_lowpri <- match_to_tier(mits_rem, ug_low)
app_lowpri <- match_to_tier(apps_rem, ug_low)

mits_rem <- mits_rem |> filter(!g_indicator %in% mit_lowpri$g_indicator)
apps_rem <- apps_rem |> filter(!g_indicator %in% app_lowpri$g_indicator)
```

```{r match-fallback}
# Fallback: global nearest-neighbour across all parcel tiers (no distance cap)
ug_all <- bind_rows(ug_high, ug_med, ug_low)

nn_fallback <- function(remaining, all_geoms) {
  if (nrow(remaining) == 0) {
    return(tibble(g_indicator = integer(), geometry_id = integer(),
                  distance = numeric()))
  }
  joined <- st_join(remaining, all_geoms, join = st_nearest_feature)
  joined |>
    mutate(distance = as.numeric(
      st_distance(
        st_geometry(joined),
        all_geoms$geometry[match(geometry_id, all_geoms$geometry_id)],
        by_element = TRUE
      )
    )) |>
    st_drop_geometry() |>
    distinct(g_indicator, .keep_all = TRUE) |>
    select(g_indicator, geometry_id, distance)
}

mit_dist <- nn_fallback(mits_rem, ug_all)
app_dist <- nn_fallback(apps_rem, ug_all)
```

---

## 8  Save per-tier CSVs

```{r save-tier-csvs}
write_csv(mit_highpri, file.path(out_dir, "mit_highpri.csv"))
write_csv(app_highpri, file.path(out_dir, "app_highpri.csv"))
write_csv(mit_medpri,  file.path(out_dir, "mit_medpri.csv"))
write_csv(app_medpri,  file.path(out_dir, "app_medpri.csv"))
write_csv(mit_lowpri,  file.path(out_dir, "mit_lowpri.csv"))
write_csv(app_lowpri,  file.path(out_dir, "app_lowpri.csv"))
write_csv(mit_dist,    file.path(out_dir, "mit_dist.csv"))
write_csv(app_dist,    file.path(out_dir, "app_dist.csv"))
```

---

## 9  Merge geometry tags back to full mit / app datasets

```{r merge-back}
mit_geom_full <- bind_rows(mit_highpri, mit_medpri, mit_lowpri, mit_dist)
app_geom_full <- bind_rows(app_highpri, app_medpri, app_lowpri, app_dist)

mit_sa_full <- mit_sa |>
  mutate(unique_id = row_number()) |>
  left_join(mit_geom_full, by = "g_indicator")

app_sa_full <- app_sa |>
  mutate(unique_id = row_number()) |>
  left_join(app_geom_full, by = "g_indicator")

# Save geometry-tagged full datasets before QA
st_write(app_sa_full, file.path(out_dir, "app_sa_full_noqa.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(mit_sa_full, file.path(out_dir, "mit_sa_full_noqa.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
```

---

## 10  Join back to parcels

The spatial matching in steps 1–9 assigns each mit/app record a `geometry_id`
— the identifier of the parcel polygon it fell inside or nearest to. The goal
here is to recover the full parcel attribute table (owner, land use, address,
improvement value, etc.) and the parcel geometry for every matched record.

The complication is that `geometry_id` is not a unique key into `parcels_nogeom`.
Two situations produce a many-to-many relationship:

1. **Condo / multi-unit complexes.** Each individual unit (e.g., the 95 units at
   9620–9630 Vinca Circle) is a separate legal parcel with its own APN, but all
   units are registered against the same polygon, so they all share one
   `geometry_id`. The mit/app address (e.g., *"9624 VINCA CIR APT C"*) identifies
   one specific unit, not the whole complex.

2. **Recorder duplicates.** A small number of geometry_ids have 2–6 parcel rows
   that differ only in `APN.SEQUENCE.NUMBER`. The row with `APN.SEQUENCE.NUMBER = 1`
   is the primary (canonical) recorder entry; the others are supplemental filings.

To resolve both cases without inflating the output row count, a two-stage
selection is applied within each duplicate `geometry_id` group:

- **Stage 1 — address similarity.** Jaro-Winkler string similarity is computed
  between the mit/app address field and each candidate parcel's `siteadd`. The
  candidate(s) with the highest similarity score are retained. This is the
  decisive criterion for condo complexes where `APN.SEQUENCE.NUMBER` is 1 for
  every unit.
- **Stage 2 — tiebreaking.** Among remaining ties (equal similarity scores),
  records are ranked by `APN.SEQUENCE.NUMBER` ascending (primary recorder entry
  first), then `luc_tier` descending (most specific land use), then `improvval`
  descending (parcel with a structure). A single winner is taken.

Non-duplicate `geometry_id`s bypass this logic entirely and are joined directly.
The two paths are recombined before the geometry attachment and file write.

```{r free-up-memory}
n_mits_rem <- nrow(mits_rem)
n_apps_rem <- nrow(apps_rem)

rm(ug_high, ug_med, ug_low, ug_all,
   highpri, medpri, lowpri,
   mit_highpri, app_highpri, mit_medpri, app_medpri,
   mit_lowpri, app_lowpri, mit_dist, app_dist,
   unique_mit, unique_app, mits_rem, apps_rem)
gc()
```


```{r join-to-parcels}
parcels_nogeom <- st_read(parcels_path, quiet = TRUE) |>
  st_drop_geometry() |>
  select(-any_of(c("vacant","gov_lu","res_lu","fr_1_4",
                   "qual_siteadd","qual_mailadd","qual_luc",
                   "qual_sum","match_priority")))
                   
# Reconstruct nc1_geoms for final geometry join (load-parcels is eval:false)
nc1_geoms <- st_read(geoms_path, quiet = TRUE,
                     query = 'SELECT parcel_index, geom FROM parcels_study_area') |>
  mutate(parcel_index = as.integer(parcel_index))

# Drop geometry from app/mit before joining
apps_nogeom <- app_sa_full |> st_drop_geometry()
mits_nogeom <- mit_sa_full |> st_drop_geometry()
```

```{r dedup-parcels}
# --- Identify geometry_ids with multiple parcel rows ---
dup_geom_ids <- parcels_nogeom |>
  count(geometry_id) |>
  filter(n > 1) |>
  pull(geometry_id)

# --- Helper: pick the best parcel for one (record, geometry_id) pair ----------
# Priority:
#   1. Highest address similarity (Jaro-Winkler) between record address and siteadd
#   2. Lowest APN.SEQUENCE.NUMBER (primary recorder record)
#   3. Highest luc_tier (most specific land use)
#   4. Highest improvval (parcel has a structure)
pick_best_parcel <- function(records_nogeom, addr_col) {
  dup_pool <- parcels_nogeom |>
    filter(geometry_id %in% dup_geom_ids)

  records_nogeom |>
    filter(geometry_id %in% dup_geom_ids) |>
    select(unique_id, geometry_id, raw_addr = all_of(addr_col)) |>
    left_join(dup_pool, by = "geometry_id", relationship = "many-to-many") |>
    mutate(
      addr_sim = stringsim(
        toupper(coalesce(raw_addr, "")),
        toupper(coalesce(siteadd,  "")),
        method = "jw"
      )
    ) |>
    group_by(unique_id, geometry_id) |>
    slice_max(addr_sim,             n = 1, with_ties = TRUE) |>
    slice_min(APN.SEQUENCE.NUMBER,  n = 1, with_ties = TRUE) |>
    slice_max(luc_tier,             n = 1, with_ties = TRUE) |>
    slice_max(improvval,            n = 1, with_ties = FALSE) |>
    ungroup() |>
    select(unique_id, geometry_id, parcel_index)
}

app_winners <- pick_best_parcel(apps_nogeom |> mutate(unique_id = row_number()), "address")
mit_winners <- pick_best_parcel(mits_nogeom |> mutate(unique_id = row_number()), "raw_addr")

# --- Build deduplicated parcel tables scoped to each record type ---------------
# For non-dup geometry_ids: keep all parcel rows (1-to-1, no change)
# For dup geometry_ids:     keep only the winner row per (unique_id, geometry_id)
parcels_nogeom_app_dedup <- parcels_nogeom |>
  filter(!geometry_id %in% dup_geom_ids)   # non-dup rows

parcels_nogeom_mit_dedup <- parcels_nogeom_app_dedup  # same base

# Winner rows — join winner parcel_index back to full parcel attributes
app_dup_rows <- app_winners |>
  left_join(parcels_nogeom, by = c("geometry_id", "parcel_index"))

mit_dup_rows <- mit_winners |>
  left_join(parcels_nogeom, by = c("geometry_id", "parcel_index"))
```

```{r join-to-parcels-dedup}
# Join mit/app → parcel attributes via geometry_id, then attach parcel geometry.
# Dup geometry_ids use the address-similarity winner; non-dup geometry_ids join
# normally (no many-to-many inflation).

join_dedup <- function(records_nogeom, dup_rows, id_col, nc1_geoms) {
  id_sym <- sym(id_col)

  non_dup <- records_nogeom |>
    mutate(.uid = row_number()) |>
    filter(!geometry_id %in% dup_geom_ids) |>
    left_join(parcels_nogeom_app_dedup, by = "geometry_id")

  dup <- records_nogeom |>
    mutate(.uid = row_number()) |>
    filter(geometry_id %in% dup_geom_ids) |>
    left_join(
      dup_rows |> select(unique_id, geometry_id, parcel_index,
                         any_of(names(parcels_nogeom))),
      by = c(".uid" = "unique_id", "geometry_id")
    )

  bind_rows(non_dup, dup) |>
    distinct() |>
    left_join(nc1_geoms |> select(parcel_index, geometry = geom),
              by = "parcel_index") |>
    select(-.uid) |>
    st_as_sf(crs = st_crs(nc1_geoms))
}

apps_pcls <- join_dedup(
  apps_nogeom |> mutate(unique_id = row_number()),
  app_dup_rows, "unique_id", nc1_geoms
)

mits_pcls <- join_dedup(
  mits_nogeom |> mutate(unique_id = row_number()),
  mit_dup_rows, "unique_id", nc1_geoms
)

st_write(apps_pcls, file.path(out_dir, "apps_pcls.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(mits_pcls, file.path(out_dir, "mits_pcls.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
```

---

## Summary
```{r load-tier-csvs}
pri_cols <- cols(g_indicator = col_integer(), geometry_id = col_integer(), distance = col_double())

mit_highpri <- read_csv(file.path(out_dir, "mit_highpri.csv"), col_types = pri_cols)
app_highpri <- read_csv(file.path(out_dir, "app_highpri.csv"), col_types = pri_cols)
mit_medpri  <- read_csv(file.path(out_dir, "mit_medpri.csv"),  col_types = pri_cols)
app_medpri  <- read_csv(file.path(out_dir, "app_medpri.csv"),  col_types = pri_cols)
mit_lowpri  <- read_csv(file.path(out_dir, "mit_lowpri.csv"),  col_types = pri_cols)
app_lowpri  <- read_csv(file.path(out_dir, "app_lowpri.csv"),  col_types = pri_cols)
mit_dist    <- read_csv(file.path(out_dir, "mit_dist.csv"),    col_types = pri_cols)
app_dist    <- read_csv(file.path(out_dir, "app_dist.csv"),    col_types = pri_cols)
```


```{r summary}
tibble::tibble(
  Type        = c("Mitigations", "Applications"),
  Total       = c(nrow(mit_sa),      nrow(app_sa)),
  High        = c(nrow(mit_highpri), nrow(app_highpri)),
  Med         = c(nrow(mit_medpri),  nrow(app_medpri)),
  Low         = c(nrow(mit_lowpri),  nrow(app_lowpri)),
  Fallback_NN = c(nrow(mit_dist),    nrow(app_dist)),
  Unmatched   = c(n_mits_rem,    n_apps_rem)
) |>
  dplyr::mutate(
    Matched = High + Med + Low + Fallback_NN
  ) |>
  dplyr::select(Type, Total, High, Med, Low, Fallback_NN, Matched, Unmatched) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    col.names   = c("Type", "Total", "High", "Med", "Low", "Fallback NN",
                    "Total Matched", "Unmatched")
  )
```

---

## Comparison with previous analysis

The table below compares final matched parcel counts between this run and the
prior team's Python-based pipeline. Differences stem from five sources:

1. **Input record counts** — the source `.gpkg` files have been updated since
   the prior run (mitigations: 8,801 vs. 8,803; applications: 23,232 vs. 23,256).

2. **Priority scoring revision** — the original `gov` LUC list had a missing
   comma bug (`['604' '603', '601']`), meaning no parcel ever matched that
   category. The revised scheme also separates 1–4 family residential into its
   own tier and weights site address 2× mailing address, producing a cleaner
   high/med/low split.

3. **Geometry deduplication method** — the prior script assigned `g_indicator`
   via `groupby('geometry').ngroup()`, which groups on WKT string
   representations. Two geometries that are visually identical but differ by
   sub-micron floating-point precision will receive different group IDs,
   potentially inflating unique geometry counts. This pipeline uses
   `factor(as.character(st_geometry(.)))`, which is consistent with how R
   represents geometries and avoids this edge case.

4. **Low-priority parcel pool asymmetry** — the prior script correctly excluded
   high-priority geometry IDs from the medium-priority pool, but failed to
   exclude medium-priority IDs from the low-priority pool. This slightly
   inflated the low-priority candidate set, allowing some geometries to be
   matched twice across tiers. This pipeline uses sequential remainder sets
   (`mits_rem` / `apps_rem`) so each geometry can only match in one tier.

5. **Parcel duplicate preservation** — the Python pipeline's final
   `merge(apps, parcels, on='geometry_id')` naturally re-inflates matched
   records across all parcel rows that share a `geometry_id` (i.e. a common
   physical footprint). 

6. **Duplicate-geometry deduplication** — this pipeline resolves the 39
   `geometry_id`s with multiple candidate parcel rows by selecting a single
   winner per record using address similarity (Jaro-Winkler on `siteadd`) with
   `APN.SEQUENCE.NUMBER`, `luc_tier`, and `improvval` as tiebreakers. The prior
   pipeline retained all matching rows, inflating parcel counts.

```{r comparison-table}
n_mit_pcls <- n_distinct(mits_pcls$parcel_index)
n_app_pcls <- n_distinct(apps_pcls$parcel_index)
n_both     <- length(intersect(mits_pcls$parcel_index, apps_pcls$parcel_index))
pct_both   <- round(n_both / n_app_pcls * 100, 1)

tibble::tibble(
  Metric      = c("Parcels with a mitigation",
                  "Parcels with an application",
                  "Parcels with both (mit \u2229 app)",
                  "Share of app parcels with a mit"),
  `Prior run` = c("6,087", "8,782", "4,292", "48.9%"),
  `This run`  = c(
    format(n_mit_pcls, big.mark = ","),
    format(n_app_pcls, big.mark = ","),
    format(n_both,     big.mark = ","),
    paste0(pct_both, "%")
  )
) |>
  knitr::kable(align = c("l", "r", "r"))
```

Despite the count differences, the share of application parcels that also
received a mitigation is nearly identical across both runs (48.9%
vs. `r paste0(pct_both, "%")`), suggesting the matching logic is consistent —
the absolute differences are driven by input data updates and the scoring,
deduplication, and parcel duplicate-preservation fixes above.
