Step 5: Mitigations to Parcels

Overview

This notebook spatially matches mitigation and application records to parcels using a tiered priority scheme. Parcels are scored and ranked as high, medium, or low priority for matching based on land use code and address quality. For each priority tier, mitigation/application geometries are first matched via direct intersection with parcel polygons, then by nearest-neighbor distance (≤ 50 m) for those that do not intersect. Unmatched records after all three tiers fall back to a global nearest-neighbor match across all parcels.

Input files (Russell’s):

nc1_cl_merge.csv — parcel–CoreLogic merged table (Step 4 output)
parcels_study_area.gpkg — OneMap parcel geometries with geometry_id
mit_newgeo.gpkg — mitigations (re-geocoded)
mit_elevation_qaqc_filled.csv — elevation QAQC scores for mitigations
applications_final.gpkg — FEMA applications
NC_huc6_data_union_sf.shp — flood study area boundary

External files (written to out_dir):

parcels_matcheable.gpkg — filtered parcels before priority scoring
parcels_pri.gpkg — parcels with priority scores
mit_highpri.csv, mit_medpri.csv, mit_lowpri.csv, mit_dist.csv
app_highpri.csv, app_medpri.csv, app_lowpri.csv, app_dist.csv
app_sa_full_noqa.gpkg, mit_sa_full_noqa.gpkg
apps_pcls.gpkg, mits_pcls.gpkg

In [1]:

Show / hide code

library(sf)

Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE

Show / hide code

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Show / hide code

library(readr)
library(stringr)
library(stringdist)
library(purrr)
library(tidyr)


Attaching package: 'tidyr'

The following object is masked from 'package:stringdist':

    extract

In [2]:

Show / hide code

out_dir  <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"

# External data paths
flood_boundary_path <- "/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"
mit_path            <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"
mit_elev_path       <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_elevation_qaqc_filled.csv"
app_path            <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"
geoms_path          <- file.path(out_dir, "parcels_study_area.gpkg")

1 Load & prepare parcels

In [3]:

Show / hide code


parcels_raw <- read_csv(
  file.path(out_dir, "nc1_cl_merge.csv"),
  col_types = cols(.default = col_character())
) |>
  mutate(parcel_index = as.integer(parcel_index))

nc1_geoms <- st_read(geoms_path, quiet = TRUE) |>
  mutate(
    parcel_index = as.integer(parcel_index),
    geometry_id  = as.integer(factor(as.character(geom)))
  ) |>
  select(parcel_index, geometry_id)

parcels <- parcels_raw |>
  select(-any_of(c("geometry_id1", "index_right", "geometry_id", "geometry"))) |>
  left_join(st_drop_geometry(nc1_geoms), by = "parcel_index") |>
  left_join(nc1_geoms |> select(parcel_index, geometry = geom), by = "parcel_index") |>
  st_as_sf(crs = st_crs(nc1_geoms))

2 Filter invalid parcels

In [4]:

Show / hide code


# Remove water parcels (flagged via parcel number keywords)
water_keywords <- "bay|creek|sound|water|ocean|river|canal|pond|inlet"

parcels <- parcels |>
  filter(!str_detect(parno, regex(water_keywords, ignore_case = TRUE)))

# Flag and remove invalid parcels
parcels <- parcels |>
  mutate(
    invalidparcels = case_when(
      # No identifying info at all
      is.na(`LAND.USE.CODE`) & parno == "" & is.na(siteadd) & mailadd == "" ~ 1L,
      # Carteret: no land use code, blank parno → likely right-of-way
      is.na(`LAND.USE.CODE`) & parno == "" & cntyfips == "031" ~ 1L,
      # Brunswick: no parno or altparno → right-of-way
      is.na(`LAND.USE.CODE`) & parno == "" & altparno == "" & cntyfips == "019" ~ 1L,
      # Explicit ROW tags (excluding "ROWAN COUNTY")
      (str_detect(parno,    regex("row", ignore_case = TRUE)) |
       str_detect(altparno, regex("row", ignore_case = TRUE))) &
        parno != "ROWAN COUNTY" ~ 1L,
      TRUE ~ 0L
    )
  ) |>
  filter(invalidparcels == 0L)

# Save pre-priority filtered parcels for diagnostic mapping
st_write(parcels |> select(-invalidparcels),
         file.path(out_dir, "parcels_matcheable.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

parcels <- parcels |> select(-invalidparcels)

3 Parcel priority scoring

Land use codes that define each priority bucket:

In [5]:

Show / hide code

luc_vacant      <- c("465","460","455","454","453","452","450","400")
luc_gov         <- c("604","603","601","600")     # +600 (E-Exempt) — parallels 601/603/604
luc_residential <- c("165","163","160","151","138","137","136","135","134","133",
                      "132","131","117","116","115","113","112","111","109","106",
                      "103","102","100")
luc_fr_1_4      <- c("165","163","160","151","138","137","136","135",
                      "134","133",                # +133, 134 (triplex, quadruplex — definitional fix)
                      "132","115",
                      "113","112","111",          # +111–113 (mobile/manufactured/modular — FEMA priority)
                      "102","100")
luc_buyout      <- c("602")                       # Hindsight — completed historical buyouts (LUC 602)
luc_top         <- c(luc_vacant, luc_gov, luc_residential, luc_buyout)
luc_bottom      <- c("283",                       # +283 (commercial supermarket)
                      "605","606","607","609","610","614","615","618","620","630",
                      "800","818","839","850","860","872","875","877","880","883",
                      "885","886","114")

Scoring change from original Python script

The original script collapsed vacant, government-owned, and all residential parcels into a single qual_luc = 3 bucket, meaning a 1–4 family home and a municipal lot were treated identically. It also weighted site address and mailing address equally (1 pt each), and the composite qual_sum only ranged 0–5, making the high/med/low thresholds somewhat arbitrary.

The revised scheme, with hindsight-specific elevation for completed buyouts:

LUC group	`luc_tier`	`match_priority`	Rationale
Completed buyouts (`luc_buyout`, LUC 602)	3	high (always)	Hindsight: confirmed historical buyout — parcel converted to exempt status because of the program we’re studying
1–4 family residential (`luc_fr_1_4`)	4	high if site address present	Primary buyout/mitigation candidate; includes mobile/manufactured/modular housing per FEMA HMA priorities
Vacant land (`luc_vacant`)	3	med	High candidate, no structure risk
Gov-owned (`luc_gov`)	3	med	Eligible but less common
Other residential (`luc_residential \ luc_fr_1_4`)	2	med if qual_sum ≥ 5, else low	Multi-family / condo / apartment — rarely targeted
Unknown LUC	1	low	Insufficient info
Commercial/industrial (`luc_bottom`)	0	low	Unlikely candidates

Site address is weighted 2 pts (more useful for geocoding/matching than mailing address), mailing address 1 pt, giving a qual_sum range of 0–7 and more granular priority tiers.

luc_fr_1_4 membership reflects FEMA HMA program priorities rather than CoreLogic’s literal numeric subgroupings:

LUC 111 (“Mobile home”), 112 (“Manufactured home”), 113 (“Modular home”) are included to reflect FEMA’s documented emphasis on mobile/manufactured/ modular housing in flood-prone areas.
LUC 133 (“Triplex”) and 134 (“Quadruplex”) added for definitional consistency with the “1–4 family” label — both are 2–4 unit residential and were missing from the original list.

Code list corrections from the original Python:

The gov list had a missing comma bug (['604' '603', '601'] → the string '604603'), so no parcel ever matched it. Fixed here.
LUC 600 (“E-Exempt”) added to luc_gov — parallels the existing 601/603/604 exempt-government codes.
LUC 283 (CoreLogic commercial) added to luc_bottom.
LUC 602 (“Exempt Governmental (Flood Buyout)”) added as a separate luc_buyout flag. Treated as tier 3 in luc_tier (current state = exempt government) and elevated to match_priority = "high" for the hindsight analysis — a mit/app record landing on a 602 parcel is a confirmed historical buyout, the strongest possible match signal.

In [6]:

Show / hide code


parcels <- parcels |>
  mutate(
    `LAND.USE.CODE` = as.character(`LAND.USE.CODE`),

    # Land use flags (kept for downstream use)
    vacant              = if_else(`LAND.USE.CODE` %in% luc_vacant,      1L, 0L),
    gov_lu              = if_else(`LAND.USE.CODE` %in% luc_gov,         1L, 0L),
    res_lu              = if_else(`LAND.USE.CODE` %in% luc_residential, 1L, 0L),
    fr_1_4              = if_else(`LAND.USE.CODE` %in% luc_fr_1_4,      1L, 0L),
    is_completed_buyout = `LAND.USE.CODE` %in% luc_buyout,

    # Address quality: site address weighted 2x, mail address 1x
    qual_siteadd = if_else(!is.na(siteadd) & siteadd != "", 2L, 0L),
    qual_mailadd = if_else(!is.na(mailadd) & mailadd != "", 1L, 0L),

    # Tiered LUC score (0–4) — reflects current CoreLogic classification
    luc_tier = case_when(
      `LAND.USE.CODE` %in% luc_fr_1_4                                        ~ 4L,
      `LAND.USE.CODE` %in% luc_vacant                                         ~ 3L,
      `LAND.USE.CODE` %in% luc_gov                                            ~ 3L,
      `LAND.USE.CODE` %in% luc_buyout                                         ~ 3L,
      `LAND.USE.CODE` %in% luc_residential & !`LAND.USE.CODE` %in% luc_fr_1_4 ~ 2L,
      `LAND.USE.CODE` %in% luc_bottom                                         ~ 0L,
      TRUE                                                                     ~ 1L
    ),

    # Composite score (0–7) and priority tier
    qual_sum = luc_tier + qual_siteadd + qual_mailadd,
    match_priority = case_when(
      is_completed_buyout                ~ "high",   # Hindsight: confirmed historical buyout
      luc_tier == 4 & qual_siteadd == 2  ~ "high",   # 1–4 family + site address
      luc_tier >= 3 | qual_sum >= 5      ~ "med",
      TRUE                               ~ "low"
    )
  )

# Save prioritised parcel layer
st_write(parcels, file.path(out_dir, "parcels_pri.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

In [7]:

Show / hide code


# Split into priority tiers
highpri <- parcels |> filter(match_priority == "high")
medpri  <- parcels |> filter(match_priority == "med")
lowpri  <- parcels |> filter(match_priority == "low")

4 Load & prepare mitigations and applications

In [8]:

Show / hide code

flood_boundary <- st_read(flood_boundary_path, quiet = TRUE) |>
  st_transform(crs = 32119)

In [9]:

Show / hide code

mit_gdf <- st_read(mit_path, quiet = TRUE) |>
  st_transform(crs = st_crs(flood_boundary))

# Merge elevation QAQC scores
mit_elev <- read_csv(mit_elev_path, col_types = cols(.default = col_character())) |>
  rename_with(str_trim)

mit_gdf <- mit_gdf |>
  left_join(mit_elev |> 
  select(record_ID, elevation_score), by = "record_ID")

# Clip to study area
mit_gdf <- mit_gdf |>
  mutate(Inside_SA = as.integer(lengths(st_within(st_geometry(mit_gdf), flood_boundary)) > 0))

mit_sa <- mit_gdf |> filter(Inside_SA == 1)

In [10]:

Show / hide code

app <- st_read(app_path, quiet = TRUE) |>
  st_transform(crs = st_crs(flood_boundary))

app <- app |>
  mutate(
    app_id    = row_number(),
    Inside_SA = as.integer(lengths(st_within(st_geometry(app), flood_boundary)) > 0)
  )

app_sa <- app |> filter(Inside_SA == 1)

In [11]:

Show / hide code

mit_geom_col <- attr(mit_sa, "sf_column")
app_geom_col <- attr(app_sa, "sf_column")

mit_sa <- mit_sa |>
  mutate(g_indicator = as.integer(factor(as.character(st_geometry(mit_sa)))))

app_sa <- app_sa |>
  mutate(
    g_indicator = as.integer(factor(as.character(st_geometry(app_sa)))),
    app_id      = row_number()
  )

unique_mit <- mit_sa |> distinct(g_indicator, .keep_all = TRUE) |>
  select(g_indicator, all_of(mit_geom_col)) |>
  st_transform(crs = st_crs(flood_boundary))

unique_app <- app_sa |> distinct(g_indicator, .keep_all = TRUE) |>
  select(g_indicator, all_of(app_geom_col)) |>
  st_transform(crs = st_crs(flood_boundary))

5 Build unique-geometry sets

Matching is done on unique geometries then merged back, avoiding redundant spatial operations where multiple records share a coordinate.

In [12]:

Show / hide code

read_priority <- function(priority, path) {
  st_read(
    path,
    query = paste0("SELECT parcel_index, match_priority, geometry_id, geom FROM \"parcels_pri\" WHERE match_priority = '", priority, "'"),
    quiet = TRUE
  ) |>
    st_transform(crs = 32119)
}

parcels_path <- file.path(out_dir, "parcels_pri.gpkg")

highpri <- read_priority("high", parcels_path)
medpri  <- read_priority("med",  parcels_path)
lowpri  <- read_priority("low",  parcels_path)
gc()

             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  100260937 5354.6  177477272  9478.4  177477272  9478.4
Vcells 1184124831 9034.2 2216893715 16913.6 2023658956 15439.3

In [13]:

Show / hide code

# Reproject parcel priority sets to EPSG:32119
to_unique_geoms <- function(df) {
  geom_col <- attr(df, "sf_column")
  df |>
    st_transform(crs = 32119) |>
    distinct(geometry_id, .keep_all = TRUE) |>
    select(geometry_id, all_of(geom_col)) |>
    rename(geometry = all_of(geom_col))
}

ug_high <- to_unique_geoms(highpri)
ug_med  <- to_unique_geoms(medpri)
ug_low  <- to_unique_geoms(lowpri)

6 Tiered spatial matching helper

A single function runs the intersect → distance (≤ 50 m) workflow for one priority tier, for both mitigations and applications.

In [14]:

Show / hide code

#' Match unique geometries to a parcel tier via intersect + nearest-neighbour
#'
#' @param candidates sf with columns g_indicator + geometry (mit or app unique geoms)
#' @param parcel_tier sf with columns geometry_id + geometry (unique parcel geoms)
#' @param max_dist   maximum allowable distance in metres for distance matches
#'
#' @return tibble with columns g_indicator, geometry_id, and optionally distance
match_to_tier <- function(candidates, parcel_tier, max_dist = 50) {

  # --- intersect pass ---
  intersect_raw <- st_join(candidates, parcel_tier, join = st_intersects, left = TRUE)

  intersect_matched <- intersect_raw |>
    filter(!is.na(geometry_id)) |>
    st_drop_geometry() |>
    mutate(intersect = 1L) |>
    distinct(g_indicator, .keep_all = TRUE)

  unmatched <- intersect_raw |>
    filter(is.na(geometry_id)) |>
    select(g_indicator)

  # --- distance pass on the remainder ---
  if (nrow(unmatched) == 0 || nrow(parcel_tier) == 0) {
    dist_matched <- tibble(g_indicator = integer(), geometry_id = integer(),
                           distance = numeric())
  } else {
    dist_joined <- st_join(unmatched, parcel_tier, join = st_nearest_feature)

    parcel_geoms <- parcel_tier$geometry[match(dist_joined$geometry_id, parcel_tier$geometry_id)]

    dist_matched <- dist_joined |>
      mutate(distance = as.numeric(
        st_distance(st_geometry(dist_joined), parcel_geoms, by_element = TRUE)
      )) |>
      st_drop_geometry() |>
      distinct(g_indicator, .keep_all = TRUE) |>
      filter(distance <= max_dist)
  }

  bind_rows(
    intersect_matched |> select(g_indicator, geometry_id),
    dist_matched      |> select(g_indicator, geometry_id, any_of("distance"))
  )
}

7 Run tiered matching

In [15]:

Show / hide code

mit_highpri <- match_to_tier(unique_mit, ug_high)
app_highpri <- match_to_tier(unique_app, ug_high)

# Remove matched records before next tier
mits_rem <- unique_mit |> filter(!g_indicator %in% mit_highpri$g_indicator)
apps_rem <- unique_app |> filter(!g_indicator %in% app_highpri$g_indicator)

In [16]:

Show / hide code

mit_medpri <- match_to_tier(mits_rem, ug_med)
app_medpri <- match_to_tier(apps_rem, ug_med)

mits_rem <- mits_rem |> filter(!g_indicator %in% mit_medpri$g_indicator)
apps_rem <- apps_rem |> filter(!g_indicator %in% app_medpri$g_indicator)

In [17]:

Show / hide code

mit_lowpri <- match_to_tier(mits_rem, ug_low)
app_lowpri <- match_to_tier(apps_rem, ug_low)

mits_rem <- mits_rem |> filter(!g_indicator %in% mit_lowpri$g_indicator)
apps_rem <- apps_rem |> filter(!g_indicator %in% app_lowpri$g_indicator)

In [18]:

Show / hide code

# Fallback: global nearest-neighbour across all parcel tiers (no distance cap)
ug_all <- bind_rows(ug_high, ug_med, ug_low)

nn_fallback <- function(remaining, all_geoms) {
  if (nrow(remaining) == 0) {
    return(tibble(g_indicator = integer(), geometry_id = integer(),
                  distance = numeric()))
  }
  joined <- st_join(remaining, all_geoms, join = st_nearest_feature)
  joined |>
    mutate(distance = as.numeric(
      st_distance(
        st_geometry(joined),
        all_geoms$geometry[match(geometry_id, all_geoms$geometry_id)],
        by_element = TRUE
      )
    )) |>
    st_drop_geometry() |>
    distinct(g_indicator, .keep_all = TRUE) |>
    select(g_indicator, geometry_id, distance)
}

mit_dist <- nn_fallback(mits_rem, ug_all)
app_dist <- nn_fallback(apps_rem, ug_all)

8 Save per-tier CSVs

In [19]:

Show / hide code

write_csv(mit_highpri, file.path(out_dir, "mit_highpri.csv"))
write_csv(app_highpri, file.path(out_dir, "app_highpri.csv"))
write_csv(mit_medpri,  file.path(out_dir, "mit_medpri.csv"))
write_csv(app_medpri,  file.path(out_dir, "app_medpri.csv"))
write_csv(mit_lowpri,  file.path(out_dir, "mit_lowpri.csv"))
write_csv(app_lowpri,  file.path(out_dir, "app_lowpri.csv"))
write_csv(mit_dist,    file.path(out_dir, "mit_dist.csv"))
write_csv(app_dist,    file.path(out_dir, "app_dist.csv"))

9 Merge geometry tags back to full mit / app datasets

In [20]:

Show / hide code

mit_geom_full <- bind_rows(mit_highpri, mit_medpri, mit_lowpri, mit_dist)
app_geom_full <- bind_rows(app_highpri, app_medpri, app_lowpri, app_dist)

mit_sa_full <- mit_sa |>
  mutate(unique_id = row_number()) |>
  left_join(mit_geom_full, by = "g_indicator")

app_sa_full <- app_sa |>
  mutate(unique_id = row_number()) |>
  left_join(app_geom_full, by = "g_indicator")

# Save geometry-tagged full datasets before QA
st_write(app_sa_full, file.path(out_dir, "app_sa_full_noqa.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(mit_sa_full, file.path(out_dir, "mit_sa_full_noqa.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

10 Join back to parcels

The spatial matching in steps 1–9 assigns each mit/app record a geometry_id — the identifier of the parcel polygon it fell inside or nearest to. The goal here is to recover the full parcel attribute table (owner, land use, address, improvement value, etc.) and the parcel geometry for every matched record.

The complication is that geometry_id is not a unique key into parcels_nogeom. Two situations produce a many-to-many relationship:

Condo / multi-unit complexes. Each individual unit (e.g., the 95 units at 9620–9630 Vinca Circle) is a separate legal parcel with its own APN, but all units are registered against the same polygon, so they all share one geometry_id. The mit/app address (e.g., “9624 VINCA CIR APT C”) identifies one specific unit, not the whole complex.
Recorder duplicates. A small number of geometry_ids have 2–6 parcel rows that differ only in APN.SEQUENCE.NUMBER. The row with APN.SEQUENCE.NUMBER = 1 is the primary (canonical) recorder entry; the others are supplemental filings.

To resolve both cases without inflating the output row count, a two-stage selection is applied within each duplicate geometry_id group:

Stage 1 — address similarity. Jaro-Winkler string similarity is computed between the mit/app address field and each candidate parcel’s siteadd. The candidate(s) with the highest similarity score are retained. This is the decisive criterion for condo complexes where APN.SEQUENCE.NUMBER is 1 for every unit.
Stage 2 — tiebreaking. Among remaining ties (equal similarity scores), records are ranked by APN.SEQUENCE.NUMBER ascending (primary recorder entry first), then luc_tier descending (most specific land use), then improvval descending (parcel with a structure). A single winner is taken.

Non-duplicate geometry_ids bypass this logic entirely and are joined directly. The two paths are recombined before the geometry attachment and file write.

In [21]:

Show / hide code

n_mits_rem <- nrow(mits_rem)
n_apps_rem <- nrow(apps_rem)

rm(ug_high, ug_med, ug_low, ug_all,
   highpri, medpri, lowpri,
   mit_highpri, app_highpri, mit_medpri, app_medpri,
   mit_lowpri, app_lowpri, mit_dist, app_dist,
   unique_mit, unique_app, mits_rem, apps_rem)
gc()

            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  70250857 3751.9  189280591 10108.7  236600738 12635.9
Vcells 936630850 7146.0 2216893715 16913.6 2023658956 15439.3

In [22]:

Show / hide code

parcels_nogeom <- st_read(parcels_path, quiet = TRUE) |>
  st_drop_geometry() |>
  select(-any_of(c("vacant","gov_lu","res_lu","fr_1_4",
                   "qual_siteadd","qual_mailadd","qual_luc",
                   "qual_sum","match_priority")))
                   
# Reconstruct nc1_geoms for final geometry join (load-parcels is eval:false)
nc1_geoms <- st_read(geoms_path, quiet = TRUE,
                     query = 'SELECT parcel_index, geom FROM parcels_study_area') |>
  mutate(parcel_index = as.integer(parcel_index))

# Drop geometry from app/mit before joining
apps_nogeom <- app_sa_full |> st_drop_geometry()
mits_nogeom <- mit_sa_full |> st_drop_geometry()

In [23]:

Show / hide code

# --- Identify geometry_ids with multiple parcel rows ---
dup_geom_ids <- parcels_nogeom |>
  count(geometry_id) |>
  filter(n > 1) |>
  pull(geometry_id)

# --- Helper: pick the best parcel for one (record, geometry_id) pair ----------
# Priority:
#   1. Highest address similarity (Jaro-Winkler) between record address and siteadd
#   2. Lowest APN.SEQUENCE.NUMBER (primary recorder record)
#   3. Highest luc_tier (most specific land use)
#   4. Highest improvval (parcel has a structure)
pick_best_parcel <- function(records_nogeom, addr_col) {
  dup_pool <- parcels_nogeom |>
    filter(geometry_id %in% dup_geom_ids)

  records_nogeom |>
    filter(geometry_id %in% dup_geom_ids) |>
    select(unique_id, geometry_id, raw_addr = all_of(addr_col)) |>
    left_join(dup_pool, by = "geometry_id", relationship = "many-to-many") |>
    mutate(
      addr_sim = stringsim(
        toupper(coalesce(raw_addr, "")),
        toupper(coalesce(siteadd,  "")),
        method = "jw"
      )
    ) |>
    group_by(unique_id, geometry_id) |>
    slice_max(addr_sim,             n = 1, with_ties = TRUE) |>
    slice_min(APN.SEQUENCE.NUMBER,  n = 1, with_ties = TRUE) |>
    slice_max(luc_tier,             n = 1, with_ties = TRUE) |>
    slice_max(improvval,            n = 1, with_ties = FALSE) |>
    ungroup() |>
    select(unique_id, geometry_id, parcel_index)
}

app_winners <- pick_best_parcel(apps_nogeom |> mutate(unique_id = row_number()), "address")
mit_winners <- pick_best_parcel(mits_nogeom |> mutate(unique_id = row_number()), "raw_addr")

# --- Build deduplicated parcel tables scoped to each record type ---------------
# For non-dup geometry_ids: keep all parcel rows (1-to-1, no change)
# For dup geometry_ids:     keep only the winner row per (unique_id, geometry_id)
parcels_nogeom_app_dedup <- parcels_nogeom |>
  filter(!geometry_id %in% dup_geom_ids)   # non-dup rows

parcels_nogeom_mit_dedup <- parcels_nogeom_app_dedup  # same base

# Winner rows — join winner parcel_index back to full parcel attributes
app_dup_rows <- app_winners |>
  left_join(parcels_nogeom, by = c("geometry_id", "parcel_index"))

mit_dup_rows <- mit_winners |>
  left_join(parcels_nogeom, by = c("geometry_id", "parcel_index"))

In [24]:

Show / hide code

# Join mit/app → parcel attributes via geometry_id, then attach parcel geometry.
# Dup geometry_ids use the address-similarity winner; non-dup geometry_ids join
# normally (no many-to-many inflation).

join_dedup <- function(records_nogeom, dup_rows, id_col, nc1_geoms) {
  id_sym <- sym(id_col)

  non_dup <- records_nogeom |>
    mutate(.uid = row_number()) |>
    filter(!geometry_id %in% dup_geom_ids) |>
    left_join(parcels_nogeom_app_dedup, by = "geometry_id")

  dup <- records_nogeom |>
    mutate(.uid = row_number()) |>
    filter(geometry_id %in% dup_geom_ids) |>
    left_join(
      dup_rows |> select(unique_id, geometry_id, parcel_index,
                         any_of(names(parcels_nogeom))),
      by = c(".uid" = "unique_id", "geometry_id")
    )

  bind_rows(non_dup, dup) |>
    distinct() |>
    left_join(nc1_geoms |> select(parcel_index, geometry = geom),
              by = "parcel_index") |>
    select(-.uid) |>
    st_as_sf(crs = st_crs(nc1_geoms))
}

apps_pcls <- join_dedup(
  apps_nogeom |> mutate(unique_id = row_number()),
  app_dup_rows, "unique_id", nc1_geoms
)

mits_pcls <- join_dedup(
  mits_nogeom |> mutate(unique_id = row_number()),
  mit_dup_rows, "unique_id", nc1_geoms
)

st_write(apps_pcls, file.path(out_dir, "apps_pcls.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)
st_write(mits_pcls, file.path(out_dir, "mits_pcls.gpkg"),
         delete_dsn = TRUE, quiet = TRUE)

Summary

In [25]:

Show / hide code

pri_cols <- cols(g_indicator = col_integer(), geometry_id = col_integer(), distance = col_double())

mit_highpri <- read_csv(file.path(out_dir, "mit_highpri.csv"), col_types = pri_cols)
app_highpri <- read_csv(file.path(out_dir, "app_highpri.csv"), col_types = pri_cols)
mit_medpri  <- read_csv(file.path(out_dir, "mit_medpri.csv"),  col_types = pri_cols)
app_medpri  <- read_csv(file.path(out_dir, "app_medpri.csv"),  col_types = pri_cols)
mit_lowpri  <- read_csv(file.path(out_dir, "mit_lowpri.csv"),  col_types = pri_cols)
app_lowpri  <- read_csv(file.path(out_dir, "app_lowpri.csv"),  col_types = pri_cols)
mit_dist    <- read_csv(file.path(out_dir, "mit_dist.csv"),    col_types = pri_cols)
app_dist    <- read_csv(file.path(out_dir, "app_dist.csv"),    col_types = pri_cols)

In [26]:

Show / hide code

tibble::tibble(
  Type        = c("Mitigations", "Applications"),
  Total       = c(nrow(mit_sa),      nrow(app_sa)),
  High        = c(nrow(mit_highpri), nrow(app_highpri)),
  Med         = c(nrow(mit_medpri),  nrow(app_medpri)),
  Low         = c(nrow(mit_lowpri),  nrow(app_lowpri)),
  Fallback_NN = c(nrow(mit_dist),    nrow(app_dist)),
  Unmatched   = c(n_mits_rem,    n_apps_rem)
) |>
  dplyr::mutate(
    Matched = High + Med + Low + Fallback_NN
  ) |>
  dplyr::select(Type, Total, High, Med, Low, Fallback_NN, Matched, Unmatched) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    col.names   = c("Type", "Total", "High", "Med", "Low", "Fallback NN",
                    "Total Matched", "Unmatched")
  )

Type	Total	High	Med	Low	Fallback NN	Total Matched	Unmatched
Mitigations	8,801	6,173	1,031	146	5	7,355	5
Applications	23,232	8,674	1,243	364	15	10,296	15

Comparison with previous analysis

The table below compares final matched parcel counts between this run and the prior team’s Python-based pipeline. Differences stem from five sources:

Input record counts — the source .gpkg files have been updated since the prior run (mitigations: 8,801 vs. 8,803; applications: 23,232 vs. 23,256).
Priority scoring revision — the original gov LUC list had a missing comma bug (['604' '603', '601']), meaning no parcel ever matched that category. The revised scheme also separates 1–4 family residential into its own tier and weights site address 2× mailing address, producing a cleaner high/med/low split.
Geometry deduplication method — the prior script assigned g_indicator via groupby('geometry').ngroup(), which groups on WKT string representations. Two geometries that are visually identical but differ by sub-micron floating-point precision will receive different group IDs, potentially inflating unique geometry counts. This pipeline uses factor(as.character(st_geometry(.))), which is consistent with how R represents geometries and avoids this edge case.
Low-priority parcel pool asymmetry — the prior script correctly excluded high-priority geometry IDs from the medium-priority pool, but failed to exclude medium-priority IDs from the low-priority pool. This slightly inflated the low-priority candidate set, allowing some geometries to be matched twice across tiers. This pipeline uses sequential remainder sets (mits_rem / apps_rem) so each geometry can only match in one tier.
Parcel duplicate preservation — the Python pipeline’s final merge(apps, parcels, on='geometry_id') naturally re-inflates matched records across all parcel rows that share a geometry_id (i.e. a common physical footprint).
Duplicate-geometry deduplication — this pipeline resolves the 39 geometry_ids with multiple candidate parcel rows by selecting a single winner per record using address similarity (Jaro-Winkler on siteadd) with APN.SEQUENCE.NUMBER, luc_tier, and improvval as tiebreakers. The prior pipeline retained all matching rows, inflating parcel counts.

In [27]:

Show / hide code

n_mit_pcls <- n_distinct(mits_pcls$parcel_index)
n_app_pcls <- n_distinct(apps_pcls$parcel_index)
n_both     <- length(intersect(mits_pcls$parcel_index, apps_pcls$parcel_index))
pct_both   <- round(n_both / n_app_pcls * 100, 1)

tibble::tibble(
  Metric      = c("Parcels with a mitigation",
                  "Parcels with an application",
                  "Parcels with both (mit \u2229 app)",
                  "Share of app parcels with a mit"),
  `Prior run` = c("6,087", "8,782", "4,292", "48.9%"),
  `This run`  = c(
    format(n_mit_pcls, big.mark = ","),
    format(n_app_pcls, big.mark = ","),
    format(n_both,     big.mark = ","),
    paste0(pct_both, "%")
  )
) |>
  knitr::kable(align = c("l", "r", "r"))

Metric	Prior run	This run
Parcels with a mitigation	6,087	5,836
Parcels with an application	8,782	8,523
Parcels with both (mit ∩ app)	4,292	4,137
Share of app parcels with a mit	48.9%	48.5%

Despite the count differences, the share of application parcels that also received a mitigation is nearly identical across both runs (48.9% vs. 48.5%), suggesting the matching logic is consistent — the absolute differences are driven by input data updates and the scoring, deduplication, and parcel duplicate-preservation fixes above.