This notebook spatially matches mitigation and application records to parcels using a tiered priority scheme. Parcels are scored and ranked as high, medium, or low priority for matching based on land use code and address quality. For each priority tier, mitigation/application geometries are first matched via direct intersection with parcel polygons, then by nearest-neighbor distance (≤ 50 m) for those that do not intersect. Unmatched records after all three tiers fall back to a global nearest-neighbor match across all parcels.
The following object is masked from 'package:stringdist':
extract
In [2]:
Show / hide code
out_dir <-"/proj/mhinolab/users/rbless/data/Obstacles_Output"# External data pathsflood_boundary_path <-"/proj/mhinolab/projects/obstacles/NC_huc6_data_union_sf/NC_huc6_data_union_sf.shp"mit_path <-"/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_newgeo.gpkg"mit_elev_path <-"/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/mit_elevation_qaqc_filled.csv"app_path <-"/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/raw_app_ncb/applications_final.gpkg"geoms_path <-file.path(out_dir, "parcels_study_area.gpkg")
The original script collapsed vacant, government-owned, and all residential parcels into a single qual_luc = 3 bucket, meaning a 1–4 family home and a municipal lot were treated identically. It also weighted site address and mailing address equally (1 pt each), and the composite qual_sum only ranged 0–5, making the high/med/low thresholds somewhat arbitrary.
The revised scheme, with hindsight-specific elevation for completed buyouts:
LUC group
luc_tier
match_priority
Rationale
Completed buyouts (luc_buyout, LUC 602)
3
high (always)
Hindsight: confirmed historical buyout — parcel converted to exempt status because of the program we’re studying
1–4 family residential (luc_fr_1_4)
4
high if site address present
Primary buyout/mitigation candidate; includes mobile/manufactured/modular housing per FEMA HMA priorities
Site address is weighted 2 pts (more useful for geocoding/matching than mailing address), mailing address 1 pt, giving a qual_sum range of 0–7 and more granular priority tiers.
luc_fr_1_4 membership reflects FEMA HMA program priorities rather than CoreLogic’s literal numeric subgroupings:
LUC 111 (“Mobile home”), 112 (“Manufactured home”), 113 (“Modular home”) are included to reflect FEMA’s documented emphasis on mobile/manufactured/ modular housing in flood-prone areas.
LUC 133 (“Triplex”) and 134 (“Quadruplex”) added for definitional consistency with the “1–4 family” label — both are 2–4 unit residential and were missing from the original list.
Code list corrections from the original Python:
The gov list had a missing comma bug (['604' '603', '601'] → the string '604603'), so no parcel ever matched it. Fixed here.
LUC 600 (“E-Exempt”) added to luc_gov — parallels the existing 601/603/604 exempt-government codes.
LUC 283 (CoreLogic commercial) added to luc_bottom.
LUC 602 (“Exempt Governmental (Flood Buyout)”) added as a separate luc_buyout flag. Treated as tier 3 in luc_tier (current state = exempt government) and elevated to match_priority = "high" for the hindsight analysis — a mit/app record landing on a 602 parcel is a confirmed historical buyout, the strongest possible match signal.
9 Merge geometry tags back to full mit / app datasets
In [20]:
Show / hide code
mit_geom_full <-bind_rows(mit_highpri, mit_medpri, mit_lowpri, mit_dist)app_geom_full <-bind_rows(app_highpri, app_medpri, app_lowpri, app_dist)mit_sa_full <- mit_sa |>mutate(unique_id =row_number()) |>left_join(mit_geom_full, by ="g_indicator")app_sa_full <- app_sa |>mutate(unique_id =row_number()) |>left_join(app_geom_full, by ="g_indicator")# Save geometry-tagged full datasets before QAst_write(app_sa_full, file.path(out_dir, "app_sa_full_noqa.gpkg"),delete_dsn =TRUE, quiet =TRUE)st_write(mit_sa_full, file.path(out_dir, "mit_sa_full_noqa.gpkg"),delete_dsn =TRUE, quiet =TRUE)
10 Join back to parcels
The spatial matching in steps 1–9 assigns each mit/app record a geometry_id — the identifier of the parcel polygon it fell inside or nearest to. The goal here is to recover the full parcel attribute table (owner, land use, address, improvement value, etc.) and the parcel geometry for every matched record.
The complication is that geometry_id is not a unique key into parcels_nogeom. Two situations produce a many-to-many relationship:
Condo / multi-unit complexes. Each individual unit (e.g., the 95 units at 9620–9630 Vinca Circle) is a separate legal parcel with its own APN, but all units are registered against the same polygon, so they all share one geometry_id. The mit/app address (e.g., “9624 VINCA CIR APT C”) identifies one specific unit, not the whole complex.
Recorder duplicates. A small number of geometry_ids have 2–6 parcel rows that differ only in APN.SEQUENCE.NUMBER. The row with APN.SEQUENCE.NUMBER = 1 is the primary (canonical) recorder entry; the others are supplemental filings.
To resolve both cases without inflating the output row count, a two-stage selection is applied within each duplicate geometry_id group:
Stage 1 — address similarity. Jaro-Winkler string similarity is computed between the mit/app address field and each candidate parcel’s siteadd. The candidate(s) with the highest similarity score are retained. This is the decisive criterion for condo complexes where APN.SEQUENCE.NUMBER is 1 for every unit.
Stage 2 — tiebreaking. Among remaining ties (equal similarity scores), records are ranked by APN.SEQUENCE.NUMBER ascending (primary recorder entry first), then luc_tier descending (most specific land use), then improvval descending (parcel with a structure). A single winner is taken.
Non-duplicate geometry_ids bypass this logic entirely and are joined directly. The two paths are recombined before the geometry attachment and file write.
The table below compares final matched parcel counts between this run and the prior team’s Python-based pipeline. Differences stem from five sources:
Input record counts — the source .gpkg files have been updated since the prior run (mitigations: 8,801 vs. 8,803; applications: 23,232 vs. 23,256).
Priority scoring revision — the original gov LUC list had a missing comma bug (['604' '603', '601']), meaning no parcel ever matched that category. The revised scheme also separates 1–4 family residential into its own tier and weights site address 2× mailing address, producing a cleaner high/med/low split.
Geometry deduplication method — the prior script assigned g_indicator via groupby('geometry').ngroup(), which groups on WKT string representations. Two geometries that are visually identical but differ by sub-micron floating-point precision will receive different group IDs, potentially inflating unique geometry counts. This pipeline uses factor(as.character(st_geometry(.))), which is consistent with how R represents geometries and avoids this edge case.
Low-priority parcel pool asymmetry — the prior script correctly excluded high-priority geometry IDs from the medium-priority pool, but failed to exclude medium-priority IDs from the low-priority pool. This slightly inflated the low-priority candidate set, allowing some geometries to be matched twice across tiers. This pipeline uses sequential remainder sets (mits_rem / apps_rem) so each geometry can only match in one tier.
Parcel duplicate preservation — the Python pipeline’s final merge(apps, parcels, on='geometry_id') naturally re-inflates matched records across all parcel rows that share a geometry_id (i.e. a common physical footprint).
Duplicate-geometry deduplication — this pipeline resolves the 39 geometry_ids with multiple candidate parcel rows by selecting a single winner per record using address similarity (Jaro-Winkler on siteadd) with APN.SEQUENCE.NUMBER, luc_tier, and improvval as tiebreakers. The prior pipeline retained all matching rows, inflating parcel counts.
In [27]:
Show / hide code
n_mit_pcls <-n_distinct(mits_pcls$parcel_index)n_app_pcls <-n_distinct(apps_pcls$parcel_index)n_both <-length(intersect(mits_pcls$parcel_index, apps_pcls$parcel_index))pct_both <-round(n_both / n_app_pcls *100, 1)tibble::tibble(Metric =c("Parcels with a mitigation","Parcels with an application","Parcels with both (mit \u2229 app)","Share of app parcels with a mit"),`Prior run`=c("6,087", "8,782", "4,292", "48.9%"),`This run`=c(format(n_mit_pcls, big.mark =","),format(n_app_pcls, big.mark =","),format(n_both, big.mark =","),paste0(pct_both, "%") )) |> knitr::kable(align =c("l", "r", "r"))
Metric
Prior run
This run
Parcels with a mitigation
6,087
5,836
Parcels with an application
8,782
8,523
Parcels with both (mit ∩ app)
4,292
4,137
Share of app parcels with a mit
48.9%
48.5%
Despite the count differences, the share of application parcels that also received a mitigation is nearly identical across both runs (48.9% vs. 48.5%), suggesting the matching logic is consistent — the absolute differences are driven by input data updates and the scoring, deduplication, and parcel duplicate-preservation fixes above.