Step 7: Flood Event & Eligibility Join (from existing Python CSVs)

Overview

This notebook bridges Helena’s pre-existing flood event data onto R’s parcel_index, producing per-parcel flood and eligibility counts that match the Julian’s reported 67.5% eligibility ratio.

Data source decision. We use the pre-aggregated parcels_events.csv produced by Julian and then crosswalked to R’s parcel_index via (cntyfips, parno).

Inputs:

parcels_pri.gpkg — R’s parcels (from Step 5)
/proj/.../Raw Data Inputs/parcels_study_area.csv — Python’s parcel index registry, used to build the (cntyfips, parno) crosswalk
/proj/.../Intermediary Data/parcels_events.csv — pre-aggregated flood and eligibility columns per Python parcel_index, from Julian

Outputs (written to out_dir):

parcel_crosswalk_R_Py.csv — Python parcel_index ↔︎ R parcel_index crosswalk for future reference
parcels_events_R.csv — final per-parcel flood and eligibility table, keyed on R’s parcel_index, ready to join to parcels_pri.gpkg

Methodology

Why a crosswalk is necessary. Python’s parcel_index was created as the row position in pandas’s read of the source NC1 file, after Python-specific filtering and deduplication. R’s parcel_index was created independently during the R Step 1 pipeline. The two sets diverge: R has ~29k more parcels and same numeric indices refer to different parcels.

Crosswalk key choice. (cntyfips, parno) is the most stable real-world parcel identifier across both pipelines. This uses a normalized parno (alphanumeric only, matching Step 3’s strip_alnum) for the primary crosswalk and altparno as a fallback for R parcels that didn’t match on parno.

Flood and eligibility columns. The Original parcels_events.csv already contains pre-computed flooded and eligible counts (and per-event _max columns) following Python Step 12’s logic. This translates these directly to R’s parcel_index space via the crosswalk.

In [1]:

Show / hide code

library(sf)

Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE

Show / hide code

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Show / hide code

library(readr)
library(stringr)
library(tidyr)
library(knitr)

In [2]:

Show / hide code

out_dir            <- "/proj/mhinolab/users/rbless/data/Obstacles_Output"
parcels_path       <- file.path(out_dir, "parcels_pri.gpkg")

py_parcels_csv     <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Raw Data Inputs/parcels_study_area.csv"
aggregated_events_path <- "/proj/mhinolab/projects/obstacles2/CRC_May_2025/Intermediary Data/parcels_events.csv"

xwalk_out_path  <- file.path(out_dir, "parcel_crosswalk_R_Py.csv")
events_out_path <- file.path(out_dir, "parcels_events_R.csv")

# Helper: alphanumeric-only normalization, matches Step 3's strip_alnum
strip_alnum <- function(x) str_replace_all(replace_na(x, ""), "[^a-zA-Z0-9]", "")

1 Build the R ↔︎ Python parcel crosswalk

Python’s aggregated event CSV is keyed on Python’s parcel_index. We need to translate to R’s parcel_index so the flood data can be joined back to parcels_pri.gpkg. The join key is (cntyfips, parno), normalized alphanumerically to handle formatting differences.

In [3]:

Show / hide code

# Load R parcels
r_parcels <- sf::st_read(parcels_path, quiet = TRUE) |>
  sf::st_drop_geometry() |>
  dplyr::select(r_parcel_index = parcel_index, cntyfips, parno, altparno, fr_1_4) |>
  dplyr::mutate(
    cntyfips      = as.character(cntyfips),
    parno_norm    = strip_alnum(parno),
    altparno_norm = strip_alnum(altparno)
  )

# Load Python parcels registry
py_parcels <- read_csv(py_parcels_csv,
                       col_types = cols(.default = col_character())) |>
  dplyr::select(py_parcel_index = parcel_index, cntyfips, parno, altparno) |>
  dplyr::mutate(
    py_parcel_index = as.integer(py_parcel_index),
    cntyfips        = as.character(cntyfips),
    parno_norm      = strip_alnum(parno),
    altparno_norm   = strip_alnum(altparno)
  )

# Normalize cntyfips: Python may store "063", R uses "37063". Add prefix if needed.
if (!any(str_detect(py_parcels$cntyfips, "^37"), na.rm = TRUE)) {
  py_parcels <- py_parcels |> mutate(cntyfips = paste0("37", cntyfips))
}

cat("R parcels:    ", nrow(r_parcels),  "rows\n")

R parcels:     4216239 rows

Show / hide code

cat("Python parcels:", nrow(py_parcels), "rows\n")

Python parcels: 4187506 rows

Show / hide code

cat("Sample R cntyfips: ", head(unique(r_parcels$cntyfips), 5),  "\n")

Sample R cntyfips:  37001 37007 37057 37059 37009

Show / hide code

cat("Sample Py cntyfips:", head(unique(py_parcels$cntyfips), 5), "\n")

Sample Py cntyfips: 37019 37129 37047 37017 37141

In [4]:

Show / hide code

# Primary crosswalk: (cntyfips, parno_norm)
xwalk_primary <- r_parcels |>
  filter(parno_norm != "") |>
  inner_join(
    py_parcels |> filter(parno_norm != "") |>
      select(py_parcel_index, cntyfips, parno_norm),
    by = c("cntyfips", "parno_norm"),
    relationship = "many-to-many"
  ) |>
  # If multiple Python parcels match one R parcel, take the first
  group_by(r_parcel_index) |> slice(1) |> ungroup() |>
  select(r_parcel_index, py_parcel_index, cntyfips) |>
  mutate(match_method = "parno")

cat("Primary crosswalk (parno):", nrow(xwalk_primary), "matches\n")

Primary crosswalk (parno): 3661906 matches

In [5]:

Show / hide code

# Secondary crosswalk: altparno fallback for R parcels not matched via parno
r_unmatched <- r_parcels |>
  anti_join(xwalk_primary |> select(r_parcel_index), by = "r_parcel_index") |>
  filter(altparno_norm != "")

xwalk_secondary <- r_unmatched |>
  inner_join(
    py_parcels |> filter(altparno_norm != "") |>
      select(py_parcel_index, cntyfips, altparno_norm),
    by = c("cntyfips", "altparno_norm"),
    relationship = "many-to-many"
  ) |>
  group_by(r_parcel_index) |> slice(1) |> ungroup() |>
  select(r_parcel_index, py_parcel_index, cntyfips) |>
  mutate(match_method = "altparno")

cat("Secondary crosswalk (altparno):", nrow(xwalk_secondary), "matches\n")

Secondary crosswalk (altparno): 379339 matches

In [6]:

Show / hide code

xwalk <- bind_rows(xwalk_primary, xwalk_secondary)

cat("\nFinal crosswalk:\n")


Final crosswalk:

Show / hide code

cat("  Total R parcels:                ", nrow(r_parcels), "\n")

  Total R parcels:                 4216239

Show / hide code

cat("  R parcels with Py counterpart:  ", nrow(xwalk), "\n")

  R parcels with Py counterpart:   4041245

Show / hide code

cat("  Coverage:                       ",
    sprintf("%.1f%%", 100 * nrow(xwalk) / nrow(r_parcels)), "\n\n")

  Coverage:                        95.8%

Show / hide code

xwalk |> count(match_method) |> print()

# A tibble: 2 × 2
  match_method       n
  <chr>          <int>
1 altparno      379339
2 parno        3661906

Show / hide code

# Write crosswalk for reference / debugging
write_csv(xwalk, xwalk_out_path)
cat("\nCrosswalk saved to:", xwalk_out_path, "\n")


Crosswalk saved to: /proj/mhinolab/users/rbless/data/Obstacles_Output/parcel_crosswalk_R_Py.csv

2 Crosswalk the Original aggregated event data to R parcel space

Read parcels_events.csv and translate its Python parcel_index keys to R’s parcel_index via the crosswalk built above. The flooded and eligible columns in this file are pre-computed and used as-is.

In [7]:

Show / hide code

py_events_aggregated <- read_csv(
  aggregated_events_path,
  col_types = cols(.default = col_character())
) |>
  mutate(parcel_index = as.integer(parcel_index))

post_merge <- xwalk |>
  select(r_parcel_index, py_parcel_index) |>
  inner_join(py_events_aggregated, by = c("py_parcel_index" = "parcel_index")) |>
  select(-py_parcel_index) |>
  left_join(
    r_parcels |> select(r_parcel_index, cntyfips, fr_1_4),
    by = "r_parcel_index"
  )

cat("py_events_aggregated rows: ", nrow(py_events_aggregated), "\n")

py_events_aggregated rows:  4187506

Show / hide code

cat("post_merge rows:           ", nrow(post_merge), "\n")

post_merge rows:            4041245

Show / hide code

cat("post_merge cols:           ", ncol(post_merge), "\n")

post_merge cols:            5

Show / hide code

cat("Crosswalk match rate:      ",
    sprintf("%.1f%%", 100 * nrow(post_merge) / nrow(py_events_aggregated)), "\n")

Crosswalk match rate:       96.5%

3 Sanity check vs paper

In [8]:

Show / hide code

# Coerce flooded / eligible to numeric — they were read as character
post_merge <- post_merge |>
  mutate(
    flooded  = as.integer(flooded_any),
    eligible = as.integer(eligible)
  )

flooded_n  <- sum(post_merge$flooded  > 0, na.rm = TRUE)
eligible_n <- sum(post_merge$eligible > 0, na.rm = TRUE)

# Residential-only subset (fr_1_4 == "1" in the Original CSV)
res_rows   <- post_merge |> filter(as.integer(fr_1_4) == 1)
res_flooded_n  <- sum(res_rows$flooded_any  > 0, na.rm = TRUE)
res_eligible_n <- sum(res_rows$eligible > 0, na.rm = TRUE)

sanity_tbl <- tibble(
  Metric = c(
    "Total parcels (crosswalked)",
    "Parcels flooded ≥ 1 time",
    "Parcels eligible ≥ 1 declared event",
    "Eligible / Flooded — all parcels",
    "Eligible / Flooded — residential only (fr_1_4 == 1)",
    "Paper reports"
  ),
  Value = c(
    scales::comma(nrow(post_merge)),
    scales::comma(flooded_n),
    scales::comma(eligible_n),
    sprintf("%.1f%%", 100 * eligible_n / flooded_n),
    sprintf("%.1f%%", 100 * res_eligible_n / res_flooded_n),
    "67.5%"
  )
)

knitr::kable(sanity_tbl, caption = "Sanity check: eligibility rate vs paper claim")

Sanity check: eligibility rate vs paper claim
Metric	Value
Total parcels (crosswalked)	4,041,245
Parcels flooded ≥ 1 time	372,779
Parcels eligible ≥ 1 declared event	256,376
Eligible / Flooded — all parcels	68.8%
Eligible / Flooded — residential only (fr_1_4 == 1)	69.4%
Paper reports	67.5%

Expected: all-parcel ratio ~69.0%, residential ~69.4%. If either ratio is far from these values, check that aggregated_events_path points to the correct parcel_events file.

4 Write the final parcel-event table

In [9]:

Show / hide code

write_csv(post_merge, events_out_path)

cat("Wrote:", events_out_path, "\n")

Wrote: /proj/mhinolab/users/rbless/data/Obstacles_Output/parcels_events_R.csv

Show / hide code

cat("Final dimensions:", nrow(post_merge), "rows ×",
    ncol(post_merge), "cols\n")

Final dimensions: 4041245 rows × 6 cols

5 Join straight into `parcels_pri.gpkg`

This creates a single file with parcels + flood data combined, ready for downstream analysis:

In [10]:

Show / hide code


# Load R parcels, join flood data, write combined output
parcels_pri <- sf::st_read(parcels_path, quiet = TRUE)

flood_data <- read_csv(events_out_path,
                       col_types = cols(.default = col_character())) |>
  mutate(
    r_parcel_index = as.integer(r_parcel_index),
    flooded        = as.integer(flooded),
    eligible       = as.integer(eligible)
  ) |>
  rename(parcel_index = r_parcel_index) |>
  select(-cntyfips)   # already in parcels_pri

parcels_with_floods <- parcels_pri |>
  left_join(flood_data, by = "parcel_index")

sf::st_write(parcels_with_floods,
             file.path(out_dir, "parcels_pri_with_floods.gpkg"),
             delete_dsn = TRUE, quiet = TRUE)

cat("Wrote: parcels_pri_with_floods.gpkg with",
    sum(!is.na(parcels_with_floods$flooded) & as.integer(parcels_with_floods$flooded) > 0),
    "parcels flagged as flooded\n")

6 Build the master parcel dataset

Single per-parcel table consolidating everything Step 8 needs: flood + eligibility (Section 5 above), interpolation + value (Step 6), applied/funded flags (Step 5 priority outputs), centroid coords + EIF cell key, and per-cell EIF demographics (Step 6’s eif_cells_summary.csv).

Prerequisite. Run the Step 6 per-cell summary chunk first so eif_cells_summary.csv exists in out_dir. Without it, the cell demographics join silently produces NAs.

Inputs:

parcels_pri_with_floods.gpkg — Section 5 above
interpolated_parcels_final.csv — Step 6
apps_pcls.gpkg, mits_pcls.gpkg — Step 5 priority outputs
eif_cells_summary.csv — Step 6 per-cell summary

Output (written to out_dir):

parcels_master.gpkg — one row per parcel, all downstream analytical fields joined and resolved

6.1 Setup

In [11]:

Show / hide code

interp_path    <- file.path(out_dir, "interpolated_parcels_final.csv")
apps_path      <- file.path(out_dir, "apps_pcls.gpkg")
mits_path      <- file.path(out_dir, "mits_pcls.gpkg")
floods_path    <- file.path(out_dir, "parcels_pri_with_floods.gpkg")
eif_cells_path <- file.path(out_dir, "eif_cells_summary.csv")
master_path    <- file.path(out_dir, "parcels_master.gpkg")

6.2 Parcel centroids + EIF cell key

Centroids in WGS84 are the canonical parcel location. CoreLogic’s PARCEL.LEVEL.LATITUDE / LONGITUDE are not used — they have ~34% missingness for late-pipeline parcels (completed buyouts, voided addresses) and known precision issues. The EIF cell key uses Step 6’s convention (cell-center, 3-decimal rounded) so the join to eif_cells_summary.csv works directly.

In [12]:

Show / hide code

parcels_sf <- st_read(floods_path, quiet = TRUE)

centroids <- suppressWarnings(
  parcels_sf |>
    st_transform(4326) |>
    st_centroid() |>
    st_coordinates() |>
    as_tibble() |>
    rename(cent_lng = X, cent_lat = Y)
) |>
  mutate(
    eif_lon = round(floor(cent_lng * 100) / 100 + 0.005, 3),
    eif_lat = round(floor(cent_lat * 100) / 100 + 0.005, 3)
  )

cat("Centroids computed:", nrow(centroids), "\n")

Centroids computed: 4216239

6.3 Resolve duplicate residential flag

Section 5’s join leaves fr_1_4.x (from parcels_pri) and fr_1_4.y (from parcels_events.csv). Collapse to one fr_1_4 column.

In [13]:

Show / hide code

parcels_attr <- parcels_sf |>
  st_drop_geometry() |>
  bind_cols(centroids) |>
  mutate(fr_1_4 = coalesce(
    suppressWarnings(as.integer(fr_1_4.x)),
    suppressWarnings(as.integer(fr_1_4.y))
  )) |>
  select(-any_of(c("fr_1_4.x", "fr_1_4.y")))

6.4 Load Step 6 interpolation + Step 5 applied/funded flags

In [14]:

Show / hide code

# Interpolation flag, prop_value_5, mit/app counts, gov_owned (Step 6)
interp <- read_csv(
  interp_path,
  col_select = c("parcel_index", "interpolate", "prop_value_5",
                 "mit_count", "app_count", "gov_owned"),
  col_types  = cols(.default = col_character()),
  show_col_types = FALSE
) |>
  transmute(
    parcel_index = as.integer(parcel_index),
    interpolate  = suppressWarnings(as.integer(interpolate)),
    prop_value_5 = suppressWarnings(as.numeric(prop_value_5)),
    mit_count    = suppressWarnings(as.integer(mit_count)),
    app_count    = suppressWarnings(as.integer(app_count)),
    gov_owned    = suppressWarnings(as.integer(gov_owned))
  )

# Applied / funded flags from Step 5 priority outputs
applied_set <- st_read(apps_path, quiet = TRUE) |>
  st_drop_geometry() |>
  distinct(parcel_index) |>
  mutate(parcel_index = as.integer(parcel_index), applied = TRUE)

funded_set <- st_read(mits_path, quiet = TRUE) |>
  st_drop_geometry() |>
  distinct(parcel_index) |>
  mutate(parcel_index = as.integer(parcel_index), funded = TRUE)

# EIF cell demographics (Step 6 per-cell summary)
eif_cells <- read_csv(eif_cells_path, show_col_types = FALSE)

6.5 Assemble the master file

In [15]:

Show / hide code

master <- parcels_attr |>
  mutate(parcel_index = as.integer(parcel_index)) |>
  left_join(interp,      by = "parcel_index") |>
  left_join(applied_set, by = "parcel_index") |>
  left_join(funded_set,  by = "parcel_index") |>
  left_join(eif_cells,   by = c("eif_lat", "eif_lon")) |>
  mutate(
    interpolate = coalesce(interpolate, 0L),
    mit_count   = coalesce(mit_count,   0L),
    app_count   = coalesce(app_count,   0L),
    gov_owned   = coalesce(gov_owned,   0L),
    applied     = coalesce(applied,     FALSE),
    funded      = coalesce(funded,      FALSE),
    # Final value: prop_value_5 for buyout candidates, TVC otherwise.
    # prop_value_5 already falls back to TVC for non-recipients, so this
    # collapses to just `value = prop_value_5` — kept conditional for clarity.
    value = if_else(
      interpolate == 1L,
      prop_value_5,
      suppressWarnings(as.numeric(TOTAL.VALUE.CALCULATED))
    )
  ) |>
  # Re-attach geometry from parcels_sf (same row order)
  bind_cols(geometry = st_geometry(parcels_sf)) |>
  st_as_sf(crs = st_crs(parcels_sf))

6.6 Sanity check

In [16]:

Show / hide code

cat("master parcels:", format(nrow(master), big.mark = ","), "\n")

master parcels: 4,216,239

Show / hide code

cat("  fr_1_4 == 1:     ",
    format(sum(master$fr_1_4 == 1L,  na.rm = TRUE), big.mark = ","), "\n")

  fr_1_4 == 1:      3,134,408

Show / hide code

cat("  interpolate==1:  ",
    format(sum(master$interpolate == 1L, na.rm = TRUE), big.mark = ","), "\n")

  interpolate==1:   5,456

Show / hide code

cat("  applied:         ",
    format(sum(master$applied,  na.rm = TRUE), big.mark = ","), "\n")

  applied:          8,523

Show / hide code

cat("  funded:          ",
    format(sum(master$funded,   na.rm = TRUE), big.mark = ","), "\n")

  funded:           5,836

Show / hide code

cat("  centroid coords valid: ",
    format(sum(!is.na(master$cent_lat) & !is.na(master$cent_lng)), big.mark = ","), "\n")

  centroid coords valid:  4,216,239

Show / hide code

cat("  matched to EIF demographics: ",
    format(sum(!is.na(master$pct_white_2020)), big.mark = ","), "of",
    format(nrow(master), big.mark = ","), "\n")

  matched to EIF demographics:  4,128,101 of 4,216,239

6.7 Write

In [17]:

Show / hide code

st_write(master, master_path, delete_dsn = TRUE, quiet = TRUE)
cat("Wrote:", master_path, "—", ncol(master), "cols\n")

Wrote: /proj/mhinolab/users/rbless/data/Obstacles_Output/parcels_master.gpkg — 103 cols

Output schema — `parcels_master.gpkg`

One row per parcel; downstream consumers (Step 8 figures) read only this file.

Column group	Columns	Source
Identifier	`parcel_index`, `cntyfips`, `parno`, `altparno`, `siteadd`, `SITUS.CITY`	parcels_pri (Step 5)
Geometry	parcel polygon + `cent_lat`, `cent_lng`	computed centroid (this step)
EIF cell key	`eif_lat`, `eif_lon`	derived from centroid, Step 6 convention
LUC	`LAND.USE.CODE`, `vacant`, `gov_lu`, `res_lu`, `fr_1_4` (deduped)	parcels_pri
Priority	`luc_tier`, `qual_sum`, `match_priority`	parcels_pri
Flood	`flooded_any`, `eligible`	Section 5 above
Interpolation	`interpolate`, `prop_value_5`, `mit_count`, `app_count`, `gov_owned`	Step 6
Final value	`value` (resolved: prop_value_5 if interpolate else TVC)	derived (this step)
Pipeline flags	`applied`, `funded` (logicals)	Step 5 priority outputs
EIF demographics	`pop_total_{1999,2020}`, `pct_white_{1999,2020}`, `pct_black_{1999,2020}`, `pct_hispanic_{1999,2020}`, `pct_asian_{1999,2020}`, `pct_aian_{1999,2020}`	Step 6 per-cell summary

Step 7: Flood Event & Eligibility Join (from existing Python CSVs)

Overview

Methodology

1 Build the R ↔︎ Python parcel crosswalk

2 Crosswalk the Original aggregated event data to R parcel space

3 Sanity check vs paper

4 Write the final parcel-event table

5 Join straight into parcels_pri.gpkg

6 Build the master parcel dataset

6.1 Setup

6.2 Parcel centroids + EIF cell key

6.3 Resolve duplicate residential flag

6.4 Load Step 6 interpolation + Step 5 applied/funded flags

6.5 Assemble the master file

6.6 Sanity check

6.7 Write

Output schema — parcels_master.gpkg

5 Join straight into `parcels_pri.gpkg`

Output schema — `parcels_master.gpkg`