Step 6: Property Value Interpolation & EIF Census Enrichment
Author
Russell Blessing
Overview
This notebook does two related things:
Property value interpolation. For parcels matched to a mitigation or application record but carrying an unreliable TOTAL.VALUE.CALCULATED (vacant land, government-owned, anomalously low values), estimate a market-comparable value by averaging the recorded values of the nearest 1–4 family residential donor parcels. Three estimates per recipient (k = 5, 10, 50 nearest neighbors).
EIF census enrichment. For each mit/app point, attach Census Bureau Gridded EIF attributes from two release years (1999 and 2020) at the 0.01° grid cell containing that point. Four EIF files are joined:
Population × age × race × sex — 1999 and 2020 (prefixes ars_1999_, ars_2020_)
Population × race × income decile — 1999 and 2020 (prefixes ri_1999_, ri_2020_)
The 1999 and 2020 baselines bracket the project’s main disaster history window, letting downstream analysis pull contextual demographics from a year close to each mit/app’s program record (e.g., a 1999 Hurricane Floyd buyout record can use ars_1999_* columns; later records can use the 2020 baseline).
Both stages consume the alt Step 5 resolved outputs as the single source of mit/app data. Each record carries an assigned_parcel_index (resolved to one parcel per record, no fan-out) and a point geometry. The interpolation uses assigned_parcel_index to count mits/apps per parcel; the EIF enrichment uses the point geometry to snap each record to its grid cell.
Inputs:
parcels_pri.gpkg — parcels with priority scoring (canonical Step 5)
alt_mits_resolved.gpkg — point-geometry mits, one row per record, resolved to one assigned parcel each (alt Step 5)
alt_apps_resolved.gpkg — point-geometry apps, same structure (alt Step 5)
Four Gridded EIF parquet files (downloaded on first run)
Outputs (written to out_dir):
interpolate_to.gpkg, interpolate_from.gpkg — recipient and donor pools
interpolated_vals.gpkg, interpolated_parcels_final.csv — parcels with interpolated values
mits_with_eif.gpkg, apps_with_eif.gpkg — alt resolved outputs with EIF cell attributes
mits_with_eif.csv, apps_with_eif.csv — same, no geometry
Methodology
Recipients (parcels with unreliable recorded property values): parcels that have at least one mit or app assigned to them by alt Step 5 AND whose recorded TOTAL.VALUE.CALCULATED is treated as unreliable. A value is treated as unreliable if the parcel is:
vacant land (vacant == 1) — no structure, recorded value reflects land only
government land use (gov_lu == 1) — typically tax-exempt with $0 assessed
government-owned by owner name (gov_owned == 1) — same
low recorded value (TOTAL.VALUE.CALCULATED < $25,000) — anomalously low, likely indicates undervaluation, recent transfer, or data quality issue
Donors (parcels providing value estimates): 1–4 family residential, NOT government-owned, with non-missing recorded total value above $25,000.
Mit/app per-parcel counts are computed from assigned_parcel_index in the alt resolved outputs — each unique mit/app contributes to exactly one parcel. This avoids the fan-out double-counting that would happen if we counted from canonical mits_pcls.gpkg (where a mit hitting N parcels appears in N rows). Records with assignment_method == "unassigned" are excluded from counts.
Government-ownership detection uses a regex on OWNER.1.FULL.NAME matching common government-entity keywords (CITY, STATE, TOWN, VILLAGE, COUNTY, METROPOLITAN, plus abbreviations) excluding INC and LLC.
EIF cell join uses cell-membership rather than spatial intersection — each mit/app point’s lat/lon is snapped to its 0.01° grid cell centroid via floor(coord * 100) / 100 + 0.005 (cell centers are at .005 offsets per the EIF grid topology), then joined to EIF data by the snapped coordinates as a character key. Equivalent result, much faster than st_join.
In [1]:
Show / hide code
library(sf)
Linking to GEOS 3.12.0, GDAL 3.11.0, PROJ 9.2.1; sf_use_s2() is TRUE
Show / hide code
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
One mit/app contributes to exactly one parcel — counted by assigned_parcel_index from the alt resolved outputs. Records with assignment_method == "unassigned" (no parcel within reach) are excluded. Parcels with no matched mits/apps get a zero count.
In [5]:
Show / hide code
mit_counts <- mits_resolved |>st_drop_geometry() |>filter(!is.na(assigned_parcel_index)) |>count(assigned_parcel_index, name ="mit_count") |>rename(parcel_index = assigned_parcel_index)app_counts <- apps_resolved |>st_drop_geometry() |>filter(!is.na(assigned_parcel_index)) |>count(assigned_parcel_index, name ="app_count") |>rename(parcel_index = assigned_parcel_index)parcels <- parcels |>left_join(mit_counts, by ="parcel_index") |>left_join(app_counts, by ="parcel_index") |>mutate(mit_count =replace_na(mit_count, 0L),app_count =replace_na(app_count, 0L) )cat("Parcels with at least one assigned mit:", sum(parcels$mit_count >0), "\n")
Parcels with at least one assigned mit: 6524
Show / hide code
cat("Parcels with at least one assigned app:", sum(parcels$app_count >0), "\n")
Parcels with at least one assigned app: 9146
Show / hide code
cat("Parcels with both: ",sum(parcels$mit_count >0& parcels$app_count >0), "\n")
For each recipient parcel, find the K nearest donor parcels by centroid distance, then average their TOTAL.VALUE.CALCULATED. Three values of K (5, 10, 50) computed in parallel.
The Census Bureau Gridded EIF provides privacy-protected counts on a fixed 0.01° grid (~1.2 km² per cell). Four files across two years are joined to each mit/app point:
File
Cross-tab
Prefix
Years
gridded_eif_pop_ageracesex_{year}.parquet
population × age × race × sex
ars_{year}_
1999, 2020
gridded_eif_pop_raceincome_{year}.parquet
population × race × income decile
ri_{year}_
1999, 2020
Cell-membership join via floor(coord * 100) / 100 + 0.005 — equivalent to spatial intersection but faster.
9.1 Download EIF files (one-time)
In [12]:
Show / hide code
for (name innames(eif_files)) { f <- eif_files[[name]]if (!file.exists(f$path)) {message("Downloading ", name, " -> ", basename(f$path))tryCatch(download.file(f$url, f$path, mode ="wb"),error =function(e) {stop("EIF download failed for ", name, ": ", conditionMessage(e),"\nIf HPC egress is restricted, download manually via wget ","from a login node and place at ", f$path) } ) } else {message("Already present: ", basename(f$path)) }}
The topology file is the authoritative cell registry — it defines all valid land cells on the 0.01° grid, including cells with zero population (absent from the parquets). Joining mits/apps to the topology lets us distinguish why a record has NA EIF columns:
9.5 Snap unmatched records to nearest populated EIF cell (per year)
Records that land on a valid topology cell but have no population data in a given year’s parquet (zero/DP-suppressed) are snapped to the nearest cell that does have population data for that year. Year-specific keys are stored in separate columns (eif_lat_{year}, eif_lon_{year}) so each parquet join uses the most appropriate cell. The base eif_lat/eif_lon columns (original snapped position) are preserved for reference.
In [16]:
Show / hide code
# Cells actually referenced by any mit or app — typically a small fraction# of all NC cells, so filtering early massively reduces working-set size.needed_cells <-bind_rows( mits_resolved |>st_drop_geometry() |>select(eif_lat, eif_lon), apps_resolved |>st_drop_geometry() |>select(eif_lat, eif_lon)) |>distinct()cat("Unique EIF cells needed:", nrow(needed_cells), "\n")
Unique EIF cells needed: 2866
Show / hide code
# ---------------------------------------------------------------------------# Schema-adaptive helper: identify categorical (key) vs numeric (value) cols.# EIF coord columns and any already-derived eif_* keys are excluded.# ---------------------------------------------------------------------------split_cols <-function(df) { exclude_pat <-"^(eif_lat|eif_lon|grid_lat|grid_lon)" candidate <-names(df)[!grepl(exclude_pat, names(df))] val_pat <-"^n_|count|value|pop|hu_|estimate|moe|_noise|_postprocessed" val_cols <- candidate[grepl(val_pat, candidate, ignore.case =TRUE) &sapply(df[candidate], is.numeric)] key_cols <-setdiff(candidate, val_cols)list(key = key_cols, val = val_cols)}# Pivot long -> one summary row per cell; prefix all value columns.aggregate_eif_wide <-function(d, prefix) { cols <-split_cols(d)# Guard: if no value columns recognized, warn loudly and return keys-only.# This usually means split_cols's regex didn't match the parquet column# naming. Inspect schema and update val_pat in split_cols.if (length(cols$val) ==0) {warning(sprintf("aggregate_eif_wide('%s'): no value columns identified. ", prefix ), "Available columns: ", paste(names(d), collapse =", "))return( d |> dplyr::select(eif_lat, eif_lon) |> dplyr::distinct() ) }if (length(cols$key) ==0) {# No categorical keys to pivot over — already 1 row per cell.# Just prefix the value columns. d |> dplyr::select(eif_lat, eif_lon, dplyr::all_of(cols$val)) |> dplyr::rename_with(~paste0(prefix, "_", .x), dplyr::all_of(cols$val) ) } else { d |> tidyr::pivot_wider(id_cols =c(eif_lat, eif_lon),names_from = dplyr::all_of(cols$key),values_from = dplyr::all_of(cols$val),names_sep ="_",values_fn = sum ) |> dplyr::rename_with(~paste0(prefix, "_", .x),-c(eif_lat, eif_lon) ) }}# ---------------------------------------------------------------------------# Process each EIF file:# Arrow bbox pushdown -> collect NC rows -> filter to needed cells -># aggregate to one row per cell -> release raw data.# Sequential to keep peak memory bounded.# ---------------------------------------------------------------------------process_one_eif <-function(nm, needed_cells) { f <- eif_files[[nm]] prefix <- f$prefixcat("\nProcessing", nm, "(prefix:", prefix, ")... ") raw <- arrow::open_dataset(f$path) |> dplyr::filter(as.numeric(grid_lat) >= NC_BBOX$lat_min,as.numeric(grid_lat) <= NC_BBOX$lat_max,as.numeric(grid_lon) >= NC_BBOX$lon_min,as.numeric(grid_lon) <= NC_BBOX$lon_max ) |> dplyr::collect() |> dplyr::mutate(eif_lat =round(as.numeric(grid_lat), 3),eif_lon =round(as.numeric(grid_lon), 3) ) |> dplyr::select(-grid_lat, -grid_lon)cat("NC rows:", nrow(raw), "| ") raw_needed <- raw |>inner_join(needed_cells, by =c("eif_lat", "eif_lon"))cat("needed-cell rows:", nrow(raw_needed), "| ")# Sanity check: rows/cell BEFORE aggregation — confirms long vs wide format rpc <- raw_needed |>count(eif_lat, eif_lon) |>pull(n) |>median()cat("median rows/cell:", rpc, "| ") summary_tbl <-aggregate_eif_wide(raw_needed, prefix)cat("summary rows:", nrow(summary_tbl), "\n")rm(raw, raw_needed); gc() summary_tbl}# Verify Arrow can push down the numeric-cast bbox filter before running all files.# If this takes > 30s or returns millions of rows, the filter is not pushing down —# in that case switch process_one_eif to collect() first, then filter().local({ test_q <- arrow::open_dataset(eif_files[[1]]$path) |> dplyr::filter(as.numeric(grid_lat) >= NC_BBOX$lat_min,as.numeric(grid_lat) <= NC_BBOX$lat_max,as.numeric(grid_lon) >= NC_BBOX$lon_min,as.numeric(grid_lon) <= NC_BBOX$lon_max ) elapsed <-system.time({ n <-nrow(dplyr::collect(test_q)) })cat(sprintf("Arrow filter test: %d rows in %.1fs\n", n, elapsed["elapsed"]))if (elapsed["elapsed"] >30)warning("Arrow pushdown may not be working — consider collect()-first fallback")})
Apps nearest-neighbor snapping:
ageracesex_1999 -> eif_lat_ars_1999 / eif_lon_ars_1999 : Snapped 940 unmatched records to nearest populated cell
ageracesex_2020 -> eif_lat_ars_2020 / eif_lon_ars_2020 : Snapped 727 unmatched records to nearest populated cell
raceincome_1999 -> eif_lat_ri_1999 / eif_lon_ri_1999 : Snapped 940 unmatched records to nearest populated cell
raceincome_2020 -> eif_lat_ri_2020 / eif_lon_ri_2020 : Snapped 727 unmatched records to nearest populated cell
9.6 Join all EIF datasets onto mits/apps
In [18]:
Show / hide code
# Drop geometry before joining — sf objects with hundreds of columns are# extremely memory-expensive on every left_join. Re-attach at the end.join_eif_all <-function(sf_obj) { geom <- sf::st_geometry(sf_obj) result <- sf::st_drop_geometry(sf_obj)for (nm innames(eif_files)) { yr <- eif_files[[nm]]$prefix lat_col <-paste0("eif_lat_", yr) lon_col <-paste0("eif_lon_", yr)# Columns are already prefixed by aggregate_eif_wide (e.g. ars_1999_*).# Only rename the cell-key columns to match the year-specific join keys. result <- result |> dplyr::left_join( eif_nc_tables[[nm]] |> dplyr::rename(!!lat_col := eif_lat, !!lon_col := eif_lon),by =c(lat_col, lon_col) ) } sf::st_sf(result, geometry = geom)}mits_with_eif <-join_eif_all(mits_resolved)gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 70592203 3770.1 137088171 7321.3 137088171 7321.3
Vcells 1197085623 9133.1 2077778244 15852.2 2048648489 15630.0
Show / hide code
apps_with_eif <-join_eif_all(apps_resolved)gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 70514512 3765.9 137088171 7321.3 137088171 7321.3
Vcells 1209560826 9228.3 2077778244 15852.2 2048648489 15630.0
Median original TOTAL.VALUE.CALCULATED (recipients only)
8,500.0
Median prop_value_5
64,324.6
Median prop_value_10
69,036.3
Median prop_value_50
78,450.0
Mits with EIF cell match — ageracesex_1999
8,801.0
Mits with EIF cell match — ageracesex_2020
8,801.0
Mits with EIF cell match — raceincome_1999
8,801.0
Mits with EIF cell match — raceincome_2020
8,801.0
In [21]:
Show / hide code
to_interpolate |>st_drop_geometry() |>select(parcel_index, original =`TOTAL.VALUE.CALCULATED`, prop_value_5, prop_value_10, prop_value_50) |>pivot_longer(-parcel_index, names_to ="metric", values_to ="value") |>group_by(metric) |>summarise(n =sum(!is.na(value)),p10 =quantile(value, 0.10, na.rm =TRUE),median =median(value, na.rm =TRUE),mean =mean(value, na.rm =TRUE),p90 =quantile(value, 0.90, na.rm =TRUE),.groups ="drop" ) |> knitr::kable(format.args =list(big.mark =",", scientific =FALSE),caption ="Distribution of original vs. interpolated values for recipient parcels" )
Distribution of original vs. interpolated values for recipient parcels
metric
n
p10
median
mean
p90
original
5,446
1,500.00
8,500.0
299,817.30
41,745.0
prop_value_10
5,456
38,967.25
69,036.3
103,999.81
176,157.7
prop_value_5
5,456
38,507.50
64,324.6
97,962.25
168,859.1
prop_value_50
5,456
45,409.30
78,450.0
102,299.25
161,201.5
Output schema
interpolated_vals.gpkg / interpolated_parcels_final.csv — one row per parcel:
Column
Source
Notes
(all original parcel columns)
parcels_pri.gpkg
unchanged
mit_count
derived
per-parcel count of mits assigned via alt Step 5
app_count
derived
per-parcel count of apps assigned via alt Step 5
gov_owned
derived
1 if owner name matches gov regex (excluding INC/LLC)
interpolate
derived
1 if parcel was a recipient (had value estimated)
prop_value_5
derived
mean of 5 nearest donors; original if not interpolated
prop_value_10
derived
mean of 10 nearest donors; original if not interpolated
prop_value_50
derived
mean of 50 nearest donors; original if not interpolated
mits_with_eif.gpkg / apps_with_eif.gpkg — one row per mit/app:
Column
Source
Notes
(all original alt-resolved columns)
alt Step 5 outputs
unchanged
eif_lat, eif_lon
derived
original snapped 0.01° cell-center (base position)
valid_cell
topology join
TRUE = valid land cell per EIF topology; FALSE = water/outside-US or geocoding error
eif_lat_{prefix}, eif_lon_{prefix}
nearest-neighbor fill
year-specific cell keys; equal to base keys for matched cells, snapped to nearest populated cell otherwise (one pair per EIF dataset, e.g. eif_lat_ars_1999)
ars_1999_*, ars_2020_* columns
EIF pop_ageracesex (1999 + 2020)
population × age × race × sex
ri_1999_*, ri_2020_* columns
EIF pop_raceincome (1999 + 2020)
population × race × income decile
Section 10: Analysis-ready summary outputs
In [22]:
Show / hide code
library(stringr)# Verify actual suffix tokens before aggregatingfor (yr inc(1999, 2020)) {cat("\n=== ars_", yr, " sample postprocessed cols ===\n", sep ="")print(head(grep(paste0("^ars_", yr, "_n_noise_postprocessed_"),names(mits_with_eif), value =TRUE), 10))cat("\n=== ri_", yr, " sample postprocessed cols ===\n", sep ="")print(head(grep(paste0("^ri_", yr, "_n_noise_postprocessed_"),names(mits_with_eif), value =TRUE), 10))}
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.1059 0.2147 0.2305 0.3367 1.0000
Show / hide code
# Sanity: pop_total should be non-zero for all records (post-snap)cat("\npop_total_1999 zero or NA count:", sum(sf::st_drop_geometry(mits_summary)$pop_total_1999 ==0|is.na(sf::st_drop_geometry(mits_summary)$pop_total_1999)), "\n")
pop_total_1999 zero or NA count: 0
Show / hide code
cat("pop_total_2020 zero or NA count:", sum(sf::st_drop_geometry(mits_summary)$pop_total_2020 ==0|is.na(sf::st_drop_geometry(mits_summary)$pop_total_2020)), "\n")