Codementor Events

Reverse Geocoding

Published Sep 04, 2017

I was recently working on a location based analysis with a client where we had a few records with invalid zipcodes. We did, however, have Latitude and Longitude for each of those records. This got me thinking about a reverse geocoding process so we could update those bad zips.

Two approaches came to mind. One using a KDTree to search a known dataset of coordinates and zipcodes for the closest match. Approach two will be a GeoPandas and Shapely combo to do some spatial operations.

Both approaches will leverage the latest US Census Zip Code Tabulation Areas data

Let’s start with a single point to search for

import numpy as np
omaha_point = np.array((-95.995102, 41.257160))

Using a KDTree approach with a Pandas DataFrame

import pandas as pd
from sklearn.neighbors import KDTree

url = ‘http://www2.census.gov/geo/docs/maps-data/data/gazetteer/2016_Gazetteer/2016_Gaz_zcta_national.zip'df_locations = pd.read_csv(url, dtype={‘GEOID’ : ‘str’},sep=’\t’)
#some column cleanupdf_locations.columns = df_locations.columns.str.strip()
print (len(df_locations))print(df_locations.head())

33144
GEOID INTPTLAT INTPTLONG
00601 18.180555 -66.749961
00602 18.361945 -67.175597 
00603 18.455183 -67.119887 
00606 18.158345 -66.932911 
00610 18.295366 -67.12513

I’m making the assumption that the coordinates represent the centroids of their respective zips in this dataset.

Now we construct our simple KDTree using scikit-learn’s implementation. Taking the defaults here. Look here for more information and examples.

kdt = KDTree(df_locations[[‘INTPTLONG’, ‘INTPTLAT’]])

And that’s it, our tree is built and ready to use. You must query the tree with a two-dimensional array so we’ll address that first with numpy’s expand_dims. We then execute the query with k=1 because I just want the closest zip centroid to my search point of Omaha. I use the query result to index into our DataFrame and retrieve the valid zipcode of 68132.

omaha_point_kdt = np.expand_dims(omaha_point, axis=0)

nearest_point_index = kdt.query(omaha_point_kdt, k=1, return_distance=False)
print(df_locations.loc[nearest_point_index[0], ‘GEOID’])

23609 68132
Name: GEOID, dtype: object

Onto to the GeoPandas and Shapely approach. Here we use a pretty big shp file from the Census as our reference dataset. This file contains the same zips from our first approach, but it has polygons that represent the boundaries of the zips instead of just centroids.

import geopandas as gpd
from shapely.geometry import Point

gdf_locations = gpd.read_file(‘data/tl_2016_us_zcta510/tl_2016_us_zcta510.shp’)print(len(gdf_locations))
print(gdf_locations[[‘GEOID10’, ‘geometry’]].head())

33144

GEOID10 geometry
43451 POLYGON ((-83.674464 41.331119, -83.6744449999...
43452 POLYGON ((-83.067745 41.537718, -83.067729 41....
43456 (POLYGON ((-82.8566 41.681222, -82.856831 41.6...
43457 POLYGON ((-83.467474 41.268186, -83.4676039999...
43458 POLYGON ((-83.222292 41.531025, -83.2222819999...

This shp file took 2.5 minutes to read in on my machine. Quite a few of the zips are Multi-Polygons which I think contributed to the lengthy SHP file-to-GeoDataFrame conversion time.

Next we reverse geocode, but first we need to convert our omaha_point to a Shapely Point. Then we leverage one of my favorite things about GeoPandas which is the ability to combine boolean indexing with Shapely’s spatial operations. So our 68132 zip polygon contains our omaha_point. Pretty cool I’d say.

omaha_point_shp = Point(omaha_point)

filter = gdf_locations[‘geometry’].contains(omaha_point_shp)
print(gdf_locations.loc[filter, ‘GEOID10’])

24842 68132
Name: GEOID10, dtype: object

Two different approaches yielding the same result (68132). Let’s look at some timings.

KDTree approach

%%timeit
nearest_point_index = kdt.query(omaha_point_kdt, k=1, return_distance=False)
df_locations.loc[nearest_point_index[0], ‘GEOID’]

494 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

GeoPandas/Shapely approach

%%timeit
filter = gdf_locations[‘geometry’].contains(omaha_point_shp)
gdf_locations.loc[filter, ‘GEOID10’]

243 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The KDTree solution is considerably faster in this case. I do prefer the GeoPandas and Shapely solution though because it’s more exact. I could see cases where a Lat Long could be closer to a zips centroid, but is within another zip’s boundary. At the end of the day, I think both have some merit.

Discover and read more posts from Bob Haffner
get started
post commentsBe the first to share your opinion
Show more replies