Geocoding: The Basics - TRAIN

Transcription

Geocoding:The BasicsKaryn BackusCT Dept of Public HealthNovember 2012ArcGIS 10This guide is NOT an introduction to GIS or to geocoding. In order to use this guide, it isexpected that you are familiar with ArcGIS software and general cartography concepts. Ifyou are not familiar with the software, I recommend that you read through ArcGIS’extensive Web-based Help. Some of the concepts can be initially confusing (e.g.,coordinate systems and projections) and it is important that you clearly understand thembefore progressing with your projects or your geocode results may be invalid (althoughyou might not even realize it).As of November 29, 2012:The custom DPH address locator is available on the U drive within theTA Streets.gdb (geodatabase) folder.U:\SHAREDOC\A DPH GIS User Group\Geocoding\Address Locator\TA Streets.gdbThis is an ESRI / ARCGIS specific file type and details about its contents can beviewed through ArcCatalog or ArcMap.Please do not edit or move the .gdb folder or its component files.

TABLE OF CONTENTSIntroduction . .1File Preparation .2Raw Address Data . .2Address Locators .5Geocoding an Address Dataset 12Rematching Addresses 16Joining Geocoded Points to Towns and Census Levels. . 18Summarizing Joined Data .20Adding Latitude and Longitude Coordinates . 23

INTRODUCTION“Geocoding is the process of assigning a location, usually in the form of coordinate values, to anaddress by comparing the descriptive location elements in the address to those present in thereference material” (excerpt from ESRI Help). The key to this sentence is that there are threecomponents to geocoding. The first is the set of addresses that you want to locate on a map.In this guide, these addresses are referred to as raw because they have yet to begeoprocessed. The second component is the reference material. Currently, Tele Atlas is thecompany that supplies the street reference database for Connecticut’s street network. Third,ArcGIS uses an address locator utility to define how the raw address is matched to the streetreference database so that a location/coordinate can be assigned. When all three of theseelements come together, data can be geocoded.In the first chapter of the guide, the process of preparing your data for geocoding is laid out.The file preparation stage is the most complicated and time-consuming portion of the geocodingprocess. This chapter discusses the need to review and (re)format the variables within yourthree components so they are consistent with one another prior to geocoding. Since the streetreference database is supplied by an external source (Tele Atlas), it is preferred that the rawaddress data be adjusted to the format of the reference data rather than vice versa. After yourraw data is properly prepared, this guide explains what the Address Locator is and how it works.The second chapter is a step-by-step guide to the process of geocoding in ArcMap. At thisstage, the geocoding utility is opened and your inputs are set in accordance with your filepreparations. Once the inputs are defined, ArcMap processes the raw addresses and displaysthe resulting geocoded points as a new layer in your map. You may then review and edit theresults as needed.The third chapter covers joining and summarizing geocoded data. After your addressesbecome x/y coordinate points on the map, it is usually preferred to summarize the data into ameaningful presentation. By joining the individual address points to a specific geography,such as a census tract or a flood plain, the points can be summarized (e.g., counts oraverages) and displayed using the new geography.The last chapter explains how to add latitude and longitude coordinates to your results. Thex,y coordinates of your geocoded points are based on a projected coordinate system (a flat,two-dimensional surface; just like a paper map) specific to Connecticut. A geographiccoordinate system (GCS) uses a three-dimensional spherical surface (like a globe) to definelocations on the earth and is referenced by its longitude and latitude values. When sharingdata with others, particularly outside of the CT, it is often preferred to provide geocoded datausing the more universal system of latitude and longitude. Adding latitude and longitudecoordinates to your data table requires re-projecting your map to a GCS using the Project toolin ArcToolbox and then adding XY coordinates using the Add XY tool in ArcToolbox.p. 1

FILE PREPARATIONThe process of geocoding a dataset requires three major elements:1)A street reference database that contains street segment information for the region ofinterest. 2)An address file that contains the address information that you want to represent spatially. 3)This is the Tele Atlas street centerline file already prepared for your use.This is the raw address data you want to convert to points on a map.An address locator that standardizes the information in the address file and compares it tothe information in the street reference database using pattern matching algorithms. An address locator has been prepared for use throughout the agency.Preparing the raw addresses for geocodingBasically, geocoding is like taking a pin and tacking it to where it belongs on a map. For ourpurposes, the pin (a single point) is a residential address and the map is a network of all of thestreets in Connecticut.Where the pin is tacked to the map is determined by referencing a database of all of the streetsin Connecticut. This database does not contain geographic information for every single houseor building in Connecticut. Instead, it contains a series of street segments that have alreadybeen mapped to a geographic coordinate.Each segment is defined using an address range (e.g. house numbers). Perhaps Main Street inyour town runs from 1 to 99. In the reference database, Main Street may be represented with 3entries: 1-33 Main St, 34-60 Main St, and 61-99 Main St. Geocoding takes an address (e.g. 14Main St), matches it to a street and specific segment (1-33 Main St) and then interpolates theposition of the address within the range along the segment. Interpolation is an estimate. Since14 is a little less than halfway between 1 and 33, the address would (probably) be placed a littleless than halfway along the street segment. Although the interpolation is an estimate, theaddress is mapped to that specific geographic coordinate.Thus, the process of geocoding is based on matching. An address has several elements thatmust be consistent with a reference address in order for a match to occur.House Number Pre-directional Street Name Street Type Post-directional Town State Zipcode1572 Rhey Ave N Main 259 S Main 17 Wallingford CT 06492St Mystic CT 06355St Ext Hartford CT 06105p. 2

To achieve consistency, changes to the format or structure of the address data may berequired. This is often referred to as cleaning or reformatting the data.1)Variable Structure: address data fields must be in the proper format Street information should be provided as a city-style address House Number, Prefix or Pre-Directional, Street Name, Street Type, Suffix or PostDirectional Secondary address information (apartment, unit, building, etc) is not a part of thestreet network, so it is not useful for geocoding. Sometimes, secondary informationmay interfere with the matching process so it is best to leave it out or delete it whenpossible. PO boxes, rural routes, and place names (like apartment complexes or nursinghomes) are not city-style addresses and cannot be geocoded as is. Zones used for geocoding the address should be in separate fields Zones are determined by what is available in the street reference database. Our street reference database contains Town Name, Town Code and Zip (postal)Code.2) State is not utilized as a zone because we geocoding CT resident addresses. Streetsthat surround the state borders may be included in the reference dataset. Street and Zone are the only fields necessary to geocode. It is preferable, however,to keep the identification number as well so that, in the future, the geocode resultsmay be linked with other non-address variables. It is possible to keep all of the dataset’s variables in your address dataset, but itcan be more efficient to keep the geocode dataset separate from the primarydataset. The geocoded file contains many fields that will not be needed in thefinal dataset. Keep the geocode dataset separate also ensures that the primary dataset is notcompromised during the geocoding process since sometimes the dbf formatused by ArcGIS can truncate variable names and field lengths and can changenumbers stored as text back to numeric formats.Pattern Matching The geocoding process with ArcGIS uses pattern matching to determine if youraddress is the same as an address in the street reference database. Because patternmatching is used, it is possible to match addresses that are not exactly identical. Thisis very useful when there are minor differences, such as a transposed letter in thestreet name. The pattern matching system can also interfere with successful geocoding whenunexpected values are detected. Numbers outside or possibly between the street ranges Dashes and unusual characters Missing street elements, like no house number or no street typep. 3

Incorrectly placed street elements: if an apartment number is placed at thebeginning of the city-style street address, the pattern matching system mayincorrectly interpret the apartment information to be the street information (24 A23Main St may become 24 A St). PO Boxes and Rural Routes are not city-style addresses. They do not map to alocation on the street map and so they cannot be geocoded. Occasionally, the pattern matching software may “think” it has matched a PO Boxor a Rural Route but this is in error. To prevent such instances, it is recommendedthat non city-style addresses be removed from the address file prior to geocoding.3)Variable Types/Consistency: Confirm that the raw address variable formats are consistent with the street referencefile variable formats. Some variables are defined as text (aka, character or string) and some as numeric,even when they all appear to be numbers. ArcGIS geocoding system may not recognize two variables as having the samevalue if one variable is numeric and the other is character. ArcGIS will input a variety of data file formats, however text and numeric formatdefinitions may not be retained by all data file formats. This is particularly relevant forCT Zipcodes where the preceding zero may be dropped when read in as a numericvariable. It is recommended that all data be saved as .dbf for input into ArcGIS. Dbf is the format that ArcGIS uses for its system files and output files. By using thesame file format every time, conversion errors are minimized. You may have to check and recheck that the .dbf file is properly saving andproperly loading the variables you have defined. It may take some work to get thevariables to save and load in the proper format. Be aware that dbf may truncate variable names or field lengths.p. 4

Understanding the Address Locator StructureA custom address locator has been prepared for DPH by Karyn Backus. This locator is basedon a customized dual streets locator style that was prepared for Karyn by ESRI. This addresslocator is called a "local address locator" because it resides on our agency network. Do notuse the default, online locators provided by ArcGIS 10 with confidential addresses orrelated information.Since the custom DPH address locator has already been developed, this section will walkthrough the key parameters that influence the locator but will not demonstrate creating one.address locator1. [ESRI software] A dataset in ArcGIS that stores the address attributes, associated indexes,and rules that define the process for translating nonspatial descriptions of places, such asstreet addresses, into spatial data that can be displayed as features on a map. An addresslocator contains a snapshot of the reference data used for geocoding, and parameters forstandardizing addresses, searching for match locations, and creating output. Address locatorfiles have a .loc file extension.address locator style1. [geocoding] A template on which an address locator is built. Each template is designed toaccommodate a specific format of address and reference data, and geocoding parameters.The address locator style template file has a .lot file extension.The US Address—Dual Ranges address locatorstyle is used for the majority of common UnitedStates street addresses.This address locator style permits you to providea range of house number values for both sides ofa street segment.With this, the address locator can not only delivera location along the street segment but also candetermine the side of the street segment wherethe address is located.p. 5

address range1. [geocoding] Street numbers running from lowest to highest along a street or street segment.Address ranges are generally stored as fields in the attribute table of a street data layer.They often indicate ranges on the left and right sides of streets.Each feature in the reference data represents a street segment with two ranges of addressesthat fall along that street segment, one for each side of the street.Each street segment in the reference dataset is depicted with a separate color for illustration purposes.Green background represents Cheshire town. Grey background represents Wallingford town.address element1. [geocoding] One of the components that comprise an address. House numbers, streetnames, street types, and street directions are examples of address elements.Each of the US street address locator styles has the same requirements for inputaddress data. Tables of addresses that can be geocoded using these address locators musthave an address field containing the street number and street name in addition to the street'sprefix direction, prefix type, street type, or suffix direction, if any.p. 6

Basic characteristics of address locator styles provided with ArcGISTypicalreferencedatasetgeometryStylesTypical ersExamplesApplications320 Madison St.N2W1700 CountyRd.105-30 Union St.2 Summit Rd.N5200 County Rd PP115-19 Post St.Finding a house on aspecific side of thestreetUS Address—Dual RangesLinesAddress range forAll addressboth sides of street elements in asegmentsingle fieldUS Address—One RangeLinesOne range for eachstreet segmentAll addresselements in asingle fieldUS Address—Single HousePoints orpolygonsUS Address—ZIP 5 DigitPoints orpolygonsEach featurerepresents anaddressZIP Code region orcentroidAll addresselements in asingle fieldFive-digit ZIPCode71 Cherry Ln.W1700 Rock Rd.38-76 Carson Rd.22066Finding parcels,buildings, or addresspointsFinding a specific ZIPCode locationGeneral—CityStateCountryPoints orpolygonsCity within a stateand countryCity name,state name orabbreviationRice, WA, USAFinding a specific cityin a State andCountryFinding a house on astreet where side isnot neededzone1. [geocoding] Additional information about a location or address, used to narrow a geocodingsearch and increase search speed. Address elements and their related locations such ascity, postal code, or country all can act as a zone.Many times, additional fields are found on the reference data that further clarify the location ofthe attribute including postal codes, states, or countries. This type of information is referred toas zone information and can be used to increase the likelihood of a correct match. Althoughzone fields are optional when creating an address locator, including the zones such as City,State, and ZIP fields is helpful to facilitate nationwide geocoding.The custom DPH address locator uses two zones to geocode addresses: town and zip. Town zone is defined as the official 169 Connecticut towns. Zipcode zone is the postal code value that was provided by Tele Atlas for each segment.alternate name See Also: alias1. [geocoding]A name for an address element, usually a street name, that is different fromthe official or most common name. For example, a highway number might be an alternatename for a street name.Using alternate street names allows you to match an address to a feature using one of manynames for the feature. The alternate names are provided in the Tele Atlas street segmentdataset. Some segments may have as many as 5 alternate names. Each of these is listed inthe alternate names table.The custom DPH address locator has the alternate names option included.For example: If Bridge Street is also known as Slash Road, you can find the same locationusing 266 Bridge Street as you can using 266 Slash Road.p. 7

place-name alias2. [geocoding] The formal or common name of a location, such as the name of a school,hospital, or other landmark. For example, "Memorial Hospital" is the place name for theaddress "893 Memorial Drive." In geocoding, the address locator can be set to accommodatethe use of place-name aliases in place of their addresses for matching.A place-name alias is a common name of a location, such as the name of a school, hospital, or other landmark.For example, Memorial Hospital is the place name for the address 893 Memorial Drive. Searching for a locationcan be done either by the address or its place-name alias.In a place-name alias table, each record represents one place-name and its associated address. When aplace-name is entered as an input address, the address locator searches for the location based on the aliasname's corresponding address.Alias fieldThe place-name alias table must contain a field that stores the place-names. They are the names that willpotentially be entered as the input address. For example, if the table contains a list of schools with theirassociated addresses, the field in the table that contains the actual school name is used as the Alias field. If thesame address has multiple place-names, each name with the same corresponding address should be added tothe table. If different addresses have the same place-name, additional zone information, such as City, State, orZIP Code, should be provided in the table. For example, the table can have a record for Public Library with itsaddress in Atlanta, GA, and another record for Public Library in Dallas, TX.Address fieldsBased on the address locator style you choose, the place-name alias table should contain the same set ofaddress input fields used by the address locator. For example, if an address locator specifies Streets, City, State,and ZIP as the input fields for matching, the place-name alias table should have the same set of fields. Thesefields contain the actual addresses for the alias names.The custom DPH address locator does NOT have the place name option included.However, the place name table can be added by the user when setting the final parameters forgeocoding. See Karyn for more information.p. 8

Understanding the Address Locator Parametersaddress locator property1. [geocoding] A parameter in an address locator that defines the process of geocoding.These are the fields that are defined by the user for the current geocoding session.These should be edited for each geocode session by the individual user.Street or Intersection:The name of the field in your input dataset that contains the street elements.City or Placename:The name of the field in your input dataset the contains the town information. This can be eitherthe Town Code or the Town Name.ZIP Code:The name of the field in your input dataset that contains the zip code. This field must have thepreceding zeros: the address locator will not match 6450 to 06450.Output shapefile:Remember to set the location and name of your geocoded shapefile.p. 9

The geocoding options parameters for the custom DPH locator are defaulted to thevalues that result in the highest quality matches as determined by Karyn. These can beedited for each geocode session by the individual user. Spelling Sensitivity: 80- This setting controls how much variationthe address locator allows when itsearches for likely candidates in thereference data. The spelling sensitivity setting foran address locator is a valuebetween 0 and 100. A low value for spelling sensitivityallows Universty or Universe to betreated as match candidates forUniversity. A higher value restricts candidatesto exact matches. The spellingsensitivity does not affect thematch score of each candidate; itonly controls how many candidatesthe address locator considers. The geocoding process takeslonger when you use a lowersetting because the addresslocator has to process andcompute scores for morecandidates. Minimum Candidate Score: 20- When an address locator searches for likely candidates in the reference data, it uses thisthreshold to filter the results presented. Locations that yield a score lower than thisthreshold are not presented. The minimum candidate score for an address locator is a value between 0 and100. The minimum candidate score determines which candidates are presented in theInteractive Review and Find dialog boxes. Minimum Match Score: 90- The minimum match score setting lets you control how closely addresses have to matchtheir most likely candidate in the reference data to be considered a match for theaddress. The minimum match score for an address locator is a value between 0 and 100. A perfect match yields a score of 100. An address below the minimum match scoreis considered to have no match. When batch geocoding, the minimum match score must be met or exceeded to beconsidered a match. If more than one match is found, the candidate with the highest match score isassigned.p. 10

Side Offset: 20 feet- An adjustable value that dictates how far away from either the left or right side of a linefeature an address location should be placed. A side offset prevents a point feature from being placed directly over a line feature. A side offset that is too low can make it difficult to accurately join the points topolygons when the point is located on or near a boundary line. A side offset that is too high can result in a point being placed so far from thecenterline that it lands in a neighboring polygon or parcel. End Offset: 3 percent- An adjustable value that dictates how far away from the end of a line an address locationshould be placed. Using an end offset prevents the point from being placed directly over theintersection of cross streets if the address happens to fall on the beginning or endof the street. Match If Candidates Tie: Unchecked- This option allows the geocoder to automatically match the input address to a candidateaddress when there are more than one candidates with the same minimum match score. The match to the potential candidates will be arbitrary Uncheck this option to prevent arbitrary matches when candidates are tied. Candidates that tie based on address elements but have the same x/ycoordinates will be automatically matched even when this option isunchecked. Output the X and Y Coordinates: Checked- This option is used to populate the geocoded data table with a field for X and a field for Y. The coordinates that are output will be in the projection that the current map is in atthe time of geocoding unless otherwise specified in the Advanced GeometryOptions window before geocoding (see next section).composite address locator1.[geocoding] A locator that will cycle through several individual locators.Addresses are matched using the style and settings of each locator. Since it is not possible to re-runthe locator on just a portion of the input dataset, a composite locator allows the user to define ahierarchy of locators. More than one “style” (e.g., odd/even numbering, mixed numbering) More than one reference database (e.g., roof top, centerlines) More than one zone field (e.g., town name, town code, postal name) Different sets of parameters (e.g., change in spelling sensitivities)The custom DPH locator is a composite that cycles through two locators based on thedual-streets style with alternate names: once using Town Code and Zipcode and onceusing Town Name and Zipcode. This allows flexibility for the users to have their inputdata with either town name or town code.p. 11

GEOCODING AN ADDRESS DATASETWhenever I start a new map, I always load the town layer first to set the projection to CT StatePlane. Since we will be geocoding data, I also add the street centerline shapefile layer to themap so that the streets are viewable during interactive match.Now add to the map the address tables that you want to geocode. Sample addresses is a .dbf file. Hospital addresses is an .xls file. Although ArcMap works with both formats, it prefers to use .dbf files.p. 12

Right click on the address table you want to geocode and select Geocode Addresses.This will open the Address Locator selection window. If the address locator you want to use is not listed, you may add one that has beenpreviously created. Click “Add” and navigate to the intended address locator and add it to the listing. Now select your locator of choice and click OK.p. 13

Review the address locator properties as they pertain to this dataset.o The Address Table will be the file that you right clicked on to geocode. If the file hasmore than one table, like this .xls file, you will be able to choose which table (orworksheet) you want to geocode.o Review the address fields to be sure any auto-populated values are correct. If they arenot auto-populated, use the drop downs to choose the appropriate fields. It is notrequired that you use all of the input fields. E.g., if you want to geocode on Street andZip only, you can leave City and State as None .o Set the path and filename for your geocoded results which will be output as a shapefilewith the geocoded information included as part of the attribute table. Click on “Advanced Geometry Options ” to set the spatial reference to our CT standard:NAD 1983 StatePlane Connecticut FIPS 0600 Feet. If you want to change any of your predefined address locator properties, you may do so inthe “Geocoding Options ” window. E.g., should you want to initially geocode your datawith a Minimum Match Score of 95%, you can change that here. Click OK to geocode the addresses. Astatus window will pop up with the matchresults.p. 14

Click on the “Rematch” button to open the interactive window to review your results.OR If you want to review the results of a file that has been previously geocoded, right click on thegeocoded file and select Data Review/Rematch Addresses.This is the Interactive Rematch window that will serve as your one-stop spot for reviewing andediting your address matches.p. 15

From ESRI Help:Rematching with the Interactive Rematch dialog box—A typical workflowRematching a geocoded feature class can be done by using the Interactive Rematch dialog box in eitherArcMap or ArcCatalog; however, working in ArcMap allows more options, such as viewing candidates andresults on the map, rematching in an edit session, and reverse geocoding with the Pick Address fromMap tool. A geocoded feature class is required to rematch an address.When the Interactive Rematch dialog box is open in ArcMap, you can still interact with the mapdocument. This capability allows you to use other tools to inspect candidates more thoroughly. Forexample, you can pan, zoom in and out, and use the Identify tool to ensure the address is placed in thecorrect area. You can also resize, minimize, and maximize the Interactive Rematch dialog box to make iteasier to work with the rest of ArcMap.The Interactive Rematch dialog box is shown below, and it is numbered to demonstrate a typicalworkflow. The steps are outlined below the graphic.1. The Statistics panel shows how many of your original addresses are matched, tied, orunmatched.2. The Geocoding results table shows the records from the geocoded feature class. It contains theoriginal address data and attributes indicating the status, score and matched address. Youchoose the address you want to interactively rematch by clicking a record in the table or using therecord selector on the lower left side of the panel.3. The Address panel displays the address that serves as input for matching. You can edit theinformation in the text box to possibly find a better match.4. The candidates discovered for the address you selected in step 2 and modified in step 3 aredisplayed on the Candidates panel. You can examine the list of candidates and choose the onethat you think matches your original address the best.5. The Candidate details panel shows you the same attributes as the Candidates list but displaysonly one record at a time so it is easier to read.6. Click the Match button to rematch the address you selected in step 2 to the candidate you chosein step 4. The output attributes (Status, Score, Match type, Side, and Match addr) are updatedfor the selected record in the Geocoding results table.7. Select another record and repeat steps 3 to 6.p. 16

REMATCHING ADDRESSESRematching a geocoded feature class in ArcMap allows you to interact with the map whilerematching the addresses. Click the address in the Geocoding results table that you want to rematch. Edit the input address, if necessary, in the Address text box or boxes. Click the Search button to search for candidates or refresh the list of candidates. The candidatesare highlighted on the map. Click Zoom to Candidates to zoom to the set of candidates for the address. Click the candidate in the Candidates list that you want to match the address to. The candidate thatyou choose is highlighted on the map in yellow; the others are in blue. Click Match. Click the next record in the Geocoding result table or click the left or right arrow of the Recordselec

"Geocoding is the process of assigning a location, usually in the form of coordinate values, to an address by comparing the descriptive location elements in the address to those present in the reference material" (excerpt from ESRI Help). The key to this sentence is that there are three components to geocoding. The first is the set of addresses that you want to locate on a map.