Right on target
Editor's note: Ashley H. Hyon is AVP social science research, and Dennis Dalbey is senior geodemographer, at Marketing Systems Group, a Horsham, Pa., research firm.
The Computerized Delivery Sequence (CDS) File is the raw address file from the U.S. Postal Service containing nearly every deliverable postal address in the U.S. In its raw form it does not yield itself as a useful sampling frame. However, by merging it with other public and commercial data sources to append additional geographic and demographic data, it can be transformed into a robust and versatile tool for survey research. The sources incorporated include those from the government, such as the American Community Survey, Current Population Survey and decennial Census. In addition, a myriad of ancillary data items such as telephone, name and demographics (i.e., age, gender, income, ethnicity, education, presence of child, etc.) can also be appended to the addresses from a multitude of commercially available databases. Access to multiple data sources will result in the highest possible match rates and provide the ability to possibly increase the accuracy of the data items.
Additional frame enhancements include amelioration of some of the known coverage problems associated with the CDS, particularly in rural areas where more households rely on P.O. boxes and contain inconsistent address formats. Another enrichment is the capability to geocode every address with all levels of census geography including census blocks.
Linear interpolation
Addresses are geocoded using a method known as linear interpolation. Interpolation is a fancy way of saying estimation. The geocoded location of each address is estimated based on the range of numeric values between the starting and ending nodes for each street segment. Generally, every node is assigned two values – one odd and one even, such as 100-198 and 101-199. The values of each node correspond to the known starting and ending addresses found on both sides of the street.
Geocoded addresses are then plotted based on the numeric value of each address itself and in the range between the starting and ending node. For example, if we were to interpolate address 150 Main Street it would be approximately midway between 100 and 199 so the address would be plotted in the middle of the street segment and offset to the side of the street where the nodes correspond to the even numbers.
Using this method of geocoding will always be accurate enough to append the correct underlying geography to each address but not always with the precision of rooftop geocoding results. Here’s why: Interpolated geocoding will always provide accuracy to the street level. That is to say, if the correct address and ZIP code are provided, the geocoded results will always be accurate enough to determine the correct underlying geography of the address. How close or how far the geocoded location is to the actual real-world location of the address will vary depending on factors such as urban/rural settings, the distance between housing units, parks and open space, etc.
Housing units across the country vary in size, shape, distance from the street, urban, rural, suburban setting, etc. The main concentration in this process is to get the correct census block appended to the address record. Figure 1 illustrates an example of street-level geocoding in different density settings and how the precision of each geocode changes between urban, rural, suburban settings while the accuracy stays the same. For addresses that can’t be geocoded at this level, census geography is assigned based on the Zip+4 centroid and then the unmatched from there based on Zip+2 and, finally, down to the standard five-digit ZIP code centroid. Over 90 percent of the ABS frame is geocoded at the street level.
This army of enhancements enables researchers to develop more efficient sample designs as well as broaden their analytical possibilities through an expanded set of covariates for hypothesis testing and statistical modeling tasks. To distinguish this amplified residential address frame from the original USPS CDS file, it has been branded as the address-based sampling (ABS) frame.
ABS provides the highest coverage possible for an address sampling frame, making it the gold standard for mail surveys. Figure 2 breaks out the ABS frame by each address type. The counts are distinct (each address is only counted once) even though an individual address can take on more than one characteristic (i.e., vacant city-style address). The only exception is that the drop points count is also included in the count of drop units. Simplified addresses are generally rural addresses without a physical street address and are not included in the ABS frame. Before the 911 conversion campaign to try and get every address locatable for emergency response services, the number of simplified addresses was roughly 10 million. With the success of the conversion campaign as well as with other augmentations, the number of simplified addresses is currently 467,357.
Researchers can also take advantage of the ABS frame in designing sampling plans for in-person household surveys, which typically involve a multi-stage sampling methodology of primary sampling units (PSUs) and secondary sampling units (SSUs) based on census blocks. The CDS file does not contain any census geography so prior to ABS only traditional field listing (physically going into the areas and listing every address in that segment) could be employed to develop the sampling frame of addresses. As you can imagine, this is a very costly and time-consuming process. Researchers can now rely on the ABS frame to obtain the list of addresses to sample from since every address is geocoded to a census block. Approaches of utilizing the ABS frame were sought by survey researchers as ways of dramatically reducing the time and costs in developing the sampling frame without compromising quality. It should be noted that only locatable addresses can be considered for this. This means P.O. boxes, rural routes, highway contracts and simplified addresses will be excluded.
Whether a field listing or an ABS methodology is used to develop the sampling frame for in-person house-hold surveys, both will have inherent non-coverage issues that need to be addressed. Several different approaches to coverage enhancement have been developed. For more information on these procedures please view the AAPOR ABS Task Force Report available at http://bit.ly/25B7UUE.
Three-stage sampling methodology
The following focuses on the sampling methodology and the linkage procedures used to reduce the non-coverage bias by referencing a paper published by Westat, a long-standing research partner of Marketing Systems Group. Only the tip of the iceberg will be covered but the full published paper detailing all the research findings and citations can be viewed in the Journal of Survey Statistics and Methodology, Vol. 2, No. 3, September 2014 edition in “Handling frame problems when address-based sampling is used for in-person household surveys.” The paper was authored by Graham Kalton, Jennifer Kali and Richard Sigman of Westat. Access it at http://bit.ly/20TvjtW.
For the study described in the aforementioned article, Westat implemented a three-stage sampling methodology. The study was a national household survey of over 65,000 sampled households, sampled using a three-stage sample design. The third stage of sampling involved selection of a sample of addresses. Although the ABS frame was used as the primary source of addresses for this stage of sampling, as noted in the article, “The main address sample was supplemented by a sample of addresses that were either not on the USPS lists or not locatable from those lists.”
Kalton, Kali and Sigman (2014) also echoed the same sentiment that the coverage of locatable addresses in rural areas has greatly improved due to the 911 conversion campaign. Aside from the under-coverage due to the unlocatable addresses being excluded it can also be related to the accuracy of the geocoding. For example, an address can be right on the border of two census blocks and it will be geocoded to one block but when physically viewed is actually in the other adjacent block. This can result in some addresses being geocoded outside the sampling area and, on the flip side, some addresses can be geocoded in that are actually outside the sampling area.
To identify and correct for any discrepancies in geocoding, Kalton, Kali and Sigman (2014) chose to conduct an address coverage enhancement (ACE) procedure for some of the sampled areas. They had field listers physically go out and record whether the addresses identified on the ABS frame as being located in these areas were indeed in or out. It was reported that about 90 percent were correctly geocoded but they did see differences based the type of area (urban vs. rural). In urban areas the accuracy was about 92 percent but for rural areas it was about 79 percent.
At the end of the study Westat deemed that ABS proved to be an acceptable frame for in-person household surveys, which coincides with the findings of other researchers who specialize in this field. For some small-scale surveys of urban areas, it is possible that the ABS frame alone could be used as the sampling frame. For larger studies or in rural areas you would need to incorporate some sort of coverage enhancement procedure. It was also noted that including the no-stat addresses that are not part of the main ABS file could be considered as an option to decrease some of the non-coverage. No-stat addresses do come from the U.S. Postal CDS file but are not part of the main CDS frame. These addresses can’t be mailed to but they can be located for in-person household surveys. Currently there are about 8 million no-stat addresses, with about 10 percent of them being occupied.
Provides the best balance
Government-funded surveys are very concerned with coverage because they can’t afford to have any bias in their outcomes for the reason that policy and program decisions are based on these findings. At the same time they also have budgets that need to be met. In the case of in-person household surveys ABS provides the best balance between coverage and cost. One last note would be that ABS is not limited to this type of research. On the contrary, it’s a workhorse for probability-based panel recruitments, multimode surveys, non-response follow-ups and for designing stratified samples for the hard-to-reach populations, providing the highest efficiency rates without jeopardizing coverage.