Accuracy of LexisNexis-derived retrospective address histories in the Sister Study cohort

Study population

The Sister Study is an ongoing prospective cohort study designed to investigate environmental risk factors for breast cancer [13]. Between 2003 and 2009, a total of 50,884 women across the United States and Puerto Rico were enrolled into the study. All participants provided written informed consent prior to enrollment. The collection of data in the Sister Study and linkage with commercial data sources has been approved by the institutional review board of the National Institute of Environmental Health Sciences. We excluded participants who had withdrawn from the study (n = 9; Data Release 11.1).

Sources of address data

LexisNexis is a commercial vendor of data products that aggregates information from public records including credit reporting data, real estate and tax records, property deed transfers and mortgages, driver’s license records, court filings, and state death registries. We requested up to 20 most recent addresses for all Sister Study participants and provided LexisNexis with the following information: full names, gender, date of birth, enrollment address, phone number, and date of death (when applicable). We also utilized the United States Postal Service (USPS) Residential Delivery Indicator product (RDI; https://postalpro.usps.com/address-quality-solutions/residential-delivery-indicator-rdi) to identify business addresses. Self-reported residence history at baseline included the street address and dates of residence of participants’ primary residence at study enrollment and where they lived longest as an adult.

Cleaning and processing of address data

LexisNexis provided a set of addresses for each participant that included the street address, city, state, zip code, and start and stop dates “seen” for each address. To create continuous residential histories from the set of LexisNexis addresses, we adapted a published algorithm to clean the address data [6]. After excluding addresses with missing information, we used the USPS RDI identify and exclude business addresses. We also excluded addresses with timeframes not within or overlapping the period from 1980 through the study enrollment year and truncated address stop years at the enrollment year.

We performed several steps to reconcile incongruities in timeframes of the cleaned address data. First, to ensure the residential histories reflected time that can be linked to meaningful exposure durations relative to chronic disease outcomes, we excluded short duration addresses (≤31 days) [6]. Next, when self-reported baseline or longest adult address locations matched the LexisNexis records, we substituted the LexisNexis dates with self-reported dates of residence. Then, we sorted addresses by their start dates. When there were matching street addresses, we combined the time frames (which also reconciled duplicate addresses). When gaps or overlaps existed, we assigned the start date of the following address as the end date of the preceding address. We followed this procedure for resolving gap and overlaps in address histories given evidence that start dates in LexisNexis are more accurate than end dates [6].

Address validation sample

Of the participants with LexisNexis residential history data, we selected 1000 women to participate in an address validation study. To ensure the sample included participants across sociodemographic groups to facilitate comparisons, we drew a weighted random sample based on baseline age and self-reported race and ethnicity: 25% non-Hispanic White (NHW) and ≤55 years, 25% NHW and >55 years, 25% non-NHW and ≤55 years, and 25% non-NHW and >55 years.

Study participants selected into the validation sample were asked to complete an address validation questionnaire. The form was personalized for each participant and included a list of their LexisNexis addresses (street name and number, city, state, zip code, and corresponding years of residence) up to their year of study enrollment. For each address, participants had the option to select “No updates” if the address was correct, or “Yes, updates” if any of the provided address information was incorrect. If participants selected “Yes, updates,” they were instructed to provide the correct address information. Separately, participants were asked to provide address information for any residences at which they lived between 1980 and their enrollment year that were not included on the form.

Descriptive summaries and analysis

Among all participants with available residential history data, we summarized the distribution of the number of addresses per participant, duration of residence (years) each address, total years of address history, and age at the start of the earliest address overall and by sociodemographic characteristics.

To evaluate the accuracy of LexisNexis address locations among the validation sample, we calculated the proportion of address confirmed at the detailed street (street name and number), street name, zip code, city, and state level, as well as the proportion of LexisNexis addresses that were assigned to the same census tract as verified/corrected address. To evaluate the accuracy of timing, we calculated the proportion with confirmed dates of residence (start and/or stop year) and the percent of time (years) correctly covered by the LexisNexis addresses.

Address validation metrics were calculated overall and by self-reported sociodemographic characteristics collected at baseline: age (≤45, 46 to 56, 56 to 65, and >65 years), race and ethnicity [Hispanic, non-Hispanic Black, NHW, additional groups (including American Indian or Alaska Native, Asian, Native Hawaiian or other Pacific Islander, and unknown or not specified)], educational attainment (high school graduate or lower, some college, four-year degree or higher), and household income (<$50,000, $50,000–$99,999, and ≥$100,000). We also calculated proportions by address urbanicity [based on 2003 USDA Rural-Urban Continuum Codes [14]: urban/metro [1,2,3], urban/non-metro [4,5,6,7], or rural county [8, 9] and census region (Midwest, Northeast, South, West, Puerto Rico). To make the results representative of the overall population with address history data, we calculated a weighted mean proportion for the overall sample, age groups, and race and ethnicity groups to account for sampling proportions; for all other groups, we calculated a simple proportion. Weights were calculated as the ratio of each sampling group’s proportion among analyzed respondents to their proportion in the full cohort with address history data. To assess differences between subgroups, we obtained p-values from chi-square tests of independence (\(\alpha\) = 0.05; tests for age and race/ethnicity groups were performed on weighted counts).

For addresses with a correction to location or dates of residence, we summarized the distance (km) and difference in years, respectively, between the geocoded LexisNexis and updated addresses.

Comments (0)

No login
gif