BigSurv18 program
Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October
Big Data, Little Error? Assessing the Total Error of Survey Estimates in the Era of Big Data |
|
Chair | Dr Ashley Amaya (RTI International) |
Time | Friday 26th October, 09:45 - 11:15 |
Room: | 40.109 |
Total Error Frameworks for Hybrid Estimation and Their Applications
Final candidate for the monograph
Dr Paul Biemer (RTI International) - Presenting Author
Dr Ashley Amaya (RTI International)
Hybrid estimation is a process whereby estimates are derived from multiple datasets, especially survey, administrative and found (or Big Data) datasets. Hybrid estimates can be quite complex involving combining probability and nonprobability samples, linking individual records, pooling nonlinked data, and modeling and statistically adjusting the estimates based upon external data using nontraditional approaches. This paper reviews the current literature on total error frameworks for survey and non-survey estimators and considers their strengths, weaknesses and utility. It then proposes a general total error framework that can be adapted to a range of hybrid estimators, from linked survey and administrative data sets to complex data compilations such as government national accounts. With this total error perspective, data producers can more objectively and comprehensively consider the entire range of quality dimensions and error sources associated with hybrid estimators so that the source datasets can be combined in ways that leverage their positive and diminish their negative quality attributes. The paper will also demonstrate the uses of this framework for (a) designing hybrid estimators that minimize total error, (b) evaluating the quality of a hybrid estimator, and (c) improving the quality of an existing hybrid estimator. For example, one tool that is particularly useful for these purposes is the so-called error risk profile which assigns "intrinsic" and "residual" risks to each error source that is inherent in the hybrid estimator. Three applications of this and other tools will be discussed in some detail in the paper: (1) combining administrative and survey data on housing square footage in a national survey of residential energy use; (2) combining found data and survey data to estimate the market share of consumer products; and (3) combining survey, administrative and online data for estimating the gross domestic product (GDP) for a national statistical office.
Understanding the Effects of Record Linkage on Estimation of Total When Combining a Big Data Source With a Probability Sample
Mr Benjamin Williams (Southern Methodist University) - Presenting Author
Dr Lynne Stokes (Southern Methodist University)
Much recent research has focused on methods for combining a probability sample with a non-probability sample to improve estimation by making use of information from both sources. If units exist in both samples, it becomes necessary to link the information from the two samples for these units. Record linkage is a technique to link records from two lists that refer to the same unit but lack a unique identifier across both lists. Record linkage assigns a probability to each potential pair of records from the lists so that principled matching decisions can be made. Because record linkage is a probabilistic endeavor it introduces randomness into estimators that use the linked data. The effects of this randomness on regression involving the linked datasets has been examined (for example: Lahiri and Larsen, 2005). However, the effect of matching error has not been considered for the case of estimating the total of a population from a capture-recapture model. In this paper, we investigate a motivating example with this structure, described below.
The National Marine Fisheries Service (NMFS) estimates the total number of fish caught by recreational marine anglers. Currently, NMFS arrives at this by estimating from independent surveys the total effort (the number of fishing trips) and the catch per unit effort or CPUE (the number of fish caught per species per trip), and then multiplying them together. Effort data are collected via a mail survey of potential anglers. CPUE data are collected via face-to-face intercepts of anglers completing fishing trips at randomly selected times/docks. The biologist-interviewers identify the catch totals of intercepted anglers by species. The effort survey has a high non-response rate. It is also retrospective, which causes the entire estimation process to take more than a month, which precludes in-season management.
Due to these limitations, NMFS is experimenting with replacing the effort survey with electronic self-reporting. The anglers report details of their trip via an electronic device and remain eligible to be sampled in the dockside intercept.
Several estimators have been proposed to estimate total catch using these self-reports alongside the dockside intercept using capture-recapture methodology (Liu et al 2017). For the estimators to be valid, the records from trips that both self-reported and were sampled in the intercept survey must be linked. The self-reported data is a non-probability sample because it is voluntarily submitted and can be considered as a big data source, while the dockside intercept is a smaller probability sample. Liu et al assumed perfect matching, however this is difficult in practice due to device and measurement error. Currently, the effect of potential matching errors on the estimates is unknown.
In this paper, we propose several estimators in addition to those from Liu et al (2017). Then, we develop a more sophisticated record linkage algorithm to match trips present in both samples. We examine the effect of errors in record linkage on our estimators when compared to the current assumption of perfect matching.
Income Data Linkage in the Swiss Context: What Can We Learn Regarding Different Error Sources?
Professor Boris Wernli (FORS) - Presenting Author
Mr Nicolas Pekari (FORS)
Professor Georg Lutz (FORS)
We focus on how data linkage can be used to provide an empirical basis to assess some elements of the TSE error framework. One key survey question, and one which often suffers from significant nonresponse and measurement error, is income. Using validated income data, we study how data linkage can help us understand these different sources of error. We use data from two surveys: the 2015 post-electoral Swiss electoral study (Selects), conducted as a CATI-CAWI mixed mode survey and the European Values Study (EVS) 2017, conducted as a CAPI-CAWI-PAPI mixed mode survey. These datasets were subsequently merged with income data from the Swiss social security system. The initial register-based samples were provided by the Swiss Federal Statistical Office (FSO). For all sampled citizens, information on socio-demographic information (gender, birth year, household size, marital status, country of birth, canton, community number) was available. This information was then enriched with income data from the Swiss social security system, which includes AVS (old age and survivors pension scheme), AI (invalidity insurance), APG (allowance for loss of earning), and AC (unemployment insurance). Detailed income information was provided for all household members of the sampled persons.
Using the TSE framework and based on the linked datasets, we address the following questions: 1) Do non-respondents present a different distribution of income compared to respondents, and what is the influence of income on total non-response, compared to other known parameters (unit non-response)? 2) Do non-respondents to the survey income question present a specific distribution of income (item non-response)? 3) Which parameters, including social, demographic or political indicators, as well as mode effects, have an impact on the difference between declared and registered values concerning household income (measurement error)? 4) What is the influence of this departure from validated data on the study of some dependent variables (idem)? We will present preliminary answers to the above-mentioned questions, using multivariate analysis to disentangle the impact of different families of explanatory factors. Our aim is to be able to generalize our results within the Swiss context, by showing that they remain stable from one area of inquiry to another.
Combining Administrative Data and Survey Samples for the Intelligent User
Dr Phillip Kott (RTI International) - Presenting Author
In her startling WSS President’s Invited Lecture in 2014, Connie Citro called for the slow and careful implementation of a paradigm shift in the way government agencies produce federal statistics. She provided several reasons for a shift away from a primary reliance on the survey sampling, chief among them were increasing costs, both financial and psychic (e.g., dealing with irate Congressmen complaining about the burden of government surveys on their constituents).
We will review recent literature on combining simple linear models and probability-sampling principles when combining administrative data with survey samples to produce useful estimates at levels of aggregation where using the latter alone would be inadequate. Multiple imputation fails as a method of variance estimation in this context. Jackknife variance estimation can be used in its place, but a jackknife requires the generation of multiple data sets - much more that the standard five with multiple imputation.
Since the approach described above depends on a model, statistical tests will be proposed to assess the viability of the model and also to inform users of potential biases in the estimates. There are issues of Type 1 and Type 2 error which often separate survey sampling from the rest of statistics that need to be conveyed to the user as do all the other problems associated with the estimation process. Indeed, that is the paradigm shift I am proposing: government statistical agencies need to stop treating users like they are dumber than dirt and cater more to intelligent users of their statistics. They also need to walk humbly with their data products to convey to the less sophisticated users that there are many reasons not to accept government statistics at face value. Nevertheless, these statistics provide close to the best estimates humanly possible given the constraints of living in our complex modern world.