BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Ethical Considerations for Using Big Data II: Exploring the Relationship between Ethical Considerations, Reproducibility, and Participation

Chair Dr Rebecca Powell (RTI International)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.150

Justice Rising - The Growing Ethical Importance of Big Data, Survey Data, Models, and AI

Dr Richard Timpone (Ipsos) - Presenting Author
Dr Charles Taber (Kansas State University)
Dr Yongwei Yang (Google)

Download presentation 1

Download presentation 2

In past work, the criteria of Truth, Beauty, and Justice have been proposed for evaluating models (Taber and Timpone 1996, Lave and March 1993). Earlier, while relevant, Justice was seen as the least practical of modeling considerations, but that is no longer the case.

As the nature of data and computing power have opened new possibilities for the application of data and algorithms from public policy decision making to technological advances like self-driving cars, the ethical considerations become far more important in the work that researchers are doing both intentionally and unintentionally.

The number of works raising the idea that there are fundamental problems in the data, machine learning and AI applications being developed today continues to increase (O’Neil 2016, Wachter-Boettcher 2017, Eubanks 2017). While risks exist, and will continue to grow with the applications, we believe that understanding the types of issues that can arise will allow those collecting Big Data, as well as Survey data, and implementing it in models, from academic to industry to governmental, will allow us to advance the intended goals while avoiding potential risks and pitfalls.

By framing the different sources of risks, we believe more prescriptive approaches can be applied to avoid unintended ethical issues. These fall into several camps- fundamental accuracy (models and data), representativeness (survey and Big Data), considering when the Why matters beyond the what identified in Big Data (where survey data may become a supplement), and diversity of information included to train models themselves (model development and machine learning training). While these do not replace the value of a diverse team, this can allow any team, regardless of its composition to develop systematic processes to think broadly of potential pitfalls in the collection and application of the data.

With the increased use of algorithms in every aspect of our lives, a framework for considering the issues of data, algorithms and AI, and ways to address each in the research process may help mitigate some risks from academia to industry to governmental and NGO work.


Reproducibility in the Era of Big Data: Lessons for Developing Robust Data Management and Data Analysis Procedures

Final candidate for the monograph

Dr D. Betsy McCoach (Neag School of Education, University of Connecticut) - Presenting Author
Dr Jennifer Dineen (Department of Public Policy, University of Connecticut)
Dr Sandra Chafouleas (Neag School of Education, University of Connecticut)
Dr Amy Briesch (Bouvé College of Health Sciences, Northeastern University)

The first part of the twenty-first century has ushered in a new era for survey researchers. The availability to collect and use data from a variety of sources has exploded in recent years. For example, when conducting an online survey in addition to data on respondents' selected answers, survey researchers can now access information about respondents’ response time, whether they changed their responses, and how they navigated through the survey. Publicly available data are easily scraped from the internet, and warehouses of large data are increasingly available. In the age of survey fatigue, as evidenced by declining response rates, researchers increasingly look to big data as a way to ease respondent burden and supplement data collection. Simultaneously, concerns over reproducibility of research findings have changed expectations about the research process and the research products. Increasingly, researchers are expected to provide data sets and robust input files that allow reviewers, editors, and other researchers to reproduce their findings (Jacoby, et. al., 2017; National Institute of Health, 2018), which are not only involving larger datasets but those that are combined across increasingly complex and differing sources. The confluence of these two factors requires that researchers change their approach to data management and analyses.

In this presentation, we will overview the challenges researchers face in the era of big data and reproducibility, and provide recommendations for developing robust data management and data analysis procedures. Our paper will review a case study based on our National Exploration of Emotional/Behavioral Detection in School Screening (NEEDs2) project, which is an exploratory project funded by the U.S. Department of Education, Institute of Education Sciences. In order to explore if and how social, emotional, and behavioral screeners are being used in schools, and what factors influence use, the NEEDs2 project combines five stakeholder group surveys from a nationally representative sample of school districts in the United States with outcome data from (a) the U.S. Department of Education’s Common Core of Data, (b) Civil Rights Data Collection from the U.S. Office of Civil Rights, and (c) state-specific databases. In the presentation, we will discuss lessons learned and processes involved in (1) documenting the provenance of all variables, (2) creating a data management workflow that allows for reproducibility and transparency, (3) creating a data analysis workflow that is robust and reproducible, (4) using automation to increase efficiency and reproducibility, and (5.) automating the creation of data reports, papers and manuscripts.

Works Cited:
Jacoby, William G. (2017). Should Journals be Responsible for Reproducibility? Inside Higher Ed. Retrieved from www.insidehighered.com/blogs/rethinking-research/should-journals-be-responsible-reproducibility
Rigor and Reproducibility (2018). Retrieved from https://www.nih.gov/research-training/rigor-reproducibility


Big Data's Front-Ended Ethical Considerations Ignore How Results Can Stigmatize Identifiable Groups: Examining Big Wastewater Data in New Zealand

Final candidate for the monograph

Professor Martin Tolich (University of Otago) - Presenting Author

This article reviews the three ethical considerations the UK data service commends to Big Data researchers through the lens of a case study whereby researchers collected and analysed wastewater to measure the prevalence of illicit drug use in large and small population centres in New Zealand.

First, data, often administrative data, are not collected for a specific research purpose. Consequently and second, researchers cannot logistically establish relationships between themselves and any individual subject annulling individual consent processes. Third, moving beyond potential privacy issues and the reliability of de-identified data, the UK Data Service cautions how the presentation of data based on "group membership" can lead to a denial of a public service for certain groups. Not made explicit by the UK Data service's considerations is whether these data sets are a random sample of the entire population or if they are comprised of sub groups of the whole population.

The goal of this article is to move big data's ethical considerations away from its front-end obsession with privacy, informed consent and de-identification of data and instead towards research outputs; how results can potentially harm or stigmatize socio-economic or ethnic categories that over populate some administrative data sets. The wastewater data set investigated in this article features (at times) skewed sampling when investigating ethnic sub-populations. While the researcher's data collection is unproblematic in large conurbations, the researcher self-discloses in radio interview transcripts that they had thought little about ethical considerations when conducting data collections in small, ethnically homogeneous New Zealand towns. Such research has the potential to malign and label ethnic enclaves as high drug use areas. This finding is important in New Zealand as cultural research policies for research involving indigenous groups dictate consultation with vulnerable groups, especially Maori, as essential. Absent in these Wastewater researcher's ethical considerations are any intention to consult with town officials or Maori prior to the data collection, even though the results could stigmatise both. The article documents how the Wastewater researcher responded to an official complaint made by the author to their IRB and to the New Zealand Health Research Council's Ethics Committee.

The article ends commending the UK Data Service’s balanced ethical considerations between front-end ethical considerations with back-end presentation of data, but draws attention to problems when data sets drawn for non-randomised groups can harm certain groups.


Enriching an Ongoing Panel Survey With Mobile Phone Measures: The IAB-SMART App

Mr Sebastian Bähr (Institute for Employment Research)
Mr Georg-Christoph Haas (Institute for Employment Research, University of Mannheim) - Presenting Author
Professor Florian Keusch (University of Mannheim)
Professor Frauke Kreuter (Institute for Employment Research, University of Mannheim, University of Maryland)
Professor Mark Trappmann (Institute for Employment Research, University of Bamberg)

The PASS panel survey is a major data source for labor market and poverty research in Germany with annual interviews since 2007. In January 2018, the supplemental IAB-SMART study has been started, in which selected respondents were asked to install a study app on their smartphones.

The IAB-SMART app combines short questionnaires that can be triggered by geographic location with passive data collection on a variety of measures (e.g. geographic location, app use). The triggering of questions allows us to enrich yearly retrospective data with data collected immediately after a certain event (e.g. placement officer visit). Passive data collection allows innovative measures, e.g. for social capital that complement traditional survey measures. Furthermore, the additional smartphone measures create the potential to address new research questions related to the labor market and technology use (digital stress, home office performance). Finally, the study provides new insights in the day structure and coping behavior of unemployed persons and thus replicate aspects of the classic Marienthal case study with modern means.

In this presentation we will provide an overview of the study and share our experiences in conducting an app project. We will focus on data protection issues, implementation of the fieldwork, participation in the study and participation in short surveys.


Augmenting Survey Data With Big Data: Is There a Threat to Panel Retention?

Professor Mark Trappmann (IAB, University of Bamberg) - Presenting Author
Dr Sebastian Bähr (IAB)
Mr Georg Haas (IAB)
Professor Florian Keusch (University of Mannheim)
Professor Frauke Kreuter (IAB, University of Mannheim, University of Maryland)

Hardly any researcher doubts that the arrival of so-called Big Data, will have a huge impact on research infrastructure and methodology in the social sciences. While some researchers argue that they will replace surveys entirely one day, a majority sees a high potential in the integration of data sources and thus predict that surveys augmented with Big Data will be a major data source in the future of the discipline.

In most cases this linkage, if performed on an individual level, requires some form of informed consent. In a cross-sectional setting, the worst outcome is that the time required for asking the question is wasted, especially if consent is requested at the end of the questionnaire. For longitudinal surveys with repeated measurement, however, requests to collect and link additional data can have additional costs related to panel retention. Some panel members might disapprove of the extra burden or feel that a line has been crossed by requesting access to potentially highly sensitive data as produced by e.g. their online activities.

The German panel study PASS is designed to collect data for research into (un)employment, poverty and social security. PASS is a mixed-mode CATI and CAPI panel survey of the general population that finished its 11th annual wave of data collection in 2017.

In January 2018, we invited randomly selected panelists (n=4,293), who had reported on the previous wave that they owned a smartphone, to participate in the IAB-SMART-Study. The panelists were asked to install an app on their smartphones. They were asked to give access to passive smartphone data like sensor data, GPS position or app usage data for half a year and were regularly asked to answer short questionnaires during that time.

The random selection of invitees allows us to evaluate the unintended effect that the request for participation to a smartphone study collecting information that might be considered sensitive has on panel retention in the subsequent wave. We will report results on response rates as well as on composition of respondents with respect to target variables of the survey.