BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Big brother, big data or both? Protecting privacy in the era of big data

Moderator: Don Jang (
Slack link
Quick Zoom

Detailed zoom login information
Friday 6th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Ethical issues in the use of big data for social research

Mr Michael Weinhardt (Technische Universität Berlin) - Presenting Author

Questions of privacy, data protection and ethics have been discussed for decades in the area of behavioral research, psychology and epidemiology, and solutions have been put in place to address these issues in practice. This is somewhat different in the area of Big Data (BD) in the social sciences, where the new availability of vast amounts of data have come faster than ethical and legal standards could develop regarding the use of this data. While there is an emerging body of literature on the ethics of BD in general, literature specifically dealing with the research ethics of BD is still scarce. This is worrying, as the proliferation of personal and impersonal data clearly poses new challenges to the ethics of research. Consequently, those who collect data and those who use it face new moral dilemmas and must address these challenges to traditional assumptions about privacy and autonomy that emerge in applications of BD. This presentation therefore asks which ethical and legal aspects need to be considered when collecting and analyzing data on individuals from the web and combining them to gain an enriched picture on human activities. It will outline some of the challenges that come with the use of BD in social research. For example, it will argue that researchers who use BD collected by other third parties have a moral obligation to check whether the data has been collected ethically and lawfully in the first place. If this is not the case, or if the proper source of such data cannot be obtained, it should not be used for research purposes.
When considering data sources it also important to establish whether they are strictly private or not. For data already public, the question of informed consent becomes irrelevant. It is therefore essential to be clear about what can be safely assumed to be public information. For example, tweets on Twitter, a microblogging platform, are meant to be public and accessible by anyone and therefore may be used for research purposes. While one needs to have an account to post something, everyone can read what is tweeted even without having an account. However, in general, whenever some information is only accessible after logging on to certain platform or service, it may not be considered public anymore. The matter quickly becomes complex, as the level of publicness of a datum can vary widely and change quickly even within the same social network. As the example makes clear, for researchers, it becomes essential to clarify which data is accessible publicly or whether the sources of data may be deemed proper. However, this stance necessitates a clear understanding of what constitutes the ‘public’ realm. While existing concepts such as the ‘public domain’ may provide guidance, they originate in other contexts, such as copyright legislation. Hence, there is the pressing need for further clarification of the questions what constitutes public data and what not. The presentation will illustrate this issue with real-world examples from social networking sites and other sources.

The paradox of data sharing and data privacy in the social sciences

Dr Edward Freeland (Princeton University) - Presenting Author

Download presentation

Over the past few years, calls for greater transparency and accountability in the natural and social sciences have intensified. This push has been driven in part by controversies among scientists over the integrity of their research methods, conflicts of interest, the inability to replicate the findings of well-known studies, and by the failure to share or publish experiments that do not turn out as expected. But it is also driven by a desire to protect scientists from attacks by non-scientists defending certain economic interests or political agenda. The emergence of the Open Science movement coincides with a vast increase in the volume and accessibility of data on individuals and the growing power of big data analytics. These trends raise significant concerns about how to continue protecting the privacy of people who participate in social science research. This paradox is on full display in the recent decision by the US Environmental Protection Agency to prohibit evidentiary uses of scientific studies without full disclosure of identifying information on all participants to allow for verification of the results. The paper includes an overview of how the data sharing/data privacy paradox emerged, a look at trends in public opinion on science, and a discussion of ways to enhance the legitimacy of scientific research while protecting the confidentiality of people who participate in research studies.

The challenges of legal analysis, between text mining and machine learning

Professor Maria Francesca Romano (Scuola Superiore Sant'Anna) - Presenting Author
Professor Giovanni Comandè (Scuola Superiore Sant'Anna)
Dr Pasquale Pavone (Scuola Superiore Sant'Anna)
Professor Denise Amram (Scuola Superiore Sant'Anna)

The judgements are a possible new source of textual data and researchers can use them for many different type of analysis. Along this research field, we realized that a more precise analysis is possible through the interplay of different expertise: for instance by way of developing context-based automatic queries that distinguishes among texts discovered in the context of facts description or motivations in judgements. In this dimension, the paper will explore pros and contra of the use of statistical techniques for data extraction and annotation, comparing them to the possibilities offered by machine learning tools.
In addition, the significant amount of personal (often sensitive) data contained in rulings creates a series of technical and legal challenges linked to data protection. In this perspective, the ethical-legal profile enriches our interdisciplinary dialogue. The use of machine learning techniques model must be built following the principles of privacy-by-design and privacy-by-default. In addition, when dealing with research, the proper technical and organizational measures provided for by art. 89 EU Reg. 2016/679 and related national implementations must be abided to. And yet, this is not sufficient; the use of Machine Learning techniques to case-law must be ethically acceptable and in line with shared values. The analysis of judicial decisions to identify patterns of harmonization of practices, faces problems when encounters statistically extraneous results because it cannot simply classify them as biases. In fact, legal systems evolve thanks to the technical discretion of judges in interpreting legal rules and departing from prevalent interpretations: what normally is a spurious correspondence can be the spark of innovation and change in case law.
To disentangle these issues, our paper will depart from analyzing the state of art in annotation and extraction techniques applied to judicial texts and the corresponding applicable ethical and legal framework in a comparative perspective. In turn, we will illustrate the necessary technical and ethical-legal specifications for developing an approach to Machine Learning techniques for the textual analysis of case-law.

Comandé G. (2017), Regulating Algorithms’ Regulation? First Ethico-Legal Principles, Problems, and Opportunities of Algorithms in (T. Cerquitelli, D. Quercia, F. Pasquale eds) “Towards glass-box data mining for Big and Small Data”, 169 ff
Comandé G 2019), Multilayered (Accountable) Liability for Artificial Intelligence, in “Liability for Artificial Intelligence and the Internet of Things”, Lohsse S., Schulze R., Staudenmayer D. (eds), 165-187, ISBN: 978-3-8487-5293-5, Hart Publishing Nomos
Romano MF, Rey G, Baldassarini A, Pavone P (2018) Text Mining per l’analisi qualitativa e quantitativa dei dati amministrativi utilizzati dalla Pubblica Amministrazione, in: DF Iezzi, L Celardo, M Misuraca Proceedings of the 14th International Conference on Statistical Analysis of Textual Data (JADT18), Roma, Universitalia ISBN: 978-88-3293-137-2, pp 547-8.
Romano MF, Baldassarini A, Pavone P (2020), Text Mining of Public Administration documents: preliminary results on judgements, in: ADVANCES AND CHALLENGES IN TEXT ANALYTICS, Iezzi D.F., D. Mayaffre, M. Misuraca (eds), Springer.

Measuring privacy and accuracy concerns for 2020 census data dissemination

Mrs Jennifer Childs (U.S. Census Bureau) - Presenting Author
Dr Casey Eggleston (U.S. Census Bureau)
Dr Aleia Clark Fobia (U.S. Census Bureau)

Download presentation

The increasing availability of auxiliary data and advancements in computing power threatens the credibility of claims made by survey organizations that respondent data will not be released in a way that could identify individuals. As several high-profile incidents have demonstrated, hacking and data breaches are not the only threat to the protection of respondent data. Intruders can use sophisticated algorithms to match anonymized datasets with other publicly available information in order to identify individuals. Acknowledging new threats to privacy, the Census Bureau is exploring new approaches to protecting respondent data that go beyond traditional methods of disclosure avoidance. New approaches, such as differential privacy, have the potential to respond with more flexibility to technological advancements or the sensitivity of particular data, but also bring new challenges such as a need to specify the degree of “acceptable” privacy loss. Thus, the challenge arises to decide the level of trade-off between the protection of individual privacy and the benefit of making accurate data available to the public.

This research focuses on understanding how respondents think about their information, its sensitivity, and concerns about privacy loss. In preparation for its planned implementation of differential privacy for 2020 Census data releases, the Census Bureau has spent several years thinking about ways to measure and quantify respondent privacy concerns for different kinds of personal information. Key topics of this research effort have included measurement of privacy concerns for decennial census data items and non-decennial items (to provide a basis for comparison), communicating with respondents about risks to privacy, and assessing respondent preferences with regard to the inevitable trade-offs between information privacy and data accuracy.

This paper reports results from a survey intended to measure individuals’ privacy risk tolerance with the goal of informing the privacy loss budgets allowed in mathematical privacy models for 2020 Census data releases. Responses to the questionnaire items provide information about respondent concerns surrounding each of the different pieces of individual and household-level information collected on the 2020 Census and included in statistical summaries released by the Census Bureau. Survey data was collected from 10,000 respondents drawn from a nationally-representative sample of households.