BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Big Data for Official Statistics I: Big Data Use is Officially a Big Deal for Government Statistics

Chair Dr Lars Lyberg (Inizio)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.105

Big Data Initiatives in Official Statistics

Final candidate for the monograph

Dr Lilli Japec (Statistics Sweden)
Dr Lars Lyberg (Inizio) - Presenting Author

Statistics production in official statistics and other statistics production context is facing considerable challenges. Many are the result of what we might describe as “a changing survey landscape”. Costs of traditional data collection are increasing and users want more timely and rich data. Alternative inferential paradigms such as calibrated Bayes and more extensive use of nonprobability sampling are increasingly promoted in some research groups. Thus, the need for a modernization of statistics production has become obvious as a result of the challenges and the abundance of big data. Producers of official statistics such as national statistical institutes cannot ignore this development if they want to remain relevant as discussed by e.g., Pfefferman (2015), Landefeld (2014), Hulliger et al (2012), Kitchin (2015), Baker et al (2013) and Little (2015).

In this paper we provide a review of various initiatives from producers of official statistics including national statistical institutes both centralized and decentralized and organizations such as UN, OECD and Eurostat. We discuss the growing number of alternative big data sources and how they can be combined to enhance survey data or used directly for statistics production. There are many issues associated with this development, including:

• the lack of a common strategy to cope with problems associated with a changing survey landscape
• the potential of big data to support statistics production
• actual attempts at using big data in statistics production areas such as prices, traffic, health, agriculture, consumer behavior and confidence, daytime population, geographic knowledge, tourism, travel demands, population density and mobility, opinions, safety, investments, sanitation, housing, life satisfaction, and employment
• the relative importance of various big data sources
• attempts at developing big data quality frameworks and quality assessment models
• big data and the European General Data Protection Regulation (GDPR)
• general issues related to big data privacy and confidentiality.

The paper describes how these and other issues have been handled by the main producers of official statistics and future landscape scenarios will be discussed.

Baker, R. et al (2013). Nonprobability Sampling. Report of the AAPOR Task Force on Non-probability Sampling.
Hulliger, B. et al (2012). Analysis of the Future Research Needs of Official Statistics.
Kitchin, R. (2015). The Opportunities, Challenges and Risks of Big Data for Official Statistics.
Landefeld, S. (2014). Uses of Big Data for Official Statistics.
Little, R. (2015). Calibrated Bayes, as Inferential Paradigm for Official Statistics in the Era of Big Data.
Pfefferman, D. (2015). Methodological Issues and Challenges in the Production of Official Statistics.


A Framework for Big Data in Official Statistics

Final candidate for the monograph

Dr Kees Zeelenberg (Statistics Netherlands)
Dr Sofie De Broe (Statistics Netherlands) - Presenting Author

In this paper, we describe and discuss opportunities for big data in official statistics. Big data come in high volume, high velocity and high variety. Their high volume may lead to better accuracy and more details, their high velocity may lead to more frequent and more timely statistical estimates, and their high variety may give opportunities for statistics in new areas. But there are also many challenges: uncontrolled changes in sources threaten continuity and comparability, and data may refer only indirectly to phenomena of statistical interest. Furthermore, big data may be highly volatile and selective: the coverage of the population to which they refer may change from day to day, leading to inexplicable jumps in time-series. And very often, these big data sets cannot be linked to other datasets or to population frames. This severely limits the possibilities for correction of selectivity and volatility. Also, with the advance of big data and open data, there is much more scope for disclosure of individual data, and this poses new problems for statistical institutes.

The use of such sources in official statistics requires other approaches than the traditional one based on surveys and censuses. In this paper we develop a framework for the use of big data in official statistics.

A first approach is to accept the big data just for what they are: an imperfect, yet very timely, indicator of developments in society. In short, we might argue: these data exist and that’s why they are interesting.

A second approach is to use formal models and extract information from these data. In recent years, many new methods for dealing with big data have been developed by mathematical and applied statisticians. National statistical institutes have always been reluctant to use models, apart from specific cases like small-area estimates. Based on experience at Statistics Netherlands, we argue that NSIs should not be afraid to use models, provided that their use is documented and made transparent to users. On the other hand, in official statistics, models should not be used for all kinds of purposes.


Mining the New Oil for Official Statistics

Final candidate for the monograph

Dr Siu-Ming Tam (Australian Bureau of Statistics) - Presenting Author

Mining the new oil for official statistics
By Siu-Ming Tam (Siu-Ming.Tam@abs.gov.au), Australian Bureau of Statistics and Jae-Kwong Kim, Iowa State University

It has often been said that official statistics are now entering into a golden age, when increasingly decisions are based on evidence, and when there is now a plethora of data sources, primarily from opportunities associated with the Internet of Things, for the production of official statistics. Data is now the new oil, and official statisticians are challenged to find new and more effective ways of mining this new oil.

However, official statisticians have been using administrative sources for producing official statistics for decades, e.g. birth, death and migration records are used to update Census counts to derive the latest estimate of population and population distribution. So what is new? An important characteristic in the use of such sources is that invariably, their data are used without much need for transformation or statistical adjustment. The use of census and surveys, or direct substitution using data from administrative sources, in the production of official statistics can, for the purpose of this paper, be described as Official Statistics 1.0.

Many of the newer data sources, e.g. Big Data, however, do not generally provide data that is suitable for direct substitution for producing official statistics. For example, satellite imagery data, which are basically wave lengths, whilst carrying information on land use, crop classification and crop yields, cannot be used to substitute the same variables typically collected in agricultural censuses and surveys.

Another new data source is from information networks which contain information from self-selected participants, e.g. in social media, and also in what we describe as “cooperatives”. An example of the latter is the Australian DairyBase which contains dairy farm performance information, using farm management data contributed by the farm themselves. Again, information from networks cannot be used directly for substitution as they generally suffer from self-selection and coverage bias, and are likely to have measurement definitions that are different from those adopted by official statisticians.

Applying transformation or statistics methodologies on data from these new data sources for producing official statistics can be described as Official Statistics 2.0.

In this paper, we will outline methodologies for mining cooperatives data in such a way the resultant data are fit for official statistics production.

Assuming that we have (1) access to identified data from cooperatives, and (2) a random sample that can be used to calibrate the cooperatives' dataset, we describe methods to (1) adjust for statistical bias from self-selection and under-coverage; and (2) measurement “errors”; and (3) methods to harness information for official statistics production that exists in the cooperatives' dataset but not normally asked in censuses or surveys.

Publication preference – possibly Option A, but will confirm later.


Enhancing U.S. Federal Statistics by Combining Multiple Data Sources

Dr Brian Harris-Kojetin (National Academies of Sciences, Engineering, and Medicine) - Presenting Author
Dr Robert Groves (Georgetown University)

Download presentation

Official statistics provide vital indicators of the well-being of the population and the economy. Large-scale probability sample surveys have long been the foundation for producing many national statistics, but the costs of conducting such surveys have been increasing while response rates have been declining, and many surveys are not keeping up with growing demands for more timely and detailed local information. The Committee on National Statistics at the National Academies of Sciences, Engineering, and Medicine convened a committee of experts in social science research, sociology, survey methodology, economics, statistics, privacy, public policy, and computer science to explore a possible shift in federal statistical programs—from the current approach of providing users with the output from a single census, survey, or administrative records source, to an approach combining data sources to give users richer and more reliable datasets. The panel conducted a two-year study and produced two reports with conclusions and recommendations for federal statistical agencies. This presentation will provide an overview of the major conclusions and recommendations from these reports.

The first report, Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy, reviews the current approach for producing federal statistics, examines other data sources such as government administrative data and private sector data sources, including Internet and other big data sources, which could also be used for federal statistics. It also discusses the key characteristics of the environment needed for utilizing multiple data sources, including approaches for protecting privacy and preserving confidentiality. The second report, Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps, examines more in-depth the necessary infrastructure, methods, and skills to implement a multiple data sources paradigm for official statistics.

Together, the reports provide an overview of statistical methods that have been used for combining information and outline research that is needed for the development of methods for combining data sources. Combining data from multiple data sources requires a new or modified quality framework from the total survey error model, as administrative and private-sector data have their own unique challenges and errors. The reports review frameworks that go beyond the total survey error framework and discuss important dimensions of quality for nonsurvey data and estimates from combining different sources. Moving to an environment in which multiple datasets are combined can also present new threats to privacy. Federal statistical agencies are subject to a number of confidentiality laws, but additional legal and policy issues may arise when linking records from different data sources. Because linked datasets have the potential to create greater privacy threats than single datasets, the panel recommends that federal statistical agencies develop and implement strategies to safeguard privacy while increasing accessibility to linked data sets for statistical purposes. The panel also recommends the creation of a new entity that will provide a secure environment for analysis of data from multiple sources, coordinate acquisition of data, identify and facilitate research on the challenges that are common across statistical agencies, and permit data access only for statistical purposes.