BigSurv18 program
Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October
Big Data for Official Statistics II: Administrative and Big Data Use for Survey Design and Estimation |
|
Chair | Dr Lilli Japec (Statistics Sweden) |
Time | Friday 26th October, 16:00 - 17:30 |
Room: | 40.105 |
Web Scraping Meets Survey Design: Combining Forces
Dr Olav ten Bosch (Statistics Netherlands) - Presenting Author
Mr Dick Windmeijer (Statistics Netherlands)
Dr Arnout van Delden (Statistics Netherlands)
Dr Guido van den Heuvel (Statistics Netherlands)
Download presentation
Web scraping – the automatic collection of data on the Internet – is as old as the Internet itself. Search engines rely on it to build up an index of the web. Web scraping has been used increasingly by national statistical institutes (NSIs) around the world to improve official statistics. It can be applied to reduce the response burden, speed up the production of statistics, derive new indicators, explore background variables, characterise (sub) populations etc. Statistics Netherlands started using the web as an additional data source about a decade ago. These days, web scraping is used in the production of price statistics. In other domains, such as the housing market, it has proven to be a valuable tool to study the dynamics of a phenomenon before designing a new costly statistical production chain. In yet other cases, such as economic and business statistics, we use web scraping to supplement our administrative sources and metadata systems.
A necessary precondition to the use of web scraping in official statistics is mastering the technical and legal aspects. We have found these to be crucial, but also manageable. With respect to the technology of scraping, we have concluded that if you can see it in your browser, you can grab it and sometimes there is even more data hidden in a page. As for the legal aspects, these do not pose any problem as long as we, being a government institution, only use the scraped data for statistics and inform data owners as needed.
The main challenge in using web scraped data for official statistics is of a methodological nature. Where survey variables are designed by an NSI and administrative sources are generally well-defined and well-structured, data extraction from the web is neither under NSI control nor well-defined or well-structured. Therefore, the use of web data as such is often problematic. A promising approach however is to combine web data with the huge amount of data we, as a statistical office, already have in our databases. In this way, we combine high-quality data from surveys or administrative sources with web data that are more volatile, that are usually unstructured and badly-defined but in many cases also richer and more frequently updated. The latter two characteristics make it particularly attractive to invest in. An example is the work we performed recently starting from the general business register, one of the backbones of any statistical office, to search for relevant spots on the web that enable us to complete and extend our knowledge on enterprises and their activities. This example combines focused web scraping and machine learning in order to determine the validity of the information we find.
In this paper, we reflect on the increasing use of web scraping in official statistics globally and report on our own experiences over the past few years and the lessons we have learned. We identify the various approaches, their successes and their challenges and we philosophise about a successful combination of proven survey methodology
Making Administrative Records Key to Operational Agility for the American Community Survey
Dr Jennifer Ortman (U.S. Census Bureau) - Presenting Author
Dr Victoria Velkoff (U.S. Census Bureau)
The changing landscape of America’s communities yields new and complex challenges for survey and census takers. This presentation details the impetus behind recent U.S. Census Bureau research initiatives to evaluate the use of administrative records to replace or supplement content on the American Community Survey. The Census Bureau has made significant progress exploring the use of administrative records in household surveys and the census to meet these challenges and continue to meet the data needs of an increasingly complex nation. Incorporating administrative records into our data gathering and analysis efforts will have a palpable impact on respondents by reducing the amount of information we request from them. Administrative records may also increase data reliability and provide cost-savings by reducing the need for follow up visits. While there is great potential for the role of administrative records in the future of data collection and processing, there are also great challenges to using these data (e.g., issues with matching rates, geographic coverage, and leveraging data designed for different uses). We examine these obstacles and the Census Bureau’s efforts to overcome them. To conclude, we discuss public perceptions and attitudes about linking administrative sources with survey and census data.
Linkage of the Australian Census to Three Administrative Databases With an Application to Understanding Income Inequality in Australia
Dr Nicholas Biddle (Australian National University) - Presenting Author
Professor Robert Breunig (Australian National University)
Australia is somewhat unique in how it undertakes and uses its population census. First, the Australian Census occurs every five years rather than ten. Second, the Australian Census is a compulsory long form for all respondents, with questions on demography, ethnicity, economic circumstances, disability, caring responsibilities, volunteer work, and education outcomes/participation. Third, Australia has a number of hard-to-survey populations of high policy interest that are often missing from sample surveys. Fourth, there has been little use of data linkage in collection and validation of Census data. In many ways, the Australian Census is used and conceptualised as a very large sample survey.
Recently, the Australian Bureau of Statistics has been leading the Multi-Agency Data Integration Project (MADIP). This project involves linking of data from the Census to administrative data from three sources - an individual tax payer database, an income support database, and a government-funded health services database (http://www.abs.gov.au/websitedbs/D3310114.nsf/home/Statistical+Data+Integration+-+MADIP). This unique database has only just been made available (in a deidentified format) to external academic researchers. This paper provides a first analysis of the linked dataset, with the following substantive and methodological focuses:
- The background to MADIP and the four component datasets (the Census, and the three administrative datasets);
- The linkage process and the quality of the linked data;
- The data quality of the linked data;
- What can the data tell us about the spatial distribution of income inequality in Australia, taking into account taxable income and income-support status; and
- Are there differences by race and ethnicity in taxable income, controlling for observable human capital characteristics; and
- The implication for linkage of large survey datasets to administrative databases, including the policy recommendations that might result.
Use of Alternative Data Sources at Statistics Canada: A Case Study With GPS Data
Mr Francois Brisebois (Statistics Canada) - Presenting Author
The proliferation of new data sources of all sizes and nature has provided new opportunities to gain insights about the economy and society in general. For many national statistical institutes, traditional survey taking has long been a safe avenue to measure those insights, but in today’s rapidly changing reality, the availability of alternative data sources presents golden opportunities to enrich statistical information available and therefore remain responsive to the needs of users.
Statistics Canada has through the years innovated by incorporating new data sources, but mostly from administrative sources where structure and volume weren’t as challenging as what we now face with the advent of big data. In more recent years, several statistical programs have explored various alternative data sources in order to improve, supplement, or replace survey data. The presentation will feature some examples of these data integrations; one recent feasibility study involving the use of GPS data to improve statistics for the trucking industry will be covered more in detail, with the purpose of illustrating how new data sources can help solve more traditional methodological challenges, as well as offer new opportunities to enrich the inventory of data offered by a statistical programme.
The Usability of Government Open Data for Social Research - Estonian Case
Mr Laur Lilleoja (Tallinn University) - Presenting Author
With the early adaptation of e-Governance, digital tax declaration, digital ID, Internet voting and with recent implementation of unique e-Residency program (https://e-estonia.com), Estonia has been considered one of the most advanced digital societies in the world. Therefore, it is not surprising that similarly to several other countries, in 2014 Estonia introduced its Open Data Portal. It is primarily intended to serve as a platform for the dissemination of data by public bodies and for searching and retrieving such datasets by open data users. In the more general level, this project aims to stimulate economy, increase transparency, facilitate creation and management of open services for private and community sectors and to encourage the migration to future technologies such as Linked Data, Big Data and Internet of Things.
It is clear that this initiative holds significant potential for both business and academic research, but it's not clear how well this potential is presently employed.
Given presentation will give an overview of current stage of Estonian Open Data project. First there will be an overview of the general structure and usability of given Data portal. Second part of the presentation will try to evaluate the scientific value of given project, by analysing the actual usage of available data sets and by conducting some comparative studies to compare the quality of data from Open Data Portal with secondary data sources, like social surveys.