BigSurv20 program
Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December
Back
The integration of machine learning into official statisticsModerator: Claude Julien (cjulien1234@outlook.com) |
|
Friday 27th November, 11:45 - 13:15 (ET, GMT-5) 8:45 - 10:15 (PT, GMT-8) 17:45 - 19:15 (CET, GMT+1) |
The integration of machine learning into official statistics
Dr Wesley Yung (Statistics Canada) - Presenting Author
Professor Hugh Chipman (Acadia University)
Dr Siu-Ming Tam (Australian Bureau of Statistics)
The interest in the use of Machine Learning (ML) for Official Statistics is growing rapidly and for many reasons such as processing secondary data sources such as Big Data, remaining relevant in the current landscape where many organizations are publishing statistics and the ability to improve timeliness of Official Statistics products. In response to this interest, the High-Level Group for the Modernisation of Official Statistics of the United Nations Economic Commission for Europe has established the Machine Learning Project which is investigating how best to integrate Machine Learning (ML) into Official Statistics. The objective of the project is to produce better official statistics through the responsible integration of demonstrated ML solutions. To demonstrate the added value of ML, the project is conducting several pilot studies in coding, edit and imputation, and the use of imagery. To support the sound use of ML, the project is proposing a framework to assure and evaluate quality of ML solutions in the context of Official Statistics. To effectively and efficiently integrate ML in the process flow, the project is gathering and summarizing good practices across statistical organisations. The knowledge, results, ML code and good practices gained by the project will be shared with the statistical community.
This presentation will give an overview of the project, the progress attained to date and will focus particular attention on the quality framework.
Case studies on machine learning for editing and imputation
Miss Anneleen Goyens (Flemish institute for technological research (VITO)) - Presenting Author
Dr Bart Buelens (Flemish institute for technological research (VITO))
Miss Fabiana Rocci (Italian National Institute of Statistics)
Miss Roberta Varriale (Italian National Institute of Statistics)
Mr Florian Dumpert (Federal Statistical Office of Germany (Destatis))
We report on case studies that have been conducted in the context of the UNECE Machine Learning Project, more specifically on the use of machine learning methods for editing and imputation. In this project, editing is meant only the part concerned with the identification of suspicious values and imputation is the filling of missing values or correction of errors. Case studies have been conducted to assess the usability, applicability and quality of machine learning methods for these purposes. The case studies were conducted by six different organizations in six different countries and they covered both the editing and imputation themes. In editing theme the following topics have been covered: living cost and food (UK), statistical register (ISTAT). On the imputation theme: tourism expenditures (Poland), attained level of education (Italy), energy statistics (Belgium), household and person data (Australia), and expenses for research and development (Germany). We present challenges and results, and discuss machine learning methods that were considered, including neural networks, random forests, penalty regression methods, support vector machines and Bayesian networks. Two case studies are explored in some greater detail: energy statistics and attained level of education. We outline the setup and design of these studies, the data that are required for training machine learning models, approaches to model selection and validation, and assessment of the results. In addition to measures of predictive accuracy, machine learning methods are compared with current best practices in the field. We present results of all case studies, and draw overall conclusions leading to lessons-learned. We conclude with recommendations and outline how the results from the case studies will contribute to further initiatives in the wider Machine Learning Project, including deployment of methods and adoption of machine learning in statistical production processes.
Algorithmic choices for sentiment coding of Flemish tweets
Dr Michael Reusens (Statistics Flanders) - Presenting Author
Dr Marc Callens (Statistics Flanders)
Dr Ann Carton (Statistics Flanders)
Dr Dries Verlet (Statistics Flanders)
The availability of public social media data and rapid advances in natural language processing algorithms to automatically interpret text have allowed for the analysis of a continuous stream of signals being sent out by people. Analysing these signals, such as Facebook posts, Twitter tweets, and Instagram photo’s, using natural language processing techniques could be a low-cost and high-frequency complement or alternative to survey analysis for measuring perceived quality of life.
In this paper we present the results of a pilot project that combines natural languages processing (NLP) and machine learning (ML) techniques to extract general sentiment from Flemish twitter data. This pilot study is part of UNECEs Modernisation project on Machine Learning (HLG MOS ML project 2019-2020) in the field of official statistics (coding, edit and imputation, imagery).
In this paper we will report on objectives, techniques, results, conclusions and lessons learned of the Flemish tweets coding project. More specifically we will look at the sensitivity of the results for different NLP and ML choices.
ABS use of machine learning to classifying addresses use on the Address Register
Mr Daniel Merkas (Australian Bureau of Statistics) - Presenting Author
Mr James Farnell (Australian Bureau of Statistics)
Miss Debbie Goodwin (Australian Bureau of Statistics)
The Australian Bureau of Statistics maintains an Address Register containing over 10 million residential addresses to provide a mail-out frame for the Population Census and for its household survey program. Addresses are classified by land use as residential, commercial, under-construction or vacant. Regular maintenance of the register involves the review of existing classifications, and the quarterly addition of 100,000 new addresses. Approximately 68% of new addresses are able to be resolved with the use of administrative data including postal, electoral, construction, and sales data. This leaves a large number for manual review using aerial imagery and other online tools. With trained analysts only able to resolve 200 addresses per day, this is a very resource intensive process. To automate this resource intensive, manual process, we have developed an aerial image classification model called Automated Image Recognition (AIR).
In this presentation I will explain the methods used to develop this model, including a description of which classes of addresses were most successful and measures of accuracy.
In implementing this model the ABS is able to classify addresses where no administrative data is available and further strengthen classifications when it is available. The AIR model helps resolve the future predictive nature of the administrative data model by observation of an address at a recent point in time. Using the AIR model we are able to detect whether the dwelling is habitable or is still either vacant land or under-construction, which has significant benefits for household survey and Population Census operations. Implementation of the AIR model in conjunction with administrative data has led to more efficient use of resources in compiling the Address Register, and improved quality through improved accuracy and timeliness.