BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Big Data = Big Applications: From Data Linkage to Education

Chair Dr Ralph Meijers (Statistics Netherlands)
TimeSaturday 27th October, 16:00 - 17:30
Room: 40.012

5324 Euros Per Hour: Outlier or Football Player? Unsupervised Learning Methods for Anomaly Detection: Application to the Individual Declaration of Social Data

Mrs Julie Djiriguian (INSEE) - Presenting Author
Mrs Marie Cordier-Villoing (INSEE)
Mr Thomas Deroyon (INSEE)

This article joins in the global project of revision of the Annual Declarations of Social Data (DADS), wage bill information firms have had to fill in annually for each of their employees for payroll and fiscal tax purposes and which are now progressively replaced by new monthly Social Nominative Declarations files. These individual Annual Declarations of Social Data and Social Nominative Declarations are the standard national source for the analysis of wages in France. The reconstruction of the usual production line gives the opportunity to rethink the methodology of this process and more particularly the process for anomaly detection. Indeed, the management of big data each month urges us to test methods of machine learning to detect automatically anomalies.

This presentation aims at reporting works which were realized in this context. One part will focus on the specifities of of the data themselves, and the nature of the problem ; and another one on the methods tested and the results.

In particular, we will see that he statistical processing of these data, as DADS, and their control, turn out complex, because two of the three main variables, which are the gross and net salaries, are not directly observed but calculated from data reported by companies to respect their social obligations. Therefore, the anomalies may arise either at the level of the elementary variables, or at the level of aggregated variables. In this paper we choose to detect anomalies directly on the synthetic variables (gross salary, net salary) since the detection stage occurs after a consistency analysis if the elementary variables. Indeed, the study performed at the level of reported variables does not exempt us from an analysis of aggregated variables of salaries. Furthermore, the available reported variables in the individual declarations depend on the standard according to which companies have to fill their declarations. Yet, if this standard changed, the reported variables could change and thus, the detection of outliers based on their values could be unusable.

For now, only an unsupervised learning, which consists of the detection of anomalies without labeled data, is feasible with the available data. This work is however much more complex because it is extremely hard to distinguish the outliers from normal data without ever assessing the real success of the process. We plan to use and test several methods like association rules, isolation forest, local outlier factor (LOF). The presentation will present the algorithms tested for anomaly detection and compare them on observed data in a attempt of performance evalution.


The Best of Two Worlds: Combining Longitudinal Health and Learning to Learn Surveys With National Registry Data

Dr Henrik Dobewall (University of Tampere) - Presenting Author
Professor Arja Rimpelä (University of Tampere)
Mr Lasse Pere (University of Tampere)
Dr Pirjo Lindfors (University of Tampere)
Professor Mari-Pauliina Vainikainen (University of Tampere )

Download presentation

In our paper, we discuss methods for combining Big Data with traditional survey data. We aim to find the most common paths from the end of basic education into upper-secondary schools and vocational institutions and to identify the health and learning related factors that explain these trajectories.

Our project “Redefining adolescent learning: A multi-level longitudinal cohort study of adolescent learning, health, and well-being in educational transitions in Finland” follows a cohort of 12 000 children from the beginning of lower secondary education (age 12–13) to the end of the upper-secondary education (age 18 and older) in the 14 municipalities of the Helsinki Metropolitan Region. Registry data was generated by the Finnish National Agency for Education during students’ application to upper-secondary education. We demonstrate how to combine students' answers to health and learning to learn questionnaires (2011-2014) with national registry data (2014-2017) by use of their personal identification number.

Additional to predicting students’ successful educational paths to adulthood, we have a special interest in those trajectories where the student has not proceeded - after finalizing the basic education - directly into any upper-secondary institution (or never), has changed the institution, interrupted studies, or applied again without success. Identified health and learning related factors can be individual/family, class-based, or school-based.

Challenges of analyzing, interpreting, and reporting national registry data, which have been collected for a different purpose, will be highlighted. Ethical considerations and consequences of the new European general data protection regulation will be discussed.


Enriching Education Survey Data With Knowledge Graphs

In review process for the special issue

Dr Jamie Shorey (RTI International) - Presenting Author
Mrs Helen Jang (RTI International)
Mr Peter Baumgartner (RTI International)

A fundamental barrier to enriching survey data with administrative or related survey data is data integration: creating links between disconnected data. This is often accomplished by manual coding of text responses into a common scheme. This process is a time consuming, expensive and often error prone aspect of surveys. Various methods of coding open ended responses have been explored in the research but typically rely on trained natural language processing (NLP) techniques. This research investigates the creating and use of structured prior knowledge in the form of the Minerva Knowledge Graph to demonstrate improved coding, imputation, and quality control of survey response data in the education domain.

Humans have several advantages over machines in coding, essentially a clustering task, due to their knowledge of world that allows them to reason about relationships between entities and language. To emulate this reasoning ability, we used previously coded data to construct a knowledge graph relating entities such as universities, courses, and departments in the education domain. Calculating similarity among real world entities is a fundamental step in many analytic driven tasks such as ranking, relationship discovery, and data linkage. In a series of empirical tests we demonstrate enhanced coding of college transcript data using a semantic similarity search within an education knowledge graph.