BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Improving survey questions using machine learning and AI

Moderator: Katharina Meintinger (
Slack link
Quick Zoom

Detailed zoom login information
Friday 6th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Open question formats: Comparing the suitability of requests for text and voice answers in smartphone surveys

Dr Jan Karem Höhne (University of Mannheim) - Presenting Author
Professor Annelies Blom (University of Mannheim)
Mr Konstantin Gavras (University of Mannheim)
Professor Melanie Revilla (RECSM-Universitat Pompeu Fabra)

While surveys provide important standardized data about the population with large samples, they are limited regarding the depth of the data provided. Although surveys can offer open answer formats, the completeness of and detail provided in these formats is often limited, particularly in self-administered web surveys, for several reasons: On the one hand, respondents find it difficult to express their attitudes in open answer formats by keying in the answers. Respondents also keep their answers short or skip such questions altogether. On the other hand, survey designers seldom encourage respondents to elaborate on their open answers, because the ensuing coding and analysis have long been conducted manually. This makes the process time-consuming and expensive, reducing the attractiveness of such formats. However, technological developments for surveys on mobile devices, particularly smartphones, offer the collection of voice instead of text answers, which may facilitate answering questions with open answer formats and provide richer data. Additionally, new developments in automated speech-to-text transcription and text coding and analysis allow the proper handling of open answers from large-scale surveys. Given these new research opportunities, we address the following research question: How do requests for voice answers, compared to requests for text answers, affect response behavior and survey evaluations in smartphone surveys?
We conducted an experiment in a smartphone survey (N = 2,400) using the opt-in Omninet Panel (Forsa) in Germany in December 2019 and January 2020. From their panel, Forsa drew a quota sample based on age, education, gender, and region (East and West Germany) to match the German population on these demographic characteristics. To collect respondents’ voice answers, we developed the JavaScript- and PHP-based “SurveyVoice (SVoice)” tool that records voice answers via the microphone of smartphones. We randomly assign respondents to answer format conditions (i.e., text or voice) and ask them six questions dealing with the perception of the most important problem in Germany as well as attitudes towards the current German Chancellor and several German political parties.
In this study, we compare requests for text and voice answers in smartphone surveys with respect to several aspects: First, we investigate item nonresponse (i.e., item missing data) as an indicator of primarily low data quality. Second, we investigate response times (i.e., the time elapsing between question presentation on the screen and the time until the survey page was submitted) as an indicator of respondent burden. Finally, we investigate respondents’ survey evaluations (i.e., level of interest and level of difficulty stated by respondents) as an indicator of survey satisfaction.
This experiment aims to test the feasibility of collecting voice answers for open-ended questions as an alternative data source in contemporary smartphone surveys. In addition, it explores whether and to what extent voice answers collected through the built-in microphones, compared to open answers entered via the keyboard of smartphones, represent a sound methodological substitute.

The sound of respondents: How do emotional states affect the quality of voice answers in smartphone surveys?

Mr Christoph Kern (University of Mannheim) - Presenting Author
Mr Jan Karem Höhne (University of Mannheim)
Mr Stephan Schlosser (University of Göttingen)

Download presentation

Technological developments for surveys on mobile devices, particularly smartphones, offer a variety of new forms and formats for collecting data for social science research. A recent example is the collection of voice answers via the built-in microphone of smartphones instead of text answers via the keyboard when asking open questions. Voice answer formats, compared to text answers, may decrease respondent burden by being less time-consuming and may provide richer data. Furthermore, advances in natural language processing and voice recognition do not only facilitate to process and analyze voice data, but also allow to utilize (meta)information, such as voice pitch, that can be used to study new research questions. Specifically, pre-trained emotion recognition models can be used to predict the emotional states of respondents based on voice answers. In this study, we use voice data to compare respondents with different predicted emotional states and study the effect of emotions on response quality (e.g., item nonresponse, recording lengths, and number of words). We thereby assume that responses are context-dependent and affected by the emotional conditions of respondents (Heide and Gronhaug 1991). We address the following research question: How do emotional states affect the quality of voice answers to open questions in smartphone surveys?

We conducted a smartphone survey (N = 1,200) using the Omninet Panel (Forsa) in Germany in December 2019 and January 2020. To collect respondents’ voice answers, we developed the JavaScript- and PHP-based “SurveyVoice (SVoice)” tool that records voice answers via the microphone of smartphones. Voice answers were collected for six open questions dealing with the perception of the most important problem in Germany as well as attitudes towards the current German Chancellor and several German political parties. We make use of openEAR (Eyben et al. 2009) which allows us to derive features from voice data and to predict respondents’ emotional states using pre-trained models for emotion recognition. In addition, we measured respondents’ emotional states by employing the Discrete Emotions Questionnaire for validation purposes (Harmon-Jones et al. 2016).

This study presents a novel approach to explore factors that affect the quality of voice answers and adds to the research on the merits and limits of collecting voice answers in smartphone surveys.

Eyben, F., Wöllmer, M., and Schuller, B. (2009). OpenEAR -- Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit. Paper presented at the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, Amsterdam, 2009, 1-6.
Harmon-Jones, C., Bastian, B., and Harmon-Jones, E. (2016). The Discrete Emotions Questionnaire: A New Tool for Measuring State Self-Reported Emotions. PLoS ONE, 11, 1-25.
Heide, M., and Gronhaug, K. (1991). Respondents’ Moods As a Biasing Factor in Surveys: An Experimental Study. Advances in Consumer Research Volume, 18, 566-575.

Automated double-barreled question classification using machine learning

Dr King Chung Ho (SurveyMonkey) - Presenting Author
Mr Shubhi Jain (SurveyMonkey)
Mr Fernando Espino Casas (SurveyMonkey)
Mr Zewei Zong (SurveyMonkey)
Mrs Jing Huang (Surveymonkey)

Writing neutral survey questions is important in survey creation to avoid response bias. These questions are designed to uncover a person’s unbiased opinion toward a subject. Double-barreled question (DBQ) is one of the major causes of biased questions in survey creation. A question is defined as DBQ when it touches upon more than one issue, yet allows only one answer. DBQ can mislead a survey researcher or survey taker because there are multiple parts in a question such that there is no way to tell what part of the question a response refers to.

Past research work has suggested that many DBQs can be detected by the existence of a grammatical conjunction (i.e., and). While this is a simple and straightforward rule-based approach, this method may not be accurate and generalizable because conjunctions can also exist in properly constructed questions. There is limited work in classifying DBQs despite its popularity and understanding in survey research.

Recently, machine learning techniques have been applied widely and can achieve good classification performance for problems in different domains. Their ability to learn and utilize patterns from data is the major factor that leads to state-of-the-art performance in various problems. Particularly, deep learning is an area of machine learning that receives significant research interest due to its capability in learning predictive features automatically. Using deep embedding methods, sentences can be transformed into feature vectors that capture the semantic meaning of sentences, which traditional bag-of-words methods cannot achieve. These feature vectors are powerful representations that can be used in training classification models for different tasks.

In this work, we present an end-to-end machine learning approach for DBQ classification. Using deep learning, we transformed text data into vector representations that were used to train classification models (e.g., random forest) for DBQ classification. We validated the models using SurveyMonkey data and the results showed that the machine learning approach outperformed the simple rule-based model (i.e., detection of grammatical conjunction). This work lays a foundation to detect biased questions in surveys using machine learning techniques, advancing artificial intelligence techniques one step closer to a full product feature for survey creation guidance.

Using generative adversarial active learning to identify poor closed-ended survey responses

Mr Krishna Sumanth Muppalla (SurveyMonkey) - Presenting Author
Mrs Jin Yang (SurveyMonkey)
Mrs Jing Huang (SurveyMonkey)
Mr Johan Lieu (SurveyMonkey)
Mr Manohar Angani (SurveyMonkey)
Mrs Megha Rastogi (SurveyMonkey)

The validity and reliability of survey data often rely on respondents giving mindful and appropriate responses. Acquiring high quality feedback is one of the biggest challenges in market research because there can be poor responses due to satisficing/inattentive respondents, or fake responses from bots. These response data are less reliable and introduce noise to the response summary for survey creators. This also results in major efforts around filtering out or re-fielding poor quality responses by the survey platform company.

Low quality responses can account for up to 15% of total responses, depending on the respondent source, based on an in-house SurveyMonkey study of several of the most commonly used market research panels in the industry. To better understand and detect poor quality closed-ended responses, we proposed a machine learning approach using deep learning. The approach leverages Multiple Objective Generative Adversarial Active Learning (MO-GAAL) to automatically detect poor closed-ended responses before the survey responses are sent to the survey creators.

We framed the problem as an unsupervised outlier detection problem. Using actual survey data which is aggregated and anonymized, we identified multiple features related to the questions and answer options in the survey along with respondent’s completion time and answer options chosen for the questions. The features computed are then passed through the MO-GAAL architecture which performs a mini-max game between a generator and a discriminator to identify a decision boundary to separate the outlier data.

Using this proposed approach, we predict the probability score of the response being a poor response. The poor responses can be filtered based on the score cutoff before they are served to the survey creators, thereby providing more reliable data for generating meaningful insights.

Improving SHARE translation verification

Miss Yi-Chen Liu (CIS, LMU Munich)
Dr Yuri Pettinicchi (MEA-MPISOC) - Presenting Author
Professor Alexander Fraser (CIS, LMU Munich)

Download presentation

The Survey of Health, Ageing and Retirement in Europe (SHARE) project is a cross-national survey that aims to improve social policies through studying the living status of aging population in Europe. It is a multilingual questionnaire with 39 country-specific languages. In order to get high quality standards, the translated survey questions are verified before the interviewers are sent into the field. During translation verification, common translation mistakes such as misspelled terms, omissions and other mistranslations are corrected. Third party machine translation systems and human verifiers currently perform the verification task.

The process of translation verification in SHARE takes a third of the time allocated for translation procedures and it costs a person-month for each of 28 SHARE country teams. Our aim is to preserve high standards while we reduce the operational costs. We are developing in-house solutions to make the process more efficient. Unfortunately, creating machine translation systems requires large parallel corpora, which are expensive to get and/or are not available for some languages and specific domains. Therefore, we train a word-to-word translation system in an unsupervised manner, which allows us to solve the problem without parallel corpora.

We use the unsupervised model introduced by Artetxe et al. (2018), training bilingual word embeddings and then finding word translations through dictionary induction. To train the model, only appropriate monolingual corpora in the source language and the target language are needed, which allows us to collect and combine more relevant and bigger datasets for training our translation system. For instance, Europarl, News Commentary, and SHARE survey questions are all combined together for training a high quality word translation system.

After bilingual word embeddings are trained, English and foreign language survey questions are imported into the model, which then generates the 10 best foreign language translations of each English word. If one of the best translations is in the human foreign language translation then this word pair can be marked as matched. By measuring the number of matched word pairs, we can estimate the translation quality. Our new unsupervised approach has the potential to not only make translation verification more efficient, but also to lessen the heavy workload for human verifiers in SHARE.