BigSurv20 program


Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back

The Bling of Bing...and other words: Exploring text and search data for classification and estimation

Moderator: Ana Lucia Cordovar (alcordova@usfq.edu.ec)
Slack link
Quick Zoom

Detailed zoom login information
Friday 20th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

A text mining and machine learning platform to classify businesses into NAICS codes

Professor Sudip Bhattacharjee (University of Connecticut, US Census Bureau) - Presenting Author
Dr Ugochukwu Etudo (University of Connecticut)
Dr Justin C Smith (US Census Bureau)

Classification of business establishments into their correct NAICS (North American Industry Classification System) codes is a fundamental building block for sampling, measuring and monitoring the $19 trillion US economy. NAICS codes form the basis of several economic reports. The US Census Bureau is the custodian of NAICS for the US, and receives NAICS codes for businesses from internal surveys, and other sources such as IRS, SSA and BLS. These NAICS codes do not match in several cases. Further, the process relies heavily on response to surveys as well as hours of analyst effort. This results in expense and errors on the part of the establishments and statistical agencies.

In this research, we develop (1) a suite of tools that systematically gather public, textual, information on US establishments and (2) a natural language processing and machine learning based methodology to predict full 6-digit NAICS codes. We rely on a novel mix of publicly available, commercial, and official data. Specifically, we use publicly available textual information (business names and company website text), commercial information (Google reviews and Place Types), and official NAICS codes at the establishment level. Our sample consists of approximately 130,000 establishments across all 20 NAICS sectors (2-digit).

We implement an ensemble machine learning framework that relies on four constituent trained machine learning classifiers. We show that publically available, firm-sourced data is typically most discriminative when used to train our models to detect the correct NAICS code at the 2, 4 and 6-digit levels. We find that data sourced from commercial entities additional discriminative information. Model accuracies range from 70% to 95%, depending on the level of NAICS specificity. Accuracies increase further with other feature engineering additions. We evaluate model stability and other performance criteria. Our research can reduce both respondent and analyst burden while improving the quality of business classifications.

Our model can be used by various statistical agencies and private entities to streamline and standardize a core task of business classifications. Our text mining based methodology can be used in multiple other scenarios, from statistical classification to operational readiness research.



Predicting leading economic indicators using big data

Dr Zeynep Suzer-Gurtekin (University of Michigan ISR) - Presenting Author
Miss Yingjia Fu (University of Michigan ISR)
Dr Richard Curtin (University of Michigan ISR)

Predicting Leading Economic Indicators using Big Data
While other general population surveys have limited their use of open-ended questions (Schuman and Presser, 1981), the University of Michigan’s Surveys of Consumers (SCA) have been using open-ended questions to measure the consumer expectations since 1950s. The spontaneous mentions are often reported as being highly correlated with the economic indicators, which is a measure of external validity. For example, the proportion of spontaneous mentions of jobs in the recent business news is cited as highly correlated with the annual change in non-farm employment in a time series analysis (Curtin, 2018). While the sample and questionnaire design and data collection procedures are specifically targeting such external validity level, the surveys also provide a set of rules to define the polarity of sentiment (favorable, unfavorable and neutral) with respect to a topic in coding the responses to open-ended questions. The previous research on big data limited its focus on keyword labeling and as an advancement we will use the survey coding rules on twitter data and newspaper corpus using a semi-supervised learning algorithm and explore the external validity of the measures against the nation’s leading economic indicators.

Schuman, Howard, and Stanley Presser. 1981. Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context. New York: Academic Press.
Curtin, Richard. October 26, 2018. October 2018 Survey Results. Technical Report. https://data.sca.isr.umich.edu/fetchdoc.php?docid=61457

Google Searches in Cross-National Analyses

Dr Anna Turner (Polish Academy of Sciences) - Presenting Author
Dr Marcin Zielinski (Polish Academy of Sciences)
Professor Kazimierz M. Slomczynski (Polish Academy of Sciences, Ohio State University)

The development of global technologies has created a fast changing environment where monitoring changes in society is based not only on national surveys but also on various new forms of Internet data. This includes Google search queries which can be used as indicator of public agenda. Google is the world's largest search engine and processes over 4.7 billion search queries per day, even in countries with lower Internet penetration the reach of Google is very high (in Europe, with the exception of Russia). The fact that we have access to these data is undoubtedly of tremendous potential and enables scholars to study citizen’s interests from a different perspectives. The aim of the project we want to present is to investigate public attitudes to the topics of information privacy and data protection online in the context of revelations about government agencies surveillance practices including collecting personal data on a large scale. Our study intends to examine these topics further, by adopting a cross national longitudinal approach (28 EU countries), employing survey data from two editions of Eurobarometer study (Special Eurobarometer on Data Protection, edition 2010 and 2015) and over 4,000 Google search queries in multiple languages, spanning a period of 2 years (at monthly intervals, from April 2013 to March 2015). We will answer two research questions: (1) To what extend survey data predict aggregate measures of Google searches? (2) To what extend aggregate measures of Google searches explain individual responses to survey items? The ‘mix strategy’ of analysing survey data at individual level and Google data at a country level is relatively new in the area of research methodology, perhaps because of low awareness and familiarity with Google tools, but also the lack of clear methodology to be used in such projects. Our research attempts to address this gap.

Popular with what? Market basket analysis of Google trend topics related to female employment

Dr Anil Boz Semerci (Hacettepe University) - Presenting Author
Dr Duygu Icen (Hacettepe University)
Dr Ayse Abbasoglu Ozgoren (Hacettepe University Institute of Population Studies)

Traditional research on intentions, preferences and opinions of people on women’s employment use survey data to monitor their change in time and across countries. On the other hand, the emergence of big data provides new opportunities for monitoring and modeling attitudes towards social and economic issues. Specifically, employing trend analysis using online search data brings rewarding input to evaluate and assess changes in public opinion and perception, which can be thought as a proxy for the level of public knowledge and awareness of specific terms. This study aims to examine concepts related to female employment by using market basket analysis on Google Trends data. We aim to analyze popularity and awareness of keywords related to employment of women in a global and cross-regional setting for the period of January 2009-November 2019 using Google trends data. This study also analyzes the connectivity of keywords and their inter-associations as being consecutives or descendants using market basket analysis. We aim to reveal association rules of apriori selected keywords, which will disclose contextual information specific to regions. Regions are selected as the East and South Asia and the Europe and North America. Our data source, Google Trends, have become very popular and been widely used in assessing public opinions about several kinds of subjects. It is one of the digital data platforms that provides a time series index of the volume of queries users enter into Google in a given geographic area and provides compilations of big data. Market basket analysis is one of the used methodological approaches for working on big data. It indicates items that appear/used together and the frequency of these appearances. Such technique is appropriate in finding non-obvious and hidden associations between items, which is also crucial in assessing individuals’ thoughts on a specific topic. In this study, the association rules formed by the Apriori algorithm are listed and visualized with the help of R studio. The findings will be interpreted through the lens of gender-responsive strategies, equality, efficiency and social justice within different country and region contexts. This research contributes to the literature in several ways. First, to the best of authors’ knowledge, this study is the first one, which employs market basket analysis to Google trends data in female employment context. Second, analysing the popularity and awareness of keywords related to employment of women in a global and cross-regional setting provide comprehensive empirical ground for practical suggestions.


This project is funded by Hacettepe University Scientific Research Projects Coordination Unit, project ID SÇP-2020-18179.

Prediction of author’s educational background using text mining

Mr Rense Corten (Department of Sociology/ICS, Faculty of Social Sciences, Utrecht University, Utrecht, The Netherlands) - Presenting Author
Mrs Shiva Nadi (Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, The Netherlands)

In the emerging peer-to-peer platform economy, participants crucially and increasingly rely on personal information such as profile pictures but also written text for the formation of trust. To some extent, a user’s assessment of the trustworthiness of interaction partners may be based on the social background of the partner, and in the absence of clear markers of social background, users may try to infer social background from written text. In this paper, we examine to what extent educational background can be inferred from written text, based on the assumption that educational levels are associated with the style of writing which includes their signature fashion of using a certain vocabulary of words which makes their literature unique and recognizable. Using a large public dataset of almost 60000 dating profiles, we aimed to model author style to be used as a predictor of educational background. In order to make data ready to be fed into machine learning algorithms, level of education for each profile is encoded to the International Standard Classification of Education (ISCED) format and all individual profiles’ texts are merged to be considered as a unique piece of text for each user. Different lexical features such as bag of words and stylistic features including length of words, average number of unique words, number of punctuations, and number of misspelled words are considered in this study. Using the extracted features, we explore the level of education within two approaches: (i) classifying the level of education to elementary or higher education using lexical features as the source for author identification; (ii) adding stylistic features as well as lexical features to the classification model. We used a naïve Bayes, a logistic regression, and a long-short-term memory (LSTM) neural network for the classification task in our experimental study. LSTM has performed with better accuracy using stylistic and lexical features to predict the education level than other classification models. Our results may not only be useful in the context of the platform economy and online markets, but also more generally to researchers who need to rely on written text as an indicator of educational background.