BigSurv18 program
Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October
Filing the Claim for Social Media Data: Who's Covered and Who Isn't |
|
Chair | Professor Jonathan Nagler (NYU, Social Media and Political Participation Lab) |
Time | Friday 26th October, 16:00 - 17:30 |
Room: | 40.109 |
Seeking the "Ground Truth": Assessing Methods Used for Demographic Inference From Twitter
Ms Colleen McClain (University of Michigan) - Presenting Author
Dr Zeina Mneimneh (University of Michigan)
Dr Lisa Singh (Georgetown University)
Dr Trivellore Raghunathan (University of Michigan)
Given the relative ease of accessing and collecting Twitter data, social science researchers are increasingly using Twitter to understand social trends on a diverse set of topics -- including political communication, dynamics of financial markets, forced migration, transportation patterns, and attitudes toward health practices and behaviors. These research efforts have raised questions on whether and how "big" textual data sources can supplement or replace traditional survey data collection. Any analysis of social media data seeking to draw generalizable conclusions about human attitudes, behaviors, or beliefs, however, raises important questions of 1) whether and how social media measurement diverges from more traditional survey data, and 2) whether representation of the target population can be sufficiently achieved. The latter question specifically requires identifying the socio-demographic characteristics of the individuals analyzed and adjusting for known mismatches between the social media sample and the target population on those measures. Even if the target population is social media users (as opposed to the general population), investigating any socio-demographic differences on behaviors, attitudes or beliefs requires the availability of such information.
Since the rate of Twitter users who self-disclose their socio-demographic information in profiles is very low, a number of researchers in computer science and, increasingly, across the social sciences have sought to build models to predict the characteristics of the users in the Twitter samples they analyze. Yet, assessing the accuracy of those models is challenging given the need for "ground truth" measures against which the predictions can be compared. Though a variety of "ground truthing" methods exist, they have key limitations including : 1) relying on resource-heavy techniques that are prone to human error, and that do not scale easily (such as manual annotation); and 2) restricting the validity of inferences to a subset of users who disclose optional personal information, or who can be linked to other data sources, potentially jeopardizing the generalizability of the predictive models. The literature, however, is largely silent about the properties, costs, and scalability of different ground truth measures.
In this light, we systematically review existing literature on predicting socio-demographic variables (including age, gender, race/ethnicity, geographic location, political orientation, and income) from Twitter data. We summarize the predictive methods and the ground truth procedures used and discuss strengths, weaknesses, and potential errors associated with each ground truth method -- including those stemming from definitional differences in the constructs of interest, complexities of the unit of analysis (e.g., tweet/post versus user), and issues of sub-sampling and representation. We focus on six different ground truth methods: manual annotation; user-provided information; linked data from another social network source; linked data from a non-social network source (including a survey); distribution-based validation; and geographic assignment. We also discuss logistical and cost implications and conclude with a set of recommendations related to predictions of socio-demographic characteristics of Twitter users.
Coverage Bias in Election Research Using Data From Social Media
Ms Marie Kühn (GESIS Leibniz Institute for the Social Sciences) - Presenting Author
Ms Hannah Bucher (GESIS Leibniz Institute for the Social Sciences)
Dr Joss Roßmann (GESIS Leibniz Institute for the Social Sciences)
Relevance and Research Questions
While a large part of election research is based on survey data, a growing number of studies use alternative data sources like social media to analyze and predict political attitudes and voting behaviors (e.g. Stier, Schünemann, and Steiger 2017; Beauchamp 2017; Freelon and Karpf 2015; Colleoni, Rozza, and Arvidsson 2014; Ceron, Curini, Iacus, and Porro 2014). Research based on the latter has advantages like large sample sizes and lack of respondent-based error sources. However, there is also the risk of systematically excluding parts of the population from analyses since not all adults eligible to vote are also part of the internet population and membership in social networks might be even more selective. In our contribution, we aim to answer two questions: (1) Who is underrepresented in analyses with social media data? and (2) To what extent does the misrepresentation of certain parts of the population bias common estimates in election research?
Methods and Data
To answer our questions, we used data from the cross-sectional face-to-face survey of the German Longitudinal Election Study 2017, in which a random sample of German adults eligible to vote were asked about their voting behaviors and political attitudes. This dataset contains information about the frequency of respondent internet use and which social media platforms they used (Facebook, WhatsApp, YouTube, Twitter, Google+, others, and none). To understand online coverage and coverage within these platforms, we fitted logistic regression models for internet usage and within the online population for overall as well as specific social media use. Predictors included standard demographics as well as respondents’ political attitudes and (intended) voting behaviors. We will analyze potential biases by contrasting common estimates for political attitudes and behaviors from the full sample with estimates for online users and users of the different social media platforms.
Results
Preliminary results show that internet users are more likely to be younger, male, higher educated and voters than respondents who do not use the internet. Of those respondents who use the internet, social media users are more likely to be younger and female than those who do not report any social media usage. The composition within the social media platforms varies between the different platforms. Significant differences can be observed for users of some social media platforms in comparison to the respective non-users on age, gender, level of education, political left-right self-classification and election behavior (having voted and party voted for in the last general election).
Added Value
Our study contributes to understanding possible coverage issues in big data election research with social media data by using comprehensive information from a "traditional” probability-based face-to-face survey. Our findings show that the results of analyses which are exclusively based on social media data might not be fully generalizable to the population of interest in election research.
Who's Tweeting About the President? What Big Survey Data Can Tell Us About Digital Traces
Dr Josh Pasek (University of Michigan) - Presenting Author
Ms Colleen McClain (University of Michigan)
Dr Frank Newport (Gallup)
Ms Stephanie Marken (Gallup)
One principal concern about drawing conclusions about social phenomena from social media data stems from the self-selected nature of service users. Individuals who post on social media sites represent a subset of a group that is already unrepresentative of the public—even before considering the content of their posts, how often they post, or any number of other relevant characteristics. Hence, social researchers have long worried that trace data describe nonprobability samples. As data scientists attempt to leverage these data to generate novel understandings of society, a key overarching concern has been the need to consider the demographic attributes that distinguish social media posters from the populations researchers hope to describe. To this end, data scientists have begun to infer demographic characteristics for social media posters, with one end goal being to construct weights that allow the collection of posts to reflect public opinion.
To investigate the possibility that demographic differences may be useful in improving correspondence between tweets and survey responses, we investigate the correspondence between the daily sentiment of tweets about President Obama and aggregate assessments of presidential approval of nearly one million respondents to the Gallup daily survey from 2009 into 2014. Words from more than 120 million tweets containing the keyword “Obama” were gathered using the Topsy service. Tweet text was aggregated on a daily basis and sentiment analyzed using Lexicoder 3.0 to generate a net sentiment score for each of 1,960 days for which both data streams were available. Sentiment scores were then used to predict variations in approval among the nationally representative sample of Americans interviewed by the Gallup daily survey as well as among various demographic subgroups of Americans.
In general, Twitter sentiment scores tracked changes in American’s presidential approval. Both data streams displayed downward trends over the course of President Obama’s term and were positively correlated (Pearson’s r = .44). These findings track nicely with typical patterns of presidential approval. They also appeared robust to concerns about nonstationarity in time-series models, which suggests that they capture similar variations that occur in both data streams on both micro and macro levels.
Notably, however, attempts to isolate the demographic groups that drive the relations between these measures proved uniformly unsuccessful. When we calculated daily approval of the president within sex, race, education, income, partisanship, and employment categories (as well as interactions of these categories), very few demographic subgroups produced correlations of greater than .44, and none of these were significantly stronger. Attempts to reverse engineer the demographic composition of individuals posting about “Obama” by estimating approval for each demographic group and regressing these onto sentiment to generate survey weights was fruitless in improving correspondence.
These findings suggest that demographics are not the principal distinction underlying correspondence between social media and survey data. Instead, a demographic adjustment approach is likely doomed to fail. We believe that the results suggest that both data streams, although responsive to similar overarching societal patterns, differ principally because they are measuring substantively different things.
Augmenting Public Opinion Research With Social Media Data: A Case Study of Brexit
Ms Celeste Stone (American Institutes for Research) - Presenting Author
Ms Claire Kelley (American Institutes for Research)
Ms Sarah Kelley (American Institutes for Research)
Ms Caitlin Deal (American Institutes for Research)
Mr Luke Natzke (American Institutes for Research)
What do a tweet about the “failure of the so-called multiculturalism. #Brexit” and the questions in module B of the European Social Survey have in common? How about the number of mentions of “#MSM" and aggregate responses to ESS questions on trust in the media? Both of these examples show how we can track public opinion through both social media and traditional public opinion surveys.
In this paper we examine ways in which social media data can augment surveys to provide insight into shifts in public opinion. Survey measurement of public opinion is widely trusted by researchers, probabilistic and thoroughly validated. However, while public opinion surveys are the gold standard of opinion measurement, they suffer from decreasing response rates and increasing cost of collection (Groves, 2011; Czajka & Beyler, 2016; Brick & Williams, 2012). In contrast, the evaluation of public opinion from social media data is relatively new and has not been thoroughly validated by research. Social media data may be freely available, plentiful and available in near real time, but it is un-validated, and because it relies on user-volunteered data it may be heavily biased. In addition, those commenting on social media are not likely to be representative of the general population. In many ways the strengths of survey research and social media data are complimentary, and taken together these sources of data may give a more complete and usable portrait of public opinion.
Using the United Kingdom European Union membership referendum as a case study, we evaluate different methods for measuring populism using historical Twitter data. The methods will be evaluated in comparison with data from the European Social Survey, which will serve as the gold standard for our study. We will specifically seek to identify keyword combinations, data sampling and weighting methods, and computational measures for twitter data that can be used to create accurate barometers of public opinion and early indicators of new social phenomena. We employ natural language processing techniques to filter large volumes of twitter data for relevant content as well as to detect the authors' sentiments, and match these to broader public opinion trends.
In addition to discussing methods for combining Big Data with traditional data sources, our study will provide meaningful contributions to pressing topics in the statistical and data science fields including: methods for selecting relevant data streams, methods for dealing with variable specification inconsistencies across datasets, and assessing new data processing and analytical tools for handling big data.