BigSurv18 program
Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October
The Fourth Paradigm: Moving From Computational Science to Data-Intensive Scientific Discovery? |
|
Chair | Dr Craig Hill (RTI International) |
Time | Friday 26th October, 16:00 - 17:30 |
Room: | 40.248 |
Moving Social Science Into the Fourth Paradigm: Opportunity Abounds
Final candidate for the monograph
Dr Craig Hill (RTI International) - Presenting Author
A thousand years ago, all science, such as it was, was experimental. Science amounted to description of natural phenomena and quite small-scale experiments, after which scientists merely described what they saw. Beginning in the 19th century and continuing on even today, we have evolved to a theoretical state of science, as exemplified by Newton’s laws or Maxwell’s equations. Scientists postulated a theory, devised hypotheses to test the theory, and conducted experiments designed to prove or fail their hypotheses. Once enough hypotheses passed the test, a theory was accepted. In the social sciences, we developed tools and processes and approaches to apply this “second paradigm” to understanding human behavior. We wrote textbooks and computer software (e.g., SAS and SPSS). Using these tools, we were able to calculate estimates, make inference, and produce official statistics.
In the last few decades, science has become “computational” (the Third Paradigm). Harnessing the abilities of ever more powerful computers, fields such as physics, chemistry, and biology have increasingly turned toward simulation and modelling as a way of doing “more” science faster. In the last ten years, high-performance computing and software tools such as Python and Hadoop have allowed scientists to take a computational approach to many scientific problems—although, I would posit, the social sciences have not kept pace.
We have now embarked on the Fourth Paradigm, variously referred to as e-Science, or data-enabled science, or data-intensive scientific discovery. These four paradigms were described, presciently, almost a decade ago (Hey, Tansley, and Tolle. 2009. The Fourth Paradigm: Data-intensive Scientific Discovery.) in a book which expanded on the vision of pioneering Microsoft engineer Jim Gray. In full bloom, Fourth Paradigm scientists will leverage advanced computing capabilities that help researchers manipulate and explore massive datasets, collaborate with one another, and with technologists, to reach scientific breakthroughs.
In this essay, we examine the opportunities awaiting Fourth Paradigm social scientists, the barriers that must be hurdled, and discuss some of the research challenges that we should be undertaking even now. We will create a new typology of the “Data Life Cycle” for social scientists and show how subject matter experts and statisticians, data scientists, and computer scientists can come together to take full advantage of a Fourth Paradigm approach to social science.
Doing Social Science With Big Data Sets - A Framework of Approaches
Dr Peter Dahlin (Mälardalen University) - Presenting Author
A large variety of datasets are joined under the term ‘Big Data’. They differ regarding volume, velocity and variety, but also regarding the context they are in, and to the phenomenon they relate. Generating or collecting data poses some challenges; making use of it implies others.
This paper outlines a framework of approaches to using big datasets in social science. Examples are from business research, studying companies and managers. Mentioned datasets contain, for example, standardized data on company finances, formal relationships such as board appointments, and unstructured text describing events.
The suggested framework consists of two dimensions. The first concerns how central the dataset is to the study and analysis. This ranges from being the only data used at one end of the scale, to being a minor complement to other data at the other. The other dimension regards the refinement of the data. At one extreme, the data are directly analyzed, while at the other, data needs to be calculated or created to represent the sought variables. Simplified to dichotomies, these dimensions form a four-field matrix. The paper uses four example studies to illustrate the four different approaches:
1) In a dataset on companies’ involvement in acquisitions, an interesting case was identified and formed the basis for a thorough qualitative study. This approach has little refinement, and a minimal role in the main study.
2) In a study of auditors’ roles, variables representing the company’s network capital was added. Through Social Network Analysis of board appointments from a big dataset, each company’s embeddedness was computed. This is an example of high refinement, complementing other data.
3) In a study of company turnaround, a big dataset on 600,000 companies’ development over ten years was used to test different measures of turnaround. Limited refinement, but as the main dataset.
4) Studying survival of start-ups, network analysis of interlocking directorates provided the main explanatory variables. Extensive refinement through calculations of >1M board appointments, and this was the main dataset.
In the approaches requiring much refinement, calculations and data generation, the analysis layer is very central and, in some cases, also very extensive. Whereas this aspect offers great opportunities, it also complicates the understanding of what the underlying dataset is. From the perspective of quantitative survey research, the dataset can appear ready for statistical analysis. Instead, the big dataset should be considered an “empirical model”, a representation of the studied phenomenon. Similar to asking questions to a sample of respondents, questions can be asked to such datasets. For example, the dataset might not know the network centrality of an actor, but it can give the data necessary for calculating it. Calculated or generated variables then form the data in the analysis layer, which can be used to answer the research question.
In summary, Big Data approaches can take many different forms, somewhat differing from “standard” research methods. The framework could aid in planning as well as explaining how big datasets are used in social science.
A Paradigm Shift From Surveys to Big Data in Financial Market Research
Dr Amos Chinomona (Rhodes University) - Presenting Author
Recent developments have seen a paradigm shift in data-based research approaches from the traditional survey data analysis to “Big Data” analytics. The advent of the technology age enhanced by the emergence of fast and sophisticated computers has made easy the collection and storage of enormous and complex data sets termed “Big Data”. Often these data sets are updated frequently and continuously with observation and variable counts being of substantially large magnitudes. Because of their complexity and size these “Big Data” are often stored in relational databases on multiple 'rented' servers and are accessed with structured query language (SQL). This has given researchers and data stakeholders a new thread of data sources. Analysis of such data requires powerful data manipulation and analysis tools. These include machine learning techniques that allow exploration of more flexible relationships between variables. These data manipulation tools are aimed at bringing out underlying trends in the variables. On the other hand, survey data are flat and rectangular. These data are often analysed using the conventional statistical techniques and software with a main focus on summarisation, statistical inference and prediction.
“Big Data” are usually gathered without a desire to collect variables that would be of relevance for specific objectives such as to answer particular research questions whereas surveys are often explicitly designed and carried out to achieve specific goals and in most cases supported by statistical theory. “Big Data” analytics are very vital in the healthcare sector where the entirety of data related to patients' healthcare and well-being are collected and stored. There has been a substantial increase in the data liquidity, that is medical data have become both ubiquitous and easily accessible.
The aim of the current study is to use both “Big Data” and survey data to assess operational efficiency at Settlers Hospital in the Eastern Cape, South Africa. The results from the two approaches are compared. In particular, operational efficiency is assessed via monitoring patient servicing. The “Big Data” are obtained from patients’ data that are generated from computerized physician order entry, prescriptions, medical imaging, laboratory, pharmacy and insurance, and electronic patients’ records. The survey data are obtained from a two-stage cluster sampling design of the patients. The variables of interest include personal attention, privacy, query handling, general hygiene, average length of stay and waiting time for admission. In addition, non-clinical issues such as pre-authorisation process, timely medication and discharge process were measured.
The results show similar trends in average stay, query handling and waiting time for admission. “Big Data” analytics have the advantage that data are collected in real time unlike in survey data analysis which involves historic data.
Addressing the Variety and Changeability of Big Data
Dr Peter Dahlin (Mälardalen University) - Presenting Author
The “Big” in “Big Data” sets more focus on Volume and Velocity than on Variety. Variety implies richness of the data; multiple types of data, most likely created for other reasons, that is unstructured or semi-structured. When working with data that is not strictly structured, changes and variations are natural. For example, announced changes of data format from an API only require adaptation to the new specifications, whereas unannounced, sudden and inconsistent changes of a scraped website require extensive redesign and reprogramming.
This paper describes a system and infrastructure focused on handling variety and changeability of data. The system in question collects data on companies, managers and business activities and has been developed and running since 2008. It deals with much larger data volumes than commonly handled in business studies, for example, from studies based on quantitative surveys or qualitative interviews. However, the current inflow of around 100-200 GB/month can be handled without high-performance storage solutions. Instead, adaptability and flexibility have been the primary goals to handle (the all too frequent) changes of the data sources and format.
The first of two highlighted principles is modularity. By separating different parts of the system, modules are created for different functions (such as data collection, distribution, parsing, processing, storing) as well as different instances (mainly distinct data sources). This resembles matrix organizations, and has the effect that errors can be trapped in a confined part of the system, but also that optimization can be made of the parts as well as the whole system. A failing function or instance does not cause a system-wide halt. For example, the data retrieval function can easily be expanded by running more instances of that module, and since it is separate from more demanding processing, it can be run with small resources. In this system, data retrieval runs on Raspberry Pi / C.H.I.P. / Pine64, old laptops and minimal Virtual Machines, whereas the data processing and storing runs on high-end workstations.
The other principle is stepwise, hierarchical, parsing of retrieved data. The raw input (e.g. from web scrape) goes through several steps of parsing, increasingly specific and fine-tuned, before arriving at a state where the data can be organized. Traceability is maintained by keeping the hierarchical relations between steps. In the event of an error, the prior state can be revisited and used for altering the parser. After successful organization of data, the hierarchical branch is deleted.
Instead of extracting specific data, a data‐greedy approach is used, where most of the data is saved, even if its use is not currently apparent. This results in varied and complex datasets, where also small deviations can cause large problems. The stepwise hierarchical approach to parsing isolates errors, keeps the original data and maintains traceability until it is fully organized.
In summary, the principles of modularity and stepwise, hierarchical, parsing are suggested for data collection setups to address the challenges from variety and changeability.
From Data to Big Analytics- Automated Analytic Platforms for Data Exploration
Dr Richard Timpone (Ipsos) - Presenting Author
Dr Yongwei Yang (Google)
Mr Jonathan Kroening (Ipsos)
Download presentation 2
As Big Data has altered the face of research, the same factors of Volume, Velocity and Variety called out to define it by the Meta Group (now Gartner) are changing the opportunities of analytic data exploration as well. Improvement in algorithms and computing power provide the foundations for automated platforms to identify patterns in analytic model results beyond simply looking at the patterns in the data.
We have developed an Automated Analysis Inductive Exploration Platform that conducts tens of thousands of statistical models and explores them for systematic changes to identify shifting patterns that would often be missed otherwise. This exploratory platform does not replace hypothesis driven research, but shines light quickly in an exploratory manner that goes beyond the examination of data sources themselves.
These techniques are designed to extract more value out of both traditional survey as well as Big Data, and other behavioral data, and is relevant for both academic, industry, governmental and NGO exploration of new insights of changing patterns of attitudes and behaviors.
The two parts of the architecture of our Ipsos Research Insight Scout (IRIS) include an automated analytic platform and the framework to extract insights from the analyses. Such an automated platform is made possible by advances in computational modeling and computing power (with the extraction of insights possible from smart summarization to rule based Expert Systems to Machine and Deep Learning).
Beyond theory and the technical discussion of the potential of Automated Analytic Exploration Platforms, we will discuss the case of how a major company is using this to identify insights in patterns of data that potentially have been missed. Their experience with the practical value demonstrates how exploratory analysis can go beyond what is possible with the vast volumes, velocity and variety of analytic models and show how these can become important in our decision processes in the future.