“Saying ‘Big Data’ isn’t enough. You gotta be about doing Big Data right.” Dr. David Lazer
According to the March 28th article in the Chronicle of Higher Education, Big Data has encountered a few setbacks in the research realm. Examples include:
- The Google Flu Trends is considered a failure in that it overestimated the degree of flu trends in the country.
- Debunking President’s Obama’s 2012 win as being a result of his staff’s “technological wizardry.”
- Correlating social-media behavior with demographic traits.
What strikes me about these studies and their flaws is not the use (or abuse?) of Big Data, but the research methodlogies and limitations built into the research questions. Here is a few reminders from our DH course:
1. What is your research question? Asking the wrong question or formulating it in the wrong way will skew variable selection, population selection and other factors and you will have an invalid study. In short, garbage in, garbage out pertains to the research model too.
2. What is your data set and is it representative of the larger population? One of the research antedotes used in the article discussed data mining a person’s partisan Twitter posts to determine his political affilitation. While this sounds like a good research topic, the trouble lies in taking the results and generalizing them to the larger population. People use Twitter for other than politics so the data set was not representative of Twitter users as a whole.
3. What tools have you selected to answer your question and how reliable are they? Creating an algorithm is a skill; knowing how to use it is another skill; evaluating and interpreting the results another skill altogether. As the article states: “Most big data that have received popular attention,” Mr. Lazer wrote in Science, “are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”
4. Are you the best person to attempt to answer the question or do you need to collaborate with others? The one thing that we have contronted time and time again in using the tools in the course has been the wall of ignorance of statistics in general and statistical modeling in particular. As a researcher you will have to make the decision of learning this component of DH research or deciding to collaborate with experienced statistical researchers.
So the question is not only “How do you do Big Data right? but also how we should not neglect the research processes and methodologies that get us to the point of using Big Data.
A researcher can do everything right but still receive unexpected results. This is what science is all about and what budding digital humanities scholars need to learn. An unexpected result is often an invitation to explore further. For example, compare the Sochi Olympics results with the predictions from the article Using Data Mining to Predict the Winter Olympics Medal Counts in Sochi. I hope the authors provide an explanation as to what they got right and what went wrong.