The increasing availability of digitized text presents enormous opportunities for social scientists.
Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible.
Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category.
Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly.
We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.
Existing supervised methods of analyzing textual data come primarily from the tremendously productive computer science literature. This literature has been focused on optimizing the goals of computer science, which for the most part involve maximizing the percent of documents correctly classified into a given set of categories.
We do not offer a way to improve on the computer scientists’ goals. Instead of seeking to classify any individual document, most social science literature that has hand- (or computer-) coded text is primarily interested in broad characterizations about the whole set of documents, such as unbiased estimates of the proportion of documents in given categories. Unfortunately, since they are optimized for a different purpose, computer science methods often produce biased estimates of these category proportions.
Although we have included only a few applications in this article, the methods offered here would seem applicable to many possible analyses that may not have been feasible previously.With the explosion of numerous types and huge quantities of text available to researchers on the web and elsewhere, we hope social scientists will begin to use these methods, and develop others, to harvest this new information and to improve our knowledge of the political, social, cultural, and economic worlds.