With the rise of social network sites that allow consumers to make comments and exchange ideas, we have seen an explosion of consumer-generated text data in the form of reviews, blogs, messages, etc. In this context, automated text analysis, which utilizes computers to detect and analyze language patterns underlying the messy swamp of user-produced text, offers marketers as well as researchers a deeper insight into consumer thoughts, behavior, psychology, decision-making processes, and culture.
There are six main stages in designing and executing an automated text analysis:
Stage 1: Developing a Research Question
Before conducting an automated text analysis, we should first make sure that the research question is suitable for text analysis. There are three research contexts that would be considered inappropriate for text analysis: 1) research that needs precise control to compare groups, introduce manipulations, or rule out alternative hypotheses (Cook, Campbell, and Day; 1979); 2) research that concerns behavioral or unarticulated data (e.g., response time); 3) research that investigates subtle and indirect expressions or complex arguments.
Except for the above-mentioned contexts, automated text analysis is especially useful in discovering systematic relationships
in text that can be easily neglected by human researchers; for example, patterns in correlation, notable absences, etc. It can also reflect and study changes in language over time and compare between groups.
Stage 2: Construct Identification
After making sure the text analysis is suitable for the research question, we can then identify the construct. Linguistically, language consists of three elements – semantic, pragmatic, and syntactic. The examination of each element can provide marketers with unique information about consumer thoughts, interaction, and culture. Attention can be measured through semantics (i.e., word frequency in a text can be used to measure attention). Processing can be examined through syntax (i.e., the frequency of conjunction words like “and” can be used to indicate the depth of processing in reviews). Interpersonal dynamics can be studied through pragmatics (i.e., the analyses of pronouns and demonstratives can detect the degree of intimacy, authority, or self-consciousness). And group level characteristics can be analyzed through semantics (i.e., group attention, differences among groups, and the collective structure can all be measured).
Stage 3: Data Collection
After identifying the research question and constructs, the next step is data collection. There are three steps: 1) identify data sources, including designing a sampling strategy that eliminates selection biases and considers the sample size; 2) prepare data, including spell-check, data cleaning, and necessary processing of languages other than English; 3) unitize and store the data, including creating an organized file structure through coding, or using a program and database management system.
Stage 4: Choose an Operationalization Approach and Execute Operationalization
Once data has been collected, prepared, and stored, we then need to choose a research approach and execute the operationalization.

Stage 5: Interpretation and Analysis
The next step is to analyze and interpret the results. There are three ways to apply text analysis finding to research design: 1) comparison between groups; 2) correlation between textual elements; and 3) prediction of variables outside the text.
Comparing between groups can reflect statistically meaningful differences among texts (e.g., the pronoun use differences between high power and low power individuals).
Comparisons over space and time
can be used to reveal how a construct can vary in magnitude based on outside changes (e.g., how discourse changes as casino gambling becomes legitimate). Message type can also serve as a comparison variable
(e.g., the difference between public and private Facebook messages).
Correlation enables researchers to see relevance between (non)textual elements (e.g., survey responses, etc.). Besides correlational analysis, prediction using text analysis will also take non-textual variables into account (i.e., using email text and linguistic matching to predict deception).
Stage 6: Validation
As the data of automated text analysis is originally produced consumer texts, the external validity and ecological validity can thus be guaranteed (Mogilner et al. 2011). Apart from that, construct validity, concurrent validity, discriminant validity, convergent validity, and predictive validity should be addressed using different techniques.
This research was adapted and summarized from an original paper by Ashlee Humphreys, and Jen-Hui Rebecca Wang published in Journal of Consumer Research (2017).
Written by Ashlee Humphreys, Associate Professor at Northwestern Medill IMC
Edited by Sherry Xie, Medill IMC Class Of 2018
Sources:
Cook, TD & Campbell, DT 1979, Quasi-Experimentation: Design and Analysis Issues for Field Settings. Houghton Mifflin.
Mogilner, C., Aaker, J. and Kamvar, S. (2011). How Happiness Affects Choice. Journal of Consumer Research, 39(2), pp.429-443.