Do first encounters make or break new users?

Using text features in the first comment to predict new user return on Reddit



new user attrition, churn, Reddit, text analysis, sentiment analysis, online comments, social media, social feedback


Many new users quit a site after only one interaction. Existing studies of user return consider user characteristics and simple feedback like upvotes, while leaving potentially useful text data unstudied. Here, we analyze 700,000 first post/sole comment pairs on Reddit, with the goal of determining whether comments are related to return probabilities. Using two complementary text analysis techniques—text regression (CCS) and Linguistic Inquiry and Word Count (LIWC)—we demonstrate that information from the first comment a new user receives improves predictions of new user return. Our work serves as an example of useful predictive features being extracted from very short text comments, and also illustrates the importance of social feedback on the experiences of new users.

Author Biography

Emma Mary Klugman, Harvard University

Emma Klugman is a doctoral student in Education and Data Science at Harvard University.


Ammari, T., Schoenebeck, S., & Romero, D. (2019). Self-declared Throwaway Accounts on Reddit: How Platform Affordances and Shared Norms enable Parenting Disclosure and Support. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 135:1–135:30.

De Choudhury, M., & De, S. (2014). Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Eighth international AAAI conference on weblogs and social media.

Coussement, K. and Bock, K. W. D. (2013). Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning. Journal of Business Research, 66(9):1629–1636.

Coussement, K. and den Poel, D. V. (2008). Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques. Expert Systems With Applications, 34(1):313– 327.

DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3).

Dror, G., Pelleg, D., Rokhlenko, O., and Szpektor, I. (2012). Churn prediction in new users of Yahoo! answers. Proceedings of the 21st International Conference on World Wide Web, pages 829–834.

He, B., Shi, Y., Wan, Q., and Zhao, X. (2014). Prediction of customer attrition of commercial banks based on SVM model. Procedia Computer Science, 31:423–430.

Hung, S.-Y., Yen, D. C., and Wang, H.-Y. (2006). Applying data mining to telecom churn management. Expert Systems With Applications, 31(3):515– 524.

Jamal, Z. and Bucklin, R. E. (2006). Improving the diagnosis and prediction of customer churn: A heterogeneous hazard modeling approach. Journal of Interactive Marketing, 20(3):16–29.

Jia, J., Miratrix, L., Yu, B., Gawalt, B., El Ghaoui, L., Barnesmoore, L., and Clavier, S. (2014). Concise comparative summaries (CCS) of large text corpora with a human experiment. Annals Of Applied Statistics, 8(1):499–529.

Miratrix, L. (2017). textreg: n-Gram Text Regression, aka Concise Comparative Summarization. R package version 0.1.4.

Miratrix, L. W. and Ackerman, R. (2016). Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability. Statistical Analysis and Data Mining: The ASA Data Science Journal.

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12:77.

Sarkar, C. (2013). The effects of participation and feedback received on the length of time members in online communities remain active. PhD thesis, Michigan State University.

Tausczik, Y. R. and Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24–54.

Wang, T., Wang, K., Erlandsson, F., Wu, S., and Faris, R. (2013). The influence of feedback with different opinions on continued user participation in online newsgroups. ASONAM ’13, pages 388–395. ACM and IEEE.

Yang, J., Wei, X., Ackerman, M. S., and Adamic, L. A. (2010). Activity lifespan: An analysis of user survival patterns in online. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Artificial Intelligence.