Automated Extraction of Typical Expressions Describing Product Features from Customer Reviews

  • Karel Barák Mendelu University in Brno
  • František Dařena Mendelu University in Brno
  • Jan Žižka Mendelu University in Brno
Keywords: product aspects identification, text mining, cluster analysis, feature selection


The paper presents a procedure that helps in revealing topics hidden in large collections of textual documents (such as customer reviews) related to a certain group of products or services. Together with identification of the groups containing the topics the lists of important expressions is presented which helps in understanding what characterizes these aspects most typically from the semantic point of view. The procedure includes determining an appropriate number of groups representing the prevailing topics, partitioning the documents into a desired number of groups using clustering, extracting significant typical features of documents from each group with application of feature selection methods, and evaluating the outcomes with the assistance of a human expert. The results show that the presented approach, consisting mostly of automated steps, is able to separate and characterize the aspects of a certain product as discussed by the customers and be later useful, e.g., for handling customer complaints, designing promotional campaigns, or improving the products.


Alpu, O. 2015. A methodology for evaluating satisfaction with high-speed train services: A case study in Turkey. Transport Policy, 44, 151–157.

Ares, G. and Jaeger, S. R. 2013. Check-all-that-apply questions: Influence of attribute order on sensory product characterization. Food Quality and Preference, 28 (1), 141–153.

Bafna, K. and Toshniwal, D. 2013. Feature Based Summarization of Customers' Reviews of Online Products. In: 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems – KES2013. Procedia Computer Science, 22, 142–151.

Bell, E. and Bryman, A. 2015. Business research methods. Oxford: Oxford University Press.

Bingham, E., Kabán, A. and Girolami, M. 2003. Topic identification in dynamical text by complexity pursuit. Neural Processing Letters, 17, 69–83.

Dařena, F., Žižka, J. and Přichystal, J. 2014. Clients' freely written assessment as the source of automatically mined opinions. In: 17th International Conference Enterprise And Competitive Environment. Amsterdam, Netherlands: Elsevier Science Bv, 103–110.

Duda, R. O., Hart, P. E. and Stork, D. G. 2001. Pattern Classification. New York, NY: Wiley.

Dy, J. G. and Brodley, C. E. 2004. Feature Selection for Unsupervised Learning. Journal of Machine Learning Research, 5, 845–889.

Engler, T. H., Winter, P. and Schulz, M. 2015. Understanding online product ratings: A customer satisfaction model. Journal of Retailing and Consumer Services, 27, 113–120.

Färber, I., Günnemann, S., Kriegel, H. et al. 2010. On Using Class-Labels in Evaluation of Clusterings. In: Proceedings of the 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings (MultiClust 2010) in conjunction with 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Washington.

Feldman, R. and Sanger, J. 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press.

Gelbukh, A. F., Alexandrov, M., Bourek, A. and Makagonov, P. 2003. Selection of Representative Documents for Clusters in a Document Collection. In: Proceedings of Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, 120–126.

Guyon, I. and Elisseeff, A. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157–1182.

Hu, M. and Liu, B. 2004. Mining and summarizing customer reviews. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Seattle, USA.

Joachims, T. 2002. Learning to classify text using support vector machines. Norwell: Kluwer Academic Publishers.

Kaufmann, L. and Rousseeuw, P. J. 2005. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: Wiley.

Kohavi, R. and John, G. 1997. Wrappers for feature selection. Artificial Intelligence, 97 (1–2), 273–324.

Krupník, J. 2014. Stopwords removal influence on text mining task results. In: PEFnet 2014. Brno: Mendel University.

Liu, B. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.

Liu, Y., Li, Z., Xiong, H., Gao, X. and Wu, J. 2010. Understanding of Internal Clustering Validation Measures. In: Proceedings of ICDM 2010, The 10th IEEE International Conference on Data Mining, 911–916.

Maks, I. and Vossen, P. 2012. A lexicon model for deep sentiment analysis and opinion mining applications. Decision Support Systems, 53 (4), 680–688.

Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.

McAuley, J. J., Targett, C., Shi, Q. and Van Den Hengel, A. 2015. Image-Based Recommendations on Styles and Substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 43–52.

MacKenzie, S. B. and Podsakoff, P. M. 2012. Common Method Bias in Marketing: Causes, Mechanisms, and Procedural Remedies. Journal of Retailing, 88 (4), 542–555.

Meyer zu Eissen, S. and Stein, B. 2002. Analysis of Clustering Algorithms for Web-Based Search. In: 4th International Conference, PAKM 2002 Vienna, Austria, December 2–3. Berlin: Springer, 168–178.

Morozkov, M., Granichin, O., Volkovich, Z. and Zhang, X. 2012. Fast algorithm for finding true number of clusters. applications to control systems. In: Control and Decision Conference (CCDC), 2001–2006.

Quinlan, J. R. 2015. Data mining tools See5 and C5.0. RuleQuest Research. [online]. Available at: [Accessed 2015, October 14].

Salton, G. and Buckley, C. 1988. Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24 (5), 513–523.

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.

Saratlija, J., Šnajder, J. and Dalbelo Bašić, B. 2011. Unsupervised Topic-Oriented Keyphrase Extraction and its Application to Croatian. In: 14th International Conference on Text, Speech and Dialogue, 340–347.

Tibshirani, R. and Walther, G. 2005. Cluster Validation by Prediction Strength. Journal of Computational and Graphical Statistics, 14 (3), 511–528.

Weiss, S. M., Indurkhya, N., Zhang, T., Damerau, F. 2010. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York, Springer.

Xu, R. and Wunsch, D. C. 2009. Clustering. Hoboken, NJ: Wiley.

Zhao, Y. and Karypis, G. 2001. Criterion Functions for Document Clustering: Experiments and Analysis. University of Minnesota, Technical Report.

Žižka, J. and Dařena, F. 2012. Parallel Processing of Very Many Textual Customers' Reviews Freely Written Down in Natural Languages. In: IMMM 2012: The Second International Conference on Advances in Information Mining and Management. Venice, Italy, October 21–26. IARIA, 147–153.

Žižka, J. and Dařena, F. 2013. Revealing Prevailing Semantic Contents of Clusters Generated from Untagged Freely Written Text Documents in Natural Languages. In: Text, Speech, and Dialogue. Heidelberg: Springer, 434–441.