Conceptualizing Mining of Firm's Web Log Files

Ruangsak TRAKUNPHUTTHIRAK, Yen CHEUNG, Vincent C. S. LEE

系统科学与信息学报(英文) ›› 2017, Vol. 5 ›› Issue (6) : 489-510.

PDF(239 KB)
PDF(239 KB)
系统科学与信息学报(英文) ›› 2017, Vol. 5 ›› Issue (6) : 489-510. DOI: 10.21078/JSSI-2017-489-22

Conceptualizing Mining of Firm's Web Log Files

    Ruangsak TRAKUNPHUTTHIRAK, Yen CHEUNG, Vincent C. S. LEE
作者信息 +

Conceptualizing Mining of Firm's Web Log Files

    Ruangsak TRAKUNPHUTTHIRAK, Yen CHEUNG, Vincent C. S. LEE
Author information +
文章历史 +

摘要

In this era of a data-driven society, useful data (Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information, however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.

Abstract

In this era of a data-driven society, useful data (Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information, however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.

关键词

web usage mining / web log files / Big Data / machine learning / business intelligence

Key words

web usage mining / web log files / Big Data / machine learning / business intelligence

引用本文

导出引用
Ruangsak TRAKUNPHUTTHIRAK, Yen CHEUNG, Vincent C. S. LEE. Conceptualizing Mining of Firm's Web Log Files. 系统科学与信息学报(英文), 2017, 5(6): 489-510 https://doi.org/10.21078/JSSI-2017-489-22
Ruangsak TRAKUNPHUTTHIRAK, Yen CHEUNG, Vincent C. S. LEE. Conceptualizing Mining of Firm's Web Log Files. Journal of Systems Science and Information, 2017, 5(6): 489-510 https://doi.org/10.21078/JSSI-2017-489-22

参考文献

[1] Chen M, Mao S, Liu Y. Big data:A survey. Mobile Networks and Applications, 2014, 19(2):171-209.
[2] Fan W, Bifet A. Mining big data:Current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 2013, 14(2):1-5.
[3] Demchenko Y, Laat C D, Membrey P. Defining architecture components of the Big Data Ecosystem. International Conference on Collaboration Technologies and Systems (CTS), 2014.
[4] Khan R A, Quadri S. Business intelligence:An integrated approach. Business Intelligence Journal, 2012, 5(1):64-70.
[5] Agosti M, Crivellari F, Di Nunzio G M, Web log analysis:A review of a decade of studies about information acquisition, inspection and interpretation of user interaction. Data Mining and Knowledge Discovery, 2012, 24(3):663-696.
[6] Chung P T, Chung S H. On data integration and data mining for developing business intelligence. IEEE Long Island Systems, Applications and Technology Conference (LISAT), 2013.
[7] Grace L K, Maheswari V, Nagamalai D. Analysis of web logs and web user in web mining. arXiv preprint, 2011, arXiv:1101.5668.
[8] Manikandan S G, Ravi S. Big data analysis using apache hadoop. International Conference on IT Convergence and Security (ICITCS), 2014.
[9] Arbelaitz O, Gurrutxaga I, Lojo A. Web usage and content mining to extract knowledge for modelling the users of the Bidasoa Turismo website and to adapt it. Expert Systems with Applications, 2013, 40(18):7478-7491.
[10] Sujatha V. Improved user navigation pattern prediction technique from web log data. Procedia Engineering, 2012, 30(1):92-99.
[11] Pamutha T, Chimphlee S, Kimpan C, et al. Data preprocessing on web server log files for mining users access patterns. International Journal of Research and Reviews in Wireless Communications (IJRRWC), 2012, 2(2):92-98.
[12] Srivastava J, Garg R, Mishra P K. Preprocessing techniques in web usage mining:A survey. International Journal of Computer Applications, 2014, 97(18):1-9.
[13] Heikkinen E, Timo D H. LOGDIG log file analyzer for mining expected behavior from log files. SPLST, 2015.
[14] Lokeshkumar R, Sindhuja R, Sengottuvelan P. A survey on preprocessing of web log file in web usage mining to improve the quality of data. International Journal of Emerging Technology and Advanced Engineering, 2014, 2250-2459.
[15] Hoek V, Wilko, Shen W, et al. Identifying user behavior in domainspecific repositories. Information Services & Use, 2014, 34(3-4):249-258.
[16] Ghezzi C, Sama M, Tambrrelli G. Mining behavior models from userintensive web applications. ACM Proceedings of the 36th International Conference on Software Engineering, 2014.
[17] Sumo Logic. http://www.sumologic.com.
[18] Andrew M. How big data analytics is solving big advertiser problems. https://www.entrepreneur.com/article/293678.
[19] Pentaho Community. http://community.pentaho.com.
[20] Adamov A. Data mining and analysis in depth:Case study of Qafqaz University HTTP server log analysis. IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), 2014.
[21] Keim D, Qu H, Ma K L. Bigdata visualization. IEEE Computer Graphics and Applications, 2013, 33(4):20-21.
[22] Wu X, Zhu X, Wu G, et al. Data mining with big data. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107.
[23] Koliopoulos A K, Yiapanis P, Tekiner F, et al. A parallel distributed weka framework for big data mining using Spark. IEEE International Congress on Big Data, 2015.
[24] Hendler J. Broad data:Challenges on the emerging Web of data. ACM Proceedings of the 2nd IKDD Conference on Data Sciences, 2015.
[25] Madhavji N H, Miranskyy A, Kontogiannis K. Big picture of big data software engineering:With example research challenges. IEEE/ACM 1st International Workshop on Big Data Software Engineering (BIGDSE),2015.
[26] Seref B, Bostanci E. Opportunities, threats and future directions in big data for medical wearables. ACM Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, 2016.
[27] Franklin M J. Making Sense of big data with the berkeley data analytics stack. SSDBM, 2013.
[28] Mohammad A, Mcheick H, Grant E. Big data architecture evolution:2014 and beyond. Proceedings of the Fourth ACM International Symposium on Development and Analysis of Intelligent Vehicular Networks and Applications, 2014.
[29] Klein J, Gorton I. Runtime performance challenges in big data systems. Proceedings of the 2015 Workshop on Challenges in Performance Methods for Software Development, 2015.
[30] Cuzzocrea A, Bellatreche L, Song I Y. Data warehousing and OLAP over big data:Current challenges and future research directions. ACM Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, 2013.
[31] Cuzzocrea A. Warehousing and protecting big ata:Stateoftheartanalysis, methodologies, future challenges. ACM Proceedings of the Interntional Conference on Internet of Things and Cloud Computing, 2016.
[32] Zhou J. Big data analytics and intelligence at alibaba cloud. ACM Proceedings of the TwentySecond International Conference on Architectural Support for Programming Languages and Operating System, 2017.
[33] Cuzzocrea A, Loia V, Tommasetti A. Bigdatadriven innovation for enterprises:Innovative big value paradigms for nextgeneration digital ecosystems. ACM Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, 2017.
[34] Susha I, Janssen M, Verhulst S, et al. Data collaboratives:How to create value from data for public problem solving?:Panel. ACM Proceedings of the 18th Annual International Conference on Digital Government Research, 2017.
[35] Chaudhuri S. What next?:A halfdozen data management research goals for big data and the cloud. Proceedings of the 31st ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, 2012.
[36] Cuzzocrea A. Privacy and security of big data:Current challenges and future research perspectives. ACM Proceedings of the First International Workshop on Privacy and Secuirty of Big Data, 2014.
[37] Agrawal R, Kadadi A, Dai X, et al. Challenges and opportunities with big data visualization. ACM Proceedings of the 7th International Conference on Management of Computational and Collective Intelligence in Digital EcoSystems, 2015.
[38] Fang R, Pouyanfar S, Yang Y. Computational health informatics in the big data age:A survey. ACM Computing Surveys (CSUR), 2016, 49(1):12.
[39] Kechadi M. Healthcare big data:Challenges and opportunities. ACM Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, 2016.
[40] McAuley J, Leskovec J. Hidden factors and hidden topics:Understanding rating dimensions with review text. Proceedings of the 7th ACM Conference on Recommender Systems, 2013.
[41] Wang Z, Ji Q. Classifier learning with hidden information. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[42] Chu D, Sheets D A, Zhao Y, et al. Visualizing hidden themes of taxi movement with semantic transformation. Visualization Symposium (PacificVis) IEEE Pacific, 2014.
[43] Abrol S, Kotrotsou A, Salem A. Radiomic phenotyping in brain cancer to unravel hidden information in medical images. Topics in Magnetic Resonance Imaging, 2017, 26(1):43-53.
[44] SuhLee C, Jo J Y, Kim Y. Text mining for security threat detection discovering hidden information in unstructured log messages. IEEE Conference on Communications and Network Security (CNS), 2016.
[45] Thusoo A, Shao Z, Anthony S. Data warehousing and analytics infrastructure at facebook. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010.
[46] Zhong C, Salehi M, Shah S, et al. Social bootstrapping:How pinterest and last. fm social communities benefit by borrowing links from facebook. ACM Proceedings of the 23rd International Conference on World Wide Web, 2014.
[47] Nacke L E, Klauser M, Prescod P. Social player analytics in a facebook health game. Proceedings of HCI Korea, 2014.
[48] Chen C, Iglasias J, Lin X, et al. Facebook traffic pattern analytics. MISNC, 2016.
[49] Rieder B. Studying facebook via data extraction:the netvizz application. Proceedings of the 5th Annual ACM Web Science Conference, 2013.
[50] Sloan L, Morgan J, Burnap P. Who tweets? Deriving the demographic characteristics of age, occupation and social class from Twitter user metadata. PloS One, 2015, 10(3):e0115545.
[51] Kumar S, Morstatter F, Liu H. Twitter data analytics. Springer, 2014.
[52] Lee K, Agrawal A, Choudhary A. Realtime disease surveillance using twitter data:Demonstration on Flu and Cancer. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
[53] Williams G, Mahmoud A. Mining Twitter data for a more responsive software engineering process. Proceedings of the 39th International Conference on Software Engineering Companion, IEEE Press, 2017.
[54] Kalampokis E, Karamanou A, Tambouris E, et al. On oredicting election results using twitter and linked open data:The case of the UK 2010 election. J. UCS, 2017, 23(3):280-303.
[55] Sarnovsky M, Butka P, Huzvarova A. Twitter data analysis and visualizations using the R language on top of the hadoop platform. IEEE 15th International Symposium on Applied Machine Intelligence and Informatics (SAMI), 2017.
[56] Korpusik M, Sakaki S, Chen Y Y. Recurrent neural networks for customer purchase prediction on twitter. CBRecSys, 2016.
[57] Adel H, Chen F, Chen Y. Ranking convolutional recurrent neural networks for purchase stage identification on imbalanced Twitter data. EACL 2017, 2017:592.
[58] Cooley R, Mobasher B, Srivastava J. Web mining:Information and pattern discovery on the world wide web. Proceedings of Ninth IEEE International Conference on Tools with Artificial Intelligence, 1997.
[59] Pabarskaite Z. Implementing advanced cleaning and enduser interpretability technologies in web log mining. IEEE Proceedings of the 24th International Conference on Information Technology Interfaces, 2002.
[60] Yuan F, Wang L J, Yu G. Study on data preprocessing algorithm in web log mining. IEEE International Conference on Machine Learning and Cybernetics, 2003.
[61] Zhang H, Liang W. An intelligent algorithm of data preprocessing in web usage mining. IEEE Fifth World Congress on Intelligent Control and Automation, 2004.
[62] Tanasa D, Trousse B. Advanced data preprocessing for intersites web usage mining. IEEE Intelligent Systems, 2004, 19(2):59-65.
[63] Khasawneh N, Chan C C. Active userbased and ontologybased web log data preprocessing for web usage mining. IEEE/WIC/ACM International Conference on Web Intelligence, 2006.
[64] Murata T, Saito K. Extracting users' interests from web log data. IEEE/WIC/ACM International Conference on Web Intelligence, 2006.
[65] Pabarskaite Z, Raudys A. A process of knowledge discovery from web log data:Systematization and critical review. Journal of Intelligent Information Systems, 2007, 28(1):79-104.
[66] Castellano G, Fanelli A M, Torsello M A. LODAP:A log data preprocessor for mining web browsing patterns. Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, 2007.
[67] Stermsek G, Strembeck M, Neumann G. A user profile derivation approach based on logfile analysis. IKE, 2007:258-264.
[68] Dell R F, Roman P E, Velsquez J D. Web user session reconstruction using integer programming. Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008.
[69] Wahab M H A, Mohd M N H, Hanafi H F, et al. Data preprocessing on web server logs for generalized association rules mining algorithm. World Academy of Science, Engineering and Technology, 2008.
[70] Li Y, Feng B, Mao Q. Research on path completion technique in web usage mining. International Symposium on Computer Science and Computational Technology, 2008.
[71] Suneetha K R, Krishnamoorthi R. Identifying user behavior by analyzing web server access log file. IJCSNS International Journal of Computer Science and Network Security, 2009, 9(4):327-332.
[72] Khosla M S, Bhojane M V. Capturing web log and performing preprocessing of the users accessing distance education system. International Journal of Modern Engineering Research (IJMER), 2012, 2(5):3128-3130.
[73] Li X Y. Data preprocessing in web usage mining. The 19th International Conference on Industrial Engineering and Engineering Management, 2013.
[74] Chauhan A, Tarar S. Prediction of user browsing behavior using web log data. IJSRSET, 2016, 1(2):419-422.
[75] Witten I H, Frank E, Hall M A. Data mining:Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[76] Cho Y M, Ritchie M D, Moore J H. Multifactordimensionality reduction shows a twolocus interaction associated with Type 2 diabetes mellitus. Diabetologia, 2003, 47:549-554.
[77] Cichocki A, Mandic D, Phan A H, et al. Tensor decompositions for signal processing applications:From twoway to multiway component analysis. IEEE Signal Processing Magazine, 2015, 32(2):145-163.
[78] Schlomer G, Bauman L, Card N A. Best practices for missing data management in counseling psychology. Journal of Counseling psychology, 2010, 57(1):1.
[79] Yang S, Kalpakis K, Mackenzie C F, et al. Online recovery of missing values in vital signs data streams using lowrank matrix completion. IEEE International Conference on Machine Learning and Applications (ICMLA), 2012.
[80] Newman D A. Longitudinal modeling with randomly and systematically missing data:A simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organizational Research Methods, 2003, 6(3):328-362.
[81] Jiang N, Gruenwald L. Estimating missing data in data streams. Advances in Databases:Concepts, Systems and Applications, 2007:981-987.
[82] Zhang P, Zhu X, Tan J. Skif:A data imputation framework for concept drifting data streams. ACM Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010:1869-1872.
[83] Aryal S, Kai M T, Washio T, et al. Datadependent dissimilarity measure:An effective alternative to geometric distance measures. Knowledge and Information Systems, 2017:1-28.

基金

Supported by Royal Thai Government Scholarship and Faculty of IT, Monash University, Resources Support

PDF(239 KB)

455

Accesses

0

Citation

Detail

段落导航
相关文章

/