Web Surfer Tracking using Big Data Technologies

B. B. Jayasingh

B. B. Jayasingh Department of IT, CVR College of Engineering, Ibrahimpatan, Hyderabad, India

Abstract

Log files are semi-structured on the web server-side. Computer generates huge flat text files as log. Therefore, Hadoop file systems are suitable to store as text files. The MapReduce techniques of the Hadoop works well in distributed clusters as it process one line at a time. NASA web server log is considered as input to system in order to perform the session identification task. A Statistical report is produced based on total count of visit per hour, per day, and per date. R language is used in order to find the user sessions and analyzed rigorously. In this paper, the distributed cluster is considered in Hadoop for the session identification in the web log. The analyst loads the log file and does analysis using MapReduce and generating statistical reports. The system developed to analyze the traffic from the log file based on various parameters, such as hours of a day, days of a week, days in a month etc. The system generates the detailed information about the location of a user that includes the country, state, city, zip code and also the co-ordinates of the location (latitude and longitude) by making use of the IP address field of a web server log.