1 Csci322-Assignment3
1 Csci322-Assignment3
1 Csci322-Assignment3
Session 3 2011
Aim:
In this assignment you will be parsing log files from a web proxy. This is an example of a simple task a system administrator may have to embark upon. It is designed to get you warmed up for what comes next in this subject.
Task
Imagine you work for an Internet Service Provider like exetel or Internode with a number of customers. Each customer has a unique IP address which acts as their identified. In this world the provider only allows users to surf the web using the Hyper Text Transport Protocol (HTTP). When a user logs on to the ISP to get access to the Internet, they would typically use a browser such as Safari (preferably) or IE to view web pages. When the user wishes to visit the site http://www.apple.com.au/index.html the ISP logs this request and its results for billing purposes via a proxy. Your job is to write some billing software for this Internet Service provider using any scripting language you feel comfortable with i.e. Python, Perl or Shell (if you want you can use C++). All user requests are stored in a log file at the ISP called proxy.log with one user request to a line. Considering the above request the resulting line in the log file would be 1184813281.056 0.170 137.157.60.46 http://www.apple.com.au/index.html "text/html" For this task the most important columns are; a) Column 3 represents the IP address of the user. Each user has a unique IP address. For this assignment users will be represented by IP address. b) Column 4 is the HTTP response code. A value of 200 means the request for the resource was successful anything else for the purpose of this assignment means it failure. c) Column 5 represents the size of the file. If the HTTP response code was not 200 then it can be assumed the size is 0 i.e. an unsuccessful request. d) Column 6 the HTTP query issued by the user. When a user makes a request to a web server for a web page this is known as a GET request. There are others, which for the purpose of this 200 1830 GET
Page 1
assignment can be ignored. A GET request is successful (in this assignment) if the http response code is 200. e) Column 7 represents the URL (resource) the user is trying to get. Generally speaking we are only concerned with the name of the host name of the host i.e. www.apple.com.au. Be warned not all hosts are qualified i.e. sometimes they are just names. For these ones you should simply ignore. f) Column 8 represents the MIME type of the object. You should note in the log file the MIME type is always encapsulated in . If the proxy can not work out the MIME type it represents it as - Your program should parse log files of the above form. You can assume all entries in the log are in the above form (there shouldnt be any dud lines in the files). To invoke the program you would enter the following logwhacker logfile-name The logwhacker command takes a filename to parse. If the arguments are invalid an error message is printed and the program terminated. The report you are to produce should have the following: 1. Total number of log lines read. 2. Total number of bytes downloaded for all users this is all SUCCESSFUL HTTP GET Requests. 3. A summary of each individual users consumption in bytes that is the users SUCCESSFUL HTTP GET Requests. 4. The top 5 most frequently visited hosts e.g. www.apple.com 5. The top 5 most downloaded MIME types.
Submit:
Submit your solutions to these questions to your tutor during your designated class. An extension of time for the completion of the assignment may be granted in certain circumstances. A request for an extension must be made to the Subject Coordinator before the due date. Late assignments without granted extension will be marked but the mark awarded will be reduced by 1 mark for each day late. Assignments will not be accepted more than three days late.
Page 2