Server logs in a REST API show user actions rather than just page views. I’ll show a code pattern to extract actions from server logs and to analyze them in Hive. The key lesson is “Track behavior, not logs.”
0) Two.five sentences on logs
Servers like Apache Tomcat or gunicorn create a special file called an access log that specifies information about each request the server receives. Key information includes the url that was requested, the ip address of the requester, a timestamp, a response status code, and potentially several other fields. Example:
192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395
1) Log analysis platform: Hive
On an even moderately trafficked website, server logs grow by several hundred thousand rows a day. Standard “big data” platforms like Hive make the most sense to store and analyze server logs. Hive provides a query and analysis language that maps our requests to Map-Reduce jobs that are scheduled on a Hadoop cluster.
We’ll use Hive on Amazon’s Elastic MapReduce platform. Amazon’s Hive additions allow us to attribute columnar structure to flat file stores. Put another way, we can store our logs directly in S3 and use a Hive table definition to extract columns each row in the file. The secret sauce in our mapping is a serde (serializer-deserializer), which is a .jar that defines a way to translate rows in flat files to rows in a Hive table. One option is to use the regular expression serde and define a regular expression to parse each access log line into its fields. This S.O. post has an example external table definition for an access log with the given format:
-- Log line:
-- 220.127.116.11 - - [14/Jan/2012:06:25:03 -0800] "GET /users/2345/edit HTTP/1.1" 200 708 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
CREATE EXTERNAL TABLE access_log (
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+) "([^"]+)" "([^"]+)".*'
STORED AS TEXTFILE
It is relatively straight forward to parse raw access logs with a regular expression. There’s a problem though: the behavioral information we are tracking is in the URL. A regular expression parsing of the log line will produce the contents of the URL, but it isn’t a sustainable way to parse semantic information embedded in the URL.
2) Extract and store behavior
URL patterns define actions, and identifiers with the URL tell the server what data to use to perform the action. For instance, in django, a URL description looks like this:
url(r'^user/(?P<user_id>\d+)/add/$', UserAddView.as_view(), name='add_user')
The url is specified by a regular expression with a named group
user_id. Another important element in the description is the name, which is a string we can use in the reverse resolution engine to identify urls within a django app. The reverse engine will also give us all named URLs and their patterns. This gist returns all of the regular expressions for named URLs:
We can use this list of patterns to parse the URL portion of a log entry into the action requested and the data provided:
Two things deserve mention about this class.
First, it is not efficient. Run time is O(n * m * k) for n log lines and m url patterns of max length k. We could speed the algorithm up in multiple ways including using a trie search structure or smartly ordering the regular expression list to catch common operations first. Keep in mind, we are processing our logs once before we put them on our Hive cluster.
Second, what are those nasty regexes I’ve added to the initializer:
# If there is a non-capturing wrapper, it will be eliminated here.
self.re_replace_non_capture = re.compile(r'\(\?\:#/?\)\?')
self.re_replace_wrapped_capture = re.compile(r'\(#\?\/\?\)?\??')
These were added to handle optional fields in my company’s URL set. They identify optional and non-capture statement in the python regular expression syntax. Don’t worry too much about them.
3) Use the JSON serde to store behavior.
In my production log tracker, I use the apachelog package server logs into dictionaries, append the slug and parameters from the URLParser, and store gzipped JSON in S3. Then I use the json serde to deserialize user behavior logs into a Hive table for further analysis.
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
create external table user_behavior_log (
… your fields …
WITH SERDEPROPERTIES ( ‘paths’=’… comma separated list of fields in your json array …’)