Image denoising with an Ising Model

This will follow Kevin Murphy’s example in chapter 21 of Machine Learning: A Probabilistic Perspective, but we’ll write the code in python with numpy and scipy. I wanted to understand the model better, so these are mostly notes for myself. Hope they are useful!

I’m not sure I love wordpress for heavy math and programming, so I did this one in an ipython notebook.

Data Scientists: learn to say no!

I’ve left Wave to join the Data Mining team at Microsoft Bing in Bellevue, WA. That means that the last few weeks have been a time to reflect on what my time at Wave has meant and how I can contribute in a larger organization. One role I’ve embraced that I hadn’t anticipated is saying no. Obviously, my career and my way of thinking have both benefited from an experimental, data driven approach to product development. More tests are often, but not always, helpful to an organization operating under uncertainty. But a lot of my energy lately has been devoted to showing that some things don’t need to be tested formally.

Wait, what?

Here’s two examples from the last few months.

The first was a discussion about re-formatting our company’s weekly emails from a heavily styled, image-heavy email to something with only plain text. There were good motivations behind changing the format. Among other things, it did not display well on mobile. However, the first iteration was a plain text email that went too far from our established brand and design, and I fought against testing it. It turns out, the email did go out in an A/B test and it had a slightly better click-through rate than our styled email.

Success, right?

Not really. The plain email also had higher spam rates. It did look a little spammy: bullet points and blue links in an email from a reputable company just doesn’t feel right. We even had one long time customer ask whether we were hacked. There were two lessons behind this story: bad design doesn’t need to be tested and a test can only be as successful as the definition of success it uses. In our case, we should have tested two well-designed emails with a hypothesis that simpler emails generated higher click-through. Starting with two great designs means we would have been one iteration closer to the final email when we started. Also, we should have explicitly included spam and unsubscribes rate as part of our success metric, and we should have accounted for email platform.

In this case, we learned something from the test (maybe we are right to try a simpler email) but the cost wasn’t worth the new knowledge.

Another example came from fixing a usability problem in our application. We removed a verify transaction feature from the transactions page at Wave. This lets users segregate transactions they know are correct from transactions they should review. When we heard from a lot of our users that the feature was something they had come to rely on, we decided to work on putting it back in. Again, the question was whether we should test the “new” feature in an A/B release. I said no, because the time required to engineer a test wasn’t worth the knowledge gain. In this case, the feature had already been tested in production and its removal is what caused a problem. User feedback, properly handled, is a great form of testing. In this case, we had loads of support tickets and direct customer calls to validate that we needed the feature. We will obviously track our overall application metrics before and after the fix, and we’ll track use of transaction validation to see whether it is experiencing the uptake we anticipate.

So when is a test appropriate?

A statistical test is useful when you don’t already know the answer. We already knew that our emails should be on-brand. We already knew that transaction verification was a feature our users missed.

What we don’t know, yet, is the best way to present transaction verification. Given two good designs, we may not know which email performs better (or whether either does).

A test only makes sense if the benefit of knowledge gained is worth the cost of running the test. Usually, the cost comes from two sources. First, it requires creating two working, complete features. A new feature that enters an A/B test must be production ready even if its long term deployment is up in the air. It must be QAed and considerations like performance must be met. The cost of a shoddy test feature is that it nullifies the experiment: it introduces covariates like bug rate and performance differences that mask any true differences. Second, it potentially forces some users to experience a sub-optimal experience. Is the cost in churns or lost brand loyalty of interacting with a poorly performing experiment worth the knowledge gained?

Hedge your bets: test a theory not a feature

The easiest way to alter the cost/benefit relationship in an experiment is to increase the benefit by testing theories not just features.
Design your tests so that the result, in either direction, will influence your theory about how your company should work. Try to test a theory about how your users interact with your product, like our hypothesis about serving mobile email users.

Sqoop via an ssh tunnel on Amazon EMR

I’m working on a data warehouse in Hive that incorporates 35+ mysql databases, our server logs, ad logs, and several other sources into one, distributed location. This article will focus on a key portion of the task: importing mysql tables into Hive via an ssh tunnel.

Sqoop: move relational tables to hadoop (and vice versa).

Sqoop is a simple tool to copy relational tables to Hive and vice versa. We’re using it for our import process only, but you could easily use it to copy cleaned results out of Hive and back into a smaller warehouse for quick analysis. Our first problem is (1) install sqoop on Amazon’s EMR. The trouble is, our warehouse is on Amazon’s EMR, and our security procedures mean that our mysql databases are behind a firewall. That makes our second problem: (2) construct a proper ssh tunnel to allow us to connect with our database.

Installing sqoop on EMR

It turns out other people have solved this problem, but here’s our solution for posterity. Note that it’s essentially the same as the sources referenced.

This installer assumes that you have sqoop 1.4.2 in a tarball and that you are using mysql. Sqoop requires database drivers for the databases that you will be utilizing. They are available with a bit of google foo. Once you’ve sshed into your cluster, run the script above from the master node and you will have a working sqoop instance. Now, we just need to access our data.

SSH Tunneling for sqoop

A simple ssh tunnel looks like this:

ssh -L local_port:tunnel_target:target_port 
This isn’t too difficult:

  • ssh is the protocol
  • -L is a flag that specifies this is a tunnel
  • local_port:tunnel_target:target_port is our tunnel specification. It means that “local_port” is a port locally that will mirror “target_port” on “tunnel_target”. See example below.
  • specifies that user guyrt is negotiating the tunnel and that “” is the machine that is making the connection. Usually, this is the same as tunnel_target.

My tunnel looks like this:

ssh -L 3307:

Note that, locally, port 3307 listens for traffic on the tunnel, not the default port 3306 for mysql. This is because there is a mysql instance running locally that would capture that traffic.

The hitch: this only works on the master!

Make an ssh tunnel like the one above and use sqoop to list the tables on your db:

> sqoop-1.4.3.bin__hadoop-1.0.0/bin/sqoop eval --connect jdbc:mysql://<your master ip address>:3307/my_database --username <mysql_username> -P --query "show tables"

With any luck, this will print the result of “show tables”. (Hint: your master IP address is in the default command line prompt on EMR.) Since “eval” is a simple command, it runs on the master node, which is listening for traffic on port 3307 from Now, let’s try to import our data:

> sqoop-1.4.3.bin__hadoop-1.0.0/bin/sqoop import --connect jdbc:mysql://<your master ip address>:3307/my_database --username <mysql_username> -P --direct --hive-import --table some_table --where "modified < '2013-03-15'"

This will almost certainly fail. Why? When sqoop imports, it creates multiple map jobs that each make a connection to the database and select a subset of the table (splitting by section on the primary key). However, by default, ssh tunnels listen for local traffic only. Your worker nodes can’t access the tunnel!

The fix: let your tunnel listen for inbound traffic.

The key insight here is your worker nodes have to access the tunnel, too. To allow them access, add a “-g” flag to your tunnel:
ssh -L 3307: -g

Then all of your slave nodes can also access the tunnel and your import process will complete.

Warehousing semi-relational data

Your non-relational data has a lot of relationships.

Unless you’ve got a time series in a vacuum, all that data in your nosql database relates to other data sources in your general universe. Take your log data, which encodes information about your individual users’ interaction with their personal data. What happens when you want to segment your user activity by country, by gender, by all of those other variables in your core relational data?

Your relational data isn’t always relational.

The most common operation in a user-facing relational database is usually to retrieve dozens of rows joined across several tables. Indexes and highly evolved query engines make these operations relatively simple to do in real time even with thousands of connections. But that doesn’t mean your relational data is always relational.

Data scientists typically aggregate millions of rows (i.e. table scans) in large tables. Relative table sizes in many applications follow a log-linear relationship, with the largest tables containing millions of rows. Moreover, the questions we are asking as data scientists take a more column-oriented flavor than the product team requires (what is the trend, not what are the specific values for a single user). We are often analyzing relational data using column oriented methods and nonrelational data by combining it with our other relational data stores.

chart_1 (1)chart_1 (2)

Server Log Mining: store information, not logs

Server logs in a REST API show user actions rather than just page views. I’ll show a code pattern to extract actions from server logs and to analyze them in Hive. The key lesson is “Track behavior, not logs.

0) Two.five sentences on logs

Servers like Apache Tomcat or gunicorn create a special file called an access log that specifies information about each request the server receives. Key information includes the url that was requested, the ip address of the requester, a timestamp, a response status code, and potentially several other fields. Example: - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395

1) Log analysis platform: Hive

On an even moderately trafficked website, server logs grow by several hundred thousand rows a day. Standard “big data” platforms like Hive make the most sense to store and analyze server logs. Hive provides a query and analysis language that maps our requests to Map-Reduce jobs that are scheduled on a Hadoop cluster.

We’ll use Hive on Amazon’s Elastic MapReduce platform. Amazon’s Hive additions allow us to attribute columnar structure to flat file stores. Put another way, we can store our logs directly in S3 and use a Hive table definition to extract columns each row in the file. The secret sauce in our mapping is a serde (serializer-deserializer), which is a .jar that defines a way to translate rows in flat files to rows in a Hive table. One option is to use the regular expression serde and define a regular expression to parse each access log line into its fields. This S.O. post has an example external table definition for an access log with the given format:

-- Log line:
-- - - [14/Jan/2012:06:25:03 -0800] "GET /users/2345/edit HTTP/1.1" 200 708 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +"

`ip` STRING,
`time_local` STRING,
`method` STRING,
`uri` STRING,
`protocol` STRING,
`status` STRING,
`bytes_sent` STRING,
`referer` STRING,
`useragent` STRING
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+) "([^"]+)" "([^"]+)".*'
LOCATION 's3://your-s3-buckets/';

It is relatively straight forward to parse raw access logs with a regular expression. There’s a problem though: the behavioral information we are tracking is in the URL. A regular expression parsing of the log line will produce the contents of the URL, but it isn’t a sustainable way to parse semantic information embedded in the URL.

2) Extract and store behavior

URL patterns define actions, and identifiers with the URL tell the server what data to use to perform the action. For instance, in django, a URL description looks like this:

url(r'^user/(?P<user_id>\d+)/add/$', UserAddView.as_view(), name='add_user')

The url is specified by a regular expression with a named group user_id. Another important element in the description is the name, which is a string we can use in the reverse resolution engine to identify urls within a django app. The reverse engine will also give us all named URLs and their patterns. This gist returns all of the regular expressions for named URLs:

We can use this list of patterns to parse the URL portion of a log entry into the action requested and the data provided:

Two things deserve mention about this class.

First, it is not efficient. Run time is O(n * m * k) for n log lines and m url patterns of max length k. We could speed the algorithm up in multiple ways including using a trie search structure or smartly ordering the regular expression list to catch common operations first. Keep in mind, we are processing our logs once before we put them on our Hive cluster.

Second, what are those nasty regexes I’ve added to the initializer:

# If there is a non-capturing wrapper, it will be eliminated here.
self.re_replace_non_capture = re.compile(r'\(\?\:#/?\)\?')
self.re_replace_wrapped_capture = re.compile(r'\(#\?\/\?\)?\??')

These were added to handle optional fields in my company’s URL set. They identify optional and non-capture statement in the python regular expression syntax. Don’t worry too much about them.

3) Use the JSON serde to store behavior.

In my production log tracker, I use the apachelog package server logs into dictionaries, append the slug and parameters from the URLParser, and store gzipped JSON in S3. Then I use the json serde to deserialize user behavior logs into a Hive table for further analysis.

add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;

create external table user_behavior_log (
… your fields …
WITH SERDEPROPERTIES ( ‘paths’=’… comma separated list of fields in your json array …’)
LOCATION ‘s3://your-s3-buckets/’;