Sunday, June 16, 2013

Data-Mining The State Department Cables

With all the recent talk about data-mining and the NSA's surveillance programs, I thought it might be interesting to experiment with some data-mining visualization techniques. So I spent a few hours data-mining the State Department cables that were leaked by Bradley Manning and distributed by WikiLeaks.

Data-mining, broadly construed, just refers to any process of taking large amounts of data and searching for useful patterns. The data could be nearly anything -- emails, phone records, medical records, Facebook status updates, and so on. What most people don't think about when there's a discussion of data-mining is that it's usually very difficult to transform that data into a form that can be analyzed. For example, if we're confronted with a lot of text (as with the State Department cables), we need to bring out the features of that text that we're interested in. If we're particularly interested in people's names, for example, we have to find a way to grab the names from the text. This isn't a problem if we already know what names we're interested in analyzing. But the data could be a lot more valuable if we could ask the computer to find the names.

This is a really hard problem. And as a result, data-mining is usually performed on data that's already in a usable form. For instance, a vast amount of research is being done on data collected from Twitter. There's a good reason for this -- tweets are already parsed and put into a format that's very easy for programmers to manipulate and search. On the other hand, much less data-mining is done on raw text like we find in the cables, or in Google's scans of books, to take two prominent examples.

But there are ways to extract information from text, and we can use those techniques to get data we need in order to data-mine. Yesterday, I spent a few hours leveraging those techniques to create this visualization of the leaked cables:

I'll explain what this visualization represents, and then I'll explain how I generated it. Finally, I'll zoom in on just a couple of clusters in the graph. You can also download the visualization as a PDF (it's zoomable).

The graph shows peoples' names, and the names of organizations mentioned in the cables. Each name is a "node" represented by a small circle. When two different names are mentioned in the same cable, we draw a line connecting the two nodes. Each node is labeled, and the size of the label is proportional to how "central" that node is in the graph. I'll explain what I mean by "central" below. Think of the lines connecting the nodes as springs -- and the more closely related the two names are, the tighter the spring. This has the effect of "pulling together" clusters of nodes that have the strongest relationship. Of course, we're inferring that there's a relationship simply from the fact that two names occur in the same cable; the computer has no way to "understand" anything more than that.

Like I said, this took only a few hours to put together, and I used only my laptop. There was no fancy hardware or commercial software. The hard part of the process is extracting the names from the cables. This was a two-step process. First, I used an open-source program that performs so-called "named entity recognition" on text. This is software that analyzes the grammar of a sentence and extracts the noun phrases from it. Then it uses some additional heuristics to determine which noun phrases are most likely to be names. This software isn't magic, however. Identifying names -- and especially determining whether it's a name of a person, location, or organization -- is really hard, and there are lots and lots of errors. So I wrote a very quick-and-dirty script that goes through each of the alleged named, and gives me the opportunity to make corrections. The program stores a list of all my corrections, and uses a fuzzy pattern-matching algorithm to apply the corrections automatically throughout the text. That way, I don't need to make corrections more than once. There are about 250,000 cables; I went through about five hundred, and the computer applied those corrections to the entire set. But the data represented in this graph is still very, very incomplete -- it represents just a small fraction of the information that we could extract from the leaked cables.

As I said above, I record whenever two names occur in the same cable. Of course, two names might happen to occur in the same cable even though they don't have anything to do with each other. So there's an additional processing step to try to address that possibility. I apply a quick-and-dirty algorithm that uses all the available data to measure how closely related two nodes are. For those who might be interested, I'm using the "mean commute time" between every pair of nodes. The total amount of code I had to write for the entire project is perhaps fifty lines in Python. I mention this in order to emphasize that it's not that hard to do something like this.

The size of each node represents the importance of that name. In order to calculate the importance of each node, I used the PageRank algorithm, which is the same algorithm that Google uses to rank web pages. It's not complicated. In order to calculate the PageRank of a node, you simply imagine that a person starts at a random node and follows links randomly. Nodes that are very important or influential will tend to be visited more often, and those get the highest PageRank. The visualization was done using an open-source program called "Gephi". In order to make the graph more readable, I deleted a couple of nodes that were too well-connected. For example, the White House is mentioned about a zillion times, and so I deleted it. It's mentioned so often that it doesn't convey any information, and just clutters up the graph.

Does this sort of technique work? Does the visualization show real relationships? The answer is yes. Take a very simple example. Here, I've zoomed in on a very small cluster at the periphery of the graph:
Clearly, this cluster is about Bosnia and war criminal Radovan Karadzic. Personally, I didn't know who Clifton M. Johnson is, and I didn't know if Del Ponte was person or something else. A two-second search on Google shows that Clifton M. Jones is a legal adviser in the State Department. Carla Del Ponte is a United Nations prosecutor for international tribunals. So this little cluster, which I chose at random, is coherent and shows relationships between items that are, in fact, closely related.

Here's another cluster of nodes from the graph:

What's interesting about this cluster is that the private businesses mentioned in the cables have all clustered to the same location in the graph. We've got IBM, Vodaphone, Motorola, Packard-Bell, Merrill Lynch, and others. The most important nearby node is the Mutilateral Investment Guarantee Agency. It's a part of the World Bank that is involved in investments, and so it's no surprise that businesses and government agencies concerned with economic development would cluster around it.

I don't know what insights might be gleaned from further analysis of this data. And as I've said already, this is a small fraction of the information that could be extracted from those cables. I'm planning on expanding this analysis over the next week or so, time allowing.