Saturday, February 16, 2013

Content Creation vs. Data Mining

Many of the most interesting and complicated systems have a central core containing a lot of information, which is surrounded by a lot of machinery that manipulates the center and reads the information. Cells are like this -- lots of machinery at the center that's manipulated and turned on and off by genes. Our economy is increasingly like this -- a real economy that produces value, surrounded by an enormous banking and financial services industry. We should think of the internet like this, too -- it's a central core of information content, surrounded by layers of machinery. Increasingly, we interact with that machinery more than we interact directly with the content.

I used to be active on the bulletin board systems of the 80s, and I first came across the internet when I installed the NCSA Mosaic program, which is the great-granddaddy of internet browsers. The internet had quite an impressive amount of content back then, but it was like a gigantic library with all the books thrown onto the floor. When someone would create a new page, it was like they were walking into the library and throwing a new book somewhere at random. There was no good way to find anything, and so a lot of pages were "link farms", containing links to pages on various subjects. There were links farms containing links to other link farms. The precursor to Yahoo would occasionally put out a list of internet sites, which could be printed out in about ten or twenty pages.

Not surprisingly, the first layer of machinery surrounding this content comprised the search engines, most of which were hand-curated directories. It may be difficult now to recall how much value this added to the internet, which used to be a disorganized, chaotic mess populated entirely by bleeding-edge early adopters who used it mainly because it was fascinating and new, not because it was useful.

Obviously, when the PageRank algorithm was implemented on a large scale by Google, it was a game-changer. The link farms went away, and hand-curated directories started to die out, too. Google was -- and still is -- a gigantic card catalog for the library. This aspect of Google is still nothing more than a very efficient, large-scale card catalog.

That was the first layer of machinery around the content-filled core: the organization and taming of the chaotic mass of web pages. Think of that layer as an organizing force that makes information accessible, but doesn't create any new information itself.

But an equally important aspect of Google other than its power to organize stems from the fact that it's automated, not hand-curated like the earlier search engines. An automated system can be implemented on a huge scale, and the behavior of the users can be mined for information. The result is everything from targeted, online advertising to early detection of the flu. In my opinion, it's an often overlooked fact that this data-mining of user behavior is parasitic on the internet's central core of content. There may be a very large number of interesting and useful ways to mine user behavior, but that number is finite. Given only a finite amount of online content, it will become increasingly difficult to find new ways to exploit it as the number of data-mining projects increases.

Nonetheless, the number of ways to exploit content is much larger than the number of distinct kinds of content. So whenever there is a new mode of content creation, a comparatively huge number of data-mining initiatives becomes possible. For example, Wikipedia is clearly a novel way to create content. And that content gives rise to a large number of ways to read those articles, analyze the internal links, extract data from the text, and so on. Twitter supports an entire ecosystem of data-mining projects. Wikipedia and Twitter are similar in that respect.

However, Wikipedia and Twitter are different in an important respect. Wikipedia's main economic value is in the content itself. It was conceived as a reference resource, and that's still its primary use. Twitter, on the other hand, has some value as a way of sending out short bursts of information to one's followers. But that's not its main value anymore. It makes money primarily by selling its data to businesses that mine it for useful information. Twitter makes money by harnessing the unforeseen consequences of its system, whereas Wikipedia makes money (through donations) by doing a good job of what it was originally intended to do.

This is why Twitter and Facebook are more interesting than Google and Wikipedia. Twitter and Facebook represent new ways to create content online, and they combine that content creation with unforeseeable (at the time) ways to exploit that content for data-mining purposes. Google and Wikipedia each contain half of that equation. Google excels at data-mining, but not content creation (one aspect of content creation is social media, and Google is notoriously bad at social media; but that's just one specific kind of content creation, which I think is the more important point). Wikipedia is great at content creation, but not data-mining.

Data-mining opportunities are vast, but finite. Content creation opportunities are unlimited, but more difficult to execute. The predictable implication of this fact is that organizations that successfully data-mine a particular core of content will eventually run out of opportunities for growth. Thus, they have to either remain stagnant or go into the content creation business. Examples are everywhere. Netflix is now producing its own shows. Google is creating an augmented reality ecosystem around its "Google Glass" project that will undoubtedly spur a vast amount of new content creation (they did this on a smaller scale with Google Earth, which allowed people to create new layers of information). Conversely, groups that have done very well with content creation must find ways to exploit that content. The obvious example is Facebook. But there are plenty of other examples -- one of my favorite is StockTwits, which provides a way for investors to share opinions about the stock market, and which is also being used to extract opinion about stocks in an automated fashion.

In my opinion, there are two important questions we must answer when we try to predict the future of an internet-related business. The first is, "Are they primarily a data-mining business or a content-creation business?". The second is, "Can they effectively move to the other mode after they've started to grow?". Those are tough questions, but I think they're the right ones.