When a movement in this sector gets a name, then it gets momentum. The classic is Web 2.0; until Tim O’Reilly invented it, no one knew what the the trends they had been following for years was called. Similarly Big Data: now we can see it in the room we know what it is for and can approach it purposefully. And we know it is an elephant in this room, for no better reason than the fact that Doug Cutting called his management and sorting system for large, various, distributed, structured and unstructured data Hadoop – after his small boy’s stuffed elephant. And this open source environment, now commercialized by Yahoo, who developed it over the previous five years on top of Google’s open source MapReduce environment, is officially named HortonWorks, in tribute to the elephant from Dr Seuss who Hadoop really was. With me so far? Ten years of development since the early years of Google, resulting in lots of ways to agglomerate, cross search and analyse very large collections of data of various types. Two elephants in the room (only really one), and it is Big Search that is leading the charge on Big Data.

So what is Big Data? Apparently, data at such a scale that its very size is the first problem encountered in handling it. And why has it become an issue? Mostly because we want to distill real intelligence from searching vast tracks of stuff, despite its configuration, but we do not necessarily want to go to the massive expense of putting it all together in one place with common structures and metadata – or ownership prevents us from doing this even if we could afford it. We have spent a decade in refining and acquiring intelligent data mining tools (the purchase of ClearForest by Reuters, as it was then, first alerted me to the implications of this trend 5 years ago). Now we have to mine and extract, using our tools and inference rules and advanced taxonomic structures to find meaning where it could not be seen before. So in one sense Big Data is like reprocessing spoil heaps from primary mining operations: we had an original purpose in assembling discreet collections of data and using them for specific purposes. Now we are going to reprocess everything together to discover fresh relationships between data elements.

What is very new about that? Nothing really. Even in this blog we discussed (https://www.davidworlock.com/2010/11/here-be-giants/) the Lexis insurance solution for risk management. We did it in workflow terms, but clearly what this is about is cross-searching and analysing data connections, where the US government files, insurer’s own client records, Experian’s data sets, and whatever Lexis has in Choicepoint and its own huge media archive, are elements in conjecturing new intelligence from well-used data content. And it should be no surprise to see Lexis launching its own open source competitor to all those elephants, blandly named Lexis HPCC.

And they are right so to do. For the pace is quickening and all around us people who can count to five are beginning to realize what the dramatic effects of adding three data sources together might be. July began with WPP launching a new company called Xaxis (http://www.xaxis.com/uk/). This operation will pool social networking  content, mobile phone and interactive TV data with purchasing and financial services content with geolocational and demographic content. Most of this is readily available without breaking even European data regulations (though it will force a number of players to re-inforce their opt-in provisos). Coverage will be widespread in Europe, North America and Australasia. The initial target is 500 million individuals, including the entire population of the UK. The objective is better ad targetting; “Xaxis streamlines and improves advertisers ability to directly target specific audiences, at scale and at lower cost than any other audience -buying solution” says its CEO. By the end of this month 13 British MPs had signed a motion opposing the venture on privacy grounds (maybe they thought of it as the poor man’s phone hacking!).

And by the end of the month Google had announced a new collaboration with SAP (http://www.sap.com/about-sap/newsroom/press-releases/press.epx?pressid=17358) to accomplish “the intuitive overlay of Enterprize data onto maps to Fuel Better Business Decisions”. SAP is enhancing its analytics packages to deal with content needed to populate locational display: the imagined scenarios here are hardly revolutionary but the impact is immense. SAP envisage telco players analysing dropped calls to locate a faulty tower, or doing risk management for mortgagers, or overlaying census data. DMGT’s revolutionary environmental risk search engine Landmark was doing this to historical land use data 15 years ago. What has changed is speed to solution, scale of operation, and availability of data filing engines, data discovery schema, and advanced analytics leading to quicker and cheaper solutions.

In one sense these moves link to this blogs perennial concern for workflow and the way content is used within it and within corporate and commercial life. In another it pushes forward the debate on Linked Data and the world of semantic analysis that we are now approaching. But my conclusion in the meanwhile is that while Big Data is a typically faddish information market concern, it should be very clearly within the ambit of each of us who looks to understand the way in which information services and their user relevance is beginning to unfold. As we go along, we shall rediscover that data has many forms, and mostly we are only dealing at present with “people and places” information. Evidential data, as in science research, poses other challenges. Workflow concentrations, as Thomson Reuters are currently building into their GRC environments, raise still more issues about relationships. At the moment we should say welcome to Big Data as a concept that needed to surface, while not losing sight of its antecedents and the lessons they teach.

In a world of remarkable events (I am trying to write this against a background of Rebekah Wade/Brooks getting arrested, the likelihood of News Corp selling its UK newspapers being discussed as a serious option, and the suggestion that now is a good time for Rupert to start sacrificing some children, while Fox News suggests that we should put the phone tapping issues aside and maturely move on (http://www.theatlantic.com/national/archive/2011/07/the-most-incredible-thing-fox-news-has-ever-done/242037/) it is hard to concentrate. Part of me rejoices at the acceleration of change in media markets, part is saddened by the loss of jobs belonging to people with no share in the wrong-doing. Part of me stares in wonderment: there really is nothing to match the British in one of their periodic outbreaks of public morality; however hypocritical they maybe, the political and the chattering classes devour each other in the media with an energy unmatched since Herod’s slaughter of the innocents!

So lets discuss, in the spirit of Fox News, something that merits some consideration. During the time from when Rupert bought and then closed MySpace, huge changes took place. These were partly to do with the emergence of the Facebook hegemony, partly because of emerging valuations for that service, LinkedIn and Twitter, partly because the succession in fashion terms seems slow to hit its stride (though I am still betting on FourSquare). But they were more to do with the emergence of a new culture around networked relations with other people, which has driven us all into discussions of social marketing, exploiting natural communities, building loyalty through networking customers, and finding out much more about user behaviours. In the information industry we have seen these issues as extensions of our CRM, with the apparent aspiration that in the Salesforce world of tomorrow we shall be able to assemble everything we need to know about the user in some Cloud-based solution platform and feed our relationships with customers in a wholly personalised way.

But what if this is not so? Since the dawn of the Web users have been stronger in marketing relationships than vendors, despite belief by vendors that they can use real world techniques  to establish virtual world advantages. We pay lip service to the idea that advertising may be affected, even replaced, by user recommendation, then spend longer periods of time arguing why it will never happen. Because, viscerally, we do not want it to happen.

And yet it may be the least of what is likely to happen, and if we seek evolution rather than revolution then we need to put our heads into some emerging user positions. An important one of these is VRM (Vendor relationship management), in which individual users decide how to hold and store critical information about themselves (not their descriptors – age, sex etc – but their performance as buyers and sellers, readers and browsers, etc). What will happen when statistically significant groups of people get far enough down the road to Data Literacy (probably the most important untaught subject in our education systems) to practise what one leading practitioner and media influencer in this sector, Adriana Lukas  (http://www.mediainfluencer.net/), calls “Self-Hacking” and others term QS (Quantified Self). We are told on the Web that “markets are conversations”. Well, they are also relationships and transactions, and if users are able to hold and use aggregate knowledge of their web footprint then they have a considerable weapon in the battle to persuade vendors that free users are better than captive ones, and that each of us is likely to be the best advertiser of what we ourselves want.

What are the signs of progress towards this new world of “ambient intimacy”? Have a look first at the joint Harvard – Berkman Center programme around Doc Searls’ work on Project VRM (http://cyber.law.harvard.edu/projectvrm) and the EmanciPay work program. This has deep roots, and recalls Searl’s pronouncement in the Cluetrain Manifesto:

snipmanifesto

And if you think this is just a one-off research-funded effort, have a look at Diaspora’s alpha (http://blog.joindiaspora.com/what-is-diaspora.html) or at TrustFabric (www.trustfabric.org). As Facebook begins to slowly lose growth and start marginal decline, there may be space for a new/old view of networked relationships. Of course this is an issue intimately related to privacy (see what Mozilla propose in their Drumbeat environment with privacy icons :https://drumbeat.org/en-US/projects/privacy-icons/. And then look at MyCube (www.mycube.com), and, if you think that personal datamanagement does not relate to what the real world does , see what the Guardian does in its datastore environment (http://www.guardian.co.uk/data) to sort and re-aggregate diverse datastreams.

Still too distant to grasp? Buyosphere (http://buyosphere.com) paints a picture of semantic web based shopping in beta, and Zaarly (www.zaarly.com) is a first attempt at doing community cross-selling in geolocational contexts. This is the beginning of a new, post-Facebook world, and must be grasped now if we are to migrate towards it. Happy travels!

« go backkeep looking »