It reminds one superficially of mineral extraction. Who owns the seam of diamonds – the miner or the landowner? When rights are not clear or landownership in dispute? But this business of text or data mining is not really like that at all, and I was reminded this week by blogging contributions from two old friends that who owns the results of data extraction, from thousands or millions of unstructured files, where the data retrieved from individual datasets may be tiny (well within most fair usage provisions) but the contribution to the whole value may be huge, remains at issue. Play this in the context of Big Data and real questions emerge.

Lets go back to the beginning. Here are a couple of top of head examples of life on the planet that give a clue to what is worrying me:

* According to research quoted by the UK’s National Centre for Text Mining “fewer than 7.84% of scientific claims made in a full text article are reported in the abstract for that article”. This, they point out, makes cross-searching of articles using data mining and extraction techniques very important to science research. Fortunately the JISC organization which licences all journal article content from publishers on behalf of UK universities permits researchers to data mine these files, and no doubt this was agreed with the publishers within the license(?). But the question in my mind is this: who owns the product created by the data mining, and is this a new value which can be resold to someone else?

* Lexis Risk Management use many hundreds of public and private US data resources in their Big Data environment to profile people and companies. Both private and public data is researched, and, of course, it will often be the case that unique connections will be thrown up which encourage or discourage users from doing business with the data subject. Clearly Lexis own the result of the custom sweep of the data, and clearly it needs to be updated and amended over time as a result of fresh data becoming available, or more data being licensed into the mine. But do Lexis, or any other data extractor, own the result of the extraction process? They are able to sell a value derived from it, and that value emerges directly from the search activity and the weighting of the answers that they have accomplished. But do they own or need to own the content (which may be different in ten minutes time when another search is done on the same subject)? And can the insurance company who buys that result as part of their risk management model resell the data content itself to a third party?

I have put up two examples because I do not wish to polarize the argument into publishers v government. The issue arises in the UK, as the media lawyer’s lawyer, Laurie Kaye has pointed out, because the Hargreaves Review of copyright law recommends the retention of rights with the data miner – so you can make new products by recombining other people’s data. The UK government has adopted this recommendation with its usual emphatic “maybe”. Elsewhere in the world of August which I deserted to take a holiday, the UK government has come out with a storming approval of Open Data, and, as Shane O’Neill has repeatedly pointed out in his blogs, this contrasts sharply with the content retention policies pursued by UK civil servants, even now creating a Public Data Corporation in order to frustrate the political drive of its masters (how easily a licensing authority becomes a restricting body!).

There are two really troubling aspects of this to me. In the first instance we are not going to get the data revolution, the Berners Lee dream of linked data, the creation of hybrid workflow content modelling, or the Big Data promise of new product and service development unless there is a primary assumption in our society that all Open Web content, and all government or taxpayer funded content is available for data cross searching, unless there are national security considerations. And that it is a standard expectation for data leasing that discovery from multiple files creates new services for the person putting the intellectual effort into that discovery, and hopefully new wealth and employment in our society. If we simply continue to debate copyright as if it connotes the transfer of real world rights into the digital network then we shall constrain the major hope of intellectual property development this century.

And the second thing? Well, I am realist enough to know, after 20 years of lobbying this point, that it is unreasonable to expect the UK government to change its attitude to an information society in my lifetime. So maybe we can undermine these guardians of “my information is my power” by saying that we do not want their content – just the right to search it. After all if it is good enough for the universities and the progress of science, it should be good enough for Ordnance Survey and the Land Registry!

References

Making Open Data Real (www.data.gov.uk/opendataconsultation)

The Public Data Corporation (http://discuss.bis.gov.uk/pdc/)

Response to the Hargreaves Report (http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review)

National Centre for Text Mining (http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review)

Laurence Kaye (http://laurencekaye.typepad.com/)

Shane O’Neill (http://www.shaneoneill.co.uk/)

When a movement in this sector gets a name, then it gets momentum. The classic is Web 2.0; until Tim O’Reilly invented it, no one knew what the the trends they had been following for years was called. Similarly Big Data: now we can see it in the room we know what it is for and can approach it purposefully. And we know it is an elephant in this room, for no better reason than the fact that Doug Cutting called his management and sorting system for large, various, distributed, structured and unstructured data Hadoop – after his small boy’s stuffed elephant. And this open source environment, now commercialized by Yahoo, who developed it over the previous five years on top of Google’s open source MapReduce environment, is officially named HortonWorks, in tribute to the elephant from Dr Seuss who Hadoop really was. With me so far? Ten years of development since the early years of Google, resulting in lots of ways to agglomerate, cross search and analyse very large collections of data of various types. Two elephants in the room (only really one), and it is Big Search that is leading the charge on Big Data.

So what is Big Data? Apparently, data at such a scale that its very size is the first problem encountered in handling it. And why has it become an issue? Mostly because we want to distill real intelligence from searching vast tracks of stuff, despite its configuration, but we do not necessarily want to go to the massive expense of putting it all together in one place with common structures and metadata – or ownership prevents us from doing this even if we could afford it. We have spent a decade in refining and acquiring intelligent data mining tools (the purchase of ClearForest by Reuters, as it was then, first alerted me to the implications of this trend 5 years ago). Now we have to mine and extract, using our tools and inference rules and advanced taxonomic structures to find meaning where it could not be seen before. So in one sense Big Data is like reprocessing spoil heaps from primary mining operations: we had an original purpose in assembling discreet collections of data and using them for specific purposes. Now we are going to reprocess everything together to discover fresh relationships between data elements.

What is very new about that? Nothing really. Even in this blog we discussed (https://www.davidworlock.com/2010/11/here-be-giants/) the Lexis insurance solution for risk management. We did it in workflow terms, but clearly what this is about is cross-searching and analysing data connections, where the US government files, insurer’s own client records, Experian’s data sets, and whatever Lexis has in Choicepoint and its own huge media archive, are elements in conjecturing new intelligence from well-used data content. And it should be no surprise to see Lexis launching its own open source competitor to all those elephants, blandly named Lexis HPCC.

And they are right so to do. For the pace is quickening and all around us people who can count to five are beginning to realize what the dramatic effects of adding three data sources together might be. July began with WPP launching a new company called Xaxis (http://www.xaxis.com/uk/). This operation will pool social networking  content, mobile phone and interactive TV data with purchasing and financial services content with geolocational and demographic content. Most of this is readily available without breaking even European data regulations (though it will force a number of players to re-inforce their opt-in provisos). Coverage will be widespread in Europe, North America and Australasia. The initial target is 500 million individuals, including the entire population of the UK. The objective is better ad targetting; “Xaxis streamlines and improves advertisers ability to directly target specific audiences, at scale and at lower cost than any other audience -buying solution” says its CEO. By the end of this month 13 British MPs had signed a motion opposing the venture on privacy grounds (maybe they thought of it as the poor man’s phone hacking!).

And by the end of the month Google had announced a new collaboration with SAP (http://www.sap.com/about-sap/newsroom/press-releases/press.epx?pressid=17358) to accomplish “the intuitive overlay of Enterprize data onto maps to Fuel Better Business Decisions”. SAP is enhancing its analytics packages to deal with content needed to populate locational display: the imagined scenarios here are hardly revolutionary but the impact is immense. SAP envisage telco players analysing dropped calls to locate a faulty tower, or doing risk management for mortgagers, or overlaying census data. DMGT’s revolutionary environmental risk search engine Landmark was doing this to historical land use data 15 years ago. What has changed is speed to solution, scale of operation, and availability of data filing engines, data discovery schema, and advanced analytics leading to quicker and cheaper solutions.

In one sense these moves link to this blogs perennial concern for workflow and the way content is used within it and within corporate and commercial life. In another it pushes forward the debate on Linked Data and the world of semantic analysis that we are now approaching. But my conclusion in the meanwhile is that while Big Data is a typically faddish information market concern, it should be very clearly within the ambit of each of us who looks to understand the way in which information services and their user relevance is beginning to unfold. As we go along, we shall rediscover that data has many forms, and mostly we are only dealing at present with “people and places” information. Evidential data, as in science research, poses other challenges. Workflow concentrations, as Thomson Reuters are currently building into their GRC environments, raise still more issues about relationships. At the moment we should say welcome to Big Data as a concept that needed to surface, while not losing sight of its antecedents and the lessons they teach.

« go backkeep looking »