Somewhere in the UK’s Palace of Westminster, where members of Parliament and their lordships, the Peers of the Upper House, will be recalled to discuss Britain’s urban conflagration, a tiny group of MPs have until recently been locked in solemn conclave in a Committee Room to finalize a report on… (no, not the economic disasters, or the future of communication with the Murdochs, but…) …the future of peer review in scholarly research publishing in science. To the profound relief of traditional science journal publishers everywhere, the report concludes that nothing much needs to be done by anyone to anything at any great rate, especially if change requires investment (http://www.publications.parliament.uk/pa/cm201012/cmselect/cmsctech/856/85602.htm). Yet some far-seeing and radical publishers are beginning to change traditional rules – so why do they see the need for action while others crave only for the status quo?

It may well be the case that the largest long term impact of Open Access on scientific research reporting is not on business models or usage rights but, as this report goes to great lengths to deny, on peer review. We now have a situation where growing volumes of articles, waiting at the gate for time consuming single or double blind reviewing by unpaid armies of academics, are increasingly able to move past the “does it have an impact on the study of this subject?” test and pass only the essential but simpler technical surveillance implied in a test which asks “is this good scientific method and is it reproducible?”. This is the revolution wrought by Plos One and which is now being followed widely (Nature Communications and elsewhere). The Select Committee enquiry, while paying lip service to this, never brought itself to the point of quite grasping what has happened, and the evidence of publishers allowed them to happily luxuriate in that innocence.

The title of this piece is a quote from an American commentator given in evidence. It is one of the few peripheral signs of a widespread fear that the game is up for peer review where that means journals that clique-ishly track one school of thought and exclude others, where originality and innovation may fail at the barrier of even double blind reviewing (in niche sub-disciplines of some life sciences, everyone knows everyone’s research areas, or can look them up, anyway). No more than passing reference was made to gender prejudice, or indeed the craving of some journals to deter BRIC-based research – and others to artificially encourage it.

Having taken its evidence from those on both sides anxious for the status quo to be preserved, it is not surprising that the Parliamentary Committee drew a blank. While it noted that the BMJ followed an Open Peer Review methodology, it did not appear to feel it necessary to recommend this to others. Yet an increasing body of opinion seems to be saying that while a simple set of technical tests may be all that is needed to get into the database, all of these processes must be completely open. Peers should be brave enough to stand behind their views, review notes and correspondence should be published, evidential data supporting the experiments should be available, and, post-publication, notes and correspondence relating to reproducibility of the experiment should appear alongside the article. At the same time bibliometrics relating to citation and to actual usage must also be maintained. It seems to some observers that Publishers might have to dismantle the cozy editorial relationships that surround current practice in favour of appointing some paid-for fulltime investigators to give a thorough and documented public technical report on whether the paper applies recognizable scientific method which aligns with accepted good practice, and then track and publish reactions to the work within the community. In other words, a different way of spending the £1.9 bn said in RIN’s 2008 report to be the cost of peer review.

And then you see publishers with a real sense of community ownership beginning to build the tools that will allow them to do this. This week the American Institute of Physics (www.aip.org) launched its iPeerReview tool, allowing authors and reviewers to download an app to their iPhone/iPad enabling them to review  status and do work on articles submitted and in work in progress. This extends AIP’s existing workflow environments, Scitation and Peer X-Press. The day is not long off when this type of workflow tool will not only be omnipresent but also transparent, and while some competitive issues, especially around patents applied for, will need careful handling, so much of this research is pre-competitive that this may be less of a problem than first appears. Publishing evidential data may be more of an issue . Publishers and academic administrators currently chorus that the cost would be excessive, but surely they cannot be talking, as they did to the Parliamentary Committee, about the costs of storage, since those are a fraction of what they were a decade ago, and anyway the evidence is already stored by the research project – it just needs to be linked and accessible, and transparently available for other researchers, with permission, to use their own tools to search it and other experimental data with the range of mining and extraction tools now open to them. Publishers should perhaps be in the forefront of extending this service base to their communities of users: those giving evidence to the committee seemed more anxious to defend the journal, as if it were a craft skill like dry stone walling, hedge laying or wattle hurdle making.

And then I came across GSE Research.com, a new project in beta which will launch in the fall. It aims to provide an effective Open Access platform for research into Governance, Environmental Science  and Sustainability, importantly relating research to practice and allowing users a full community participation alongside researchers and professionals. But it was not the built-digital features (how much easier without a print legacy!), or the social investment fund or even the Research Exchange that first attracted me.  It was the emphasis on putting in, alongside the option for a traditional review model, a fast publication Open Peer Review system, in which the Editor makes the first decision, and the community is able to comment, build and improve the result. “We need to learn to include, not exclude, and give the peer community a chance to decide what is relevant, (not just a handful of individuals.” This is a project to watch, but also a trend to be noted. (www.gseresearch.com )

So should we be surprised or disappointed with the result of the Commons deliberations? As a UK taxpayer, I feel like asking for my money back, but as an observer of Parliamentary Committees, noting the number of times the Murdochs and their executives appeared before them before the Great Hacking Scandal broke, surprise would hardly be in the range of available emotional responses.

When a movement in this sector gets a name, then it gets momentum. The classic is Web 2.0; until Tim O’Reilly invented it, no one knew what the the trends they had been following for years was called. Similarly Big Data: now we can see it in the room we know what it is for and can approach it purposefully. And we know it is an elephant in this room, for no better reason than the fact that Doug Cutting called his management and sorting system for large, various, distributed, structured and unstructured data Hadoop – after his small boy’s stuffed elephant. And this open source environment, now commercialized by Yahoo, who developed it over the previous five years on top of Google’s open source MapReduce environment, is officially named HortonWorks, in tribute to the elephant from Dr Seuss who Hadoop really was. With me so far? Ten years of development since the early years of Google, resulting in lots of ways to agglomerate, cross search and analyse very large collections of data of various types. Two elephants in the room (only really one), and it is Big Search that is leading the charge on Big Data.

So what is Big Data? Apparently, data at such a scale that its very size is the first problem encountered in handling it. And why has it become an issue? Mostly because we want to distill real intelligence from searching vast tracks of stuff, despite its configuration, but we do not necessarily want to go to the massive expense of putting it all together in one place with common structures and metadata – or ownership prevents us from doing this even if we could afford it. We have spent a decade in refining and acquiring intelligent data mining tools (the purchase of ClearForest by Reuters, as it was then, first alerted me to the implications of this trend 5 years ago). Now we have to mine and extract, using our tools and inference rules and advanced taxonomic structures to find meaning where it could not be seen before. So in one sense Big Data is like reprocessing spoil heaps from primary mining operations: we had an original purpose in assembling discreet collections of data and using them for specific purposes. Now we are going to reprocess everything together to discover fresh relationships between data elements.

What is very new about that? Nothing really. Even in this blog we discussed (https://www.davidworlock.com/2010/11/here-be-giants/) the Lexis insurance solution for risk management. We did it in workflow terms, but clearly what this is about is cross-searching and analysing data connections, where the US government files, insurer’s own client records, Experian’s data sets, and whatever Lexis has in Choicepoint and its own huge media archive, are elements in conjecturing new intelligence from well-used data content. And it should be no surprise to see Lexis launching its own open source competitor to all those elephants, blandly named Lexis HPCC.

And they are right so to do. For the pace is quickening and all around us people who can count to five are beginning to realize what the dramatic effects of adding three data sources together might be. July began with WPP launching a new company called Xaxis (http://www.xaxis.com/uk/). This operation will pool social networking  content, mobile phone and interactive TV data with purchasing and financial services content with geolocational and demographic content. Most of this is readily available without breaking even European data regulations (though it will force a number of players to re-inforce their opt-in provisos). Coverage will be widespread in Europe, North America and Australasia. The initial target is 500 million individuals, including the entire population of the UK. The objective is better ad targetting; “Xaxis streamlines and improves advertisers ability to directly target specific audiences, at scale and at lower cost than any other audience -buying solution” says its CEO. By the end of this month 13 British MPs had signed a motion opposing the venture on privacy grounds (maybe they thought of it as the poor man’s phone hacking!).

And by the end of the month Google had announced a new collaboration with SAP (http://www.sap.com/about-sap/newsroom/press-releases/press.epx?pressid=17358) to accomplish “the intuitive overlay of Enterprize data onto maps to Fuel Better Business Decisions”. SAP is enhancing its analytics packages to deal with content needed to populate locational display: the imagined scenarios here are hardly revolutionary but the impact is immense. SAP envisage telco players analysing dropped calls to locate a faulty tower, or doing risk management for mortgagers, or overlaying census data. DMGT’s revolutionary environmental risk search engine Landmark was doing this to historical land use data 15 years ago. What has changed is speed to solution, scale of operation, and availability of data filing engines, data discovery schema, and advanced analytics leading to quicker and cheaper solutions.

In one sense these moves link to this blogs perennial concern for workflow and the way content is used within it and within corporate and commercial life. In another it pushes forward the debate on Linked Data and the world of semantic analysis that we are now approaching. But my conclusion in the meanwhile is that while Big Data is a typically faddish information market concern, it should be very clearly within the ambit of each of us who looks to understand the way in which information services and their user relevance is beginning to unfold. As we go along, we shall rediscover that data has many forms, and mostly we are only dealing at present with “people and places” information. Evidential data, as in science research, poses other challenges. Workflow concentrations, as Thomson Reuters are currently building into their GRC environments, raise still more issues about relationships. At the moment we should say welcome to Big Data as a concept that needed to surface, while not losing sight of its antecedents and the lessons they teach.

« go backkeep looking »