A few weeks ago, in “Scraps and Jottings” I tried to reflect, while talking about the newly-launched journal Cureus, an increasing feeling that both traditional publishers and the mujahaddeen of the Open Access world (yes, that good Mullah Harnad and his ilk) are both being overtaken by events. The real democratization which will change this world is popular peer review. Since the Mujahadeen got in and named the routes to Open Access Paradise as Green and Gold, and publishers seem quite happy to work within these definitions, especially if they are gold, I have no choice but to name the Post Publication Peer Review process as the Black Route to Open Access. You read it here first.

This thought is underlined by the announcement, since I wrote my previous piece, that the Faculty of 1000 (F1000Research) service has emerged from its six month beta and can now be considered fully launched. Here we have a fully developed service, dedicated to immediate “publication”, inclusive of all data, totally open and unrestricted in access and enabling thorough and innovative refereeing as soon as the article is available. And the refereeing is open – no secrets of the editorial board here, since all of the reports and commentaries are published in full with the names and affiliations of referees. The F1000Research team report that in the last six months they have covered major research work from very prominent funders – Wellcome, NIH etc – and that they now have 200 leading medical and biological science researchers on their International Advisory panel and more than 1000 experts on the Editorial Board (see http://f1000research.com). And since they have a strategic alliance with figshare, the Macmillan Digital Science company, “publishing” in this instance could be as simple as placing the article in the researcher’s own repository and opening it up within F1000Research. And since othe partners include Dryad and biosharing, the data can also be co-located within specialized data availability services. Saves all those long waits – as soon as it is there, with its data as well, the article is ready to be referenced alongside the academic’s next grant application. The fact that all current publishing has been accompanied by the relevant data release (for which read genomes, spreadsheets, videos, images, software, questionnaires etc) indicates that this too is not the barrier that conventional article publishing made it out to be.

Ah, you will say, the problem here is that the article will not get properly into the referencing system and without a “journal” brand attached to it there will be a tendency to lose it. Well, some months ago Elsevier agreed that Scopus and Embase would carry abstracts of these articles, and, as as I write PubMed has agreed to inclusion once post-publication review has taken place. But then, you will say, these articles will not have the editorial benefits of orthodox journal publishing, or appear in enhanced article formats. Well, nothing prevents a research project or a library licensing Utopia Docs, and nothing inhibits a freelance market of sub-editors selling in services if F1000Research cannot provide them – this is one labour market which is dismally well staffed at present.

Now that F1000Research has reached this point it is hard to see it not move on and begin to influence the stake which conventional publishing has already established in conventional Open Access publishing. And F1000 obviously has interesting development plans of its own: its F1000Trials service is already in place to cover this critical part of bio-medical scholarly communication, and, to my great joy, it has launched F1000Posters, covering a hugely neglected area for those trying to navigate and annotate change and track developments. Alongside Mendeley and the trackability of usage, post-publication review seems to me a further vital step towards deep, long term change in the pattern of making research available. My new year recommendation to heads of STM publishing houses is thus simple: dust off those credit cards, book a table at Pied de Terre, and invite Vitek round for lunch. He has not sold an STM company since BMC, but it looks as if he has done the magic once again.

But, now, I must end on a sad note. The suicide this week of Aaron Swartz, at the age of 26, is a tragic loss. I understand that he will be known as one of the inventors of RSS – and of Reddit – and he had been inventing and hacking since he was 13. PACER/RECAP controversially “liberated” US Common Law to common use. He was known to suffer from severe depression and it appears that he ended his life in a very depressed state. But here is what Cory Doctorow (http://boingboing.net/2013/01/12/rip-aaron-swartz.html) had to say about what might have been a contributory factor:

“Somewhere in there, Aaron’s recklessness put him right in harm’s way. Aaron snuck into MIT and planted a laptop in a utility closet, used it to download a lot of journal articles (many in the public domain), and then snuck in and retrieved it. This sort of thing is pretty par for the course around MIT, and though Aaron wasn’t an MIT student, he was a fixture in the Cambridge hacker scene, and associated with Harvard, and generally part of that gang, and Aaron hadn’t done anything with the articles (yet), so it seemed likely that it would just fizzle out.

Instead, they threw the book at him. Even though MIT and JSTOR (the journal publisher) backed down, the prosecution kept on. I heard lots of theories: the feds who’d tried unsuccessfully to nail him for the PACER/RECAP stunt had a serious hate-on for him; the feds were chasing down all the Cambridge hackers who had any connection to Bradley Manning in the hopes of turning one of them, and other, less credible theories. A couple of lawyers close to the case told me that they thought Aaron would go to jail.”

Well, one thing we can be quite certain about. Protecting intellectual property or liberating it cannot ever be worth a single human life.

So have we all got it now? When our industry (the information services and content provision businesses, sometimes erroneously known as the data industry) started talking about something called Big Data, it was self-consciously re-inventing something that Big Science and Big Government had known about and practised for years. Known about and practised (especially in Big Secret Service; for SIGINT see the foot of this article) but worked upon in a “finding a needle in a haystack” context. The importance of this only revealed itself when I found myself at a UK Government Science and Technology Facilities Council at the Daresbury Laboratory in he north of England earlier this month. I went because my friends at MarkLogic were one of the sponsors, and spending a day with 70 or so research scientists gives more insight on customer behaviour than going to any great STM conference you may care to name. I went because you cannot see the centre until you get to the edge, and sitting amongst perfectly regular normal folk who spoke of computing in yottaflops (processing per second speeds of 10 to the power of 24) as if they were sitting in a laundromat watching the wash go round is fairly edgy for me.

We (they) spoke of data in terms of Volume, Velocity and Variety, sourced from the full gamut of output from sensor to social. And we (I) learnt a lot about the problems of storage which went well beyond the problems of a Google and a Facebook. The first speaker, from the University of Illinois, at least came from my world: Kalev Leetanu is an expert in text analytics and a member of the Heartbeat of the World Project team. The Great Twitter Heartbeat ingests Twitter traffic, sorts and codes it so that US citizens going to vote, or Hurricane Sandy respondents, can appear as geographical heatmaps trending in seconds across the geography of the USA. The SGI UV which did this work (it can ingest the printed resources of the Library of Congress in 3 seconds) linked him to the last speaker, the luminous Dr Eng Lim Goh, SVP and CTO at SGI, who gave a magnificent tour d’horizon of current computing science. His YouTube videos are as wonderful as the man himself (a good example is his 70th birthday address to Stephen Hawking, his teacher, but also look at (http://www.youtube.com/watch?v=zs6Add_-BKY). And he focussed us all on a topic not publicly addressed by the information industry as a whole: the immense distance we have travelled from “needle in a haystack” searching to our current pre-occupation with analysing the differences between two pieces of hay – and mapping the rest of the haystack in terms of those differences. For Dr Goh this resolves to the difference between arranging stored data as a cluster of nodes to working in shared memory (he spoke of 16 terabyte supernodes). As the man with the very big machine, his problems lie in energy consumption as much as anything else. In a process that seems to create a workflow that goes Ingest > Store and Organize > Analytics > Visualize (in text and graphics – like the heatmaps) the information service players seem to me to be involved at every point, not just the front end.

The largest data sourcing project on the planet was represented in the room (The SKA, or Square Kilometre Array, is a remote sensing telemetry experiment with major sites in Australia and South Africa). Of course, NASA is up there with the big players, and so are the major participants in cancer research and human genomics. But I was surprized by how Big the Big Data held by WETA Data (look at all the revolutionary special effects research at http://www.wetafx.co.nz/research) in New Zealand was, until I realised that this is a major film archive (and NBA Entertainment is up there too on the data A List) This reflects the intensity of data stored from film frame images and their associated metadata, now multiplied many times over in computer graphics – driven production. But maybe it is time now to stop talking about Big Data, the term which has enabled us to open up this discussion, and begin to reflect that everyone is a potential Big Data player. However small our core data holding may be compared to these mighty ingestors, if we put proprietory data alongside publicly sourced Open Data and customer-supplied third party data, then even very small players can experience the problems that induced the Big Data fad. Credit Benchmark, which I mentioned two weeks ago, has little data of its own: everything will be built from third party data. The great news aggregators face similar data concentration issues as their data has to be matched with third party data.

And I was still thinking this through when news came of an agreement signed by MarkLogic (www.marklogic.com) with Dow Jones on behalf of News International this week. The story was covered in interesting depth at http://semanticweb.com/with-marklogic-search-technology-factiva-enables-standardized-search-and-improved-experiences-across-dow-jones-digital-network_b33988 but the element that interested me and which highlights the theme of this note concerns the requirement not just to find the right article, but to compare articles and demonstrate relevance in a way which only a few years ago would have left us gasping. Improved taxonomic control, better ontologies and more effective search across structured and unstructured data lie at the root of this, of course, but do not forget that good results at Factiva now depend on effective Twitter and blog retrieval, and effective ways of pulling back more and more video content, starting with You Tube. The variety of forms takes us well beyond the good old days of newsprint, and underline the fact that we are all Big Data players now.

Note: Alfred Rolington, formerly CEO at Janes, will publish a long-awaited book with OUP on “Strategic Intelligencein the Twenty First Century” in January which can be pre-ordered on Amazon at http://www.amazon.co.uk/Strategic-Intelligence-21st-Century-Mosaic/dp/0199654328/ref=sr_1_1?s=books&ie=UTF8&qid=1355519331&sr=1-1. And I should declare, as usual, that I do work from time to time with the MarkLogic team, and thank them for all they have done to try to educate me.

« go backkeep looking »