The Road to Dogpatch Labs

Filed Under Blog, eBook, Industry Analysis, internet, mobile content, Publishing, semantic web, STM, Uncategorized, Workflow | 2 Comments

This week is Frankfurt, and thus the pleasure of interviewing Annette Thomas, Macmillan CEO on the STM conference agenda, traditional forerunner of the Frankfurt Book Fair. And I find a hint of nostalgia in the conference programme which precedes our event. It has a traditional flavour. For whenever STM publishers sit down to discuss the twin evils of Open Access and Peer Review (or those who slight it) they do so with a lip-smacking relish which is more akin to tucking into Christmas turkey than a logical discussion of the issues facing scholarly communication. Indeed I sometimes wonder if “science publishing” has gone off on its own, leaving “scholarly communication” to the scholars.

Let me try to illustrate what I mean. The looming crisis in STM, in my warped view, is the data crisis. In every other sector it is rapidly becoming clear that increasingly sophisticated data mining and extraction techniques will come into play as users seek to extract new meaning from existing files, and further discovery as they cross search those files with currently unstructured content held elsewhere. STM, it seems to me, is peculiarly susceptible to this Big Data syndrome, for behind the proprietory content stores of perfectly preserved published research articles “owned” by publishers lies the terra incognito of research data and findings held in labs and on research networks. Future scholars will want to search everything together, and will be impatient with barriers which prevent this. Once the tools and utilities which comprise research workflow become generally available and the techniques and value of semantic searching locks into this, the urge becomes irresistible, and scholarly article data gets versioned, commoditized, “outed “. It does not really matter if it is located on the open web, the closed web, or in the cloud or in a university repository.

The implications of this are vast. Scholars want to be published by prestigious branded journals as a way of being noted: they also want to be searched in the bloodstream of science. They will make sure they are everywhere, and that their data is where it needs to be as well. The metadata may note that this article was Gold OA and that one was published by Science, but this may be of most interest to the filtering interface in the workflow environment, which uses the information to rank or value results. And there is a finding from 25 years ago which continues to haunt me in STM, which alleges that most searches are performed not to find claims or results, but to discover, check and compare experimental methodologies and techniques. In a world where regulation and compliance grew ever more powerful, this is unlikely to diminish.

So I have come to feel that Open Access (one participant asked me what market share it would eventually have, and was appalled when I said 15% – before it becomes wholly irrelevant) and Peer Review (increasingly all research validation exercises will be multi-metric, so even the traditional argument collapses) are more about the preservation of publishers than the future of scholarly communication. Not that I object to that preservation, but I really did sit up as Annette Thomas, in her interview, began to describe some of the game changing activity that Digital Science, child of Nature, is doing as an investor in a variety of workflow-enhancing technologies built by bench researchers for themselves (http://digital-science.com/products).

And in particular the announcement, made during the session, that Labtiva, a Digital Science investment at Harvard (sited in Dogpatch Labs) was launching ReadCube as an App (http://www.readCube.com). If anything bespeaks workflow then it is the App. And what does this one do? It allows researchers to order their current world of articles as a personal content library, free and Cloud-based, with features like a filing system for PDFs, fast download from a university or institutional login, the ability to save and re-read annotations, cite and create references and a personalised recommendation services. In other words, a smart App, worthy of the world of iPad, which solves the distressing everyday issues of finding what you once downloaded and recalling what you once thought about it, and finding more of the same. What could be more simple? But in simplicity like this there is a form of beauty. An App is definable as a workload tool which takes clumsy pieces of multi-stage routine out daily interactions with work – and makes sure you do not have to remember next time the cumbersome process you had to perform to do that.

So, whatever the introspective mood in the room, here is one publisher setting off on the migration to new values, determinedly seeking the pain points in the researchers’ working life and seeking to solve them. And indeed, other publishers (including Elsevier with their SciVerse and SciVal developments) are heading in the same direction. Yet the contrast between this and the generality of players in the sector is profound. At one point in the meeting I found myself in a discussion about what was going right with STM in a difficult marketplace dependent on government finance. Well, said one very knowledgeable source, we are doing a great deal with eBooks, selling them into places we never thought we would reach. Enhanced with video or audio? No, just reversioning of text. And library subscriptions are holding up really quite well, said another, and the market seems to have been able to absorb some limited price increases. And so I took away a picture of a sector holding its breath and hoping that things would revert to normal, and traditional business models would prevail. But we all knew in our hearts that when “normal” came back it would be different. Postponing the trek down the road to Dogpatch Labs only loses first mover advantage, the experience born of re-iteration, and ensures that it will be more difficult to change successfully in the long term.

Sep

22

Dog Days in the Data Mine

Filed Under B2B, Blog, Financial services, healthcare, Industry Analysis, internet, Publishing, Reed Elsevier, Search, semantic web, STM, Thomson, Uncategorized, Workflow | Leave a Comment

It reminds one superficially of mineral extraction. Who owns the seam of diamonds – the miner or the landowner? When rights are not clear or landownership in dispute? But this business of text or data mining is not really like that at all, and I was reminded this week by blogging contributions from two old friends that who owns the results of data extraction, from thousands or millions of unstructured files, where the data retrieved from individual datasets may be tiny (well within most fair usage provisions) but the contribution to the whole value may be huge, remains at issue. Play this in the context of Big Data and real questions emerge.

Lets go back to the beginning. Here are a couple of top of head examples of life on the planet that give a clue to what is worrying me:

* According to research quoted by the UK’s National Centre for Text Mining “fewer than 7.84% of scientific claims made in a full text article are reported in the abstract for that article”. This, they point out, makes cross-searching of articles using data mining and extraction techniques very important to science research. Fortunately the JISC organization which licences all journal article content from publishers on behalf of UK universities permits researchers to data mine these files, and no doubt this was agreed with the publishers within the license(?). But the question in my mind is this: who owns the product created by the data mining, and is this a new value which can be resold to someone else?

* Lexis Risk Management use many hundreds of public and private US data resources in their Big Data environment to profile people and companies. Both private and public data is researched, and, of course, it will often be the case that unique connections will be thrown up which encourage or discourage users from doing business with the data subject. Clearly Lexis own the result of the custom sweep of the data, and clearly it needs to be updated and amended over time as a result of fresh data becoming available, or more data being licensed into the mine. But do Lexis, or any other data extractor, own the result of the extraction process? They are able to sell a value derived from it, and that value emerges directly from the search activity and the weighting of the answers that they have accomplished. But do they own or need to own the content (which may be different in ten minutes time when another search is done on the same subject)? And can the insurance company who buys that result as part of their risk management model resell the data content itself to a third party?

I have put up two examples because I do not wish to polarize the argument into publishers v government. The issue arises in the UK, as the media lawyer’s lawyer, Laurie Kaye has pointed out, because the Hargreaves Review of copyright law recommends the retention of rights with the data miner – so you can make new products by recombining other people’s data. The UK government has adopted this recommendation with its usual emphatic “maybe”. Elsewhere in the world of August which I deserted to take a holiday, the UK government has come out with a storming approval of Open Data, and, as Shane O’Neill has repeatedly pointed out in his blogs, this contrasts sharply with the content retention policies pursued by UK civil servants, even now creating a Public Data Corporation in order to frustrate the political drive of its masters (how easily a licensing authority becomes a restricting body!).

There are two really troubling aspects of this to me. In the first instance we are not going to get the data revolution, the Berners Lee dream of linked data, the creation of hybrid workflow content modelling, or the Big Data promise of new product and service development unless there is a primary assumption in our society that all Open Web content, and all government or taxpayer funded content is available for data cross searching, unless there are national security considerations. And that it is a standard expectation for data leasing that discovery from multiple files creates new services for the person putting the intellectual effort into that discovery, and hopefully new wealth and employment in our society. If we simply continue to debate copyright as if it connotes the transfer of real world rights into the digital network then we shall constrain the major hope of intellectual property development this century.

And the second thing? Well, I am realist enough to know, after 20 years of lobbying this point, that it is unreasonable to expect the UK government to change its attitude to an information society in my lifetime. So maybe we can undermine these guardians of “my information is my power” by saying that we do not want their content – just the right to search it. After all if it is good enough for the universities and the progress of science, it should be good enough for Ordnance Survey and the Land Registry!

References

Making Open Data Real (www.data.gov.uk/opendataconsultation)

The Public Data Corporation (http://discuss.bis.gov.uk/pdc/)

Response to the Hargreaves Report (http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review)

National Centre for Text Mining (http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review)

Laurence Kaye (http://laurencekaye.typepad.com/)

Shane O’Neill (http://www.shaneoneill.co.uk/)

« go back — keep looking »