Trends and trending analysis are one thing, making an impact on the way people work is often quite another. So while I respectfully write up the huge progress being made to provide large scale tools for analytical discovery in unimaginable quantities of data, a small portion of me remains skeptical about the impact of these developments in the short term on the working lives of professionals. Look at researchers in science and technology: you can readily imagine the impact of Big Data on Big Pharma, but can you so easily imagine what this will mean in materials science? Or can you see how the workbench performance of the  individual researcher in neuroscience might be impacted? Its tough, and because it is tough we go back to saying that the traditional knowledge components will last the course. So if you have a good library, access to a reasonable collection of journals and the ability to network with colleagues then that is enough. Or Good Enough, as we keep saying.

So when I read the words “This is important not only for the supplementary data accompanying one’s experiment, but even negative results” I came alive immediately and read consciously what I had hitherto skipped. You see, in all the years that I have spoken with and interviewed researchers, when we get off the formal ground of OA or conventionally published articles, or the iniquities of publishers and the inadequacy of librarians, we get back to some stubborn issues that cling to the bottom of the bucket. One is what do you do with the remaining content derived from the research process which did not get into the article, where it was summarized and where conclusions were drawn from it. I mean the statistical findings, the raw computations. the observations and logs, the audio and video diaries, the discarded hypotheses etc. Vital stuff, if anyone is going to walk that way again. Even more vital is the detritus of failure: the experiment which never made a paper since it demonstrated what we already know, or where the model proved inadequate to demonstrate what we sought to show. Researchers going back to find why a generation of research went astray from a finding that proved fallible often need this content: in terms of detective fiction it is the cold case evidence. Yet more often than not it is not available.

So here is what I found in the nearly discarded press release. Nature Publishing’s Digital Science company (yes, them again!) have refinanced figshare (http://figshare.com) and yesterday they relaunched it. What does it do? It archives all the stuff I have been talking about, providing a Cloud environment with unlimited public public storage. They call it “a community-based open data platform for scientific research”. I call it a wonderful way of embedding research workflow into a researchable storage environment that eventually becomes a search magnet for researchers wanting to check the past for surprising correlations. At the moment it is just a utility, a safe place to put things. But if I just add a copy of the article itself then it becomes a record of a research process. Put hundreds of thousands of those together and then you have a Big Data playground. Use intelligent analytics and new insights can be derived, and science moves forward on the tessellate of previous experimentation – only quicker, with less effort and more productivity for the researcher. And much less is lost, including the evidence from the wrong turnings that turned out to be right turnings. (http://digital-science.com/press-releases/)

So will there be 20 of these? Well, there may be two, but if figshare gets an early lead perhaps there will only be one. After all , the reason  researchers would come to value this storage would be having their content in close proximity to others in their field. And while early progress is likely to run quick in Life Sciences, this application has relevance in every field of study. And it also calls into question ideas of what “publishing” actually is. By storing and making available these data, are figshare “publishing” them. They are certainly not editing or curating them. Network access alters many things and here, once again, it catches publishing on the hop. If traditional publishers confine themselves to making margins solely from the first appearance of an article then traditional publishing in this sector is in severe difficulty, whatever happens to the Open Access debate. Elsevier and Nature clearly get it: go upstream in value terms or drown in commoditized content where you are. But does anyone else see it? And why not?

 

 

 

 

Its Big Data week, yet again. In the last two months we have seen all of the dramas and confusions attendant upon emerging markets, yet none of the emerging clarity which one might expect when a total sea change is taking place in the way in which we extract value from data content. Then this week, with all the aplomb of an elephant determined not to be left behind in a world which has apparently decided that the hula hoop is the only route to sanity, Oracle announced its enterprize Big Data solution. Again. Only now it is called the Big Data Appliance. It started shipping on Tuesday. And the world will never be the same again.

At the heart of the Oracle launch is a Hadoop license. This baby elephant lies at the heart of almost everything. The two Hadoop – based commercializations, have both raised finance in the lead-up to 2012: Cloudera ($40m) and Hortonworks ($20m), while other sector players like MapR who also exploit Hadoop found 2011 a really good time to raise money. And this had a radiating effect on the whole data handling sector. Neo 4j, a database technology (NeoTechnology, based in Malmo and Menlo Park) for  graph storage and resolution raised $10m in a round led by Fidelity. Meanwhile, Microsoft signed a deal with Horton works, IBM said it would launch Hadoop in the Cloud, EMC (Greenplum) went for MapR, Dell announced a Hadoop-based initiative, and the world waits and wonders what Hewlett Packard will do, now that it has Autonomy for analytics.

So now we have plenty of initiatives, and, as usual, not much idea of who the next generation of users will be. The first generation speak for themselves. We can see the benefits that Facebook derive from being able to used Hadoop-based tools to find connections and meanings in their content that would have been impossible to cost-effectively reveal in a prior age. And the same would be true of such unlikely bedfellows as the Department of Homeland Security, or Walmart, or Sony (think Playstation Network), or the Israeli Defence Force, or the US insurance industry (via Lexis Risk), or Lexis Nexis (who announced a Big Data integration with MarkLogic), let alone the two players who effectively started all this: Yahoo! (Hadoop) and Google (MapReduce). So asking where it goes next is a legitimate question, but one which can only be answered if we accept that the next group of users are never going to recreate  the Google server farms in order to break into these advantageous processing environments. The next group of intensive users will have their XML content on MarkLogic, or their graphical data on Neo 4j. They will want to use the US census data remotely (so will contract with Amazon for process time on the Amazon web presence), and will use a large variety of third party content held in similar ways. Some of their own content will still be held locally on MySQL databases – like Facebook – while others will be working in part or fully in the Cloud, and combining that with their own NoSQL applications. But the essential point here is that no one will be building huge data warehousing operations governed by rigid and mechanistic filing structures. Literally, we are increasingly leaving the data where it is, and bringing the analytical software to it, in order to produce results that are independent of any single data source.

And this too produces another sort of revolution. The front door to working in this way is now the organizational software itself. When Lexis Risk announced at the end of last year that they were going to take HPCC open source, a number of critics saw that as turning their back to an exploitation opportunity. Yet it makes very real sense in the context of Oracle, Microsoft and IBM seeking to build their own “solutions”. Some businesses will want to run their own solutions, and will make a choice between open source Hadoop and open source HPCC. Others in systems integration will seek out open source environments to create unique propositions. But since it was always unlikely that Lexis Risk was going to challenge the enterprize software players in their own bailiwick, then open source is a way of getting a following, harvesting vital feedback, and earn not insignificant returns in servicing and upgrading users.

I am also delighted to see that other winners seem likely to be MarkLogic, since I have been proud of working with them and speaking at their meetings for a number of years. For publishers and information providers, it is now clear that XML remains the route forward. But MarkLogic 5 is clearly being positioned as the information service providers socket for plugging into the Big Data environment. Anyone who believes that scientists will NOT want to analyse all data in a segment, or engineers source all relevant briefs with their ancilliary information, or lawyers cross examine all documentation regardless of location, or pharma companies examine research files in the context of contra-indications should stop reading now and take up fishing. My observation is that Big Data is like Due Diligence: once someone does it, even if the first results are not impressive, all competitors have to do it. The risk of not trying to find the indicative answer by the most advanced methods is too great to take.

 

 

 

« go backkeep looking »