So have we all got it now? When our industry (the information services and content provision businesses, sometimes erroneously known as the data industry) started talking about something called Big Data, it was self-consciously re-inventing something that Big Science and Big Government had known about and practised for years. Known about and practised (especially in Big Secret Service; for SIGINT see the foot of this article) but worked upon in a “finding a needle in a haystack” context. The importance of this only revealed itself when I found myself at a UK Government Science and Technology Facilities Council at the Daresbury Laboratory in he north of England earlier this month. I went because my friends at MarkLogic were one of the sponsors, and spending a day with 70 or so research scientists gives more insight on customer behaviour than going to any great STM conference you may care to name. I went because you cannot see the centre until you get to the edge, and sitting amongst perfectly regular normal folk who spoke of computing in yottaflops (processing per second speeds of 10 to the power of 24) as if they were sitting in a laundromat watching the wash go round is fairly edgy for me.

We (they) spoke of data in terms of Volume, Velocity and Variety, sourced from the full gamut of output from sensor to social. And we (I) learnt a lot about the problems of storage which went well beyond the problems of a Google and a Facebook. The first speaker, from the University of Illinois, at least came from my world: Kalev Leetanu is an expert in text analytics and a member of the Heartbeat of the World Project team. The Great Twitter Heartbeat ingests Twitter traffic, sorts and codes it so that US citizens going to vote, or Hurricane Sandy respondents, can appear as geographical heatmaps trending in seconds across the geography of the USA. The SGI UV which did this work (it can ingest the printed resources of the Library of Congress in 3 seconds) linked him to the last speaker, the luminous Dr Eng Lim Goh, SVP and CTO at SGI, who gave a magnificent tour d’horizon of current computing science. His YouTube videos are as wonderful as the man himself (a good example is his 70th birthday address to Stephen Hawking, his teacher, but also look at (http://www.youtube.com/watch?v=zs6Add_-BKY). And he focussed us all on a topic not publicly addressed by the information industry as a whole: the immense distance we have travelled from “needle in a haystack” searching to our current pre-occupation with analysing the differences between two pieces of hay – and mapping the rest of the haystack in terms of those differences. For Dr Goh this resolves to the difference between arranging stored data as a cluster of nodes to working in shared memory (he spoke of 16 terabyte supernodes). As the man with the very big machine, his problems lie in energy consumption as much as anything else. In a process that seems to create a workflow that goes Ingest > Store and Organize > Analytics > Visualize (in text and graphics – like the heatmaps) the information service players seem to me to be involved at every point, not just the front end.

The largest data sourcing project on the planet was represented in the room (The SKA, or Square Kilometre Array, is a remote sensing telemetry experiment with major sites in Australia and South Africa). Of course, NASA is up there with the big players, and so are the major participants in cancer research and human genomics. But I was surprized by how Big the Big Data held by WETA Data (look at all the revolutionary special effects research at http://www.wetafx.co.nz/research) in New Zealand was, until I realised that this is a major film archive (and NBA Entertainment is up there too on the data A List) This reflects the intensity of data stored from film frame images and their associated metadata, now multiplied many times over in computer graphics – driven production. But maybe it is time now to stop talking about Big Data, the term which has enabled us to open up this discussion, and begin to reflect that everyone is a potential Big Data player. However small our core data holding may be compared to these mighty ingestors, if we put proprietory data alongside publicly sourced Open Data and customer-supplied third party data, then even very small players can experience the problems that induced the Big Data fad. Credit Benchmark, which I mentioned two weeks ago, has little data of its own: everything will be built from third party data. The great news aggregators face similar data concentration issues as their data has to be matched with third party data.

And I was still thinking this through when news came of an agreement signed by MarkLogic (www.marklogic.com) with Dow Jones on behalf of News International this week. The story was covered in interesting depth at http://semanticweb.com/with-marklogic-search-technology-factiva-enables-standardized-search-and-improved-experiences-across-dow-jones-digital-network_b33988 but the element that interested me and which highlights the theme of this note concerns the requirement not just to find the right article, but to compare articles and demonstrate relevance in a way which only a few years ago would have left us gasping. Improved taxonomic control, better ontologies and more effective search across structured and unstructured data lie at the root of this, of course, but do not forget that good results at Factiva now depend on effective Twitter and blog retrieval, and effective ways of pulling back more and more video content, starting with You Tube. The variety of forms takes us well beyond the good old days of newsprint, and underline the fact that we are all Big Data players now.

Note: Alfred Rolington, formerly CEO at Janes, will publish a long-awaited book with OUP on “Strategic Intelligencein the Twenty First Century” in January which can be pre-ordered on Amazon at http://www.amazon.co.uk/Strategic-Intelligence-21st-Century-Mosaic/dp/0199654328/ref=sr_1_1?s=books&ie=UTF8&qid=1355519331&sr=1-1. And I should declare, as usual, that I do work from time to time with the MarkLogic team, and thank them for all they have done to try to educate me.

Personally, I blame Marjorie Scardino. When she announced her retirement, this statement, included amongst her comments, might have been intended to encourage the Pearson troops and point them towards the challenges of the future and the Golden City on the Hill. Unfortunately, her comments also reached the wider publishing community, and encouraged that sort of complacency and fired up the sort of debate that the British publisher appears to love, since it enables him to conclude that it is all too complex and no one knows a guaranteed route to success, so it may be wiser not to try until matters have clarified a little more. To those, like me, who have spent over 40 years declaiming that experiment followed by re-iteration is the only way to go, and that you go nowhere in the digital world until you have failed at something digital, this is, at its least, a little frustrating. You see, I know that we have arrived and that we left the foothills behind in 1999. Not in education, I agree: Dame Marjorie’s brilliant step was to see past the publishers and address the real problems of education markets – administration, assessment, marking, communication with parents, teacher skills etc. The textbook was a small market which could be left until the infrastructure could be digitalized and then Pearson would have a head start in plugging their content into that infrastructure.

Dame Marjorie (aided and abetted by Anthony Forbes Watson) bought Dorling Kindersley for Pearson. Embedded into that company was real digital publishing. In 1996 DK was producing CD-ROM-based encyclopaedias and reference works that were a delight, for their day, in terms of interactivity and multimedia development. They were on CD-ROM only because online did not have the bandwidth, and it is noteworthy now that only with ePub3 has the eBook caught up with the mid ’90s CD-ROM. Yet, as a non-executive director of DK at the time, I was sure that we were doing real digital publishing for very large numbers of real users. So when I saw a report by Linda Bennett in Bookcrunch of a seminar by Cognizant entitled ” Digital Publishing: Still in the foothills?” (27 November) I frayed slightly round the edges. Really good speakers, but in a meeting where a questionner asks “whether publishers who engaged with such innovative ventures as digital development) could still truly call themselves publishers” one wonders whether Publishing will not always be in the Foothills, wandering around, lost and resentful and playing a game of their own with ever dwindling audiences of paper-lovers.

This is not to say that valuable points were not made. Mark Marjurey’s comment that “content won’t cut it much longer” is important, if it reminds us that it is not content per se that matters, but the context in which we deliver it that will drive our future developments. When someone asked “Is it true that social networking doesn’t sell books?” they were reminded that it is word of mouth that sells books (and presumably as effectively on Facebook as on Amazon). When someone said that the rentals model, the disappearing book that dissolves as you read it, “sounds bonkers”, they were at least reminded that this is a very valid model which may eventually prevail. But the skepticism expressed about the digital illustrated book may be totally misplaced. It all depends what experience you want. We have plenty of examples of text files co-located with audio, video and image where the user is invited to chart his own course through the material. But why are we so hung up with trying to replicate the the narrative experience of the illustrated book online?

About an hour later on the Web I did encounter a digital publisher. One who publishes for consumers yet does not use paper at all. Its Vice President was writing a Christmas message to the staff on November 28th. He reported that 13 of the Top 100 Kindle bestsellers were published by their organization, and he recognized 11 authors whose new titles on his list had sold 100,000 copies in the past few months. He pointed to the success of the company language translation scheme, with 12 titles translated by this operation getting into the German Kindle Top 100, and the German into English programme beginning in the New Year with a prize winning German novel. He spoke of serialization and reminded his internal readers that the programme they launched in September was now serializing seven never-before published Kurt Vonnegut stories over the next seven weeks. And he spoke of global outreach and of his plans to open a European operation in Luxembourg early next year.

The writer was of course Jeff Belle of Amazon Publishing. And his words make one realize how late in the day all this foothills talk is. He does, however quote Jeff Bezos as saying “Its still Day One”. Yes, but late in the day on day One!

« go backkeep looking »