Jul
29
Intelligent Life in Big Data
Filed Under B2B, Blog, data protection, Financial services, healthcare, Industry Analysis, internet, privacy, Search, semantic web, Uncategorized, Workflow | 1 Comment
When a movement in this sector gets a name, then it gets momentum. The classic is Web 2.0; until Tim O’Reilly invented it, no one knew what the the trends they had been following for years was called. Similarly Big Data: now we can see it in the room we know what it is for and can approach it purposefully. And we know it is an elephant in this room, for no better reason than the fact that Doug Cutting called his management and sorting system for large, various, distributed, structured and unstructured data Hadoop – after his small boy’s stuffed elephant. And this open source environment, now commercialized by Yahoo, who developed it over the previous five years on top of Google’s open source MapReduce environment, is officially named HortonWorks, in tribute to the elephant from Dr Seuss who Hadoop really was. With me so far? Ten years of development since the early years of Google, resulting in lots of ways to agglomerate, cross search and analyse very large collections of data of various types. Two elephants in the room (only really one), and it is Big Search that is leading the charge on Big Data.
So what is Big Data? Apparently, data at such a scale that its very size is the first problem encountered in handling it. And why has it become an issue? Mostly because we want to distill real intelligence from searching vast tracks of stuff, despite its configuration, but we do not necessarily want to go to the massive expense of putting it all together in one place with common structures and metadata – or ownership prevents us from doing this even if we could afford it. We have spent a decade in refining and acquiring intelligent data mining tools (the purchase of ClearForest by Reuters, as it was then, first alerted me to the implications of this trend 5 years ago). Now we have to mine and extract, using our tools and inference rules and advanced taxonomic structures to find meaning where it could not be seen before. So in one sense Big Data is like reprocessing spoil heaps from primary mining operations: we had an original purpose in assembling discreet collections of data and using them for specific purposes. Now we are going to reprocess everything together to discover fresh relationships between data elements.
What is very new about that? Nothing really. Even in this blog we discussed (https://www.davidworlock.com/2010/11/here-be-giants/) the Lexis insurance solution for risk management. We did it in workflow terms, but clearly what this is about is cross-searching and analysing data connections, where the US government files, insurer’s own client records, Experian’s data sets, and whatever Lexis has in Choicepoint and its own huge media archive, are elements in conjecturing new intelligence from well-used data content. And it should be no surprise to see Lexis launching its own open source competitor to all those elephants, blandly named Lexis HPCC.
And they are right so to do. For the pace is quickening and all around us people who can count to five are beginning to realize what the dramatic effects of adding three data sources together might be. July began with WPP launching a new company called Xaxis (http://www.xaxis.com/uk/). This operation will pool social networking content, mobile phone and interactive TV data with purchasing and financial services content with geolocational and demographic content. Most of this is readily available without breaking even European data regulations (though it will force a number of players to re-inforce their opt-in provisos). Coverage will be widespread in Europe, North America and Australasia. The initial target is 500 million individuals, including the entire population of the UK. The objective is better ad targetting; “Xaxis streamlines and improves advertisers ability to directly target specific audiences, at scale and at lower cost than any other audience -buying solution” says its CEO. By the end of this month 13 British MPs had signed a motion opposing the venture on privacy grounds (maybe they thought of it as the poor man’s phone hacking!).
And by the end of the month Google had announced a new collaboration with SAP (http://www.sap.com/about-sap/newsroom/press-releases/press.epx?pressid=17358) to accomplish “the intuitive overlay of Enterprize data onto maps to Fuel Better Business Decisions”. SAP is enhancing its analytics packages to deal with content needed to populate locational display: the imagined scenarios here are hardly revolutionary but the impact is immense. SAP envisage telco players analysing dropped calls to locate a faulty tower, or doing risk management for mortgagers, or overlaying census data. DMGT’s revolutionary environmental risk search engine Landmark was doing this to historical land use data 15 years ago. What has changed is speed to solution, scale of operation, and availability of data filing engines, data discovery schema, and advanced analytics leading to quicker and cheaper solutions.
In one sense these moves link to this blogs perennial concern for workflow and the way content is used within it and within corporate and commercial life. In another it pushes forward the debate on Linked Data and the world of semantic analysis that we are now approaching. But my conclusion in the meanwhile is that while Big Data is a typically faddish information market concern, it should be very clearly within the ambit of each of us who looks to understand the way in which information services and their user relevance is beginning to unfold. As we go along, we shall rediscover that data has many forms, and mostly we are only dealing at present with “people and places” information. Evidential data, as in science research, poses other challenges. Workflow concentrations, as Thomson Reuters are currently building into their GRC environments, raise still more issues about relationships. At the moment we should say welcome to Big Data as a concept that needed to surface, while not losing sight of its antecedents and the lessons they teach.
May
13
Stroking the Winter Palace
Filed Under Blog, eBook, Industry Analysis, internet, Publishing, Reed Elsevier, Search, semantic web, STM, Workflow | 1 Comment
Which is to say, we haven’t exactly been storming it in St Petersburg this week. There is little which is revolutionary in nature about any conference of librarians, publishers and academics, and the 13th meeting of the Fiesole Collection Development Retreat does not sound like a caucus of anarchists. But then, pause for a moment. We have been 60 or so mixed discipline people, an offshoot of the famous Charleston librarians meetings, using a few days in a brilliant city to cogitate on the future needs of research, and the roles that librarians and publishers may play in attaining them. Anyone afraid of the argument would not come here, but go to specialist meetings of their own ilk.
Youngsuk Chi, surely the most persuasive and diplomatic representative of the publishing sector, keynoted in his role as CEO at Elsevier Science and Technology. So we were able to begin with the schizophrenia of our times: on one side the majestic power of the largest player in the sector, dedicated both to the highest standards of journal quality and the maintenance of the peer reviewed standards of science, while on the other investing hugely and rightly in solutioning for scientists (ScienceDirect, Scopus, Scirus, SciVerse, SciVal…..) without the ability to make those solutions universal (by including all of the science needed to produce “answers”).
I was taught on entry into publishing that STM was only sustainable through the support of the twin pillars of copyright and peer review. This week those two rocked a little in response to the earth tremors shaking scholarship and science. We reviewed Open Access all right, but this now seems a tabby cat rather than the lion whose roar was going to reassert the power of researchers (and librarians) over scholarly collections. The real force which is changing copyright is the emergence of licensing and contract systems in the network which embed ownership but defuse the questions surrounding situational usage. And the real force which is changing peer review is the anxiety in all quarters to discover more and better metrics which demonstrate not just the judgement of peers but the actual usage of scholars, the endurability of scholarship, and the impact of an article and not the journal in which it appeared.
And its clearly game over for Open Access. The repetitive arguments of a decade have lost their freshness, the wise heads see that a proportion of most publishing in most sectors will be Open Access, much of it controlled by existing publishers like Springer who showed the intensity of their thinking here. But does it matter if this is 15% of output in History rising to 30% in Physics? It is a mixed economy, and my guess is that the norm will be around 15% across the board, which makes me personally feel very comfortable when I review the EPS prognosis of 2002! A few other ideas are going out with the junk as well – why did we ever get so excited about the institutional repository, for example.
So where are the Big Ideas now? Two recurrent themes from speakers resonated with me throughout the event. We now move forward to the days of Big Data and Complete Solutions. As I listened to speakers referring to the need to put experimental data findings in places where they were available and searchable, I recalled Timo Hannay, now running Digital Science, and his early work on Signalling Gateway. What if the article is, in some disciplines, not the ultimate record? What if the findings, the analytical tools and the underlying data, with citations added for “referenceability”, forms the corpus of knowledge in a particular sector? And what if the requirement is to cross search all of this content, regardless of format or mark-up, in conjunction with other unstructured data? And use other software tools to test earlier findings? And in these sectors no one can pause long enough to write a 10,000 word article with seven pages of text, three photos and a graph?
And where does this data come from? Well, it is already there. Its experimental, of course, but it is also observational. It is derived from surveillance and monitoring. It arises in sequencing, in scanning and in imaging. It can be qualitative as well as quantitative, it derives from texts as well as multimedia, and it is held as ontologies and taxonomies as well as in the complex metadata which will describe and relate data items. Go and take a look at the earth sciences platform, www.pangea.de, or at the consortium work at www.datacite.org in order to see semantic web come into its own. And this raises other questions, like who will organize all of this Cloud-related content – librarians, or publishers, or both, or new classes of researchers dedicated to data curation and integration? We learnt that 45% of libraries say that they provide primary data curation, and 90% of publishers say that they provide it, but the anecdotal evidence is that few do it well and most do no more than pay lip service to the requirement.
Of course, some people are doing genuinely new things (John Dove of Credo with his interlinking reference tools – www.credoreference.com – for undergraduate learning would be a good example: he also taught us how to do a Pecha Kutcha in 6 minutes and 20 slides!). But it is at least observable that the content handlers and the curators are still obsessed by content, while workflow solutions integrate content but are not of themselves content vehicles. My example would be regulatory and ethical compliance in research programmes. The content reference will be considerable, but the “solution” which creates the productivity and improves the lab decision making and reduces the costs of the regulatory burden will not be expressed in terms of articles discovered. Long years ago I was told that most article searching (as much as 70% it was alleged) was undertaken to “prove” experimental methodology, to validate research procedures and to ensure that methods now being implemented aligned with solutions already demonstrated to have successfully passed health and safety strictures. Yet in our funny mishaped world no specialist research environment seems to exist to search and compare this facet, though services like www.BioRAFT.com are addressing the specific health and safety needs.
Summing up the meeting, we were pointed back to the role of the Web as change agent. “Its the Web, Stupid!” Quite so. Or rather, its not really the Web, is it? Its the internet. We are now beyond the role of the Web as the reference and searching environment, and back down into the basement of the Internet as the communications world between researchers, supported by the ancilliary industries derived from library and publishing skills, moves into a new phase of its networked existence. It takes meetings that have equal numbers of academics and librarians and publishers to provide space to think these thoughts. Becky Lenzini and her tireless Charleston colleagues have now delivered a further, 13th, episode in this exercise in recalibration of expectations, and deserve everyone’s gratitude for doing so. And the sun shone in St Petersburg in the month of the White Nights, which would have made any storming of the Winter Palace a bit obvious anyway.
« go back — keep looking »