So have we all got it now? When our industry (the information services and content provision businesses, sometimes erroneously known as the data industry) started talking about something called Big Data, it was self-consciously re-inventing something that Big Science and Big Government had known about and practised for years. Known about and practised (especially in Big Secret Service; for SIGINT see the foot of this article) but worked upon in a “finding a needle in a haystack” context. The importance of this only revealed itself when I found myself at a UK Government Science and Technology Facilities Council at the Daresbury Laboratory in he north of England earlier this month. I went because my friends at MarkLogic were one of the sponsors, and spending a day with 70 or so research scientists gives more insight on customer behaviour than going to any great STM conference you may care to name. I went because you cannot see the centre until you get to the edge, and sitting amongst perfectly regular normal folk who spoke of computing in yottaflops (processing per second speeds of 10 to the power of 24) as if they were sitting in a laundromat watching the wash go round is fairly edgy for me.

We (they) spoke of data in terms of Volume, Velocity and Variety, sourced from the full gamut of output from sensor to social. And we (I) learnt a lot about the problems of storage which went well beyond the problems of a Google and a Facebook. The first speaker, from the University of Illinois, at least came from my world: Kalev Leetanu is an expert in text analytics and a member of the Heartbeat of the World Project team. The Great Twitter Heartbeat ingests Twitter traffic, sorts and codes it so that US citizens going to vote, or Hurricane Sandy respondents, can appear as geographical heatmaps trending in seconds across the geography of the USA. The SGI UV which did this work (it can ingest the printed resources of the Library of Congress in 3 seconds) linked him to the last speaker, the luminous Dr Eng Lim Goh, SVP and CTO at SGI, who gave a magnificent tour d’horizon of current computing science. His YouTube videos are as wonderful as the man himself (a good example is his 70th birthday address to Stephen Hawking, his teacher, but also look at (http://www.youtube.com/watch?v=zs6Add_-BKY). And he focussed us all on a topic not publicly addressed by the information industry as a whole: the immense distance we have travelled from “needle in a haystack” searching to our current pre-occupation with analysing the differences between two pieces of hay – and mapping the rest of the haystack in terms of those differences. For Dr Goh this resolves to the difference between arranging stored data as a cluster of nodes to working in shared memory (he spoke of 16 terabyte supernodes). As the man with the very big machine, his problems lie in energy consumption as much as anything else. In a process that seems to create a workflow that goes Ingest > Store and Organize > Analytics > Visualize (in text and graphics – like the heatmaps) the information service players seem to me to be involved at every point, not just the front end.

The largest data sourcing project on the planet was represented in the room (The SKA, or Square Kilometre Array, is a remote sensing telemetry experiment with major sites in Australia and South Africa). Of course, NASA is up there with the big players, and so are the major participants in cancer research and human genomics. But I was surprized by how Big the Big Data held by WETA Data (look at all the revolutionary special effects research at http://www.wetafx.co.nz/research) in New Zealand was, until I realised that this is a major film archive (and NBA Entertainment is up there too on the data A List) This reflects the intensity of data stored from film frame images and their associated metadata, now multiplied many times over in computer graphics – driven production. But maybe it is time now to stop talking about Big Data, the term which has enabled us to open up this discussion, and begin to reflect that everyone is a potential Big Data player. However small our core data holding may be compared to these mighty ingestors, if we put proprietory data alongside publicly sourced Open Data and customer-supplied third party data, then even very small players can experience the problems that induced the Big Data fad. Credit Benchmark, which I mentioned two weeks ago, has little data of its own: everything will be built from third party data. The great news aggregators face similar data concentration issues as their data has to be matched with third party data.

And I was still thinking this through when news came of an agreement signed by MarkLogic (www.marklogic.com) with Dow Jones on behalf of News International this week. The story was covered in interesting depth at http://semanticweb.com/with-marklogic-search-technology-factiva-enables-standardized-search-and-improved-experiences-across-dow-jones-digital-network_b33988 but the element that interested me and which highlights the theme of this note concerns the requirement not just to find the right article, but to compare articles and demonstrate relevance in a way which only a few years ago would have left us gasping. Improved taxonomic control, better ontologies and more effective search across structured and unstructured data lie at the root of this, of course, but do not forget that good results at Factiva now depend on effective Twitter and blog retrieval, and effective ways of pulling back more and more video content, starting with You Tube. The variety of forms takes us well beyond the good old days of newsprint, and underline the fact that we are all Big Data players now.

Note: Alfred Rolington, formerly CEO at Janes, will publish a long-awaited book with OUP on “Strategic Intelligencein the Twenty First Century” in January which can be pre-ordered on Amazon at http://www.amazon.co.uk/Strategic-Intelligence-21st-Century-Mosaic/dp/0199654328/ref=sr_1_1?s=books&ie=UTF8&qid=1355519331&sr=1-1. And I should declare, as usual, that I do work from time to time with the MarkLogic team, and thank them for all they have done to try to educate me.

The best network marketplace ideas are simple. And inexpensive in terms of user adoption. And productivity enhancing. And regulator pleasing. And very, very clever. So we need to give Credit Benchmark, the next business created by Mark Faulkner and Donal Smith, who successfully sold DataExplorers to Markit earlier this year, a double starred AAA for ticking all these boxes from the start. And doing so in the white-hot heat of critical market and regulatory attention currently being focused on the three great ratings businesses: S&P, Moodys and Fitch. Here is a sample from the US (taken from BIIA News, the best source of industry summary these days at www.biia.com):

“Without specifying names, the U.S. regulator said on Nov. 15 that ratings agencies in the country experienced problems such as the failure to follow policies, keep records, and disclose conflicts of interest. Moody’s and Standard & Poor’s Corp. accounted for around 83% of all credit ratings, the SEC said. Each of the larger agencies did not appear to follow their policies in determining certain credit ratings, the SEC found, among other things. The regulator also said all the agencies could strengthen their internal supervisory controls.

The SEC noted that Moody’s has 128 credit analyst supervisors and 1,124 credit analysts, in contrast with S&P’s 244 supervisors and 1,172 credit analysts. The regulator also examined the function of board supervision at ratings agencies, and implied in its report that directors should be “generally involved” in oversight, make records of their recommendations to managers, and follow corporate codes of conduct. Source: Seeking Alpha”.

Well, in a global financial crisis, someone had to be to blame. It was the credit rating agencies who let us all down! The French government and the EU have them in their sights. They have a business worth some $5 billion with excellent margins (up to 50% in some instances). They are still growing by some 20% per annum because they are a regulatory necessity. They have become a natural target for disruptive innovation, and small wonder, because this combination of success and embedded market positioning attracts anger and envy in equal parts. Yet no one, least of all the critical regulators, wants disruptive change. It is easy enough to point to the problems of the current system, illustrate the conflicts inherent in the issuer-pays model, bemoan the diminished credibility of the ratings, or criticize the way in which multiple -notch revisions can suddenly bring crisis recognition where steady alerting over a time period would have been more useful, but at present no one has a better mousetrap.

At this point look to Credit Benchmark (http://creditbenchmark.org/about-us). Having successfully persuaded the marketplace, and especially the hedge funds, to contribute data on equity loans to a common market information service at DataExplorers (a prime example of UGC – user generated content – more normally seen in less fevered and more prosaic market contexts) the team there have a prize quality to bring to the marketplace. They have been once, and can be again, a trusted intermediary for handling hugely sensitive content in a common framework which allows value to be released to the contributors, which gives regulators and users better market information, and which does not disadvantage any of the contributors in their trading activities. So what happens when we apply the DataExplorers principle to credit rating? All of a sudden there is the possibility of investment banks and other financial services sharing their own ratings and research via a neutral third party. At present the combined weight of the bank’s own research, in manpower terms, dwarfs the publicly available services – there are perhaps as many as 8000 credit analysts at work in the banks in this sector globally, covering some 74% of the risks. If all members of the data sharing group were able to chart their own position on risks in relationship to the way in which their colleagues elsewhere across a very competitive industry rated the same risk using the same data – in other words show the concensus and show their own position and indicate the outliers – then the misinformation risk is reduced but the emphasis on judgement in investment is increased.

And of course the Big Three credit agencies would still be there, and would still retain their “external” value, though maybe their growth might be dented and the ability to force up prices diminished if there was a greater plurality of information in the marketplace, and if banks and investors were not so wholly reliant upon them .The direction in which Credit Benchmark seem to be going is also markedly one which is very aligned to the networked world of financial services. User generated content; data analytics in a “Big Data” context; the intermediary owning the analysis and the service value, but not the underlying data; the users perpetually refreshing the environment with new information at near real-time update. And these are not just internet business characteristics: they also reflect values that regulators want to see in systems that produce better-informed results. A good conclusion from Credit Benchmark’s contributory data model would be better visibility into thematic trends for investment instrument issuers and their advisors, as well as more perception of and ongoing monitoring of their own, their client’s and their peer’s ratings. In market risk management terms, regulators will be better satisfied if players in the market are seen to be benchmarking effectively, and analysts and researchers who want to track the direction and volatility of ratings at issuer, or instrument, or sector, or regional levels will have a hugely improved resource. And something else will become clear as well: the spread of risk, and where consensus and disagreement lies. Both issuers and owners get a major capital injection of that magic ingredient – risk – reducing information.

None of this will happen overnight. Credit Benchmark are currently working on proof of concept with a group of major investment banks, and the data analytics demand (in a market place which is not short of innovative analytical software at present) is yet to be fully analysed. Yet money markets are the purest exemplars of information theory and practice, and it would be satisfying to be able to report that one outcome of global recession had been vast improvements in the efficacy of risk management and credit rating of investments. Indeed, in this blog in this year alone we have reported on crowd-sourcing and behavioural analysis for small personal loans (Kreditech), open data modelling for corporate credit (Duedil) and now, with Credit Benchmark, UGC and Big Data for investment rating. These are indicators, should we need them, of an industrial revolution in information as a source of certainty and risk reduction. Markets may never (hopefully) be the same again.

« go backkeep looking »