This may be the age of data, but the questions worth asking about the market viability of information service providers are no longer about content. They are about what you do to content-as-data as you seek to add value to it and turn it into some form of solution. So, in terms of Pope’s epigram, we could say that the proper study of Information Man is software. Data has never been more completely available. Admittedly, we have changed tack now on the idea that we could collect all that we need and put it into a silo and search it. Instead, in the age of big data, we prefer to take the programme to the data. Structured and unstructured. Larger collectively than anything tackled before the emergence of Google and Yahoo!, and then Facebook, and inspired by the data volumes thrown off by those services. And now we have Thomson Reuters and Reed Elsevier knee deep in the data businesses and throwing up new ways of servicing data appropriate to the professional and business information user. So shall we in future judge the strategic leadership of B2B, STM, financial services or professional information services companies by what they know about the decisions they need to make about implementing which generation of what software to have what strategic effect on their marketplaces? I hope not, since I fear that like me they may be found wanting.

And clearly having a CTO but not having the knowledge of the right questions to ask him, or what the answers mean is not sufficient either. In order to get more firmly into this area myself I wrote a blog last month called “Big Data: Six of the Best”, in which I talked about a variety of approaches to Big Data issues. In media and information markets my first stop has always been MarkLogic, since working with them has taught me a great deal about how important the platform is, and how pulling together existing disparate services onto a common platform is often a critical first step. Anyone watching the London Olympics next month and using BBC Sport to navigate results and entries and schedules, with data, text and video, is looking at a classic MarkLogic 5 job (www.marklogic.com). But this is about scale internally, and about XML. In my six, I wanted to put alongside MarkLogic’s heavy lifting capacities  someone with a strong metadata management tradition, and a new entrant, with exactly those characteristics, is Pingar (www.pingar.com). Arguably, we tend to forget all the wonderful things we said about metadata a decade ago. From being the answer to all questions, it became a very expensive pursuit, with changing expectations from users and great difficulties in maintaining quality control, especially where authors created it, fudging the issue for many information companies.

So Pingar, who started in New Zealand  before going global, appropriately started its tools environment somewhere else. Using the progress made in recent years in entity extraction and  pattern matching, they have created tools to manage the automatic extraction of metadata at scale and speed. Working with large groups of documents (we are talking about up to 6 terrabytes – not “biggest” data but large enough for very many of us) metadata development becomes a batch processing function. The Pingar API effectively unlocks a toolbox of metadata management solutions  from tagging and organization  at levels of consistency that we all now need, to integration of the results with enterprize content management, with communications and with collaboration platforms. Sharepoint connectivity will be important for many users, as will the ability to output into CRM tools. Users can import their own taxonomies effectively, though over time Pingar will build facilities to allow taxonomy development from scratch.

As members of the Pingar team talked me through this, two thoughts persisted. In the first instance, the critical importance of metadata. Alongside Big Data, we will surely find that the fastest way to anything is searching metadata databases. They are not either/or, they are both/and. I am still stuck with the idea that however effective we make Big Data file searching, we will also need retained databases of metadata at every stage. And everytime we need to move into some sort of ontology-based environment, the metadata and our taxonomy become critical elements in building out the system. Big Data as a fashion term must not delude us from the idea that we shall be building and extending and developing knowledge based systems from now until infirmity (or whatever is the correct term for the condition that sparks the next great wave of software services development in 2018!)

And my other notion? If you are in New Zealand you see global markets so much more clearly. Pingar went quickly into Japanese and Chinese, in order to service major clients there, and then into Spanish, French and Italian. Cross -linguistic effort is thus critical Marc Andriessen is credited with the saying “Software is eating the world (which always reminds me of an early hero, William Cobbett, saying in the 1820s of rural depopulation through enclosures and grazing around the great heathland that now houses London’s greatest and slowest airport: “Here sheep do eat men”). I am coming to believe that Andriessen is right, and that Pingar is very representative of the best of what we should expect in our future diet.

Now we are entering the post-competitive world (with a few exceptions!) it is worth pausing for a moment to consider how we are going to get all of the content together  and create the sources of linked data which we shall need to fuel the service demand for data mining and data extraction. Of course, this is less of a problem if you are Thomson Reuters or Reed Elsevier. Many of the sources are relationships that you have had for a long time. Others can be acquired: reflect on the work put in by Complinet to source the regulatory framework for financial services prior to its acquisition by Thomson Reuters, and reflect that relatively little of this data is “owned” by the service provider. Then you can create expertise and scale in content sourcing, negotiating with government and agency sources, and forming third party partnerships (as Lexis Risk Management did with Experian in the US). But what if you lack these resources, find that source development and licensing would create unacceptable costs, but still feel under pressure to create solutions in your niche which will reflect a very much wider data trawl than could be accomplished using your own proprietory content?

The answer to this will, perhaps, reflect developments already happening in the education sector. Services like Global Grid for Learning, or the TES Connect Resources which I have described in previous blogs give users, and third party service developers (typically teacher’s centres or other “new Publishers”) the ability to find quality content and re-use it, while collaborations like Safari  and  CourseSmart allow customization of existing textbook products. So what sort of collaborations would we expect to find in B2B or professional publishing which would provide the quarries from which solutions could be mined? They are few and far between, but, with real appreciation for the knowledge of Bastiaan Deblieck at TenForce in Belgium, I can tell you that they are coming.

Lets first of all consider Factual Inc (www.factual.com). Here are impeccable credentials (Gil Elbiaz, the founder, started Applied Semantics and worked at Google) and a VC-backed attempt to corner big datasets, apply linkage and develop APIs for individual applications. The target is the legion of mash-up developers and the technical departments of small and medium sized players. Here is what they say about their data:

“Our data includes comprehensive Global Places data, with over 60MM entities in 50 countries, as well as deep dives in verticals such as U.S. Restaurants and U.S. Healthcare Providers. We are continually improving and adding to our data; feel free to explore and sign up to get started!

Factual aggregates data from many sources including partners, user community, and the web, and applies a sophisticated machine-learning technology stack to:

  1. Extract both unstructured and structured data from millions of sources
  2. Clean, standardize, and canonicalize the data
  3. Merge, de-dupe, and map entities across multiple sources.

We encourage our partners to provide edits and contributions back to the data ecosystem as a form of currency to reduce the overall transaction costs via exchange.”

As mobile devices proliferate, this quarry is for the App trade, and here is, in the opinion of Forbes (19 April 2012), another Google in potential in the field of business intelligence (http://www.forbes.com/sites/danwoods/2012/04/19/how-factual-is-building-an-data-stack-for-business/2/).

But Los Angeles is not the only place where this thinking is maturing. Over in Iceland, now that the banking has gone, they are getting serious about data. DataMarket (http://datamarket.com), led by Hjalmar Gislason from a background of startups and developing new media for the telco in Iceland, offers a very competitive deal, also replete with API services and revenue sharing with re-users. Here is what they say about their data:

“DataMarket’s unique data portal – DataMarket.com – provides access to thousands of data sets holding hundreds of millions of facts and figures from a wide range of public and private data providers including the United Nations, the World Bank, Eurostat and the Economist Intelligence Unit. The portal allows all this data to be searched, visualized, compared and downloaded in a single place in a standard, unified manner.

DataMarket’s data publishing solutions allow data providers to easily publish their data on DataMarket.com and on their existing websites through embedded content and branded versions of DataMarket’s systems, enabling all the functionality of DataMarket.com on top of their own data collections.”

And finally, in Europe we seem to take a more public interest-type view of the issues. Anyway, a certain amount of impetus seems to have come from the Open Data Foundation, a not-for-profit which also has a connection and has helped to stimulate sites like OpenCharities, OpenSpending (how does your government spend your money?), and OpenlyLocal, designed to illuminate the dark corners of UK local and regional government. All of these sites have free data, available under a creative commons-style licence, but perhaps the most interesting, still in beta, is OpenCorporates. Claiming to have data on 42,165,863 companies (as of today) from 52 different jurisdictions is is owned by Chrinon Ltd, and run by Chris Taggart and Rob McKinnon, both of whom have long records in the Open data field. This will be another site where the API service (as well as a Google Refine service) will earn the value-add revenues (http://api.opencorporates.com/). Much of the data is in XML, and this could form a vital source for some user and publisher generated value add services. The site bears a recommendation from the EC Information Society Commissioner, Nelly Kroes, so we should also record that TenForce (http://www.tenforce.com/) themselves are leading players in the creation of the Commission’s major Open Data Portal, which will progressively turn all that “grey literature, the dandruff of bureaucracy, back into applicable  information held as data.

We seem here to be at the start of a new movement, with a new range of intermediaries coming into existence to broker our content to third parties, and to enable us to get the licences and services we need to complete our own service developments. Of course, today we are describing start-ups: tomorrow we shall be wondering how we provided services and solutions without them.

 

« go backkeep looking »