Dec
31
Simple Rules for New Years Blogging
Filed Under Artificial intelligence, Big Data, Blog, data analytics, Education, Industry Analysis, internet, machine learning, mobile content, news media, Uncategorized, Workflow | Leave a Comment
Apologies to those kind readers who expected an earlier interjection in December. Truth to tell, I was speechless. Caught somewhere between astonishment at my fellow countrymen’s mania for national self harming, my own complete self-identification historically, culturally and pschychologically as a. “European”, and impatience with all the wise and honest Americans who I know and who cannot collectively somehow re-enact the Emperors clothes nursery tale, there suddenly seemed nothing left to say worth saying, least of all around the topic of electronic information and digital society.
But then I returned to Nova Scotia again for the holidays, and in its clear, cold, sunny air it seems a dereliction of a bloggers duty not to have a message at New Year. And by dint of looking over everyone’s shoulders, I see that Rule One of the New Year message is to make a recommendation, preferably to nominate something as the something of the Year. And as it happens I do have a Book of the Year for this information industry. Please read The Catalogue of Ship-wrecked Books, by Edward Wilson Lee. The inevitable pesky publishers sub-title in the US purports to sell it as a book about Christopher Columbus and his son, but the UK edition hits the point – it is about the attempt by Columbus’s son to build a universal library in Seville, getting royal patronage and setting up buying agents in the great early cities of print to create an early Internet Archive, making available a stream of knowledge as rich as the gold and silver of Peru and Mexico just then flowing into the royal coffers.
The attempt fails of course, but it does set off arguments about the nature of Knowledge which we need to keep having as we dimly perceive the arrival of the leading edge of the development of knowledge products and solutions. And here comes Rule Two: Issue a Warning. And here is mine – Refrain in 2019 from labelling everything you see as AI sourced, related or derived. We are still in the Colon Columbus stage in building the universal knowledge base. Let’s save AI as a term for when AI arrives. Many people are doing really clever things, but they are at best embryonic knowledge products. We are really quite far away from new knowledge created in a machine-driven context without human intervention. Indeed we are still a long way from getting enough information as metadata in a machine understandable form, and when we do we usually do not understand what we have done.
So here comes Rule Three: declare a News Story of the Year. And here is mine. The gracious acknowledgement by Google that their automated recruitment system, which analyses thousands of CVs to produce the best candidates, had a male bias built in to it. And of course it did! Feed the past into an expert system and it replicates the flaws of the past. And its not that the systems doing the analytics are not clever, its just that the dumb data and the dumb documents are not as dumb as we think, and in fact they are larded with all of the mistakes we have ever made. And we need to know that before we evaluate the outcomes as Intelligent, or even believable.
And if we need to be careful about the nature of the information we are using, we need to deal in known quantities. Rule Four: try to make an insight. Mine concerns differentiating between data and documents. The other night, as one does on cold and isolated coastline, we fell to discussing derivations. My wife produced her weightlifters copy of Merriam Webster, and we got into derivations old-style. Datum, neutral, is always related to single objects of an incontrovertible nature. Docuumentum carries the idea of learning throughout its history. When we talk about content-as-data, what do we really mean? And when we talk about AI, do we speak of Intelligence created by machines deriving knowledge from pure data, or of machines learning from knowledge available, fallacies and all, in order to postulate new knowledge? We do need to be clear about our, as derived from our inputs, or we will surely be disappointed by what happens next. We need to start listening very carefully to conversations about concept analysis, concept-based searching and conceptual analysis.
Which logically brings me to Rule Five. End with a prediction. Mine would concern a question I asked in several sessions at Frankfurt this year and have had little but confusion as a result. My question was “What proportion of your readership is machines, and what economic benefits does that readership bring to you?”. I think machine readership will become much more important in 2019, as we seek to monetise it and as we seek to evaluate what content in context means in the context of analytical systems. So just as none of us knew how many machines were reading us this year, next year I think most of us will be aware. And whether those were just browsers, or bots, or knowledge harvesters, or what?
And then I notice there is a Rule Six. You end by wishing every kind reader who reaches this point a happy, healthy and prosperous New Year, which I do for all in 2019. After all, using my rule-based system this column could be written by a machine next year – and read by one too!
Nov
29
Data: Re-Use, Waste and Neglect
Filed Under B2B, Big Data, Blog, data analytics, Industry Analysis, internet, machine learning, Publishing, semantic web, social media, STM, Uncategorized | Leave a Comment
We live in fevered times. What happens at the top cascades. This must be the explanation for why revered colleagues like Richard Poynder and Kent Anderson are conducting Mueller – style enquiries into OA (Open Access). And they do make a splendidly contrasting pair of prosecutors, like some Alice in Wonderland trial where off-with-his-head is a paragraph summary, not a judgement. Richard (https://poynder.blogspot.com/2018/11/the-oa-interviews-frances-pinter.html). wants to get for-profit out of OA, presumably not planning to be around when the foundation money dries up and new technology investment is needed. Kent defends vigorously the right of academic authors to make money from their work for people other than themselves, and is busy, in the wonderful Geyser (thegeyser@substack.com) journal sniffing the dustbins of Zurich to find “collusion” between the Swiss Frontiers and the EU. Take a dash of Brexit, add some Trumpian bitters, the zest of rumour, shake well and pour into a scholarly communications sized glass. Perfect cocktail for the long winter nights. We should be grateful to them both.
But perhaps we should not be too distracted. For me, the month since I last blogged on Plan S and got a full postbag of polite dissension, has been one of penitent reflection on the state of our new data-driven information marketplace as a whole. In the midst of this. Wellcome announced its Data Re-Use prize, which seems to me to exemplify much of the problem. (https://wellcome.ac.uk/news/new-wellcome-data-re-use-prizes-help-unlock-value-research?utm_source=linkedin&utm_medium=o-wellcome&utm_campaign=). Our recognition of data has not properly moved on from our content years. The opportunities to merge, overlap, drill down through, mine together related data sets are huge. The ability to create new knowledge as a result has profound implications. But we are still on the nursery slopes when it comes to making real inroads into the issues, and while data and text mining techniques are evolving at speed, the licensing of access and the ownership of outcomes still pose real problems. We will not be a data driven society until sector data sources have agreed protocols on these issues. Too much data behind paywalls creates ongoing issues for owners as well as users. Unexploited data is valueless.
It’s not as if we have collected all the data in the marketplace anyway. At this year’s NOAH conference in London at the beginning of the month I watched a trio of start-ups in the HR space present, and then realised that they were all using the same data collected differently. There has to be an easier way of pooling data in our society, ensuring privacy protection but also aligning clean resources for re-use using different analytics and market targets to create different service entities. Lets hope the Wellcome thinking is pervasive, but then my NOAH attention went elsewhere as I found myself in a fascinating conversation about a project which is re-utilising a line of content as data that has been gratuitously ignored. And in scholarly communication, one of the best ploughed fields on the data farm.
Morressier, co-founded in Berlin by Sami Benchekroun, with whom I had the conversation, is a startling example of the cross-over utility of neglected data. With Justus Weweler, Sami has concerned himself with the indicative data you would need to give evaluated.
Progress reporting on early stage science. Posters, conference agendas, seminar announcements, links to slide sets – Morressier is exploring the hinterland of emerging science, enabling researchers and funders to gauge how advanced work programmes are and how they can Map the emerging terrain in which they work. Just when we imagined that every centimetre of the scholarly communication workflow had been fully covered, here comes a further chapter, full of real promise, whose angels include four of the smartest minds in scholarly information, morressier.com is clearly one to watch.
And one to give us heart. There really are no sectors where data has been so eked out that no further possibilities, especially of adding value through recombination with other data, in fact, in my daily rounds, I usually find that the opposite is true. Marketing feedback data is still often held aloof from service data, few can get an object based view of how data is being consumed. And if this is true at the micro level in terms of feedback, events companies have been particularly profligate with data collection, assessment and re- use And while this is changing it still does not have the priority it needs. Calling user data “exhaust” does not help: we need a catalytic converter to make it effective when used with other data in a different context.
When we have all the data and we are re-combining it effectively, we shall begin to see the real problems emerge. And they will not be the access and re-use issues of today, but the quality, disambiguation and “fake” data problems we are all beginning to experience now and which will not go away, Industry co-operation will be even more needed, and some players will have to build a business model around quality control. The arrival of the data driven marketplace is not a press release, but a complex and difficult birth process.
« go back — keep looking »