Mar
9
Show me your provenance!
Filed Under Uncategorized | Leave a Comment
When I was forced to temporarily cease blogging a few years ago, (see personal note below) AI was a fact of life. Every year we saw improvements in the use of increasingly sophisticated algorithms. We noted the rise and rise of robotic process automation. Those of us with two decades of industrial memories recalled expert systems and neural networks. Those of us with four decades remembered hearing Marvin Minsky at MIT, telling us that he wanted the books in our libraries to speak to each other, to exchange and update knowledge and to build a new knowledge out of that exchange. Yet nothing here prepared us for 2023.
When historians get last year into some perspective they will probably conclude that what happened owed as much to the content creation requirements of online advertising, or the financial services requirement for a new wave of Silicon Valley investment frenzy as it did to a breakthrough in AI capabilities. yet what actually happened last year, even without such a perspective, is truly amazing. The year installed AI as a key strategic component in any strategic planning exercise in almost any commercial activity. Hyper-investment and hyperactivity resulting from it produced tools in generative AI which, a mere year later had immensely more powerful. Compare Chat GPT 3 To the current iteration of Gemini: a context window of 122,000 tokens to one of 1 million. Then look at the public recognition factor, and you find a world in which there is now a normal expectation that machine intelligence and machine interruptibility will be a part of everyone’s every day life. It is as if a switch has been flicked on, illuminating a new room into which we have walked for the first time. We all of us know that we can never now go back through that door or switch off that light. Pandora’s Law.
And we should not want to go back, either. What has happened should simply remind us that change does not happen evenly, and that the realisation of change sometimes takes longer to happen than we anticipate. But in 2023 I detected something else as well. A fear of change that was a little beyond normal anxiety. In the world in which I have worked for over 50 years the idea that content creation through the exercise of machine intelligence could be more threatening than beneficial gained a powerful currency and soon turned into dystopian editorials in both trade and consumer media. As a result we have come out of 2023, the year of AI megahype, with both an enhanced view of the speed and power with which machine intelligence will help, support, and change our society, and a hysterical fear of evils unknown which may result from quantum computers secretly plotting our downfall on the network. Since the invention of the wheel mankind has been learning to accommodate and live with the machine, and we shall surely do so in the world of AI as well. Yet, in the clan to which I belong, the data, services and solutions vendors who called themselves content companies and information providers a few years back (and then before that used to describe themselves in Gutenberg terms as publishers), there has been fear of a different sort. Whether it meant anything or not, they have always embraced the consolation of copyright, the belief that intellectual property can be described and identified and protected, as one of the bulwarks of their commercial viability. The idea that individual creativity could be mirrored by machine intelligence or that the machine might regurgitate, as a whole loan part, content acquired as part of training data, or that the value of content or data once described as “proprietary” could be lost in the machine intelligence age: these ideas are the very stuff of panic. Then add to them the knowledge that machine intelligence can produce “hallucinations “, that some related answers may not always be accurate and correct, and that the long-held belief that machines loaded with garbage do indeed produce rubbish, and we find integrity fears added alongside fear of theft and diminishing valuations.
One of my mentors of many years ago, recommending me to a potential client, commented that “while generally sound on strategy he can be unreliable on copyright “. I have over the years tried to be better behaved, but it is difficult because it takes so long to bring the heavy guns of copyright law to bear on problems that have usually departed long before adequate legislation is available to control them. Early regulation on AI, like the EU AI Act, seems , in any case, more bent on risk control than anything else.
While the Copyright lawyers are anxiously seeking reregulation for a machine age, I for one would take the arguments much more seriously if copyright holders paid real attention to marking their works with appropriate metadata and PIDs that indicated ownership and provenance. It is hard to imagine machine interoperable checking on the copyright status of works if those same works are not identified in ways that machines can recognise and understand. Then it becomes more possible to put pressure on AI developers to ensure that they licensed the genuine article, recognised the credentials of the real thing publicly, and increased the integrity of there solutions by showing users that only the real thing was used in the construction of the outcomes desired. This is beginning to happen in some encouraging ways: the fact that both Google and Open AI now accept C2 PA, the coding system developed for images and videos, shows what can be done by persuading people that being licit and responsible is good for business. Rather than have “fake“ hung round their necks, it is better to say that you will check and code every image that you use , especially in an American election year. In text and data there are similar emerging conventions. The ISCC– international standard content code – is now a draft ISO standard. The long- established GO FAIR provisions of the FAIR Foundation create metadata standards that render data “findable accessible Inter operable and reproducible “. Data and content owners who make it clear to interested parties and machines what the scope and ownership of their asset entails have a much better chance of working successfully with it in this New World. And in particular, they have a better chance of entering into proper andsatisfactory licensing agreements around it. If we are able to persuade the machine intelligence world that integrity is vital to business success, then we have a far better chance of creating the sort of licensing environments that pioneers like the Copyright Clearance Centre have advocated and piloted for years. Businesses in the network have to make for themselves the business conditions that work in the network.
So who will police and patrol all of this until law andregulation finally catches up, if it ever does? The publisher and copyright lawyer, Charles Clark, my fellow delegate to the European Commission Legal Information Observatory, invented the maxim “the answer to the machine lies in the machine”. It was never better applied than at this point. If you want to find bias in machine intelligence then the simplest way to do so is programmatically. If you wish to know whether training data has been derived from legitimate known sources that will vouch for accuracy and currency, ask the machine to interrogate the machine. For the AI companies, the price of reputation may be breaking open the black box and demonstrating good practice in creating answers from the very best inputs.
PERSONAL NOTE : I maintained this blog continuously from 2009 to 2021. I suffered eyesight problems which have left me with some 40% of my vision. My road back to this form of communication has taken three years, during which I’ve had the huge pleasure of writing two books, drafting a third and eventually returning to blogging. Writing in the world of text to speech and speech to text software is different. As I say on the end of all of my communications at work “ if you find errors of syntax, grammar or spelling in what I’ve written, please remember that it is much harder for me to edit than ever before, so try to smile indulgently. On the other hand, if you think that I have written utter gibberish, please contact me immediately!“
Sep
21
CCC – FAIR Foundation Forum
“The evolving role of DATA in the AI era “
18 September 2023 Leiden
“If we regulate AI and get it wrong, then the future of AI belongs to the Chinese“. When you hear a really challenging statement within five minutes of getting through the door, then you know that, in terms of conferences and seminars, you are in the right place at the right time. The seminar leaders, supported by the remarkable range of expertise displayed by the speakers, provided a small group with wide data experience with exactly the antidote needed to the last nine months of generative AI hype: a cold, clean, refreshing glass of reality. It was time to stop reading press releases and start thinking for ourselves.
FAIR’s leadership, committed to a world where DATA is findable, accessible, interoperable, and reusable, began the debate at the requisite point. While it is satisfying that 40% of scientists know about FAIR and what it stands for, why is it that when we communicate the findings of science and the claims and assertions which result from experimentation, we produce old style narratives for human consumption rather than, as a priority, creating data in formats and structures which machines can use, communicate and with which they can interact. After all, we are long past the point where human beings could master the daily flows of new information in most research domains: only in a machine intelligence world can we hope to deploy what we know is known in order to create new levels of insight and value.
So do we need to reinvent publishing? The mood in the room was much more in favourof enabling publishers and researchers to live and work in a world where the vital elements of the data that they handled was machine actionable. Discussion of the FAIR enabling resources and of FAIR Digital Objects gave substance to this. The emphasis was on accountability and consistency in a world where the data stays where it is, and we use it by visiting it. Consistency and standardisation therefore become important if we are not to find a silo with the door locked when we arrive. It was important then to think about DATA being FAIR “by design“ and think of FAIRificationas a normal workflow process.
If we imagine that by enabling better machine to machine communication with more consistency then we will improve AI accuracy and derive benefits in cost and time terms then we are probably right. If we think that we are going to reduce mistakes and errors, or eliminate “hallucinations“when we need to be careful. Some hallucinations at least might well be machine to machine communications that we, as humans, do not understand very well! By this time, we were in the midst of discussion on augmenting our knowledge transfer communication processes, not by a new style of publishing, but by what the FAIR team termed “nano publishing“. Isolating claims and assertions, and enabling them to be uniquely identified and coded as triples offered huge advantages. These did not end with the ability of knowledge graphs to collate and compare claims. This form of communication had built in indicators of provenance which could be readily machine assessed. And there was the potential to add indicators which could be used by researchers to demonstrate their confidence in individual findings. The room was plainly fascinated by the way in which the early work of Tobias Kuhn and his colleagues was developed by Erik Shultes, who effectively outlined it here, and the GO FAIR team. Some of us even speculated that we were looking at the future of peer review!
Despite the speculative temptations, the thinking in the room remained very practical. How did you ensure that machine interoperability was built in from the beginning of communication processing? FAIR were experimenting with Editorial Manager, seeking to implant nanopublishingwithin existing manuscript processing workflows. Others believed we needed to go down another layer. Persuade SAP to incorporate it (not huge optimism there)?Incorporate it into the electronic lab notebook? FAIR should not be an overt process, but a protocol as embedded and unconsidered and invisible as TCP-IP. The debate in the room about how best to embed change was intense, although agreement on the necessity of doing so was unanimous.
The last section of the day looked closely at the value, and the ROI, of FAIR. Martin Romacker (Roche) and Jane Lomax (SciBite) clearly had little difficulty, pointing to benefits in cost and time, as well as in a wide range of other factors. in a world where the meaning as well as the acceptance of scientific findings can change over time, certainty in terms of identity, provenance, versioning and relationships became foundation requirements for everyone working with DATA in science. Calling machines intelligent and then not talking to them intelligently in a language that they could understand was not acceptable, and the resolve in the room at the end of the day was palpable. If the AI era is to deliver its benefits, then improving the human to machine interface in order to enable the machine to machine interface was the vital priority. And did we resolve the AI regulatory issues as well? Perhaps not: maybe we need another form to do that!
The forum benefited hugely from the quality of its leadership, provided by Tracey Armstrong (CCC) and Barend Mons (GO FAIR). Apart from speakers mentioned above, valuable contributions were made by Babis Marmanis (CCC), Lars Jensen (NFF centre for protein research) and Lauren Tulloch (CCC).
« go back — keep looking »