In the history of software, as I have suffered it in the past 45 years, the most time wasting difficulty has been the false dawn syndrome. My first CTO, Norman Nunn Price was a grizzled Welshman with an unquenchable enthusiasm for the ability of software to solve all problems. As a young man, he had worked on radar in submarines in the Second World War. When, as his more youthful CEO, I sometimes questioned his predictions, the reply often included “look, we won the bloody war using this stuff didn’t we? “. But as the years passed  by in our development of a start-up in legal information retrieval, we began to notice that when Norman and his team announced that the job was done, or the fix was in place, or the application was ready and the assignment was completed, we were actually at the beginning of another work phase, and not at a point of implementation. Once, in frustration, I pointed out forcibly to Norman that, despite his optimistic announcement that he had once more brought us successfully to a moon landing, it appeared that I was still 50 feet above the surface with no available mechanism to get me down there. It became a company saying.

I find myself using it regularly as I listen to the way in which data and analytics companies are learning to live with AI. I cannot fault the ambition. it is clear that many service providers are framing solutions that are going to provide really dramatic advances in value to the widest possible range of societal requirements. But once the service design and the value added has been determined, we come to that familiar place which the software engineers  describe in terms of ETL – the whole business of extracting, transforming and loading data. it is here that we discover that our data is not like other data. It is too structured, or not structured at all. It has been marked up in a way that makes it difficult to transform, or it hasnot been marked up at all, which makes it difficult to transform. It either lacks metadata to guide the process, or has too much metadata, or nobody can understand and use the metadata. So we must pause and create a solution.

This is a well trodden track. Others have gone before us. The problems about integrating data into cloud data services like Databricks and Snowflake have slowed progress and added to costs for the past five years. It is interesting to see that the small industry has grown up to ease a problem, with companies like prophecy.com, emerging with effective solutions. One might imagine that the same will happen with AI. Data transformation will cease  in time to be an issue, since a raft of services will have emerged to deal with common problems, and the data creators will have reacted and adapted to the issues that arise when data is ingested into AI environment of all sorts.

But of course, this will not stop the press releases, which will continue to claim that something has happened some time before it might possibly happen. Yet, it should moderate our expectations a little bit. Many feel that we have not yet hit the problems of getting first generation AI services fully operational, even if we are talking as if we were rolling out, second generation services, tried and tested by legions of users. 50 feet above the moon can be a good place to be if it provides an opportunity to pause for thought, and realign our thinking before we make the slow eventual descent to the lunar surface.

 

When I was forced to temporarily cease blogging a few years ago, (see personal  note below)  AI was a fact of life. Every year we saw improvements in the use of increasingly sophisticated algorithms. We noted the rise and rise of robotic process automation. Those of us with two decades of industrial memories recalled expert systems and neural networks. Those of us with four decades remembered hearing Marvin Minsky at MIT, telling us that he wanted the books  in our libraries to speak to each other, to exchange and update knowledge and to build a new knowledge out of that exchange. Yet nothing here prepared us for 2023.

When historians get last year into some perspective they will probably conclude that what happened owed as much to the content creation requirements of online advertising, or the financial services requirement for a new wave of Silicon Valley investment frenzy as it did to a breakthrough in AI capabilities. yet what actually happened last year, even without such a perspective, is truly amazing. The year installed AI as a key strategic component in any strategic planning exercise in almost any commercial activity. Hyper-investment and hyperactivity resulting from it produced tools in generative AI which, a mere year later had immensely  more powerful. Compare Chat GPT 3 To the current iteration of Gemini: a context window of 122,000 tokens to one of 1 million. Then look at the public recognition factor, and you find a world in which there is now a normal expectation that machine intelligence and machine interruptibility will be a part of everyone’s every day life. It is as if a switch has been flicked on, illuminating a new room into which we have walked for the first time. We all of us know that we can never now go back through that door or switch off that light. Pandora’s Law.

And we should not want to go back, either. What has happened should simply remind us that change does not happen evenly, and that the realisation of change sometimes takes longer to happen than we anticipate. But in 2023 I detected something else as well. A fear of change that was a little beyond normal anxiety. In the world in which I have worked for over 50 years the idea that content creation through the exercise of machine intelligence could be more threatening than beneficial gained a powerful currency and soon turned into dystopian editorials in both trade and consumer media. As a result we have come out of 2023, the year of AI megahype, with both an enhanced view of the speed and power with which machine intelligence will help, support, and change our society, and a hysterical fear of evils unknown which may result from quantum computers secretly plotting our downfall on the network. Since the invention of the wheel mankind has been learning to accommodate and live with the machine, and we shall surely do so in the world of AI as well. Yet, in the clan to which I belong, the data, services and solutions vendors who called themselves content companies and information providers a few years back (and then before that used to describe themselves in Gutenberg terms as publishers), there has been fear of a different sort. Whether it meant anything or not, they have always embraced the consolation of copyright, the belief that  intellectual property can be described and identified and protected, as one of the bulwarks of their commercial viability. The idea that individual creativity could be mirrored by machine intelligence or that the machine might regurgitate, as a whole loan part, content acquired as part of training data, or that the value of content or data once described as “proprietary” could be lost in the machine intelligence age: these ideas are the very stuff of panic. Then add to them the knowledge that machine intelligence can produce “hallucinations “, that some related answers may not always be accurate and correct, and that the long-held belief that machines loaded with garbage do indeed produce rubbish, and we find integrity fears added alongside fear of theft and diminishing valuations.

One of my mentors of many years ago, recommending me to a potential client, commented that “while generally sound on strategy he can be unreliable on copyright “. I have over the years tried to be better behaved, but it is difficult because it takes so long to bring the heavy guns of copyright law to bear on problems that have usually departed long before adequate legislation is available to control them. Early regulation on AI, like the EU AI Act, seems , in any case, more bent on risk control than anything else.

While the Copyright lawyers are anxiously seeking reregulation for a machine age, I for one would take the arguments much more seriously if copyright holders paid real attention to marking their works with appropriate metadata and PIDs that indicated ownership and provenance. It is hard to imagine machine interoperable checking on the copyright status of works if those same works are not identified in ways that machines can recognise and understand. Then it becomes more possible to put pressure on AI developers to ensure that they licensed the genuine article, recognised the credentials of the real thing publicly, and increased the integrity of there solutions by showing users that only the real thing was used in the construction of the outcomes desired. This is beginning to happen in some encouraging ways: the fact that both Google and Open AI now accept C2 PA, the coding system developed for images and videos, shows what can be done by persuading people that being licit and responsible is good for business. Rather than have “fake“ hung round their necks, it is better to say that you will check and code every image that you use , especially in an American election year. In text and data there are similar emerging conventions. The  ISCC– international standard content code – is now a draft ISO standard. The long- established GO FAIR provisions of the FAIR Foundation create metadata standards that render data “findable accessible Inter operable and reproducible “. Data and content owners who make it clear to interested parties and machines what the scope and ownership of their asset entails have a much better chance of working successfully with it in this New World. And in particular, they have a better chance of entering into proper andsatisfactory licensing agreements around it. If we are able to persuade the machine intelligence world that integrity is vital to business success, then we have a far better chance of creating the sort of licensing environments that pioneers like the Copyright Clearance Centre have advocated and piloted for years. Businesses in the network have to make for themselves the business conditions that work in the network.

So who will police and patrol all of this until law andregulation finally catches up, if it ever does? The publisher and copyright lawyer, Charles Clark, my fellow delegate to the European Commission Legal Information Observatory, invented the maxim “the answer to the machine lies in the machine”. It was never better applied than at this point. If you want to find bias in machine intelligence then the simplest way to do so is programmatically. If you wish to know whether training data has been derived from legitimate  known sources that will vouch for accuracy and currency, ask the machine to interrogate the machine. For the AI companies, the price of reputation may be breaking open the black box and demonstrating good practice in creating answers from the very best inputs.

PERSONAL NOTE : I maintained this blog continuously from 2009 to 2021. I suffered eyesight problems which have left me with some 40% of my vision. My road back to this form of communication has taken three years, during which I’ve had the huge pleasure of writing two books, drafting a third and eventually returning to blogging. Writing in the world of text to speech and speech to text software is different. As I say on the end of all of my communications at work “ if you find errors of syntax, grammar or spelling in what I’ve written, please remember that it is much harder for me to edit than ever before, so try to smile indulgently. On the other hand, if you think that I have written utter gibberish, please contact me immediately!“


« go backkeep looking »