Nov
19
KEYNOTE: AFTER CONTENT : THE EMERGING WORLD OF INFORMATION AND INTELLIGENCE
David R Worlock , Chief Research Fellow , Outsell Inc.
Although we are wearing life-jackets , we struggle in the water . The turbulence surrounding climate change and Covid 19 is so great that we are tossed in the wake of theses vessels . In just 24 months , our ideas about the Future , the sunlit uplands of our visions of technology-enhanced work and leisure , the improvement of the human condition , the notions of incremental progress and exponential growth , have been shaken . Suddenly the Future is something endangered , something to be preserved , something to be secured – and sometimes something to be feared . This moment calls for courage and decisiveness . We have to clarify our objectives and build towards our desired outcomes regardless of tradition or orthodoxy . It is too late to say that the water is rising : we are in the water already and floundering . What we did before as researchers , librarians , publishers , intermediaries , is less than relevant to what we do next . What we do next will include things we have never tried before , so we will have to learn quickly and move flexibly .
Metaphors can only be stretched so far . In reality a good taste of the Future is already with us , though , as William Gibson so accurately forecast , it is not evenly distributed. While preparing for this Keynote through the autumn , I learnt at the CERN/University of Geneva discussions on innovation in scholarly communications that some scholars already envisage publishing on low cost open platforms managed and run by researchers and their institutions . Yet at the Frankfurt Book Fair , it was easy to sink back into the atmosphere of a scholarly journals world , post-print , but still adhering to the practises and principles of Henry Oldenburg and the Proceedings of the 1660s . Yet everyone was aware that something had happened . 450,000 Covid and Covid-related articles had been published in the previous 24 months . Everyone had seen submission growth during lockdown . Everyone paid lip service to the idea that something impressively vague – usually “AI” – would get us out of the hole . Everyone , as always in the fragmented workflow model of scholarly communications , wanted only to concentrate narrowly on their own piece of the action , regardless of what was happening elsewhere .
If any of the participants were able to take a holistic view of what is happening to researchers , then I think that these conclusions , amongst others , would offer themselves :
- KNOWLEDGE IS TOO BIG TO BE CONTAINED. It has been for many years . And it is certainly too big for convenient interrogation and access if we pretend that we are still using physical formats in digital form . And what about the evidential data ? In some disciplines the urgency of searching everything at once meets the wall of paywall science . Researchers looking to test existing methodologies or find new ones want to find the description , not the article . Machine intelligence can only address the issues if the data is organised and accessible .
- ARTICLES ARE NOT FOR READING . It is now hard to ask questions like “ Are you up to date ?” Without machine intelligence few people in research can be current in a traditional sense . The accessibility of global knowledge and the huge increases in knowledge discovery output in China and to a lesser extent India , made this difficult a decade ago . Now , without intelligent alerting and increasing use of intelligent machine summarisation , research roles would have been submerged by the struggle to keep abreast .
- RESEARCHERS DO NOT WRITE ARTICLES. Indeed , in one sense they never did . In some disciplines articles reporting research reports have always had pre-formatted sections , and compiling literature reviews or citations has often been semi-automated . Other sections are often drawn from grant applications, or by using machine intelligence to draw and compile results from a laboratory log or a Jupyter lab notebook . The data narrative intelligence that now writes so many of our sports reports and business news analysis can equally well support the workbench productivity of researchers .
These three facets of our future , all available now within plain sight , argue a certain view of change . Yet the change will not be dictated by publishers or indeed librarians . Their roles will develop and alter as a result of the decisions made in the research community about the future . Indeed , change has already taken place as a result of Open Access . Recent reports indicate that some 33% of articles are now published this way , and the STM report forecasts that in research intensive countries – the UK specifically – that figure will reach 90% by the end of 2022. Many , including myself , think that the average may be higher globally , and close to 50%. OA has taken 20 years to arrive , but it has come because of increasing researcher and research institution approval . Yet for many researchers , asking questions about the way science was communicated was the smallest issue at hand , even if it was the most easily addressed . OA was simply the preface to the book entitled “ Open Science “, in which scientists question and debate every aspect of the process of research and discovery . We should be very glad of this . If science is indeed our only hope of rescue from these storm tossed waters in which we bob helplessly , then the very least that we want is for it to be accurate , ethical , constructively competitive where that helps , completely collaborative where that helps , and squarely based on the evidence available to all . That evidence will be largely data . As Barend Mons , Professor of BioSemantics at Leiden and Director of FAIR says , “ Data is more important than articles “ . And this is where the future begins .
“As most article writing is increasingly done using machine intelligence ….the article can be fragmented and each element published when ready .”
The picture painted so far shows machine intelligence intervening to ameliorate human issues with handling content . The future is about building the structures which will accomplish that . As so often , it is not about inventing something wholly new to do the job . Artificial intelligence has been with us in principle for 60 years , and from the Expert Systems and the Neural Networks of the past 20 years we can produce a mass of practical experience . This is not to say that there are no problems : what we can do depends critically on the quality of our inputs . In many sectors of working life , data bias remains a real problem , and algorithms can as a result inaccurately represent the intentions of their creators . The positive facts are that we can plan to use a whole range of AI-based tools to address our issues . Deep learning , machine learning, the now widespread use of NLP , the increasing effectiveness of semantic-based concept searching and comparison, as well as other forms of intelligent analytics have all been deployed effectively over the past five years and intensively in Covid research . And yet , we have not yet entered the Age of Data in Scholarly communications , despite the daily practices of many researchers .
“ Our sense of priorities is upside down . Data is always more important than articles , and will continue to be in the age of machine intelligence “ .
We cannot seem to break away from the notion that communication is narrative . The journal article , a report on an experiment or a research programme , is in itself a narrative . It is a story told by humans to transfer knowledge and information . This means nothing to machine intelligence . The metadata that guides machine to human communication will be far less effective in promoting machine to machine interoperability. If we want to use this interoperability to , for example , rerun an experiment in simulation or in reality , or find every place where similar results have been recorded using similar methodologies but described in different words or languages , then we need an augmented set of sign posts to shorthand the way that two machines speak to each other . And we need protocols and permissions that license two machines to negotiate data across the fragmented universe of science data , and across its innumerable paywalls .
“Considering that most of the readers of scholarly articles are now machines we should prepare thos articles so that machines can interact with them even more effectively “
FAIR and GO FAIR have made great strides in making this new world possible . There is a role for publishers and librarians in helping to ensure that data is linked to articles , saved in a safe repository , fully accessible and with efforts being made to develop the business models that improve metadata and thus machine to machine exchange .There is an even bigger role to ensure that all parts of the article are fully discoverable as separate research elements through added metadata to support the full interactivity of machine intelligence . It is predictable that in time most readership of articles will be by machine intelligence , and that much of what researchers know about an article will come from the short synopsis or the infograms provided by alerting services or impact and dissemination players , who will have an important role in signposting change and adding metadata ( Cactus Global’s R Discovery and Impact Science , or Grow Kudos , are good examples . ) Researchers will predictably become adept users of annotation systems ( hypothes.is) , writing their thoughts directly onto the data and content-as-data to create collaborative routes to discovery . Wider fields of data will become more routinely available , as DeepSearch9 have demonstrated with the deep web and with medical drug trials. Some researchers will desert long form article writing altogether , preferring to attach results summaries directly onto the data and distinguish them with a DOI , as they have done for many years in cell signalling , and as members of the Alliance of Genomic Resources do in their Model Organism databases . Here again DOIs and metadata connect short reports ( MicroPublishing ) to the data . And , if we follow the excellent development work of Tobias Kuhn , we shall be publishing explicitly for machine understanding ( “ nano-publishing “.)
Of course , we are still a long way away from the prevalence of this very different world of scholarly communications , at least for the generality of researchers . And if this is really the way we are going then we should expect to see some stress points on the way , some indicators that the main structures of the content world of article publishing is beginning to bend and buckle . We should also expect to see the main concerns of Open Science beginning to have an impact . Every observer of these developments will have their own litmus list of indicative changes : here are mine :
1 . Article Fragmentation . Over the last thirty years I have several times acted as a Judge in contests to create The Article of the Future . Some of these , notably the one created by Elsevier , showed huge technical ability . We are now used to the article that contains video , oral interviews and statements , embedded code , graphs that can be re-run with changed axes , and in healthcare ( OpenClinical) embedded mandates that can be carried over into clinical systems . Some of these artefacts need to be searched in a data driven environment if we are to find exactly the moment in the video where certain statements were made , for example. Articles stored in traditional databases are not normally susceptible to this type of enquiry . I expect to see articles appearing in parts across time , linked to the early stage research activity (morrissier.com, and Cassyni) which makes seminars , conference speeches and other material created prior to the termination of a research project , available and accessible as indicators of early stage research results .
The influence of Open Science on the redevelopment of the article will be acute . Pre-registration , the process by which research teams publish their hypothesis and methodology at the very beginning of the research process , is designed to prevent any subtle recalibration of expectations with results in the process of formulating the published report . PLoS has implemented a service that trials this idea. At the same time Open Science demands that the searchable record should give much better coverage of successful reproduction of previously published findings , as well as coverage of failed experimentation and of failure to reproduce previous results . All of this has obvious value in the scientific argument : little of it is in tune with the practises of most journal publishers . I expect to see journal publishing becoming much more like an academic notice board , with linked DOIs and metadata helping researchers to navigate the inception to maturity track of a research programme , as well as all of the third party commentary associated with it.
2 PrePrints and Peer Review . Critics of what is happening currently as scholarly communications gradually eases itself into a born digital framework for the first time, point to the over-production of research and in particular to the rise of the pre-print as proof of too much uncategorised , lightly peer reviewed material in the system of scholarly communication . There are always voices that want to go back the way we came . Others point out that if we can successfully search the deep web – 90% greater than Google – then searching a few preprint servers should not be too much of a challenge , especially if we get DOIs and metadata right first time . And in thinking about this we should factor in the idea that developing the sophistication of our identifiers , increasing the range and quality of metadata applied throughout the workflow of scholarly communication , and extending the reach of semantic enquiry remain bedrock needs if scholarly communication is going to function , let alone become more effective . By the time that these processes reach maturity , we will have long ceased to refer to any of this material as “articles “. We will simply refer to “research objects “ in a context where such an object might be a methodology , a peer review , a literature review, a conference speech , an hypothesis , an evidential dataset or any other discrete item . Progress in this direction will be the way in which we measure the real “digital transformation “ off scholarly communications .
3 When do we do Peer Review ? In 2021 , two of the physicists , both over 80 ,who won the Nobel Prize were distinguished for work accomplished in the 1970s and 1980s . Open Science points out that our current peer review system does not account for changes in appreciation of scholarly results over time . In addition , the current system can shelter orthodoxy from criticism , and in the narrow confines of a small sub-discipline , is open to being ‘gamed ‘, if not corrupted . Many subscription publishers cling to peer review , along with VOR ( Version of Record ) , like a comfort blanket , sensing that this may give them the ‘stickiness’ to remain important in an age of rapid change . It helps that for many publishers peer review is something they organise but do not pay for , leaving an uneasy feeling that it may not survive a reluctance amongst researchers to volunteer ( a shortage is being felt in some disciplines already ) where neither pay nor public recognition is available .
Two factors complicate this issue . One is timing in the publishing process . Do we really need an intensive review at this point? Funders have reviewed the research programme and the appointed team , and will be able to do due diligence against those expectations . The availability of much more information around reproducibility or the lack of it amongst the flow of research objects is important here , but takes time post-publication . The ability of critics and supporters to add commentary within this workflow will become important , providing the critical input of named individuals who are prepared to stand behind their views . The introduction of scoring systems that are able to assess the regard with which a body of work is held , and index changes to that over time will be critical developmental needs . And then the second factor contributes : AI – based analysis has already proved successful in reducing the element of checking and verifying which is part of each peer review . The UNSILO engine , a part of Cactus Global , executes some 25 checks and is widely used to reduce time and effort in publication workflows . As work like this becomes even more sophisticated and intelligent , it will not simply improve the quality of research objects , but will create its own evidential audit trail , reassuring researchers that key elements have been checked and verified .
4 Open Access/Open Platform. The rush to embrace change is so prominent in certain parts of our society that we tend to turn the changed element into the New Orthodoxy well before its maturity date . This is certainly the case with Open Access , when perhaps the question we should be asking is “ How long will Open Access survive ?” OA is a volume based business model . This is important to recall when there is pressure for APCs to reduce , and when Diamond OA becomes a topic of real interest and concern . Diamond OA often relies on the voluntary efforts of researchers and small scholarly societies , and these efforts can prove to be sporadic . Predictably , Open Access will lead to an even greater concentration of resources in a very fragmented industry . While Springer Nature and Elsevier are described as behomeths within scholarly communications, they are far from the size of major media , information or technology players . OA will drive more Hindawis into more Wileys .
Alongside this we must note changes in publishing workflows . As APCs stabilise and tend to decrease, margins will be maintained by the increasing application of RPA , Robotic Process Automation . The technology today which can write a legal contract proves equally adept at reading and correcting a proof , resolving issues in a literature search or creating a citation listing . Yet publishers who today look at process cost reduction as a way of staying in business must also factor in the the elimination of barriers to entry that this involves . We shall reach a point where mass self-publishing of research objects , whether still in articles or not , becomes very feasible . The successors of the Writefulls , the Grammarlys and the WeAreFutureProofs of today will become the desktop tools of the working researcher . And then the F1000s of today , or their ORC derivatives , or the Octopus Project recently funded by UKRI , will assume the status of Open Platforms , the on-ramps to move articles and then research objects into the bitstream of scholarly discussion and evaluation . This too will give an opportunity to address the most glaring omission in today’s scheme of process: the lack of a cohesive dissemination element . The irony here is that , for many participants , getting published means ‘ everyone knows about it ‘. Clearly they do not . Some publishers offer large volumes of searchable content behind paywalls , and the whole sector talks learnedly about “ discoverability “. Why , in the age of knowledge graphs and low/no cost communications , a publisher would not feel able to alert every researcher in a given sector to the appearance of fresh materials linked to their research interests, is a mystery . The gap has been partially filled by social media players like ResearchGate , but as long as the social media remains advertising based some researchers will reject this . Players like ResearchSquare, Cactus RDiscovery and Impact Science, and Grow Kudos all address these issues in various ways , but gaining impact from meaningful dissemination remains a blind spot for many publishers .
5 Metrics It is obvious enough that new systems of metrics will grow out of the evaluated bitstream of scholarly communication . While citation indexation fades for some , it does not go away . Using altmetrics to create new measures , like Dimensions , provides a welcome variation , but is still far from being a standard. If it looked at one point that Clarivate was going to revive ISI to recreate the Impact Factor , then it has also looked in recent years as if Open Science advocates have set their faces against the impact factor as a indexation that can be so easily and obviously gamed . There is then a vacuum at the heart of the digital transformation of scholarly communications : we still do not know how to rank and index contributions so that searchers can see at a glance how colleagues rate and value each other’s work . When we do – and I have jumped the gun by naming it “Triple I “ already , for the Index of Intervention and Influence , it will capture and evaluate every network participation , from grant application to pre-registration intent to early stage poster and seminar and conference contributions , to blogs and reviews and on to the researcher’s own results and their reception and evaluation . Here at last the distortion of the pressure to Publish or Perish “ will be laid to rest .
CONCLUSION. I have tried to describe here a world of scholarly communications in motion . We need to watch very carefully in the next few years if we are to validate the direction and judge the speed . As we move into 2022 , the way in which so called “transformative agreements “ are renewed or replaced will offer up plenty of clues . We need to validate experimentation in forms of communication , both long and short term . While many publishers assert that authors will not accept that data leading to reproducibility should be made available, PLoS have maintained one service in which data is linked by reference to the article after being placed in a safe data repository like Dryad or Figshare . They report no resistance to these requests . Unless we all experiment we will never know.
The approaches made by Open Science as a generalised movement for change and reform will be critical , as will the speed and completeness with which these ideas are accepted and implemented , especially by funders . The issues here will be both big and small . Retractions , the way they are notified and the way in which the discovery of retracted material is flagged to users , is a finite area that has required reform for many years . On the other hand , the moves in several countries and many institutions to de-couple article and book production from promotion or preferment in academic institutions has wide implications . Remove “publish or perish “ and one of the main supports of the publishing status quo goes with it . It will not stop researchers being measured on the quality or impact of their contributions to scholarly communications , but it may well be that those contributions can be just one element of a multi-faceted rating .
Data and AI will continue to be central to the possible directions of change . Just as SciHub challenged the paywalls of the industry half a generation ago , s the announcement of the launch of The General Index in October marks a critical moment for researchers . There are alternative means of knowledge access and evaluation . There is nothing illegal about Carl Malamud’s enterprise , but using text and data mining techniques to create an index of terms and five word expressions of concepts in 107 million scholarly articles – just the beginning says the team – and making it free to use and downloadable is a huge achievement . It means that the age of going to the source document , the version of record , recedes even further from the researcher’s priorities except as a last resort or if the wording was of critical importance . For those who have long held the view that most research in the literature would eventually be done only in the metadata , this is an early dawn .
Some will read what I have said and conclude that this is just another “ the end of publishing “ talk . This would be wrong . I want to reach out to the hundreds and thousands of data scientists , software engineers and architects who have joined what were once traditional publishing houses in the last decade. You have a key role and a huge opportunity as the digital transformation of scholarly communications at last gets underway. The data analytics , the RPA systems , the dissemination environments , the new services summoned up by the Open Science vision – all of these and many more provide opportunities to reboot a service industry and create the support services that researchers need and value .
USEFUL REFERENCES
AI enabled scholarly workflow tools and other support services :
Scholarcy.com
Scite https://scite.ai
Cassyni.com
UNSILO. https://unsilo.ai/about-us/
Barend Mons. Seven Sins of Open Science. ( slide set ) https://d1rkab7tlqy5f1.cloudfront.net/Library/Themaportalen/Open.tudelft.nl/News%20%26%20Stories/2018/Open%20Science%20symposium/Spreker%204%20open-science-Barend%20Mons_web.pdf
Open Science. The Eight Pillars of Open Science. UCL London https://www.ucl.ac.uk/library/research-support/open-science/8-pillars-open-science
Oct
27
The General Index poses a publisher question
Science advances by virtue of standing on the shoulders of giants , but sometimes you need a stepladder. Longtime public access activist Carl Malamud believes he is providing one in his newly launched ( 7 October ) General Index , a way of filleting scientific knowledge and spitting out the essential bones which may yet rival SciHub , the Azerbaijan-based pirate site of full text science articles , as the no-cost way to search scientific literature without paying publishers for the privilege . In a world of pinched science budgets this may be appealing . Even more appealing may be the thought of getting to the essence without full text searching and the elimination of false leads and extraneous content.
It used to be a joke that one day the metadata around science research articles would be so good that you could pursue most searches through the metadata without troubling yourself with the text of the article . Indeed , in some fields , like legal information , the full text of cases could be a nuisance and concordances , citation indexes and other analytical tools could be used to get quickly to the nub of the question . Today these are built into the search mechanism and the search strategy . Mr Malamud has a long history in public and legal information ( see public.Resource.Org , his not for profit foundation and publishing platform ). At one point he challenged Federal law reporting on cost and campaigned to become U S Printer . But he is a very serious computer scientist and his target now is the siloed , paywalled world of non-Open Access science publishing . And the point of attack is both shrewd and powerful .
The weakness of the publishers is that their paywalled information cannot be searched externally in aggregate in a single , comprehensive sweep . Just like SciHub , Mr Malamud enables “ global” searching to take place . He has built an index . Currently he covers 107 million articles in all of the major paywalled journals . He has indexed n-grams – single words and words in groups of 2, 3, 4, and 5 . He has built metadata , IDs and references to the journals . And , he claims , he has done this without beaching anyone’s copyright . He points out that facts and ideas are not copyright , and that his index entries do not attract copyright since they are too short to be anything else but fair dealing . Publishers will no doubt try to test this legally , probably in the US or UK since common law jurisdictions look more favourably on economic rights . In the meanwhile it is worth pondering the words of part of his publication statement:
“The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.”
It is very clear from this that science publishing , if it attacks the General Index , is going to do so on very tricky grounds . Looking like monopolists is nothing new , but actually persuading researchers that they are instrumental in building reputation and career advancement weakens as an argument when the publisher is being pilloried for restricting access to knowledge . Building a new business in data solutions and analytics is a road that several have taken , but only the largest are very far advanced . This might be a time for the very largest to get together to discuss grouping services for researchers , but free , and without anti-trust implications ? Old style subscription journal publishing is getting boxed into a corner , with Open Platform publishing advancing quickly now , with applications like Aperture Neuro (https://www.ohbmbrainmappingblog.com/blog/aperture-neuro-celebrates-one-year-anniversary-with-new-publishing-platform-and-first-published-research-object ) and work like the Octopus research project at UKRI that I have mentioned previously .
In all of this , Data , the vital evidential output from research and experimentation , remains neglected . Finding a business model for making data available , marked up , richly enhanced with metadata and fully machine to machine interoperable remains a key challenge to everyone in scholarly communications . Even when Mr Malamud’s 5 terabytes of data ( compressed from 38 ) is installed it will only be a front end steering device to guide researchers more quickly to their core concerns – and those will eventually result in looking at the underlying data rather than the article .
The references below include a Nature article with the only comment from a major publisher that I have seen so far . I wonder if they are talking about it in Frankfurt!
https://www.nature.com/articles/d41586-021-02895-8
<
« go back — keep looking »