The Evaluation Task Force has published guidance for departments setting out best practice to evaluate the impact of artificial-intelligence tools and technologies.
The unit, which is a joint Cabinet Office and Treasury initiative created to help departments make sure that spending decisions are backed by the best evidence, made the document available as an annex to the Treasury's existing Magenta Book.
It aims to "enhance the safety and confidence with which government departments and agencies can adopt AI technologies, ensuring that public sector innovation keeps pace with the private sector".
The Magenta Book is written for the policy, delivery and analytical professions with the aim of ensuring that good evidence is built into the design and delivery of projects from the earliest stages to maximise understanding of the impacts new developments can have.
However, the new 54-page annex, Guidance on the Impact Evaluation of AI Interventions, acknowledges that departments' use of AI poses "unique challenges" that require "tailored approaches".
Last week, technology secretary Peter Kyle set out plans for a step change in the public sector's use of technology that targets £45bn in productivity improvements. He also previewed the civil service's new suite of AI tools for officials – dubbed "Humphrey" in honour of Yes, Minister character Sir Humphrey Appleby.
The new Evaluation Task Force guidance, which was co-created by the Department for Transport and consultancy Frontier Economics, says many AI projects are likely to require "more substantial" evaluation than other interventions because of their risk and uncertainty.
It cites the "untested nature" of AI interventions, "possible negative consequences" and "high learning potential" among the reasons they meet the Magenta Book's criteria for priority interventions.
"This means many AI interventions require more substantial evaluation than similarly sized business-as-usual interventions," the guidance says.
Among the "negative consequences" referenced by the guidance is the potential for AI systems to make predictions or recommendations that are discriminatory towards certain groups because the sytems have been trained on data that "reflects existing biases" in society.
"These factors mean it may be more challenging to identify all risks and potential unintended effects when developing the theory of change," it says.
The guidance supports the use of randomised control trials for testing new AI products to produce high-quality evidence on the intended and unintended impacts of introducing the technologies.
But it acknowledges that the "fast-moving nature" of AI interventions represents a challenge to implementing robust approaches in practice.
"These challenges can be overcome by building the evaluation into the intervention’s design, aligning the evaluation’s delivery with the delivery of the intervention and building flexibility into the approach," the guidance says.
It concludes by urging officials to consider impact evaluation as early as possible and to develop a comprehensive theory of change in partnership with key stakeholders.
"When selecting evaluation approaches, consider experimental methods for quantitative impacts, exploring feasible options and alternative designs, like quasi-experimental methods," the guidance says.
"Theory-based approaches are useful for complex systems and complement experimental or quasi-experimental methods. AI interventions require iterative evaluation throughout the development, deployment and evolution stages, using rapid methods for immediate effects and comprehensive evaluations for long-term outcomes."
It adds: "Evaluation plans should be flexible to accommodate changes and consider variations in impact across different groups. Public attitudes and perceptions should be factored into the evaluation, and establishing a clear baseline early on is crucial for effective assessment."
The guidance also incorporates four "hypothetical" case studies of evaluations for AI projects. They include a system to help assess applications for grant funding; a large language model-based application to help civil servants analyse large quantities of information; and the use of a chatbot to provide user support to members of the public.