The Testing and Evaluation Systems for Trusted Artificial Intelligence Act of 2024 (TEST AI Act) is one of several AI-related bills making its way to the Senate floor. The TEST AI Act establishes testbeds for red-teaming and blue-teaming, which are techniques to identify security weaknesses in technologies. Red-teaming, or the simulation of adversarial attacks, gained attention as a technical solution for AI harms following the 2023 Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence (E.O. 14110). The Biden administration directed federal agencies to develop guidelines and testbeds for red-teaming. The TEST AI Act operationalizes these high-level directives while including the often overlooked blue-teaming research area. Bills like the TEST AI Act that promote trustworthy AI research help lawmakers to create more effective future standards for AI development. Ultimately, the TEST AI Act may lessen the cyber, data, and misuse vulnerabilities of AI systems through improved standards and security tools. 

The TEST AI Act was introduced by a bipartisan group of Senators in April 2024. Senator Ben Lujan (D-NM) is its sponsor, with Senators Richard Durbin (D-IL), John Thune (R-SD), Marsha Blackburn (R-TN), and James Risch (R-ID) joining as co-sponsors. Senator Peter Welch has since joined as a co-sponsor. In the Committee on Commerce, Science, and Transportation, the bill was substituted via amendment to add more detail to its text. After being reported favorably by the Committee, it is now awaiting consideration by the full Senate.


Background

The TEST AI Act instructs the Secretary of the Department of Energy (DOE) and the director of the National Institute of Standards and Technology (NIST) to pilot a 7-year testbed program in consultation with academia, industry, and the interagency committee established by the National Artificial Intelligence Initiative Act of 2020. The program will be housed within the DOE’s National Laboratories, a system of seventeen federally funded, privately managed labs that pursue wide-ranging science and technology goals. 

The goal of the program is to establish testbeds, or platforms that facilitate the evaluation of a technology or tool, for the assessment of government AI systems. The composition of testbeds vary, but can include hardware, software, and networked components. Hardware offers the computing power needed for testing, while software and networked components can simulate an environment or interact with the technology being tested. 

Some of these testbeds will be designed to improve the red-teaming of AI systems. Red-teaming simulates adversarial attacks to assess the system’s flaws and vulnerabilities. It can be performed by groups of humans or AI models trained to perform red-teaming. Early-stage attacks can include model tampering, data poisoning, or exfiltrating models and data. At the user level, a red team might try prompt injection or jailbreaking. 

Similarly, the TEST AI Act will establish testbeds for blue-teaming, which simulate the defense of a system. Like red-teaming, blue-teaming can be performed by human practitioners or AI systems, who together can create an especially potent security force. A blue team may analyze network traffic, user behavior, system logs, and other information flows to respond to attackers.

The proposed testbeds are focused on evaluating AI systems that are currently used or will be used by the federal government. Some testbeds will likely be classified to protect sensitive information associated with government AI systems. However, several agencies also release testbeds to the public and/or private industry. Several can be found on GitHub like the ARMORY Adversarial Robustness Evaluation Test Bed or the National Reactor Innovation Center Virtual Test Bed. Others require credentials or registration to use testbeds actively hosted on federal resources, such as the Argonne Leadership Computing Facility AI Testbed.


Red-teaming

The Biden Executive Order also requires companies to regularly share the results of their foundation models’ red-teaming with the government based on NIST guidance. While NIST has released an initial public draft of its guidelines for Managing Misuse Risk of Dual-Use Foundation Models, the final version mandated under the EO has yet to be released. Similarly, the NSF is funding research to improve red-teaming, but has not yet released findings. In the meantime, E.O. 14110 mandates that companies share the results of any red-teaming they conduct on several critical issues, including biological weapons development, software vulnerabilities, and the possibility of self-replication.

In contrast, blue teaming is not mentioned in E.O. 14110 and is much less discussed in policy and research circles.  For example, Google Scholar returns 4,080 results for “red-teaming AI” and only 140 for “blue-teaming AI”. The TEST AI Act is unique in its inclusion of blue-teaming on its research and policy agenda.

Excitement comes with its own downsides, though. The hype around red-teaming can obscure that actual practices vary widely in effectiveness, actionability, and transparency. A best practice or consistent standard for red-teaming does not exist, so the actual objectives, setting, duration, environment, team composition, access level, and the changes that are made based on the red-teaming results will vary from company to company.  For example, a company may conduct multiple rounds of red-teaming with a diverse group of experts with unfettered model access, clear goals, and unlimited time. Another red-teaming exercise may be time-bound, crowdsourced, API access, and single-round. Both approaches are considered red-teaming, but their usefulness differs significantly. 

Design choices for red-teaming exercises are largely made without disclosure, and exercise results are not public. There is no way to know whether companies make their product safer based on the results (MIT Technology Review). Accordingly, some researchers view red-teaming as a “catch-all response to quiet all regulatory concerns about model safety that verges on security theater” (Feffer et al, preprint). These concerns are echoed in the public comments submitted to NIST regarding their assignments in E.O. 14110. Similarly, Anthropic, a safety-focused AI developer, has called for standardizing red-teaming and blue-teaming procedures.


Federal Infrastructure

The TEST AI Act modifies NIST’s role under Executive Order 14110 to allow for interagency cooperation. The Act leverages the extensive federal infrastructure already in place for AI testing and testbeds. Congressional sponsors, including Senators Lujan (D-NM) and Risch (R-ID), identify the DOE as the only agency with the necessary computing power, data, and technical expertise to develop testbeds for frontier AI systems. 

Several trustworthy AI testbeds across federal agencies could serve as resources for the TEST AI testbeds. The Defense Advanced Research Projects Agency’s Guaranteeing AI Robustness Against Deception (GARD) project develops defense capabilities (like blue-teaming) to prevent and defeat adversarial attacks. They have produced a publicly available virtual testbed, toolbox, benchmarking dataset, and training materials for evaluating and defending machine learning models. Similarly, NIST’s Dioptra testing platform, which predates E.O. 14110, evaluates the trustworthiness, security, and reliability of machine learning models. Dioptra aims to “research and develop metrics and best practices to assess vulnerabilities of AI models” i.e., improve red-teaming. NSF also funds several testbeds (Chameleon, CloudLab) that provide computing power for AI/ML experimentation.


Conclusion

The TEST AI Act could usher in an era of increased robustness and accountability for federal use AI systems. Unlike GARD or Dioptra, which narrowly focus on defensive capabilities and trustworthiness, respectively, the TEST AI Act creates wide-ranging testbeds that are applicable across use cases and contexts. 

The Act also increases activity in the under-researched area of blue-teaming. Improving blue-teaming strengthens defensive capabilities, and can also help to solve the problem of “red-teaming hype”. It makes red-teaming results more actionable, and forces red teams to meet higher standards when testing defenses. This deliberate focus on both offensive and defensive techniques improves the current state of AI security while offering a framework for developing future AI standards and testing across the federal system. 

The TEST AI Act also addresses the limitations of current ad-hoc testing environments by formalizing and expanding testbed creation. In doing so, it redefines how government AI systems will be secured, bringing consistency and transparency to previously varied practices. This supports the broader goals of the Executive Order in improving risk assessment for biosecurity, cybersecurity, national security, and critical infrastructure. Crucially, it could stop the government’s systems from contributing to these harms from AI.

The Act’s integration with established entities like NIST and the DOE is critical, leveraging their unique infrastructure and technical expertise. It adopts the Executive Order’s position that collaboration on AI across government agencies is crucial for effectively harnessing vast resources and disparate expertise to make AI as beneficial as possible. By turning testbed creation and production into an interagency effort, the TEST AI Act establishes a testbed program on a previously unreplicated scale.