What is the Treasury FS AI RMF and does it apply to non-financial companies?

The Treasury Financial Services AI Risk Management Framework was published in February 2026 by the US Department of the Treasury. It was written for financial services institutions (banks, insurers, credit unions, investment advisers), but its reach extends beyond that sector in two practical ways.

What is the difference between a vendor questionnaire and an independent AI risk assessment?

A vendor questionnaire asks a vendor to describe its own controls, policies, and commitments. The vendor answers questions about what it says it does.

How do you measure hallucination risk in a GenAI vendor?

Hallucination measurement requires a domain-specific benchmark, not just general-purpose benchmarks like MMLU or TruthfulQA. General benchmarks measure broad language model capability; domain benchmarks measure accuracy in the specific context where you will deploy the tool.

What SOC 2 controls apply to AI systems?

SOC 2 Type II reports are organized around the Trust Services Criteria.

How often should AI vendors be reassessed?

The Treasury FS AI RMF establishes a tiering framework. Tier 1 vendors (those with AI systems that are customer-facing, used in regulated decisions, or critical to operations) should be reassessed quarterly.

GenAI Vendor Risk Assessment: A Framework f…

TL;DR: GenAI Vendor Risk Assessment: A Framework for 2026 (Treasury FS AI RMF + SOC 2), a practical compliance guide for enterprise and HR teams in 2026.

Most procurement teams still treat AI vendor risk the same way they treat SaaS vendor risk: send a 40-question security questionnaire, collect a SOC 2 certificate, review the DPA, and file everything in a shared drive. That approach satisfied auditors in 2022. It does not satisfy them now.

The US Treasury published its Financial Services AI Risk Management Framework in February 2026. The document is explicit: questionnaire responses are not sufficient evidence of risk management for AI systems. What regulators want is independent testing results, documented bias audits, measurable hallucination rates, and continuous monitoring programs. Not policy documents that describe what a vendor intends to do.

This creates a practical problem for any team evaluating ChatGPT Enterprise, Claude, Azure OpenAI, Gemini for Workspace, or any other GenAI tool. The vendors have policies. They have SOC 2 reports. They have privacy documentation that runs to dozens of pages. None of that documentation tells you what you actually need to know before you sign a contract and start sending company data through the system.

This guide gives you a six-dimension assessment model that maps to the Treasury FS AI RMF requirements, a comparison table across major vendors, and a 30-minute shortcut for teams that cannot run a full assessment on every tool.

The questionnaire trap

Security questionnaires were designed for a different problem. When your vendor is a payroll processor or a CRM, the key questions are about access controls, encryption, backup procedures, and incident response. Those are meaningful controls for systems that store and transmit data. They are insufficient for systems that generate content, make predictions, or support decisions.

A GenAI system introduces risks that a traditional security questionnaire does not address. The model may produce confident, incorrect outputs at a rate that is acceptable for low-stakes use cases but dangerous for regulated ones. The model may have systematic biases in outputs affecting particular demographic groups in ways that create liability under employment discrimination law or fair lending requirements. The model may be vulnerable to prompt injection attacks that allow a malicious input to override system instructions. A model update released without notice may change the system's behavior in ways that violate your AI acceptable use policy.

None of these failure modes appear in a questionnaire that asks "Do you encrypt data at rest?" or "What is your vulnerability disclosure policy?" The controls are orthogonal to the risk.

The Treasury FS AI RMF names this problem directly. Section 4.2 of the February 2026 framework distinguishes between "intent documentation" (what a vendor says it does) and "evidence testing" (what independent evaluation shows the system actually does). Financial institutions subject to the framework are required to obtain evidence, not just documentation.

What the Treasury FS AI RMF actually requires

The Treasury framework was written for financial services institutions, but its reach is extending. NIST has referenced it as a sector-specific implementation of the broader NIST AI RMF 1.0, giving it influence over federal procurement guidance. More immediately, banks and insurers are pushing its requirements into their vendor contracts and third-party risk programs. Any company that sells to financial services buyers will start seeing these requirements reflected in customer due diligence requests within the next 12 to 18 months.

The framework establishes four core requirements for AI vendor oversight that go beyond standard vendor risk management.

Independent testing. Self-attestation by the vendor is not acceptable for Tier 1 AI systems. The framework requires testing by a party that is independent of the vendor: either an internal team that did not select or implement the system, or a third-party assessor. Testing must cover the system's performance in the specific use case and data context where it will be deployed, not just on the vendor's published benchmarks.

Bias audits. For AI systems used in decisions that affect individuals (credit, insurance, employment, customer service prioritization), the framework requires documented analysis of disparate impact across protected demographic groups. The audit must be conducted on outputs, not just on the training data or model architecture. A bias audit that covers only the training dataset does not satisfy this requirement.

Hallucination measurement. The framework requires institutions to document the acceptable hallucination rate for each AI use case and obtain vendor data showing that the system meets that threshold. For regulated use cases such as producing investment advice summaries, generating compliance documentation, or supporting credit underwriting, the acceptable rate may be very low. The framework does not set a universal threshold; it requires institutions to determine and document their own based on the use case and the consequences of errors.

Security testing. Beyond standard penetration testing, the framework requires AI-specific security evaluation covering data poisoning (can training data be manipulated to alter model behavior?), prompt injection (can a malicious input override system instructions or extract training data?), and model inversion (can outputs be used to reconstruct training data?). Standard SOC 2 audits do not cover these attack vectors.

The framework tiers vendors by risk level. Tier 1 covers AI systems that are customer-facing, used in regulated decisions, or operationally critical. These require quarterly reassessment: reviewing a SOC 2 bridge letter, checking for model updates, and re-running spot tests. Tier 2 covers lower-risk internal tools and requires annual reassessment.

The 6-dimension assessment model

Running a compliant vendor assessment does not require a dedicated AI risk team. It requires asking the right questions in the right order and knowing which answers are acceptable. The following six dimensions map to the Treasury FS AI RMF requirements and cover the risks that standard security questionnaires miss.

Dimension 1: Model transparency

The first question is whether the vendor can tell you what the system is. Not a marketing description. A technical account of what the model does, what data it was trained on, and what its known limitations are.

Ask whether the vendor publishes a model card. A model card is a standardized document that describes a model's intended use cases, performance on evaluation benchmarks, known limitations, and demographic analysis of training data. Google, Anthropic, and Meta publish model cards. Some vendors do not.

Ask what the vendor discloses about training data. You do not need to know every dataset (that is legitimately proprietary), but you should know the general categories: web crawl data, licensed datasets, synthetic data, domain-specific corpora. If the vendor responds to this question with "proprietary architecture" or "we cannot disclose that," treat it as a red flag. The absence of training data disclosure makes bias auditing impossible and makes it impossible to assess whether your data category is similar to what the model has seen before.

Ask whether known limitations are documented. Every model has them. A vendor that cannot tell you what its system gets wrong has either not tested it or is not willing to share the results.

Dimension 2: Data handling and privacy

For most companies, data handling is the highest-stakes dimension because it determines whether using the tool creates GDPR, CCPA, or HIPAA liability.

The DPA must specify the legal basis for any processing under GDPR Article 6. If the DPA does not name a legal basis, if it just says "we process data to provide the service," it is insufficient. Ask specifically whether the vendor relies on contractual necessity (Article 6(1)(b)), legitimate interest (Article 6(1)(f)), or another basis, and whether that basis is documented in the DPA itself.

Training data use is the question that most teams ask too late. Find out whether your data is used for model training by default and what the opt-out process is. Several major vendors default to using customer data for training improvement unless the customer takes explicit action, such as signing an enterprise agreement or toggling an API setting. This is not always disclosed prominently. The privacy-first AI API no training GDPR CCPA guide covers which major APIs offer training-off by default versus by request.

EU data residency is a growing compliance requirement, particularly for companies subject to Schrems II or national data localization requirements. Check whether EU data residency is the default configuration for your account or an additional paid tier. For several major vendors, EU residency requires an enterprise contract at a significant price premium over the standard API.

Subprocessor transparency matters because data does not stay within the vendor's own infrastructure. Ask for the current subprocessor list and, more importantly, ask how and how quickly the vendor notifies customers of subprocessor changes. The GDPR requires vendors to give customers sufficient notice of new subprocessors to allow them to object before the change takes effect. "Sufficient notice" is not defined by the regulation, but 30 days is a reasonable minimum and 5 days is a compliance-grade requirement in most enterprise DPAs.

Dimension 3: Security posture

SOC 2 Type II certification tells you that an independent auditor examined the vendor's controls over a defined period and found them to be operating effectively. It does not tell you that the controls cover AI-specific risks.

The Trust Services Criteria most relevant to AI are CC6 (Logical and Physical Access Controls), CC7 (System Operations: monitoring and incident response), and CC9 (Risk Mitigation: vendor and third-party management). Ask the vendor specifically whether its SOC 2 audit scope includes AI system controls such as model change management, output monitoring, and adversarial testing procedures. Many vendors' SOC 2 reports cover standard SaaS infrastructure but do not extend to the AI pipeline.

Prompt injection testing is a specific security evaluation that most enterprise software vendors do not conduct because traditional software does not have this attack surface. A prompt injection attack uses a malicious input to override a system's instructions. For example, embedding instructions in a document that a user asks the AI to summarize can cause the AI to ignore its system prompt and follow the injected instructions instead. Ask whether the vendor conducts red-team adversarial testing specifically for prompt injection and whether the results are available for review.

API key management and incident response SLAs round out the security picture. What is the vendor's detection-to-notification timeline for security incidents? What counts as an incident for AI-specific failures such as hallucination at scale, bias detection in production, or a confirmed prompt injection breach? These definitions matter because a vendor whose incident response plan does not include AI-specific failure modes will not notify you when those failures occur.

Two professionals reviewing a document together at a conference table

Dimension 4: Hallucination and accuracy measurement

Hallucination is the technical term for a model producing confident, incorrect output. It is the most common failure mode for production GenAI deployments and the one most likely to create liability in regulated use cases.

Ask what benchmarks the vendor uses to evaluate accuracy. General-purpose benchmarks like MMLU (Massive Multitask Language Understanding) and HellaSwag measure broad capability. TruthfulQA specifically measures how often a model gives truthful answers to questions where common misconceptions are widespread. These are useful starting points but insufficient for domain-specific assessment.

For regulated use cases such as producing legal analysis, generating financial summaries, or supporting medical documentation, you need to know the vendor's hallucination rate on domain-specific test sets that resemble your actual use case. This data is rarely published in marketing materials. Ask for it directly in the vendor questionnaire. If the vendor cannot provide domain-specific accuracy data, ask what benchmarks they do use and what their published scores are.

Determine the acceptable hallucination rate for your use case before you send the question. A tool used for internal brainstorming can tolerate a higher error rate than a tool used to generate client-facing compliance summaries. Document your threshold. If the vendor cannot demonstrate that its system meets that threshold for your domain, the appropriate response is either not to deploy the tool or to implement human review gates that reduce the impact of errors. The AI vendor evaluation checklist includes a section for documenting these thresholds.

A vendor who cannot answer this question, who has not measured hallucination rates and cannot tell you what floor to expect, cannot manage the risk they are creating for your organization. Treat the absence of any measurement data as a disqualifying gap for Tier 1 use cases.

Dimension 5: Bias and fairness

Bias in GenAI systems is not a hypothetical concern. Multiple peer-reviewed studies published between 2023 and 2026 have documented systematic differences in outputs based on names, demographic descriptions, and linguistic patterns associated with particular groups. These differences can create liability under employment discrimination law, fair lending requirements, and the EU AI Act's prohibited practices provisions.

Ask whether the vendor has conducted disparate impact analysis on model outputs, not on training data, but on actual outputs across demographic groups. Training data analysis is a precursor; output analysis is the evidence. If the vendor's bias documentation covers only training data composition, the audit has not addressed the question that matters for liability.

For use cases involving hiring decisions, lending, healthcare, or insurance, a third-party bias audit is not just best practice. It is increasingly a legal requirement. New York City Local Law 144 requires employers using automated employment decision tools to obtain annual independent bias audits and publish summaries of results. For vendors positioning their tools for these use cases, ask for the third-party audit results directly.

Ask what demographic groups were included in the bias evaluation. A bias audit that only covers race and gender and excludes age, disability status, and national origin provides incomplete coverage for employment discrimination purposes.

Dimension 6: Contractual remedies

Contracts for GenAI tools often have liability caps that bear no relationship to the harm potential of the system. A tool used to generate customer communications, support underwriting decisions, or produce regulatory filings can generate errors that cause harm well in excess of a standard SaaS liability cap of 12 months' fees.

Before the contract is signed, establish what remedies are available when the system fails. What is the vendor's SLA for model degradation, meaning a measurable decline in output quality between model versions? Several major vendors have changed model behavior significantly between versions without formally announcing a model change.

What happens if a model update changes the system's behavior in ways that violate your AI acceptable use policy or produce outputs that are inconsistent with your compliance requirements? Ask specifically whether the vendor offers a rollback mechanism (the ability to revert to a prior model version) or whether updates are applied universally without customer control. For Tier 1 use cases, the absence of rollback capability is a contractual gap that should be addressed before deployment.

Review the liability cap relative to harm potential. Seek a carve-out from the standard liability cap for AI output failures that result in regulatory penalties, third-party claims, or discrimination liability. Most vendors will resist this, but documenting the request and the vendor's response is itself valuable for your internal risk record.

The agentic AI vendor contract clauses 2026 and AI vendor contract redline template 2026 guides cover specific contract language for these provisions.

Vendor comparison: major GenAI platforms

The following table reflects publicly available information as of June 2026. "Partial" indicates that a feature exists but with meaningful limitations, such as EU residency available only under enterprise contracts, or training opt-out requiring explicit configuration rather than being off by default.

Vendor	Model Card Published	Training Opt-Out Default	EU Data Residency Default	SOC 2 Type II	Hallucination Rate Published	Subprocessor Transparency
ChatGPT Enterprise	Partial	Yes (enterprise)	Partial (paid add-on)	Yes	Partial (benchmark scores, not domain)	Yes (list published)
Claude API (Anthropic)	Yes	Yes (API default)	Partial (EU available, not default)	Yes	Partial (eval results on select benchmarks)	Yes
Azure OpenAI	Partial	Yes (no training by default)	Yes (region-selectable)	Yes	Partial (Microsoft benchmark data)	Yes
Gemini for Workspace	Partial	Yes (admin control)	Partial (EU available, enterprise)	Yes	Partial (benchmark scores published)	Yes (via Google Workspace DPA)
Mistral API	Partial	Yes (API default)	Yes (EU-hosted by default)	Partial (in progress as of Q1 2026)	Limited (limited public benchmark data)	Partial (basic list)

A few observations worth noting. Azure OpenAI scores well on data residency because Azure's existing regional infrastructure allows customers to select EU regions for all processing, including model inference. Mistral is EU-headquartered and hosts data in Europe by default, which is a meaningful differentiation for companies with EU data residency requirements, but its SOC 2 program is newer than the other vendors on this list. Anthropic's Claude API does not use customer data for training by default on the API (confirmed in its usage policy), but EU data residency requires a specific contract configuration rather than being the default for API customers.

The 5-question assessment that takes 30 minutes

Running a full six-dimension assessment is the right approach for Tier 1 vendors. For initial screening, deciding which vendors go to full assessment and which are excluded early, these five questions surface the most critical gaps in under 30 minutes.

Q1: Does your SOC 2 Type II include AI-specific controls, or does the audit scope cover only standard SaaS infrastructure?

A vendor with a SOC 2 that excludes the AI pipeline has audited its data center, not its model. This does not disqualify the vendor, but it means you need to obtain evidence about model-specific controls through a different path.

Q2: What is your documented opt-out process for training data use, and is training opt-out the default configuration for API customers or enterprise customers?

If the answer is that opt-out requires a specific contract amendment and is not the default, ask how many of their current customers are actually opted out. The answer tells you whether the opt-out is a real operational control or a paper policy.

Q3: Can you provide a list of all current subprocessors with a contractual commitment to notify us within five business days of any addition or change?

The list itself matters. The notification commitment matters more. Many vendors bury subprocessor change notification in a 30-day email list subscription. Five days is the threshold that gives your legal and security teams time to review before the change takes effect.

Q4: What is your incident detection-to-notification SLA, and what constitutes an incident for AI-specific failure modes, including hallucination at scale, bias detection in production outputs, and confirmed prompt injection breaches?

Most vendors have a standard incident response SLA covering data breaches and system outages. Fewer have documented incident definitions that include AI-specific failures. If the vendor's incident response policy does not define hallucination at scale as a reportable event, you will not be notified when it happens.

Q5: If your model is updated and the update changes behavior in ways that affect my use case, what is my recourse?

This question distinguishes vendors with real model governance from vendors who update models as they choose and treat customer complaints after the fact. The answer you want is some combination of advance notice, staging environment access, and a rollback mechanism. The answer that should concern you is "updates are applied automatically and we cannot revert individual customer environments."

Reassessment cadence

Vendor assessment is not a one-time activity. GenAI models update more frequently than traditional software, and behavior changes between versions can affect accuracy, bias properties, and safety in ways that are not always announced. A model that passed your initial assessment in January may behave differently in June after a silent update.

The Treasury FS AI RMF sets the cadence. Tier 1 vendors (customer-facing AI, regulated decision support, operationally critical systems) require quarterly review. That does not mean a full assessment every quarter. It means requesting a SOC 2 bridge letter covering the period since the last full report, reviewing any model update announcements from the vendor, verifying that incident notification SLAs were met in the prior period, and re-running a spot test of key outputs on a defined test set you maintain internally.

Tier 2 vendors (lower-risk internal tools, productivity use cases, limited data exposure) can be reassessed annually. Annual reassessment should include a full SOC 2 review, a check on subprocessor changes, and a review of any regulatory or legal developments involving the vendor.

Build the reassessment calendar into your vendor contracts. Include a right to audit clause that allows you to request updated SOC 2 documentation and testing results at each reassessment interval. Most enterprise vendors will accept this language; resistance to a right to audit clause is itself a signal about the vendor's confidence in its controls.

For teams building out their vendor management program, the AI vendor due diligence checklist 2026 and embedded AI governance third-party tools guides provide supporting documentation templates. For companies that have already signed contracts and need to address gaps after the fact, the AI vendor contract redline template 2026 covers how to negotiate amendments.

The GDPR AI enforcement picture is also shaping vendor risk. The GDPR AI fines enforcement cases article documents the cases from 2025 and early 2026 where companies faced penalties for AI systems that lacked adequate vendor oversight, including cases where the company's own governance was adequate but the vendor's was not, and the company bore the liability anyway.

The regulatory direction is toward more accountability, not less. The Treasury FS AI RMF is the current leading edge of that direction in the US. Treating it as a financial services-only concern misreads where procurement standards are heading.

Vendor

Model Card Published

Training Opt-Out Default

EU Data Residency Default

SOC 2 Type II

Hallucination Rate Published

Subprocessor Transparency

ChatGPT Enterprise

Partial

Yes (enterprise)

Partial (paid add-on)

Yes

Partial (benchmark scores, not domain)

Yes (list published)

Claude API (Anthropic)

Yes

Yes (API default)

Partial (EU available, not default)

Yes

Partial (eval results on select benchmarks)

Yes

Azure OpenAI

Partial

Yes (no training by default)

Yes (region-selectable)

Yes

Partial (Microsoft benchmark data)

Yes

Gemini for Workspace

Partial

Yes (admin control)

Partial (EU available, enterprise)

Yes

Partial (benchmark scores published)

Yes (via Google Workspace DPA)

Mistral API

Partial

Yes (API default)

Yes (EU-hosted by default)

Partial (in progress as of Q1 2026)

Limited (limited public benchmark data)

Partial (basic list)