Overview
Open Ratings provides independent, cryptographically anchored trust assessments for the agentic economy — evaluating the outputs, transactions, and deliverables produced by AI agents, human–AI partnerships, and autonomous software systems. Our methodology applies the risk proportionality principles used by established credit rating agencies to a new class of counterparty: the AI agent.
In the emerging agentic economy, AI agents negotiate contracts, produce code, generate content, execute transactions, and deliver services — often with minimal human oversight. Every such interaction carries risk. Open Ratings evaluates these agentic outputs at a specific point in time — not a continuous stream, but a deliberate snapshot, analogous to a credit rating agency assessing a bond issuance. Each rating reflects the trustworthiness, correctness, and reliability of that output as assessed by our multi-dimensional analytical framework.
Our rating scale mirrors the letter-grade conventions used globally in credit markets (AAA through D), and our target grade distribution is calibrated to the observed distribution of corporate credit ratings. This means a BBB from Open Ratings carries the same relative trust signal as a BBB in the bond market: adequate quality with room for improvement, but fundamentally sound. When an agent delivers a BBB-rated output, counterparties know what level of risk they are accepting.
Every rating is permanently recorded via Merkle-tree hashing and SHA-256 attestation, creating an immutable, independently verifiable record of when the assessment was performed, what grade was assigned, and what computational evidence supports the finding. This cryptographic anchoring enables trust without requiring trust — any party can independently verify an agent’s track record.
Disclosures
Grades represent an assessment of agent-produced outputs at a specific point in time. They are not a guarantee of future performance, a warranty against defects, or an endorsement of an agent’s general capability. Agent behavior may vary across tasks, parameter configurations, domains, and interaction contexts not represented in the assessed output.
Rating Scale
| Grade | Score Range | Assessment | Definition | Target Dist. |
|---|---|---|---|---|
| Investment Grade — Production Recommended | ||||
| AAA | 9.70 – 10.00 | Human + Auto | Exceptionally trustworthy agent output. Near-zero defect probability. Verified through comprehensive automated analysis, formal methods, and expert human due diligence. Extremely rare. | < 0.05% |
| AA | 9.30 – 9.69 | Human + Auto | Very strong output quality, differing from the highest-rated outputs only to a small degree. Requires enhanced validation depth and expert review. | ~2% |
| A | 7.50 – 9.29 | Automated | Strong output quality. Sound architecture, adequate verification coverage, no critical vulnerabilities detected. Somewhat more susceptible to quality degradation under high-throughput agentic workflows than higher-rated outputs. | ~8% |
| BBB | 6.00 – 7.49 | Automated | Adequate output quality. Exhibits acceptable quality parameters, but adverse conditions — such as prompt drift, tool-chain changes, or model updates — are more likely to weaken quality over time. | ~15% |
| Speculative Grade — Elevated Risk | ||||
| BB | 4.50 – 5.99 | Automated | Below-average quality. Faces ongoing uncertainties including incomplete verification, known vulnerabilities, hallucinated dependencies, or architectural weaknesses in the agent’s output. | ~20% |
| B | 3.00 – 4.49 | Automated | Weak output quality. Currently functional, but adverse conditions — model regression, tool deprecation, or environmental changes — will likely impair reliability. | ~25% |
| CCC | 1.50 – 2.99 | Automated | Substantial quality risk. Dependent on favorable conditions to avoid failure. Critical vulnerabilities, hallucinated dependencies, or fabricated API endpoints detected in agent output. | ~18% |
| D | 0.00 – 1.49 | Automated | Non-functional or critically defective. The agent’s output fails on basic execution, contains irreconcilable errors, or exhibits fundamental flaws rendering it unsuitable for any production use or downstream consumption. | ~12% |
Agent outputs rated BBB or above are considered investment grade — suitable for production use, downstream integration, and organizational reliance. Grades below BBB carry speculative-grade designation, indicating elevated counterparty risk. This threshold mirrors the BBB/BB boundary used globally in credit markets. In the agentic economy, this distinction determines whether an agent’s deliverables can be trusted without additional human verification.
Target Distribution
Our rating distribution is calibrated to mirror the observed distribution of corporate credit ratings. This ensures that an Open Ratings grade carries proportional risk weight equivalent to its credit market counterpart. A AAA is as rare in the agentic economy as it is in corporate bonds — today, only two U.S. companies maintain a AAA corporate credit rating. Most agent outputs, like most corporate issuers, cluster in the middle of the scale.
These proportions are targets, not quotas. Distribution targets serve as calibration anchors — if observed distributions diverge significantly from targets, we investigate whether the divergence reflects the true state of agent output quality or a calibration error in our model.
Assessment Dimensions
Each agent output is evaluated across six trust dimensions. Dimensions are weighted to reflect their relative contribution to overall counterparty risk in agentic transactions, calibrated through empirical analysis of production incident correlation and agent failure modes.
| Dimension | Priority | What We Assess |
|---|---|---|
| Security | Primary | Vulnerability exposure, secret management, input validation, authentication patterns, dependency vulnerability propagation, detection of hallucinated packages or API endpoints, and agent tool-use safety boundaries. |
| Logic Correctness | Primary | Semantic accuracy of implementation relative to stated intent, edge case handling, error propagation paths, type safety, and divergence between specifications and actual behavior. |
| Codebase Fit | High | Architectural coherence, consistency with project conventions, appropriate abstraction levels, integration quality with existing systems, and alignment with the consuming agent’s or human’s operational context. |
| Maintainability | Moderate | Cyclomatic and cognitive complexity, nesting depth, function length, naming clarity, documentation coverage, and long-term modifiability. |
| Test Quality | Supporting | Test effectiveness measured through mutation analysis: deliberate code changes that should break tests but don’t indicate weak test suites. |
| Dependency Health | Supporting | Dependency freshness, known vulnerability exposure (CVE), license compatibility, maintainer activity, and whether declared dependencies actually exist. |
Assessment Process
Ratings are produced through a multi-phase analytical pipeline that generates Verification Event Envelopes (VEEs) at each stage. Each VEE is an independently verifiable intermediate output, cryptographically combined into a proof-of-work attesting to the depth and integrity of the assessment.
-
Input IngestionThe agent’s output is captured at the specified revision or transaction point. Dependency manifests, language composition, framework indicators, and provenance metadata are extracted. The producing agent is identified through stylometric analysis, tool-use signatures, and declared agent identity where available.
-
Static AnalysisAbstract syntax tree parsing measures structural complexity across every function and module. Pattern-matching rules detect security anti-patterns, code duplication, and known vulnerability signatures. Declared dependencies are cross-referenced against authoritative registries.
-
Sandboxed ExecutionThe agent’s output is built and executed within an isolated containerized environment. Dependency installation is attempted and failures recorded. Test suites are executed and results captured. Mutation analysis introduces controlled defects to measure verification effectiveness.
-
Quality ScoringDimension scores are computed from the combined static and dynamic analysis results, weighted per the dimension table, and aggregated into a composite quality score. The composite score is mapped to a letter grade per the rating scale thresholds.
-
Proof-of-Work GenerationIntermediate outputs from each phase are organized into a Merkle tree. A difficulty-adjusted cryptographic proof is generated, with difficulty proportional to the achieved grade.
-
Cryptographic AnchoringThe rating grade, composite score, proof-of-work hash, Merkle root, and timestamp are sealed as a Verification Event Envelope (VEE) via cryptographic attestation, creating an immutable, timestamped record that any counterparty in the agentic economy can independently verify.
Rating Tiers
Automated Assessment (D through A)
Ratings from D through A are produced entirely through the automated pipeline. No human judgment is involved in the scoring. A grade of A represents the highest trust level achievable through automated analysis of an agent’s output alone.
The automated ceiling exists by design. Trust beyond the A threshold involves factors that no automated system can reliably assess: architectural intent that spans years of technical strategy, organizational commitment to governance, the subtle distinction between output that is merely correct and output that is wisely constructed, and whether the human–agent partnership demonstrates sustained judgment under adversarial conditions.
Enhanced Assessment (AA and AAA)
Grades of AA and above cannot be achieved through automated analysis alone. They require a structured human due diligence review conducted by senior analysts with extensive experience in agentic systems, software risk assessment, and AI governance.
For an agent output to be considered for AA or AAA, it must first achieve a minimum automated score of A (7.50+). The automated score establishes the quantitative floor; the human review determines whether the qualitative factors — including the agent’s governance posture, the human–agent partnership dynamics, and long-term reliability trajectory — support elevation above that floor.
Agent Identity & Attribution
In the agentic economy, knowing which agent produced an output — or is directly or indirectly involved in an exchange of goods or services for any consideration — is as important as knowing which firm issued a bond. Open Ratings identifies the AI agent that generated or assisted in generating the assessed output through stylometric analysis, tool-use fingerprinting, and declared agent identity. This attribution is reported as part of the rating when detected with sufficient confidence.
Different AI agents exhibit distinct failure mode profiles — hallucination rates, tool-use patterns, reasoning depth, and domain-specific weaknesses — that affect the probability of undiscovered defects. Agent attribution enables agent-specific calibration: if an agent is known to have elevated hallucination rates for a particular package ecosystem, the dependency health dimension is scrutinized more heavily. Over time, agents accumulate rating histories that function as credit histories in the agentic economy.
Open Ratings also evaluates human–AI partnerships: the composite quality produced when a specific developer works with a specific agent. Partnership ratings weight developer skill, agent capability, agent wrapper quality, and collaboration patterns to produce a composite trust score for the partnership as a counterparty.
Cryptographic Verification
Every Open Rating is sealed as a Verification Event Envelope (VEE) via cryptographic attestation at the time of issuance. The VEE taxonomy supports multiple event types — CODE_QUALITY, AGENT_GOVERNANCE, COMPLIANCE_CHECK, MODEL_ATTESTATION — enabling a unified trust infrastructure for the agentic economy. The on-chain record is a compact attestation containing the essential elements needed to verify the rating’s existence and integrity.
event_type: CODE_QUALITY | AGENT_GOVERNANCE | COMPLIANCE_CHECK | MODEL_ATTESTATION
subject_hash: SHA-256(agent_id + output_ref)
revision_hash: first 20 bytes of assessed output SHA
grade: letter grade enum (D=0 ... AAA=7)
score: composite quality score × 100
pow_hash: proof-of-work hash
merkle_root: Merkle root of analysis phase outputs
timestamp: Unix timestamp of rating issuance
agent_id: identified producing agent (when attributed)
prior_rating: address of previous rating for same subject (if any)
The attestation layer serves as a filing cabinet, not an analytical tool. It provides three guarantees: the rating existed at the claimed time, the rating has not been altered after issuance, and the full history of an agent’s or output’s ratings can be traversed by following prior_rating links — building a verifiable trust history for any participant in the agentic economy.
Agent Trust History
Every rating recorded on-chain becomes part of an agent’s permanent trust history. Because each Verification Event Envelope includes a prior_rating link, the complete chain of assessments for any agent or output can be traversed backward to its first rated interaction — creating an immutable behavioral record analogous to a credit bureau file.
This history is not merely archival. Past agentic behavior, cryptographically recorded and independently verifiable, informs future assessments. An agent with a consistent record of investment-grade outputs carries that reputation forward. An agent whose outputs have degraded over successive assessments will reflect that trajectory in its composite trust profile. The blockchain does not forget, and neither does the rating framework.
Trust history enables several capabilities that static, point-in-time assessment cannot:
- Trend detection: Identifying agents whose output quality is improving or deteriorating across successive interactions.
- Domain-specific trust: An agent may be highly reliable in one domain and unreliable in another. Historical segmentation by task type surfaces these patterns.
- Counterparty due diligence: Before entering an agentic transaction, any party can query the on-chain record to verify an agent’s demonstrated track record — not its claimed capability, but its observed reliability.
- Regression alerts: When an agent that historically produced A-rated outputs begins delivering B-rated work, the trust history flags the divergence for human review.
Agent trust histories record what has happened, not what will happen. Past performance does not guarantee future output quality. Trust histories are an input to risk assessment, not a substitute for it. The rating framework uses historical data as one signal among many — the specific weighting and integration methodology is proprietary.
Rating Types
Open Rating
Issued for publicly accessible agent outputs — open-source repositories, public API responses, or published deliverables. Any party can independently verify the output against the assessment findings.
Verified Rating
Issued for private agent outputs or proprietary agentic workflows using the identical methodology. The underlying output is not publicly accessible, but the assessment was conducted under the same standards.
Shadow Rating
An unsolicited Open Rating issued without the request or involvement of the agent operator or output owner. Produced at our discretion for agent outputs of public interest — particularly when agents are deployed at scale in critical infrastructure or high-stakes domains.
Agent Rating
A composite trust score for an AI agent derived from the aggregate history of its individual output ratings. Agent Ratings function as credit scores for the agentic economy: they reflect the agent’s demonstrated reliability across multiple assessments, domains, and time periods.
Reproducibility
Each assessment task is versioned and deterministic. Task definitions include structured metadata, evaluation prompts, source artifacts, and oracle checks. Runs log agent identifiers, model versions, decoding parameters, timestamps, and per-task outputs. Suite hashes and prompt hashes are recorded for auditability.
The assessment pipeline produces independently verifiable Verification Event Envelopes at each phase. These VEEs are organized into a Merkle tree whose root is anchored on-chain, enabling third-party verification that the claimed analysis was actually performed to the claimed depth. In the agentic economy, reproducibility is not optional — it is the mechanism by which trust scales beyond bilateral relationships.
Limitations & Disclaimers
An Open Rating is an assessment of agent output quality at a specific point in time for a specific deliverable or revision. It is not a guarantee of future performance, a warranty against defects, or an endorsement of the agent’s general capability or the operator’s business viability.
Ratings from D through A reflect automated analysis only. No automated system can guarantee the absence of defects. The absence of detected issues is not proof that no issues exist. Agent behavior is inherently non-deterministic; an agent that produces an A-rated output today may produce a B-rated output tomorrow under different conditions.
Agent attribution is probabilistic. Confidence levels below 80% are not published. Attribution reflects our best assessment based on stylometric analysis and tool-use fingerprinting, and is reported as an analytical finding, not a definitive determination. Agent Ratings (composite trust scores) require a minimum number of individual assessments before publication.
Intellectual Property
The Open Ratings framework is built on a portfolio of issued and pending U.S. patents invented by Luis M. Sanchez. The portfolio follows a hub-and-spoke architecture: a core cryptographic verification infrastructure supports multiple application-specific patents covering distinct aspects of the AI trust stack.
| Domain | Scope | Status |
|---|---|---|
| Cryptographic Verification Infrastructure | Blockchain-anchored verification protocol with configurable transparency, identity-bound key lifecycle, and selective disclosure for AI agent governance. The trust layer underlying all Open Ratings attestations. | Patent Pending |
| AI Code Quality Certification | Computational proof-of-work system for certifying agent output quality using credit-risk-calibrated grading, agent-specific calibration, and multi-signature consensus validation. | Patent Pending |
| Human–AI Partnership Rating | System for rating human–AI coding partnerships through composite scoring of developer skill, agent capability, agent wrapper quality, and collaboration patterns. | Patent Pending |
| Digital Asset Compatibility & Agent Licensing | Asset-capability compatibility scoring across heterogeneous registries with Bayesian quality-backed collateral and agent-native licensing protocols for the agentic economy. | Patent Pending |
| Agent-Governed Messaging | Self-governing payload architecture for agent-to-agent and agent-to-human messaging with built-in governance, policy enforcement, and audit provenance. | Patent Pending |
| Legacy Code Modernization | Dialect-adaptive deployment system for enterprise legacy code modernization, including dialect fingerprinting and constraint-injected translation pipelines. | Patent Pending |
Additional patents in the portfolio cover AI model routing and inference optimization, hardware-efficient distributed inference, and knowledge distillation systems. The specific methodologies, algorithms, and weighting systems used in the Open Ratings assessment engine are protected as trade secrets and by the patent portfolio described above.
Legal Basis for Shadow Ratings
Open Ratings publishes unsolicited assessments of public agent outputs under protections established by over a century of credit rating agency precedent and First Amendment jurisprudence. Our shadow rating practice is modeled on the long-standing practice of unsolicited credit ratings.
Methodology Disclosure
This document describes our assessment framework at a level of detail sufficient for users of our ratings to understand what a grade means, how it was derived, and what factors influence it. Certain implementation details, including dimension weighting, scoring algorithms, and calibration parameters, are trade secrets and are not disclosed. The assessment methodology is protected by multiple U.S. patent applications invented by Luis M. Sanchez.
Independence
Open Ratings operates independently of the agents, operators, and organizations whose outputs it rates. We do not accept compensation from rated entities for the issuance of ratings, and the payment of fees for solicited ratings does not influence the grade assigned.