FrameworkApril 14, 2026

Why most maturity scores are wrong (and what it takes to fix them)

Maturity models are everywhere but most are dangerously simplistic. What rigorous cross-function scoring actually requires.

Why methodology transparency matters

Scoring systems are only trustworthy when they can be interrogated. A score that cannot be explained is a black box, and black boxes produce the wrong behavior: teams either dismiss the score entirely or optimize for the surface measure rather than the underlying capability it represents. Dacard publishes its methodology because the score is only useful if the team understands what it is measuring and why the evidence standard is set where it is.

This article walks through the full scoring process: the three frameworks, how evidence is gathered, how dimensions are evaluated, how stages are assigned, and how the system is structured to resist gaming.

The three frameworks

Dacard scores products against three proprietary frameworks covering 54 dimensions total:

F1 (People): 27 dimensions across team maturity and organizational capability
F2 (Process): 34 dimensions across operational practices and lifecycle management
F3 (Product): 27 dimensions across AI-native product architecture and design

Each framework is a structured lens on a different aspect of organizational capability. F1 (People) measures whether the team has the skills, structures, and culture required to build and operate AI-native products. F2 (Process) measures whether the operational practices and lifecycle management are mature enough to support consistent delivery. F3 (Product) measures whether the product itself embodies AI-native principles in its architecture, data design, and user experience.

The frameworks interact. A high F3 score with a low F1 score produces the Translation Gap, which is one of the most predictive signals in the diagnostic. A high F1 with a low F3 suggests an experienced team building a product that does not yet reflect their capability, often a timing or resourcing constraint rather than a strategic deficit. The cross-framework view is where the most actionable intelligence lives.

The scoring process, step by step

URL and identity intake: The scoring run begins with a product URL and company identifier. The system verifies reachability, resolves redirects, and captures the current state of public-facing product signals. No login required. No internal data accessed. All evidence is drawn from signals observable without privileged access.
Signal extraction: The evidence engine crawls the product's public surface: marketing pages, documentation, changelog, job postings, press releases, integration listings, and any public API or developer documentation. Each source contributes to a structured signal set organized by framework and dimension category.
Dimension evidence mapping: Extracted signals are mapped to dimensions across the active frameworks. Each dimension has a defined evidence schema: what observable signals are relevant, what they indicate when present or absent, and how they are weighted against each other. A dimension with no observable evidence is scored differently from a dimension with contradictory evidence.
AI-assisted dimension scoring: The Claude-powered scoring engine evaluates each dimension against its evidence set. The prompt structure is deterministic: same framework, same evidence schema, same evaluation rubric for every scoring run. The AI's role is to interpret evidence against the rubric, not to generate opinions about the company. Outputs are dimension scores on a 0-100 scale with supporting rationale.
Stage classification: Dimension scores are aggregated by framework into a composite score, and the composite is classified into one of five stages: Foundation, Building, Scaling, Leading, or Compounding. Stage boundaries are fixed at defined score thresholds and do not adjust based on industry or company size.
Anti-gaming validation: The validation layer checks for signals that indicate surface-level optimization rather than genuine capability. A company that has added AI-related language to its marketing copy without corresponding evidence in product behavior, changelog history, or technical documentation triggers a consistency flag. Flagged dimensions are reviewed against cross-signal patterns before scores are finalized.
DAC-intelligence generation: The final step synthesizes dimension scores, cross-framework tensions, and delta data (if a prior baseline exists) into the DAC-intelligence report. This includes signal bar visualizations, the Translation Gap calculation, prioritized prescriptions from DAC-coach, and the evidence citations that support each finding.

The five stages

Stage classification is the highest-level output of the scoring engine. Each stage represents a qualitatively different relationship to AI-native product practice, not just a different score band.

Foundation (0-20): AI is absent or strictly experimental. The product has no systematic AI capability, and the team has no structured practice for evaluating or integrating it. Decisions are made on intuition and lagging indicators.

Building (21-40): First AI features are present, typically bolt-on. The team is experimenting but has not yet established the data architecture or evaluation practices required for genuine AI-native capability. Evidence is often present in marketing language before it appears in product behavior.

Scaling (41-60): AI capability is real and in production, but coverage is uneven. Some dimensions are strong; others remain at Foundation or Building levels. The Translation Gap often becomes visible here as product AI-nativeness outruns team maturity or vice versa.

Leading (61-80): AI is a structural component of the product's core value proposition. The data architecture is purpose-built for intelligence, and the team has the practices to evaluate, iterate, and improve AI capability systematically. The moat is beginning to compound.

Compounding (81-100): AI-native at the architectural level. The product generates proprietary signal, improves with usage, and creates switching costs through accumulated intelligence. The team has the organizational maturity to govern and extend AI capability at scale. Very few products score here today.

Evidence-based vs. self-report: why it matters

Self-report scoring

Evidence source: Team members answer structured questions about their own practices and capabilities
Bias exposure: High. Social desirability bias, inconsistent interpretation of questions, optimistic anchoring
Comparability: Low. Different respondents interpret the same question differently; no consistent external standard
Gaming resistance: None. Respondents can select higher answers without any corresponding evidence requirement
Actionability: Moderate. Tells you what the team believes; does not validate whether the belief is accurate

Evidence-based scoring (Dacard)

Evidence source: Public signals extracted from product surfaces, documentation, changelog, job postings, and technical artifacts
Bias exposure: Low. Evidence is observable and the same regardless of who is doing the scoring
Comparability: High. Same evidence schema applied to every company; scores are directly comparable across time and across organizations
Gaming resistance: High. Improving the score requires changing observable evidence, which requires changing the underlying capability
Actionability: High. Tells you what exists in the product and organization; prescriptions target specific evidence gaps

> "Self-report diagnostics measure confidence. Evidence-based diagnostics measure capability. These are not the same thing, and in AI-native product development, the gap between them is often where the most important problems are hiding."

Why gaming is structurally difficult

A common question from teams encountering evidence-based scoring for the first time is whether the score can be gamed by optimizing public-facing signals without changing underlying practice. The short answer is: technically yes, practically no, and the cost of gaming is higher than the cost of just improving.

The evidence engine does not score based on any single signal source. A dimension like "AI evaluation practice" is scored against a constellation of signals: changelog entries describing model evaluation runs, job descriptions requiring evaluation experience, documentation explaining how the team assesses AI output quality, and product behavior consistent with continuous improvement. Gaming one of these signals while the others remain absent produces a low consistency score that flags the dimension for manual review.

More importantly, the evidence schema is designed so that signals that are easy to fake (marketing copy, feature names, press release language) are weighted below signals that are hard to fake (changelog cadence, technical documentation depth, hiring patterns over time). A company can announce an "AI-first strategy" in thirty minutes. It takes months to accumulate the changelog evidence, documentation depth, and hiring history that the engine weights most heavily.

What a score represents, and what it does not

A Dacard score represents the current state of observable evidence for AI-native maturity across the scored frameworks. It is not a judgment of the team's intelligence, ambition, or potential. It is not a prediction of commercial success. A company can be at the Building stage and be growing fast; a company can be at the Leading stage and be struggling with product-market fit.

The score is most useful as a relative and longitudinal measure. Relative: how does this company compare to others at a similar stage, in a similar category, with a similar team size? Longitudinal: how has this company's score changed over time, in which dimensions, and at what rate?

The dimensions that are below the company's composite average are the highest-leverage targets for improvement, because they are dragging the composite below where the team's strongest dimensions suggest it could be. These are not the most urgent business problems. They are the specific capability gaps where a focused prescription has the highest probability of producing a measurable score improvement at the next cycle.

Interpreting your score

When you receive a Dacard score, the most important number is not the composite. It is the spread: the distance between your highest-scoring dimensions and your lowest. A tight spread with a mid-range composite suggests an organization with consistent but undifferentiated capability. A wide spread suggests an uneven investment pattern, with some capabilities significantly ahead of others.

The Translation Gap is the second number to examine. If F1 (your team's maturity) significantly leads F3 (your product's AI-nativeness), you have a team capable of more than the product currently reflects, often a roadmap or resourcing constraint. If F3 leads F1, you have a product making architectural commitments the team is not yet fully equipped to maintain, a technical risk that typically does not surface until a key person leaves or a significant scaling event occurs.

The DAC-coach prescriptions are ordered by the expected impact-per-effort ratio for your specific score profile. They are not generic recommendations. They are targeted at the specific evidence gaps in your lowest-scoring dimensions, with enough specificity to translate directly into planning decisions, hiring criteria, or architectural choices. The prescription is only as good as the action it generates. The score is only as good as the next cycle that measures whether that action worked.

Darren Card

Founder, Dacard.ai

See your diagnostic

Free. No sign-up required. Results in 2 minutes.