AI-assisted RevOps automation is past the demo stage. The conversational layer: Claude with HubSpot MCP, the equivalent integrations into Salesforce, the in-app copilots, is now drafting net-new workflows from a natural-language brief, labelling cross-object associations, and reporting back "done." The easy class of work has been solved well enough to be boring. The harder class has not. The harder class fails in small, accumulating ways that do not break a demo and do break a quarter-end report.
Why this matters now
Sixty-two percent of organisations are at least experimenting with AI agents, per McKinsey's most recent state-of-AI survey, and a meaningful share of that experimentation is happening inside the operating layer of GTM, workflow construction, association labelling, lifecycle automation. The trust question is not whether the assistant is generally capable. The trust question is whether the work it ships is the work you would have shipped, and whether you can tell the difference. McKinsey's 2025 state-of-AI report notes that most organisations have not yet scaled AI across the enterprise; the gap between pilot and production is exactly where the failure modes below live.
What ships well, and what this article is about
On the easy class of work, bulk property rename, property-group reorganisation, workflow pause, the conversational layer is excellent. Those operations are declarative, idempotent, and trivial to diff. We covered that pattern in an earlier piece on MCP plus Claude inside RevOps; the conclusion held.
This article is the failure catalog for the harder class, net-new workflow construction from a natural-language brief, and cross-object association labelling. The work where the assistant makes small judgement calls under-specified by the prompt, and where a small wrong answer compounds into a quarter of bad data.
Failure mode one, the silent invariant violation
Ask the assistant to build a renewal workflow that fires sixty days before the contract end date. It builds the workflow. It enrols on a date-based trigger. The trigger uses a property called renewal date; your instance has a property called renewal_date with an underscore, and a deprecated property of the same human-readable name that is empty for ninety percent of records. The assistant picked the one that matches the natural-language phrase in your brief. The workflow ships. It does not fire for ninety percent of contracts. The chat log says "workflow created, enrolled on Renewal Date." The system says the same thing. Both are technically true. Neither tells you the workflow is dead on arrival.
The assistant matched the surface form of your prompt against the surface form of property names. It did not check which property is actually populated. The verification protocol has to.
Failure mode two, the partial completion that reports success
Net-new workflows in HubSpot frequently chain three to seven actions: branch on a property value, set a field, send an internal notification, create a task on the associated company, update a custom property on the associated subscription object. The assistant builds the first four actions cleanly. On the fifth: a custom action type it has not seen recently, or an association it has to label across two objects, it produces a half-built action with a missing required field, or it skips the action entirely and reports the workflow as complete.
This is the failure mode that shows up most often in the work we audit. The chat says the workflow has five actions. The system shows four valid actions and one with an unconfigured association. The workflow runs. The first four actions fire. The fifth silently fails on every record. Nobody notices until the quarterly review pulls the report that depended on the fifth action having fired.
Failure mode three, the cross-object association labelled three different ways
Cross-object association labelling is the single failure mode that most reliably embarrasses an AI-built workflow at scale. A deal-to-company association can carry a label, primary, partner, parent, signing entity, and that label is what downstream reporting filters on. When a workflow is built across three branches (new logo, expansion, renewal), the assistant will sometimes label the association primary in branch one, leave it unlabelled in branch two, and use the system default in branch three. The branches all work. The reporting that filters on association label gets one third of the deals it should.
The assistant has no global view of how association labels are used downstream. It treats each branch as an independent construction problem. The verification has to be cross-branch.
Failure mode four, the enrolment criteria that works on demo data and breaks on live traffic
Enrolment criteria are the single point where AI-built workflows most reliably look correct in a sandbox and fail in production. The brief asks for "all open opportunities owned by the AE team in DACH." The assistant writes the criteria against a team property that exists in the test sandbox but not in production, or against an owner-team association that has been migrated to a HubSpot team in production but not in sandbox. The criteria validate. The workflow runs. The enrolment count is wrong by an order of magnitude.
Demo data is clean, small, and recent. Live traffic is dirty, large, and includes records created before the current schema. Every enrolment criterion has to be re-tested against a sample of production records before it is allowed to fire.
Pattern from the field
A B2B SaaS revenue team in EMEA used Claude with HubSpot MCP to ship a renewal workflow across roughly four hundred active subscription records. The brief was clear; the chat conversation was clean; the workflow was reported complete. Two weeks in, the customer success lead noticed the workflow was firing on about a quarter of the records it should have. Three small failures had compounded: the property-name mismatch from failure mode one, an unlabelled deal-to-company association on the expansion branch from failure mode three, and an enrolment criterion that referenced a deprecated team object. None of the three would have been caught by reading the chat log. All three were obvious within ten minutes of reading the system state, opening the workflow editor, opening the enrolment-criteria preview, opening the association configuration on a sample record. The fix was an hour of work. The two weeks of bad enrolment was the cost of trusting the chat instead of the system.
Resolution, the verification and recovery playbook
For any team using AI-assisted workflow construction in production:
- Verify in the system, not in the chat. The chat log says what the assistant intended. The system shows what it actually did. After every AI-built workflow, open the workflow editor, the enrolment-criteria preview, and the association configuration on at least one sample record. Read the system state. Trust the system.
- Pin property names to populated properties. Before any AI-built workflow ships, confirm every property reference is the populated version, not a deprecated namesake. The diagnostic question: who has read this property in the last ninety days, and what decision did they make from it.
- Walk every action, end to end. Open every action in the workflow. Confirm each one is fully configured, not partially. Treat "workflow created" in the chat as an unverified claim until every action is read in the system.
- Cross-branch the association labels. If a workflow has multiple branches, confirm the association label is identical across every branch. Mismatched labels are the most reliable source of silent under-reporting.
- Re-test enrolment against production. Run the enrolment-criteria preview against live records, not sandbox records. Compare the count against a hand-built filter on the same criteria. If the counts disagree, the criteria are wrong.
- Roll back at the workflow level. When a failure is caught, do not roll back the database. Pause the workflow, fix the configuration, re-enrol the affected records. Workflow-level rollback is recoverable; database rollback is a separate, larger project.
- Log the failure mode. Every silent failure caught becomes a verification check that runs automatically next time.
The conversational layer ships most of a workflow correctly. The remainder is the failure surface, and it is the part that costs a quarter of clean reporting if it goes unread. The verification protocol is the work.
Where Checkpoint comes in
We run AI-assisted workflow construction inside live HubSpot instances every week, and the verification protocol above is what we use on our own engagements before we hand work back to a client team. If you are scaling AI inside the operating layer of your GTM motion, the work is not the prompts, it is the catalog of failure modes you have already seen and the protocol that catches them before they ship. Talk to us about AI and automation, AI GTM consulting, the broader revenue operations work that makes AI-built automation safe to deploy, or the HubSpot implementation foundation that any of this rests on.
Sources
- "The state of AI in 2025: Agents, innovation, and transformation." McKinsey & Company, November 2025. mckinsey.com
- "Can AI Agents Be Trusted?" Harvard Business Review, May 26, 2025. hbr.org
