AI Training Week 3: Tool Deep-Dive (Copilot/ChatGPT/Claude Lab)

AI Training Week 3: Tool Deep-Dive (Copilot/ChatGPT/Claude Lab)

5/8/202619 views9 min read

TL;DR

  • Week 3 is a head-to-head lab — same prompts, three tools (Copilot, ChatGPT, Claude), real outputs scored.
  • The deliverable is a per-role recommendation, not a company-wide standard.
  • The point isn't choosing a winner; it's teaching your team to choose between tools by themselves.

The single biggest mistake I see SMB owners make in week 3 is letting IT pick the AI tool. By the time IT picks, the use cases the team committed to in week 2 have already cooled, and the tool ends up shaped to procurement preferences instead of work.

Why a tool deep-dive belongs in week 3 specifically

Earlier than week 3, the team has no real use cases to benchmark against — you'd be testing tools on toy prompts. Later than week 3, the team has built habits with whatever tool they grabbed first, and switching cost rises sharply.

Stanford's "77% rule" found that most AI work inside organizations is invisible — shadow, unofficial, on personal accounts. By week 3 your employees are very likely already using one tool unofficially. A structured lab is the lowest-friction way to surface that reality and convert it into informed standardization. The alternative — IT issuing a memo — produces compliance theater and continued shadow use. About 46% of employees in recent surveys admit to having pasted confidential data into public AI tools. Week 3 is your chance to fix that with consent, not threats.

Definition: Tool deep-dive — a structured side-by-side test of 2-4 AI tools against the same prompts, scored on output quality, latency, integration fit, and risk for your specific workflows.

What week 3 should actually contain

The structure that works:

  1. Monday — 60-minute lab kickoff. Recap week 2 backlog. Hand out the lab packet (prompts + scoring rubric).
  2. Tuesday/Wednesday — async lab time (90 min total per person). Run the same 3-5 prompts in Copilot, ChatGPT, and Claude. Capture outputs.
  3. Thursday — 90-minute role debrief. Each role-track compares notes and picks a primary tool plus a secondary fallback.
  4. Friday — 45-minute company readout. Champions present per-role recommendations. Founder approves or vetoes.

Three tools is the right number. Two is too narrow; four turns the week into a tournament instead of a deep-dive.

Definition: Scoring rubric — a fixed set of axes (e.g., output quality, ease of revision, integration with existing systems, data-handling fit) used to compare tools without arguing about feel.

The lab packet (copy/paste)

This is the rubric I hand to founders running week 3 themselves.

For each prompt, run it in Copilot, ChatGPT, and Claude.
Score 1-5 on each axis. Note one sentence on the "why."

Axis 1 — Output quality (1=unusable, 5=ship-ready first try)
Axis 2 — Ease of revision (1=fight every edit, 5=accepts pushback well)
Axis 3 — Integration fit (1=copy-paste only, 5=lives inside my workflow)
Axis 4 — Data-handling fit (1=can't use this for my work, 5=fully sanctioned)
Axis 5 — Speed (1=>30s, 5=under 5s)

Total = sum of axes. Higher = better fit FOR THIS PROMPT, FOR YOUR ROLE.
Different role = different winner. That's the point.

Five prompts per role-track. Champions write the prompts ahead of time, drawn from the week-2 committed use cases. Don't let employees write the prompts in the moment — that introduces too much noise into the comparison.

Tool tip (Course for Business): The reason this works is that Augment, don't replace lives at the prompt level, not the tool level. A "good" tool for marketing copy may be a terrible tool for variance analysis. The point of week 3 isn't to crown one tool — it's to give each role-track the judgment to pick the right tool per task. The 6-week program at https://course.aiadvisoryboard.me/business runs week 3 as five role-prompts × three tools per role, exactly because no single answer works company-wide. (Course for Business)

What the deep-dive usually surfaces

Patterns I see repeat across cohorts of 30-500-employee companies:

  • Copilot wins when the use case is anchored in Office files (Excel variance, Word redlines, Outlook reply drafts) — the integration depth beats raw output quality.
  • ChatGPT wins for general drafting, brainstorming, and one-off research — broad reach, big tool ecosystem, fast iteration.
  • Claude wins for long-document synthesis, careful reasoning, and customer-facing copy where tone matters — its outputs typically need fewer edit passes.

These patterns are not laws. They're starting hypotheses that your week-3 lab will either confirm or break.

Good vs bad week-3 outcomes

Bad outcome: "We picked ChatGPT Enterprise because IT had a contract." Good outcome: "Sales uses ChatGPT for outbound, Finance uses Copilot for variance, CS uses Claude for first-response drafts. Champions own the rationale per role."

Bad outcome: "60-page tooling assessment with no decisions." Good outcome: "Two-page memo per role-track: primary tool, secondary fallback, three example prompts, named owner."

The good versions ship a decision per role, not a standard for the company.

Team scan (what AI champions report after week 3)

  • Most cohorts find no single tool wins all five role-tracks; ~3-4 tracks split between two tools, with one outlier on the third.
  • The biggest surprise is usually how well Copilot performs on Office-anchored workflows that ChatGPT does worse on without plugins.
  • Privacy/data-handling kills more candidate tools in week 3 than output quality — if a tool isn't sanctioned for your data, it's out.
  • Long-document tasks (50+ pages) are where Claude pulls visibly ahead in most cohorts.
  • Latency matters less than people predicted before the lab — only one role-track typically picks tool A over tool B because of speed.
  • Integration fit (does it live inside the existing workflow?) ends up being the single highest-weighted axis after the lab, regardless of how you weight it on Monday.
  • Champions report that the lab format converts shadow-AI users into champions of a sanctioned tool faster than any policy memo.
  • About 1 in 4 employees switch their preferred tool after the lab — usually because they discover their workflow is Office-anchored and they'd been using the wrong one.
  • The lab also surfaces 1-2 tools no one had considered (Perplexity for research, Gemini for in-Workspace tasks) — note them but don't add to scope.
  • Cost per seat almost never decides; output quality and integration win.

Micro-case (what changes after 7-14 days)

A 240-person professional-services firm I advised ran week 3 across five role-tracks. They came in assuming "we'll standardize on ChatGPT Enterprise." The lab broke that assumption: Finance and HR scored Copilot ahead by a wide margin for their workflows; Marketing and CS picked Claude for tone-sensitive drafts; Sales kept ChatGPT for outbound speed. Total contracts ended up being two (ChatGPT + Copilot) instead of one — but each role-track had a clear champion-owned rationale, and shadow tool use dropped within a fortnight as employees moved onto sanctioned tools. By day 14, the head of IT — who had pushed for the single-tool standard — admitted the per-role split was the right call. Compare to a peer firm that imposed a single-tool standard top-down: 7 months later, internal surveys showed about 40% of employees still pasting work into a different tool unofficially.

Note on this case: This example is illustrative — based on typical patterns we observe with companies of 30-500 employees, not a single named client. Specific numbers are rounded approximations of common ranges, not guarantees.

Tool tip (Course for Business): The most underused part of week 3 is the Shoulder-to-Shoulder debrief — pairing one employee who scored Tool A highest with one who scored Tool B highest, and watching them defend their choice. Five minutes of that debate transfers more judgment than any tooling matrix. The 6-week program at https://course.aiadvisoryboard.me/business builds this debate into Thursday's role-track session, by design. (Course for Business)

FAQ

Should we add Gemini, Perplexity, or specialized tools? Note them, don't add them. Three tools is what fits in a week without diluting the lab. Specialized tools (Perplexity for research, Harvey for legal) belong to a focused side-track in week 5 or 6, not week 3.

What if our IT has already signed a contract with one provider? Run the lab anyway. The output is per-role recommendations — if 4 of 5 role-tracks pick the contracted tool, the contract is validated. If 4 of 5 pick something else, you have leverage and evidence to renegotiate.

Do we need formal MSAs/DPAs before letting employees test? For data-handling-axis testing, yes — only sanctioned tools should see real data. For non-sensitive prompts, sandboxed accounts are fine. Champions enforce this in the lab packet.

What if a role-track can't agree on a winner? Score gap under 10% on the rubric? Pick by integration fit (axis 3). That's the axis that compounds across weeks 4-6. (We separately have an advisory product for daily-management of the rollout, but we'll address that elsewhere.)

What's the failure mode of week 3? Letting the lab become a tooling debate instead of a use-case decision. Champions must redirect: "what did your prompt produce, in your tool, on your workflow?" — not "which tool is best in general."

Conclusion

Week 3 is the calibration week. Your team learns that the right AI tool is a function of the role, the workflow, and the data — not a company-wide standard. Five prompts, three tools, scored honestly. Champions own the per-role memo. The output is judgment, not procurement.

Next step: lock the lab packet (5 prompts × 5 axes per role-track) before Monday's kickoff.

If you want every employee to ship their first AI automation in five days — book a 30-min call and we'll map your team's first week: https://course.aiadvisoryboard.me/business

Frequently Asked Questions

AI-Powered Solution

Ready to transform your team's daily workflow?

AI Advisory Board helps teams automate daily standups, prevent burnout, and make data-driven decisions. Join hundreds of teams already saving 2+ hours per week.

Save 2+ hours weekly
Boost team morale
Data-driven insights
Start 14-Day Free TrialNo credit card required
Newsletter

Get weekly insights on team management

Join 2,000+ leaders receiving our best tips on productivity, burnout prevention, and team efficiency.

No spam. Unsubscribe anytime.