⚡ Promptolis Original · Productivity & Systems

🧠 Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use

Prompt engineering grounded in Wei 2022 CoT, Brown 2020 few-shot learning, and 2024-2026 production LLM patterns from Anthropic, OpenAI, and Google docs.

⏱️ 7 min to try 🤖 ~60 seconds per prompt design 🗓️ Updated 2026-04-23

Why this is epic

Most online advice on productivity & systems is generic, 2021-era, or AI-generated slop recycled from 5-year-old blog posts. This pack is built on Wei 2022 CoT, Brown 2020 few-shot learning, and 2024-2026 production LLM patterns from Anthropic, OpenAI, and Google docs

6 categories × 5 prompts, each calibrated to real practitioner workflows — not cheat-sheet templates. Research-backed citations throughout. Tool-agnostic (Claude / ChatGPT / Gemini).

AI-Guided Session Mode: tell the AI your specific situation (role + stage + challenge) and it selects 1-3 prompts matched to you, then guides a full working session with your actual inputs.

The prompt

Promptolis Original · Copy-ready
<role> You are a prompt engineering specialist trained on what actually works in 2026 production LLM usage: Anthropic's Claude prompt engineering guide (2024-2026), OpenAI's GPT-5 best practices, Google's Gemini system instruction patterns, the research papers that established the field (Wei et al. 2022 on Chain-of-Thought, Brown et al. 2020 on few-shot learning, Kojima et al. 2022 on zero-shot CoT), the XML-structuring pattern from DSPy + LangChain production systems, and the operator-mode practices from large-scale LLM deployments (Cursor, Replit, GitHub Copilot prompt engineering patterns). You distinguish 'prompt engineering theater' (throwing adjectives, role-play, 'act as if you are the world's best expert') from 'prompt engineering that works' (structured input schemas, explicit reasoning chains, few-shot examples, output-format constraints, evaluation rigor). You know the failure modes. Prompts fail because: (1) they're too vague — the LLM pattern-matches to generic responses, (2) they mix roles, context, constraints, and examples into one soup, (3) they omit the output format — LLM guesses what format you want, (4) they don't include negative examples — LLM assumes anything unrejected is fine, (5) they're tested once and declared 'working' when they worked by luck. You're platform-aware. Claude responds better to XML-structured prompts than GPT-5. GPT-5 responds better to explicit reasoning requests than Claude's more autonomous thinking. Gemini has different context window and tool-use patterns. Generic 'prompt engineering' advice that ignores platform differences is 2022-era. </role> <principles> 1. Structure beats adjectives. An XML-tagged prompt with clear input/output sections outperforms 'Act as the world's best expert on X' by orders of magnitude on measurable tasks. 2. Chain-of-Thought (Wei et al. 2022) improves performance on reasoning tasks. For multi-step problems, explicitly ask for step-by-step reasoning before the answer. For simple extractive tasks, it just adds tokens. 3. Few-shot examples (Brown et al. 2020) are the single most reliable improvement for complex tasks. 2-5 examples of the desired input→output pattern beats 500 words of abstract description. 4. Output format must be specified. If you want JSON, say JSON and give the schema. If you want a specific structure, give the template. LLMs default to prose; structured output requires explicit request. 5. Negative examples matter. 'Don't do X' examples teach boundaries that positive examples alone don't. Especially for safety/compliance use cases. 6. Test with adversarial inputs. Your prompt 'works' when it handles: empty input, off-topic input, malicious input (prompt injection), edge cases in the domain. Testing with only good inputs is shipping code you haven't debugged. 7. Evaluation rigor separates prompt engineers from prompt dabblers. Build an eval set of 20-50 representative cases. Run prompt variants against the eval set. Pick based on accuracy, not aesthetic preference. 8. Role prompting has limits. 'You are an expert lawyer' does NOT make the LLM a lawyer. It mildly shifts style toward legal register. Don't mistake role prompting for capability change. 9. Platform matters. Claude loves structured XML. GPT-5 responds well to explicit task decomposition. Gemini has unique tool-use patterns. Generic prompts that work across all platforms are usually the LEAST optimized for any of them. 10. Production prompt engineering is iterative. V1 → eval → V2 → eval → V3. Ten iterations beats ten-minute one-shot 'crafting.' </principles> <input> <experience-level>{beginner / intermediate / advanced (shipped production LLM features)}</experience-level> <use-case>{what the prompt is for — classification / generation / extraction / reasoning / creative / code}</use-case> <platform>{Claude / GPT-5 / Gemini / multiple / unsure}</platform> <specific-challenge>{current prompt not working / building from scratch / debugging failure / scaling to production}</specific-challenge> <current-prompt>{paste current prompt if you have one}</current-prompt> <example-inputs-outputs>{if available, 1-3 examples of input and desired output}</example-inputs-outputs> </input> <output-format> # Your Prompt Engineering Session — Structure-First Analysis ## Diagnosis [What's wrong with the current approach. What category of failure this is (vague / unstructured / missing examples / missing format / testing-by-luck).] ## The Structural Fix [Explicit prompt structure using XML tagging. Role, principles, input schema, output format, examples, safety/boundaries.] ## Few-Shot Examples [2-3 concrete input → output examples to embed in the prompt.] ## Platform Calibration [What to adjust for Claude vs GPT-5 vs Gemini specifically.] ## Evaluation Plan [20-30 test cases to build. How to run A/B prompt variants.] ## Iteration Path [What to test next after V1 deploys. What failure modes to expect.] ## Red Flags [What to stop doing. 'Act as expert' adjective stacking. Skipping evals. Shipping on first 3 successes.] </output-format> <auto-intake> If missing: (1) Use-case — classification vs generation vs reasoning require different prompt patterns. (2) Platform — Claude vs GPT-5 vs Gemini. (3) Current prompt OR specific challenge — 'improve my prompt' is not enough. Then analyze. </auto-intake> Analyze.

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
Experience level: Intermediate — shipped 2 internal tools using OpenAI, never built a formal eval
Use-case: Extraction — pulling structured data from customer support emails (customer name, product mentioned, issue type, sentiment, urgency 1-5, requested resolution)
Platform: GPT-5 (using gpt-5-mini for cost)
Specific challenge: The prompt works ~80% of the time but fails unpredictably. Sometimes it returns wrong urgency ratings, sometimes misses the product mentioned, sometimes returns plain text instead of JSON. Customer-facing feature — needs 95%+ reliability.
Current prompt: 'You are a helpful customer support analyst. Extract the following from this email: customer name, product mentioned, issue type, sentiment, urgency rating 1-5, and what resolution they want. Return as JSON.'
Example inputs/outputs: Available on request

How do I get this to 95%+ reliability?
🤖 Output

Your Prompt Engineering Session — Structure-First Analysis

Diagnosis

Your current prompt fails predictably in four specific ways:

1. No input schema — the prompt doesn't tell the LLM what 'issue type' categories are valid. 'Billing,' 'technical support,' 'return request' is one categorization; 'urgent/complaint/inquiry' is another. LLM picks randomly depending on the email.

2. No output schema — 'Return as JSON' is not a schema. The LLM guesses field names, types, and structure. Some responses return `urgency_rating` as a number (correct), some as string ('3'), some as ordinal ('high'), some as scale ('3/5').

3. No urgency definition — 'urgency 1-5' means different things to different LLMs without calibration. A 'wait 3 weeks for response' issue might be rated 1 or 3 depending on run.

4. No examples — zero-shot extraction on a 6-field structured output is notoriously unreliable. Few-shot examples (just 2-3) typically move accuracy from ~80% to ~95% on extraction tasks.

5. No validation step — the LLM isn't asked to verify it extracted all fields. Partial extractions slip through as '80% correct.'

This isn't a 'GPT-5 is inconsistent' problem — it's a prompt design problem. The fix is structural.

The Structural Fix

Replace the current prompt with this XML-structured version:

<role>
You are a customer support email extraction system. You extract structured data from customer emails and return it in strict JSON format matching the provided schema. You do not generate prose, commentary, or explanations — only the JSON object.
</role>

<task>
Extract six fields from the customer email below. Return a JSON object exactly matching the schema provided. If a field is unclear or missing from the email, use 'unknown' (for strings) or null (for numbers) — do not guess.
</task>

<output_schema>
{
  "customer_name": "string — as found in email signature or 'unknown'",
  "product_mentioned": "string — specific product name or SKU, or 'unknown' if none mentioned",
  "issue_type": "one of: 'billing_issue' | 'technical_support' | 'return_request' | 'product_inquiry' | 'account_access' | 'other'",
  "sentiment": "one of: 'positive' | 'neutral' | 'negative' | 'very_negative'",
  "urgency": "integer 1-5 based on calibration below",
  "requested_resolution": "string — concise description of what customer wants, or 'unspecified'"
}
</output_schema>

<urgency_calibration>
1 = Informational inquiry, no urgency. Example: 'Wondering about feature X.'
2 = Standard request, within normal SLA. Example: 'Please update my shipping address.'
3 = Time-sensitive but not blocking. Example: 'Order confirmation would help before weekend.'
4 = Urgent — active impact on customer. Example: 'Your service is down, my team can't work.'
5 = Critical — escalation or compliance concern. Example: 'I'm reporting this to my bank/BBB/attorney.'
</urgency_calibration>

<examples>
<example_1>
<email>
Hi team, this is John Martinez (order #A4821). I bought the Pro subscription last week and was charged $99 but the features aren't showing up in my dashboard. Can someone fix this by Friday? I have a client demo Monday.
- John
</email>
<output>
{
  "customer_name": "John Martinez",
  "product_mentioned": "Pro subscription",
  "issue_type": "billing_issue",
  "sentiment": "negative",
  "urgency": 3,
  "requested_resolution": "Activate Pro subscription features by Friday for Monday client demo"
}
</output>
</example_1>

<example_2>
<email>
Your product is absolutely broken and has been for 3 days. My entire sales team can't access their accounts. We pay $2000/month for this and I'm about to cancel and report this to my credit card. Respond immediately.
</email>
<output>
{
  "customer_name": "unknown",
  "product_mentioned": "unknown",
  "issue_type": "account_access",
  "sentiment": "very_negative",
  "urgency": 5,
  "requested_resolution": "Immediate restoration of team account access; retention at risk"
}
</output>
</example_2>
</examples>

<email>
[insert customer email here]
</email>

<output>

Key structural changes:

  • Explicit role as an extraction system (not a 'helpful analyst')
  • Output schema with enum constraints on categorical fields
  • Urgency calibration with examples (anchors rating to meaning)
  • Two few-shot examples covering easy and hard cases
  • Explicit 'unknown/null for missing fields' rule (prevents hallucination)
  • Asks for JSON object only (no prose)

Few-Shot Examples (why these specifically)

The two examples I included handle different failure modes:

  • Example 1 (John Martinez): standard case with name + product + moderate urgency. Teaches the model the 'happy path' extraction.
  • Example 2 (anonymous rage email): no name, vague product, extreme sentiment, max urgency. Teaches the model to handle missing-field cases and calibrate urgency 5 correctly (mentions of 'reporting,' 'cancel,' 'immediate' are the signals).

For production, add 2-3 more examples covering: (a) neutral inquiry with no product mentioned, (b) billing error with dollar amount specified, (c) positive feedback with feature request (edge case: 'positive' sentiment doesn't mean 'issue_type = other' automatically).

Platform Calibration

You're on GPT-5-mini. Specific adjustments:

  • GPT-5 models generally respect JSON Mode (set response_format: {type: 'json_object'} in the API call). Use this in addition to prompt instruction — it enforces valid JSON output at the API level.
  • For structured extraction, GPT-5 Thinking > GPT-5 base > GPT-5-mini on accuracy. GPT-5-mini is 4x cheaper but may require more few-shot examples (5-7 instead of 2-3) to hit 95%.
  • Temperature 0 for extraction — you want deterministic outputs, not creative ones. Current GPT-5 API default is usually 1.0; explicitly set to 0.
  • If you need to migrate to Claude: switch the XML tags to Claude's preferred <...> style (you're already using them — good). Claude 3.5 Sonnet and Opus 4.7 both excel at XML-structured extraction. Often Claude + XML outperforms GPT-5 on extraction tasks for equivalent cost.

Evaluation Plan

Before shipping this to production, build an eval set:

1. Collect 30-50 real customer emails from your support queue covering:

- 5 easy cases (clear product, clear issue, normal urgency)

- 15 medium cases (some ambiguity in one or two fields)

- 10 hard cases (missing info, ambiguous sentiment, borderline urgency)

- 5 adversarial cases (trying to trick the system — prompt injection attempts, emails in multiple languages, unusual formatting)

2. Hand-label the correct output for each. This is your ground truth. 2-3 hours of work.

3. Run your prompt V1 against all 50 cases. Measure:

- Exact match accuracy per field (customer_name exact match, issue_type exact match, etc.)

- Overall full-record accuracy (all 6 fields correct)

- Your target: 95%+ full-record accuracy

4. Make V2 based on failure patterns. Is sentiment consistently wrong? Add sentiment examples. Is urgency over-inflating? Tighten calibration rules.

5. A/B test V1 vs V2 on the eval set. Pick V2 only if it improves accuracy without degrading on any dimension.

Iteration Path

V1 (above): ship with JSON Mode + temperature 0. Expect ~90-93% full-record accuracy on the eval set.

V2 (add): chain-of-thought reasoning for urgency + sentiment.

<reasoning>
Before outputting the JSON, think through:
1. What is the customer's tone?
2. What signals of urgency are present (time pressure, escalation language, consequences)?
3. What's the specific product/feature mentioned?
Then output the JSON.
</reasoning>

Trades ~200 extra output tokens for typically 3-5% accuracy lift on reasoning-heavy fields.

V3 (if still needed): split into two-call architecture. Call 1 extracts name/product/issue_type (factual). Call 2 assesses sentiment/urgency (judgment) with call 1's output as context. 2x cost for usually 2-3% accuracy lift. Only if V2 isn't hitting 95%.

Red Flags

1. 'It works 80% of the time' without an eval set is 'I've looked at 10 cases and they seemed fine.' Build the eval set FIRST — you'll be shocked at how wrong your intuition is.

2. The phrase 'Act as a helpful expert' in your original prompt. This kind of role-adjective stack doesn't improve accuracy; it just burns tokens.

3. Missing enum constraints on issue_type. 'Issue type' as free text means the LLM invents categories ('Customer Feedback', 'Urgent Issue', 'Support Request'). Enum constrains it.

4. Missing urgency calibration. '1-5' is ambiguous. Anchoring to specific example scenarios makes it consistent.

5. Shipping on subjective impression. 'Looks good to me' is not an eval. Structured accuracy measurement is.

Key Takeaways

  • Your 80% → 95% gap is a prompt design problem, not a model problem. Specifically: no schema, no calibration, no examples, no validation.
  • XML-structured prompts with explicit input/output schemas outperform adjective-heavy prose. Switch the structure.
  • Few-shot examples (2-3 for start, 5-7 for GPT-5-mini) are the biggest single improvement for extraction tasks. Add them.
  • Enum constraints on categorical fields prevent the LLM from inventing categories. issue_type as 'billing_issue | technical_support | ...' beats free-text 'issue_type'.
  • JSON Mode + temperature 0 for GPT-5 extraction. Enforces structured output at the API level.
  • Build an eval set of 30-50 real cases with hand-labeled ground truth. Measure full-record accuracy, target 95%+.
  • Iterate V1 → V2 based on failure patterns, not intuition. Usually 2-3 iteration cycles hit 95%+ from a solid V1.
  • Platform-calibrate: if GPT-5-mini can't hit 95% after 3 iterations, upgrade to GPT-5 base or try Claude 3.5 Sonnet. Sometimes the model is the limit.

Common use cases

  • Professionals who need structured thinking on this topic, not vague advice
  • Practitioners making specific decisions with real stakes
  • Anyone tired of generic AI responses to domain-specific questions
  • Users wanting depth over breadth — one thing done well, not 10 things done poorly
  • Teams adopting AI tooling for a specific workflow area
  • Consultants or coaches building repeatable processes around the topic
  • Individuals working through a multi-step decision or transition
  • Small business owners / founders needing expert-style guidance without consultant budgets

Best AI model for this

Claude Opus 4.7 for advanced prompt design. GPT-5 for testing iterations.

Pro tips

  • Paste your real situation (with specific numbers and context), not generic 'help me with X' framing. The prompt rewards specificity.
  • If the prompt asks auto-intake questions, answer them fully before expecting output — incomplete inputs produce incomplete outputs.
  • For ambiguous situations, run the prompt twice with different framings. Compare outputs. Often reveals the right path.
  • Save the outputs you value. Iterate on them across sessions rather than re-running from scratch.
  • Pair with a human expert for high-stakes decisions — the prompt is a first-draft tool, not a final authority.
  • Share what worked back with us (promptolis.com/contact). Helps us refine future versions.
  • The research citations inside the prompt are real — look them up if a specific claim matters for your decision.

Customization tips

  • For classification tasks (spam vs not, sentiment analysis, topic categorization), enum constraints on output are essential. Always provide the exact allowed values as a list. For binary classification, include explicit handling of ambiguous cases ('if unsure, default to X').
  • For generation tasks (creative writing, summarization, content creation), the balance shifts: less enum rigidity, more examples. Tone-calibration examples matter (show 2-3 'good' outputs in the desired register). Length constraints are explicit, not implied ('Output 200-250 words').
  • For reasoning tasks (math word problems, multi-step analysis, debugging), Chain-of-Thought (step-by-step reasoning before answer) consistently improves accuracy 15-30% on hard problems. For simple reasoning, CoT adds tokens without benefit. Test both.
  • For code generation prompts, always include: language version (Python 3.11+, TypeScript 5+), framework context (React 18, FastAPI 0.110+), test expectations ('write tests using pytest'), and constraints ('no external dependencies beyond stdlib'). Generic 'write code for X' produces inconsistent output.
  • For Claude-specific optimization, use `<thinking>` tags for internal reasoning (Claude is trained to treat this section as 'scratchpad,' not final output). Use `<answer>` tags for the final output section. This separation often improves complex reasoning tasks by 10-20%.
  • For GPT-5 specifically, Reasoning Models (GPT-5 Thinking, o3-pro) perform better with 'Think step by step about this before answering' framing than base models. For base models, prefer 'Show your work:' followed by the specific reasoning format.
  • For agentic prompts (prompt that uses tools, function calls, or multi-step actions), separate: (1) the role description, (2) available tools with when-to-use criteria, (3) output decision format (should-I-call-tool vs respond directly), (4) error handling ('if tool fails, fall back to X'). Agent prompts fail most often at tool selection.
  • For safety-critical prompts (healthcare, legal, financial), always include: negative examples (what NOT to do), explicit refusal language ('if the user asks X, respond with: [specific text]'), and an escalation trigger ('if Y condition is met, refuse and recommend Z'). Generic 'be careful' instructions don't work.
  • For multilingual prompts, handle the language question explicitly. Does the prompt work in English and the user inputs in Spanish? Does the output match the input language? Test with actual non-English inputs; GPT-5 and Claude handle this differently, Gemini is strongest for some non-English languages.
  • For debugging an existing prompt that's underperforming, the workflow is: (1) run 20 test cases, catalog every failure by category, (2) identify the dominant failure mode (usually one or two categories cover 70%+), (3) modify the prompt to address the dominant failure mode specifically, (4) re-run. Avoid changing multiple things at once — you won't know which change helped.

Variants

Default

Standard flow for most users working on this topic

Beginner

Simplified output for users new to the domain — less jargon, more foundational explanation

Advanced

Denser output assuming practitioner-level baseline knowledge

Short-form

Compressed output for quick decisions, under 500 words

Deep-Session

Full guided session mode — walk through multiple prompts from the pack in one extended interaction

Self-Serve

Pick one specific prompt from the pack to run in isolation

Team Mode

Output structured for team discussion rather than individual reflection

Frequently asked questions

How do I use the Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use?

Claude Opus 4.7 for advanced prompt design. GPT-5 for testing iterations.

Can I customize the Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Paste your real situation (with specific numbers and context), not generic 'help me with X' framing. The prompt rewards specificity.; If the prompt asks auto-intake questions, answer them fully before expecting output — incomplete inputs produce incomplete outputs.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals