⚡ Promptolis Original · Productivity & Systems
🧠 Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use
Prompt engineering grounded in Wei 2022 CoT, Brown 2020 few-shot learning, and 2024-2026 production LLM patterns from Anthropic, OpenAI, and Google docs.
Why this is epic
Most online advice on productivity & systems is generic, 2021-era, or AI-generated slop recycled from 5-year-old blog posts. This pack is built on Wei 2022 CoT, Brown 2020 few-shot learning, and 2024-2026 production LLM patterns from Anthropic, OpenAI, and Google docs
6 categories × 5 prompts, each calibrated to real practitioner workflows — not cheat-sheet templates. Research-backed citations throughout. Tool-agnostic (Claude / ChatGPT / Gemini).
AI-Guided Session Mode: tell the AI your specific situation (role + stage + challenge) and it selects 1-3 prompts matched to you, then guides a full working session with your actual inputs.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
Experience level: Intermediate — shipped 2 internal tools using OpenAI, never built a formal eval Use-case: Extraction — pulling structured data from customer support emails (customer name, product mentioned, issue type, sentiment, urgency 1-5, requested resolution) Platform: GPT-5 (using gpt-5-mini for cost) Specific challenge: The prompt works ~80% of the time but fails unpredictably. Sometimes it returns wrong urgency ratings, sometimes misses the product mentioned, sometimes returns plain text instead of JSON. Customer-facing feature — needs 95%+ reliability. Current prompt: 'You are a helpful customer support analyst. Extract the following from this email: customer name, product mentioned, issue type, sentiment, urgency rating 1-5, and what resolution they want. Return as JSON.' Example inputs/outputs: Available on request How do I get this to 95%+ reliability?
Your Prompt Engineering Session — Structure-First Analysis
Diagnosis
Your current prompt fails predictably in four specific ways:
1. No input schema — the prompt doesn't tell the LLM what 'issue type' categories are valid. 'Billing,' 'technical support,' 'return request' is one categorization; 'urgent/complaint/inquiry' is another. LLM picks randomly depending on the email.
2. No output schema — 'Return as JSON' is not a schema. The LLM guesses field names, types, and structure. Some responses return `urgency_rating` as a number (correct), some as string ('3'), some as ordinal ('high'), some as scale ('3/5').
3. No urgency definition — 'urgency 1-5' means different things to different LLMs without calibration. A 'wait 3 weeks for response' issue might be rated 1 or 3 depending on run.
4. No examples — zero-shot extraction on a 6-field structured output is notoriously unreliable. Few-shot examples (just 2-3) typically move accuracy from ~80% to ~95% on extraction tasks.
5. No validation step — the LLM isn't asked to verify it extracted all fields. Partial extractions slip through as '80% correct.'
This isn't a 'GPT-5 is inconsistent' problem — it's a prompt design problem. The fix is structural.
The Structural Fix
Replace the current prompt with this XML-structured version:
<role>
You are a customer support email extraction system. You extract structured data from customer emails and return it in strict JSON format matching the provided schema. You do not generate prose, commentary, or explanations — only the JSON object.
</role>
<task>
Extract six fields from the customer email below. Return a JSON object exactly matching the schema provided. If a field is unclear or missing from the email, use 'unknown' (for strings) or null (for numbers) — do not guess.
</task>
<output_schema>
{
"customer_name": "string — as found in email signature or 'unknown'",
"product_mentioned": "string — specific product name or SKU, or 'unknown' if none mentioned",
"issue_type": "one of: 'billing_issue' | 'technical_support' | 'return_request' | 'product_inquiry' | 'account_access' | 'other'",
"sentiment": "one of: 'positive' | 'neutral' | 'negative' | 'very_negative'",
"urgency": "integer 1-5 based on calibration below",
"requested_resolution": "string — concise description of what customer wants, or 'unspecified'"
}
</output_schema>
<urgency_calibration>
1 = Informational inquiry, no urgency. Example: 'Wondering about feature X.'
2 = Standard request, within normal SLA. Example: 'Please update my shipping address.'
3 = Time-sensitive but not blocking. Example: 'Order confirmation would help before weekend.'
4 = Urgent — active impact on customer. Example: 'Your service is down, my team can't work.'
5 = Critical — escalation or compliance concern. Example: 'I'm reporting this to my bank/BBB/attorney.'
</urgency_calibration>
<examples>
<example_1>
<email>
Hi team, this is John Martinez (order #A4821). I bought the Pro subscription last week and was charged $99 but the features aren't showing up in my dashboard. Can someone fix this by Friday? I have a client demo Monday.
- John
</email>
<output>
{
"customer_name": "John Martinez",
"product_mentioned": "Pro subscription",
"issue_type": "billing_issue",
"sentiment": "negative",
"urgency": 3,
"requested_resolution": "Activate Pro subscription features by Friday for Monday client demo"
}
</output>
</example_1>
<example_2>
<email>
Your product is absolutely broken and has been for 3 days. My entire sales team can't access their accounts. We pay $2000/month for this and I'm about to cancel and report this to my credit card. Respond immediately.
</email>
<output>
{
"customer_name": "unknown",
"product_mentioned": "unknown",
"issue_type": "account_access",
"sentiment": "very_negative",
"urgency": 5,
"requested_resolution": "Immediate restoration of team account access; retention at risk"
}
</output>
</example_2>
</examples>
<email>
[insert customer email here]
</email>
<output>
Key structural changes:
- Explicit role as an extraction system (not a 'helpful analyst')
- Output schema with enum constraints on categorical fields
- Urgency calibration with examples (anchors rating to meaning)
- Two few-shot examples covering easy and hard cases
- Explicit 'unknown/null for missing fields' rule (prevents hallucination)
- Asks for JSON object only (no prose)
Few-Shot Examples (why these specifically)
The two examples I included handle different failure modes:
- Example 1 (John Martinez): standard case with name + product + moderate urgency. Teaches the model the 'happy path' extraction.
- Example 2 (anonymous rage email): no name, vague product, extreme sentiment, max urgency. Teaches the model to handle missing-field cases and calibrate urgency 5 correctly (mentions of 'reporting,' 'cancel,' 'immediate' are the signals).
For production, add 2-3 more examples covering: (a) neutral inquiry with no product mentioned, (b) billing error with dollar amount specified, (c) positive feedback with feature request (edge case: 'positive' sentiment doesn't mean 'issue_type = other' automatically).
Platform Calibration
You're on GPT-5-mini. Specific adjustments:
- GPT-5 models generally respect JSON Mode (set
response_format: {type: 'json_object'}in the API call). Use this in addition to prompt instruction — it enforces valid JSON output at the API level. - For structured extraction, GPT-5 Thinking > GPT-5 base > GPT-5-mini on accuracy. GPT-5-mini is 4x cheaper but may require more few-shot examples (5-7 instead of 2-3) to hit 95%.
- Temperature 0 for extraction — you want deterministic outputs, not creative ones. Current GPT-5 API default is usually 1.0; explicitly set to 0.
- If you need to migrate to Claude: switch the XML tags to Claude's preferred
<...>style (you're already using them — good). Claude 3.5 Sonnet and Opus 4.7 both excel at XML-structured extraction. Often Claude + XML outperforms GPT-5 on extraction tasks for equivalent cost.
Evaluation Plan
Before shipping this to production, build an eval set:
1. Collect 30-50 real customer emails from your support queue covering:
- 5 easy cases (clear product, clear issue, normal urgency)
- 15 medium cases (some ambiguity in one or two fields)
- 10 hard cases (missing info, ambiguous sentiment, borderline urgency)
- 5 adversarial cases (trying to trick the system — prompt injection attempts, emails in multiple languages, unusual formatting)
2. Hand-label the correct output for each. This is your ground truth. 2-3 hours of work.
3. Run your prompt V1 against all 50 cases. Measure:
- Exact match accuracy per field (customer_name exact match, issue_type exact match, etc.)
- Overall full-record accuracy (all 6 fields correct)
- Your target: 95%+ full-record accuracy
4. Make V2 based on failure patterns. Is sentiment consistently wrong? Add sentiment examples. Is urgency over-inflating? Tighten calibration rules.
5. A/B test V1 vs V2 on the eval set. Pick V2 only if it improves accuracy without degrading on any dimension.
Iteration Path
V1 (above): ship with JSON Mode + temperature 0. Expect ~90-93% full-record accuracy on the eval set.
V2 (add): chain-of-thought reasoning for urgency + sentiment.
<reasoning>
Before outputting the JSON, think through:
1. What is the customer's tone?
2. What signals of urgency are present (time pressure, escalation language, consequences)?
3. What's the specific product/feature mentioned?
Then output the JSON.
</reasoning>
Trades ~200 extra output tokens for typically 3-5% accuracy lift on reasoning-heavy fields.
V3 (if still needed): split into two-call architecture. Call 1 extracts name/product/issue_type (factual). Call 2 assesses sentiment/urgency (judgment) with call 1's output as context. 2x cost for usually 2-3% accuracy lift. Only if V2 isn't hitting 95%.
Red Flags
1. 'It works 80% of the time' without an eval set is 'I've looked at 10 cases and they seemed fine.' Build the eval set FIRST — you'll be shocked at how wrong your intuition is.
2. The phrase 'Act as a helpful expert' in your original prompt. This kind of role-adjective stack doesn't improve accuracy; it just burns tokens.
3. Missing enum constraints on issue_type. 'Issue type' as free text means the LLM invents categories ('Customer Feedback', 'Urgent Issue', 'Support Request'). Enum constrains it.
4. Missing urgency calibration. '1-5' is ambiguous. Anchoring to specific example scenarios makes it consistent.
5. Shipping on subjective impression. 'Looks good to me' is not an eval. Structured accuracy measurement is.
Key Takeaways
- Your 80% → 95% gap is a prompt design problem, not a model problem. Specifically: no schema, no calibration, no examples, no validation.
- XML-structured prompts with explicit input/output schemas outperform adjective-heavy prose. Switch the structure.
- Few-shot examples (2-3 for start, 5-7 for GPT-5-mini) are the biggest single improvement for extraction tasks. Add them.
- Enum constraints on categorical fields prevent the LLM from inventing categories. issue_type as 'billing_issue | technical_support | ...' beats free-text 'issue_type'.
- JSON Mode + temperature 0 for GPT-5 extraction. Enforces structured output at the API level.
- Build an eval set of 30-50 real cases with hand-labeled ground truth. Measure full-record accuracy, target 95%+.
- Iterate V1 → V2 based on failure patterns, not intuition. Usually 2-3 iteration cycles hit 95%+ from a solid V1.
- Platform-calibrate: if GPT-5-mini can't hit 95% after 3 iterations, upgrade to GPT-5 base or try Claude 3.5 Sonnet. Sometimes the model is the limit.
Common use cases
- Professionals who need structured thinking on this topic, not vague advice
- Practitioners making specific decisions with real stakes
- Anyone tired of generic AI responses to domain-specific questions
- Users wanting depth over breadth — one thing done well, not 10 things done poorly
- Teams adopting AI tooling for a specific workflow area
- Consultants or coaches building repeatable processes around the topic
- Individuals working through a multi-step decision or transition
- Small business owners / founders needing expert-style guidance without consultant budgets
Best AI model for this
Claude Opus 4.7 for advanced prompt design. GPT-5 for testing iterations.
Pro tips
- Paste your real situation (with specific numbers and context), not generic 'help me with X' framing. The prompt rewards specificity.
- If the prompt asks auto-intake questions, answer them fully before expecting output — incomplete inputs produce incomplete outputs.
- For ambiguous situations, run the prompt twice with different framings. Compare outputs. Often reveals the right path.
- Save the outputs you value. Iterate on them across sessions rather than re-running from scratch.
- Pair with a human expert for high-stakes decisions — the prompt is a first-draft tool, not a final authority.
- Share what worked back with us (promptolis.com/contact). Helps us refine future versions.
- The research citations inside the prompt are real — look them up if a specific claim matters for your decision.
Customization tips
- For classification tasks (spam vs not, sentiment analysis, topic categorization), enum constraints on output are essential. Always provide the exact allowed values as a list. For binary classification, include explicit handling of ambiguous cases ('if unsure, default to X').
- For generation tasks (creative writing, summarization, content creation), the balance shifts: less enum rigidity, more examples. Tone-calibration examples matter (show 2-3 'good' outputs in the desired register). Length constraints are explicit, not implied ('Output 200-250 words').
- For reasoning tasks (math word problems, multi-step analysis, debugging), Chain-of-Thought (step-by-step reasoning before answer) consistently improves accuracy 15-30% on hard problems. For simple reasoning, CoT adds tokens without benefit. Test both.
- For code generation prompts, always include: language version (Python 3.11+, TypeScript 5+), framework context (React 18, FastAPI 0.110+), test expectations ('write tests using pytest'), and constraints ('no external dependencies beyond stdlib'). Generic 'write code for X' produces inconsistent output.
- For Claude-specific optimization, use `<thinking>` tags for internal reasoning (Claude is trained to treat this section as 'scratchpad,' not final output). Use `<answer>` tags for the final output section. This separation often improves complex reasoning tasks by 10-20%.
- For GPT-5 specifically, Reasoning Models (GPT-5 Thinking, o3-pro) perform better with 'Think step by step about this before answering' framing than base models. For base models, prefer 'Show your work:' followed by the specific reasoning format.
- For agentic prompts (prompt that uses tools, function calls, or multi-step actions), separate: (1) the role description, (2) available tools with when-to-use criteria, (3) output decision format (should-I-call-tool vs respond directly), (4) error handling ('if tool fails, fall back to X'). Agent prompts fail most often at tool selection.
- For safety-critical prompts (healthcare, legal, financial), always include: negative examples (what NOT to do), explicit refusal language ('if the user asks X, respond with: [specific text]'), and an escalation trigger ('if Y condition is met, refuse and recommend Z'). Generic 'be careful' instructions don't work.
- For multilingual prompts, handle the language question explicitly. Does the prompt work in English and the user inputs in Spanish? Does the output match the input language? Test with actual non-English inputs; GPT-5 and Claude handle this differently, Gemini is strongest for some non-English languages.
- For debugging an existing prompt that's underperforming, the workflow is: (1) run 20 test cases, catalog every failure by category, (2) identify the dominant failure mode (usually one or two categories cover 70%+), (3) modify the prompt to address the dominant failure mode specifically, (4) re-run. Avoid changing multiple things at once — you won't know which change helped.
Variants
Default
Standard flow for most users working on this topic
Beginner
Simplified output for users new to the domain — less jargon, more foundational explanation
Advanced
Denser output assuming practitioner-level baseline knowledge
Short-form
Compressed output for quick decisions, under 500 words
Deep-Session
Full guided session mode — walk through multiple prompts from the pack in one extended interaction
Self-Serve
Pick one specific prompt from the pack to run in isolation
Team Mode
Output structured for team discussion rather than individual reflection
Frequently asked questions
How do I use the Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use?
Claude Opus 4.7 for advanced prompt design. GPT-5 for testing iterations.
Can I customize the Prompt Engineering Mastery Pack — 30 Structure-First Prompts for Production LLM Use prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Paste your real situation (with specific numbers and context), not generic 'help me with X' framing. The prompt rewards specificity.; If the prompt asks auto-intake questions, answer them fully before expecting output — incomplete inputs produce incomplete outputs.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals