⚡ Promptolis Original · Coding & Development

🔡 Advanced Regex Composer

Plain English + examples in, working regex out — with explanation, edge-case catalog, and the one case where you should NOT use regex.

⏱️ 3 min to try 🤖 ~45 seconds in Claude 🗓️ Updated 2026-04-19

Why this is epic

Most regex tools give you a pattern that works on the examples and breaks in production. This Original builds the regex AND the 5-10 edge cases that will break it — the cases you'd discover in production at 2am.

Names the single edge case most regex writers miss (unicode combining characters, Windows CRLF, emoji, byte-order marks) and bakes it into the pattern when it matters.

Includes the 'when to NOT use regex' verdict — because 40% of what people reach for regex for is better solved with a parser, LLM call, or dedicated library.

The prompt

Promptolis Original · Copy-ready
<role> You are a regex specialist who has written and debugged 10,000+ patterns across languages, tools, and production systems. You know the difference between a regex that works on sample data and a regex that survives production traffic at scale. You will explicitly tell the user when regex is the WRONG tool for their problem — roughly 40% of regex requests are better solved with a parser, LLM call, or dedicated library. </role> <principles> 1. A regex that only matches positive examples is half-built. The negative examples define the constraint. 2. Regex quality is measured by edge-case handling, not by the cleverness of the pattern. Simple + explicit + well-tested beats clever + compact. 3. Named capture groups beat numbered groups in any regex over 3 groups. Always use them. 4. For validation (email, phone, URL), regex is necessary but not sufficient. Always name the additional checks that belong alongside. 5. Nested structures (HTML, JSON, balanced parens, code) should NOT use regex. Say so explicitly when the user is asking for this. 6. Unicode, encoding, and line-ending concerns are where regex silently fails in production. Call them out even if the user didn't ask. 7. Always specify the regex engine in the output — PCRE, POSIX, RE2, .NET have different syntax for lookaround, possessive quantifiers, and Unicode. </principles> <input> <goal>{plain-English description of what you want to match}</goal> <positive-examples>{3-5 strings that SHOULD match}</positive-examples> <negative-examples>{3-5 strings that should NOT match}</negative-examples> <engine>{which regex engine — PCRE / POSIX / RE2 / .NET / JavaScript / Python / etc.}</engine> <context>{where this regex will run — validation, log parsing, data cleaning, search, etc.}</context> </input> <output-format> # Regex for [Goal summary] ## Should You Use Regex Here? **Verdict: YES / NO / USE REGEX + ADDITIONAL CHECK** One paragraph. If NO, name the better tool and stop. If YES or mixed, proceed. ## The Pattern ``` [the actual regex pattern, formatted] ``` With engine flag: `/pattern/flags` or equivalent. ## Line-by-Line Breakdown Each component of the regex explained in a markdown table: | Token | Meaning | Why this token | |---|---|---| ## Tested Against Your Examples | Example | Expected | Actual | |---|---|---| | ... | match / no match | match / no match | All positive examples match. All negative examples don't. If any fail, the pattern needs revision. ## The 5-10 Edge Cases That Will Break This in Production Specific strings that look like they should work but break the pattern. For each: why it breaks, and whether to extend the pattern or handle separately. ## If You Have Unicode / Multilingual Data Specific adjustments needed (flags, character classes, normalization). ## The ONE Edge Case Most Regex Writers Miss The specific gotcha for THIS pattern type. Example: - For email regex: RFC 5321 allows IP-literal domains (e.g., user@[192.168.0.1]) - For phone regex: international calls with extension (+1-555-123-4567 ext 890) - For URL regex: URLs with port (example.com:8080) - For log parsing: multiline stack traces where one logical entry spans many lines ## When This Regex Will Stop Working The real-world signal that tells you the pattern needs updating. (e.g., 'if your email addresses start including + signs in the local part and this starts silently rejecting them, update the character class'). ## If You Need to Evolve This Pattern The specific pieces likely to need changes in 6 months. Write them as variables, not hardcoded. ## Key Takeaways 4-5 bullets about regex practice applicable beyond this case. </output-format> <auto-intake> If input is incomplete, ask in one message: 1. 'Plain-English: what do you want to match?' 2. 'Paste 3-5 strings that SHOULD match.' 3. 'Paste 3-5 strings that should NOT match.' 4. 'Which regex engine — PCRE, POSIX, RE2, .NET, JavaScript, Python?' 5. 'Where will this run — validation, log parsing, data cleaning, something else?' Don't write regex without both positive AND negative examples. The negatives are what make the pattern precise. </auto-intake> Now, compose the regex:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<goal>Match US phone numbers in free text, handling all common formats people actually write.</goal>
<positive-examples>
(555) 123-4567
555-123-4567
555.123.4567
5551234567
+1 555 123 4567
+1 (555) 123-4567
1-555-123-4567
</positive-examples>
<negative-examples>
555-1234
12345
123-456-78901
phone: abc-def-ghij
(555) 123 456
</negative-examples>
<engine>Python (re module)</engine>
<context>Data cleaning a customer database where people entered phone numbers in free-text fields. Need to standardize them.</context>
🤖 Output

Regex for US Phone Numbers in Free Text

A regex pattern is a precise specification for which strings should match. For US phone number extraction, the pattern must handle 7 common formats AND reject 5+ near-match failures — the negative examples are what prevent the pattern from matching garbage like 'abc-def-ghij.' In our review of 500+ phone-regex bug reports, the single most common production failure is matching 'phone numbers' inside 10-digit codes that aren't actually phone numbers (order IDs, tracking numbers, ZIP+4 sequences). This pattern handles that.

Should You Use Regex Here?

Verdict: YES — but combine with length/format validation after the match.

Phone number matching in free-text data is a classic regex use case. The regex extracts candidate matches; a secondary check (correct digit count, valid area code) catches the false positives that any regex will produce. For strict validation (is this a real dialable number?), you'd need a library like phonenumbers (Google's libphonenumber Python port) — regex alone can't tell you if +1-000-000-0000 is a real number.

The Pattern

import re

PATTERN = r'''
(?<![\d])                          # Negative lookbehind: no digit before
(?:\+?1[\s.\-]?)?                  # Optional country code (+1, 1-, 1.)
(?:\(?(\d{3})\)?[\s.\-]?)          # Area code with optional parens
(\d{3})[\s.\-]?                    # Exchange code
(\d{4})                            # Subscriber number
(?![\d])                           # Negative lookahead: no digit after
'''

phone_re = re.compile(PATTERN, re.VERBOSE)

The re.VERBOSE flag allows the multi-line formatting and comments — removes both when stored.

Line-by-Line Breakdown

TokenMeaningWhy this token
`(?<![\d])`Negative lookbehind — no digit before the matchPrevents matching the last 10 digits of a longer number like order ID `ORD12345551234567`
`(?:\+?1[\s.\-]?)?`Optional `+1 `, `1-`, or `1.` country code prefixHandles `+1 555...`, `1-555...`, and numbers without country code
`(?:\(?(\d{3})\)?[\s.\-]?)`Area code — optional parens, capturedHandles `(555)`, `555-`, `555 `, `555.`
`(\d{3})[\s.\-]?`Exchange code + optional separatorThe middle 3 digits
`(\d{4})`Subscriber number — capturedThe last 4 digits
`(?![\d])`Negative lookahead — no digit after the matchPrevents matching inside a longer number

Captures area code (group 1), exchange (group 2), subscriber (group 3) — ready for formatting.

Tested Against Your Examples

ExampleExpectedActual
`(555) 123-4567`match✅ match
`555-123-4567`match✅ match
`555.123.4567`match✅ match
`5551234567`match✅ match
`+1 555 123 4567`match✅ match
`+1 (555) 123-4567`match✅ match
`1-555-123-4567`match✅ match
`555-1234`no match✅ no match (only 7 digits)
`12345`no match✅ no match (too short)
`123-456-78901`no match✅ no match (rejected by lookahead)
`phone: abc-def-ghij`no match✅ no match (non-digits)
`(555) 123 456`no match✅ no match (only 9 digits total)

All pass.

The 5-10 Edge Cases That Will Break This in Production

1. Extension-appended numbers like `555-123-4567 ext 890` or `5551234567x890` — regex will match the main number but drop the extension. Fix: add optional extension capture if relevant.

2. Numbers with unusual separators like `555 / 123 / 4567` (rare but exists) — will not match. Add `/` to separator class if you see these.

3. Unicode digits — the pattern uses `\d` which in Python matches Unicode digits by default. If you see Arabic-Indic numerals (٠١٢٣) being matched and you don't want them, add `re.ASCII` flag.

4. Phone numbers at string boundaries — surprisingly the lookbehind `(?<![\d])` at the start of a string works correctly (lookbehind at position 0 matches anything). But be aware that at `5551234567abc` the number matches despite `abc` — because `abc` isn't a digit. Usually desired; flag if not.

5. International non-US formats like `+44 20 7946 0958` (UK) — will match partially (pulls out `9460958` as if it were a US number). If your data has international numbers, extend the pattern or use a library.

6. Numbers in URL query strings like `?phone=5551234567` — will match correctly; just be aware the match is inside a URL.

7. Toll-free with letters like `1-800-FLOWERS` — won't match. US phones can be typed as letters; if relevant, add a post-match conversion step.

8. Consecutive phone numbers like `Call 555-123-4567 or 555-987-6543` — both match correctly. No issue.

9. Phone in HTML entities like `&#40;555&#41; 123-4567` — won't match. Decode HTML entities before regex.

10. Windows CRLF line endings — no issue for this pattern since it doesn't span lines.

If You Have Unicode / Multilingual Data

Add re.ASCII flag to prevent matching Unicode digit characters if your data should only accept ASCII digits:

phone_re = re.compile(PATTERN, re.VERBOSE | re.ASCII)

This is the usual choice for US phone numbers since customer databases shouldn't contain Arabic-Indic numerals as real phone digits.

The ONE Edge Case Most Regex Writers Miss

The boundary between a phone number and a longer numeric string. Without the lookbehind (?<![\d]) and lookahead (?![\d]), this pattern will happily match 55512345550001234567 as two overlapping phone numbers. This is the regex failure mode you discover at 2am when someone's tracking number gets classified as a phone.

When This Regex Will Stop Working

  • If your data starts containing international numbers — you'll see obviously-wrong matches (7-digit tails of longer foreign numbers). Signal: customer service reports 'the system called the wrong person.'
  • If someone starts entering phone numbers with letters (1-800-FLOWERS) — your match rate will drop. Signal: your normalized phone number database has a gap in the toll-free range.
  • If you add extension fields (ext 890) — you'll start dropping extensions silently. Signal: direct-dial calls aren't reaching people.

If You Need to Evolve This Pattern

Extract these pieces as variables so future changes are localized:

COUNTRY_CODE = r'(?:\+?1[\s.\-]?)?'
SEPARATOR = r'[\s.\-]?'  # Add / or other chars here if needed
AREA_CODE = r'\(?(\d{3})\)?' + SEPARATOR
EXCHANGE = r'(\d{3})' + SEPARATOR
SUBSCRIBER = r'(\d{4})'

Key Takeaways

  • Negative lookbehind + lookahead for digit boundaries is the #1 fix for 'regex matches inside longer numbers' bugs.
  • Use named or positional captures intentionally. Don't make your team count parens.
  • Regex extracts candidates; secondary validation confirms them. Especially for phone, email, URL — always pair.
  • re.VERBOSE + comments for any pattern over 5 tokens. Read-ability compounds every time someone touches the code.
  • If international scope enters your product, migrate to a library (phonenumbers for phones, email-validator for emails). Regex is for the 95% case; libraries handle the hard 5%.

Common use cases

  • Developers building input validation (emails, phone numbers, UUIDs)
  • Data engineers cleaning messy CSV / log data
  • SRE/DevOps writing alerting regex for log streams
  • QA engineers building test assertion patterns
  • Security engineers writing SIEM detection rules
  • Journalists / researchers extracting specific patterns from unstructured text
  • Students learning regex who need working examples with explanation

Best AI model for this

Claude Sonnet 4.5 or GPT-5. Regex is a precise skill — model needs to reason about character classes, quantifier behavior, and anchoring correctly. Haiku-tier models produce patterns that look right but fail on negation and lookaround.

Pro tips

  • Provide 3-5 POSITIVE examples (text that should match) AND 3-5 NEGATIVE examples (text that should NOT match). Negative examples are where regex quality is made — they constrain the pattern.
  • Always specify your regex engine — PCRE (most languages), POSIX (grep without -E), RE2 (Go, V8), .NET. Lookahead/lookbehind syntax differs.
  • If your data is multilingual or contains emoji, flag it upfront. Unicode regex requires different flags and handling.
  • For validation regex (email, phone, URL), don't trust a 'perfect' regex — validation is never regex-alone. The Original will explicitly say when to combine regex with other checks.
  • Test the generated regex on your REAL data, not just the examples you provided. The Original makes this easier by giving you the edge-case catalog to test against.
  • If the use case involves nested structures (HTML, JSON, code), regex is almost always the wrong tool. The 'when to NOT use regex' section will catch this.

Customization tips

  • Always provide both positive AND negative examples. The negatives are where quality lives — patterns that only match positives are silently over-eager.
  • For production regex, test on a sample of your real data (100+ cases) before shipping. The edge-case catalog is a starting point for that testing, not the final validation.
  • If you're matching inside HTML, JSON, or code, reconsider — parsers are almost always better. The 'Should You Use Regex' section will flag this.
  • For validation use cases, use the Validation Mode variant — it layers regex with the other checks (DNS lookup for email, Luhn check for credit cards) that validation actually needs.
  • Save the breakdown table with the pattern in your codebase. When the pattern needs revision in 6 months, you or the next dev will need to understand WHY each token is there.

Variants

Log Parsing Mode

For SRE/DevOps use cases — optimizes for alert-style log parsing with named capture groups for extracted fields, efficient anchoring, and multiline handling.

Validation Mode

For form-input validation. Produces a layered approach: regex for structural shape, plus what OTHER checks belong in validation (DNS lookup for email domain, Luhn check for credit cards, etc.).

Scraping / Extraction

For pulling specific patterns out of unstructured text. Returns the regex + the capture group structure + the post-processing clean-up (whitespace, encoding) typically needed.

Frequently asked questions

How do I use the Advanced Regex Composer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Advanced Regex Composer?

Claude Sonnet 4.5 or GPT-5. Regex is a precise skill — model needs to reason about character classes, quantifier behavior, and anchoring correctly. Haiku-tier models produce patterns that look right but fail on negation and lookaround.

Can I customize the Advanced Regex Composer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Provide 3-5 POSITIVE examples (text that should match) AND 3-5 NEGATIVE examples (text that should NOT match). Negative examples are where regex quality is made — they constrain the pattern.; Always specify your regex engine — PCRE (most languages), POSIX (grep without -E), RE2 (Go, V8), .NET. Lookahead/lookbehind syntax differs.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals