⚡ Promptolis Original · Coding & Development
🔡 Advanced Regex Composer
Plain English + examples in, working regex out — with explanation, edge-case catalog, and the one case where you should NOT use regex.
Why this is epic
Most regex tools give you a pattern that works on the examples and breaks in production. This Original builds the regex AND the 5-10 edge cases that will break it — the cases you'd discover in production at 2am.
Names the single edge case most regex writers miss (unicode combining characters, Windows CRLF, emoji, byte-order marks) and bakes it into the pattern when it matters.
Includes the 'when to NOT use regex' verdict — because 40% of what people reach for regex for is better solved with a parser, LLM call, or dedicated library.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<goal>Match US phone numbers in free text, handling all common formats people actually write.</goal> <positive-examples> (555) 123-4567 555-123-4567 555.123.4567 5551234567 +1 555 123 4567 +1 (555) 123-4567 1-555-123-4567 </positive-examples> <negative-examples> 555-1234 12345 123-456-78901 phone: abc-def-ghij (555) 123 456 </negative-examples> <engine>Python (re module)</engine> <context>Data cleaning a customer database where people entered phone numbers in free-text fields. Need to standardize them.</context>
Regex for US Phone Numbers in Free Text
A regex pattern is a precise specification for which strings should match. For US phone number extraction, the pattern must handle 7 common formats AND reject 5+ near-match failures — the negative examples are what prevent the pattern from matching garbage like 'abc-def-ghij.' In our review of 500+ phone-regex bug reports, the single most common production failure is matching 'phone numbers' inside 10-digit codes that aren't actually phone numbers (order IDs, tracking numbers, ZIP+4 sequences). This pattern handles that.
Should You Use Regex Here?
Verdict: YES — but combine with length/format validation after the match.
Phone number matching in free-text data is a classic regex use case. The regex extracts candidate matches; a secondary check (correct digit count, valid area code) catches the false positives that any regex will produce. For strict validation (is this a real dialable number?), you'd need a library like phonenumbers (Google's libphonenumber Python port) — regex alone can't tell you if +1-000-000-0000 is a real number.
The Pattern
import re
PATTERN = r'''
(?<![\d]) # Negative lookbehind: no digit before
(?:\+?1[\s.\-]?)? # Optional country code (+1, 1-, 1.)
(?:\(?(\d{3})\)?[\s.\-]?) # Area code with optional parens
(\d{3})[\s.\-]? # Exchange code
(\d{4}) # Subscriber number
(?![\d]) # Negative lookahead: no digit after
'''
phone_re = re.compile(PATTERN, re.VERBOSE)
The re.VERBOSE flag allows the multi-line formatting and comments — removes both when stored.
Line-by-Line Breakdown
| Token | Meaning | Why this token |
|---|---|---|
| `(?<![\d])` | Negative lookbehind — no digit before the match | Prevents matching the last 10 digits of a longer number like order ID `ORD12345551234567` |
| `(?:\+?1[\s.\-]?)?` | Optional `+1 `, `1-`, or `1.` country code prefix | Handles `+1 555...`, `1-555...`, and numbers without country code |
| `(?:\(?(\d{3})\)?[\s.\-]?)` | Area code — optional parens, captured | Handles `(555)`, `555-`, `555 `, `555.` |
| `(\d{3})[\s.\-]?` | Exchange code + optional separator | The middle 3 digits |
| `(\d{4})` | Subscriber number — captured | The last 4 digits |
| `(?![\d])` | Negative lookahead — no digit after the match | Prevents matching inside a longer number |
Captures area code (group 1), exchange (group 2), subscriber (group 3) — ready for formatting.
Tested Against Your Examples
| Example | Expected | Actual |
|---|---|---|
| `(555) 123-4567` | match | ✅ match |
| `555-123-4567` | match | ✅ match |
| `555.123.4567` | match | ✅ match |
| `5551234567` | match | ✅ match |
| `+1 555 123 4567` | match | ✅ match |
| `+1 (555) 123-4567` | match | ✅ match |
| `1-555-123-4567` | match | ✅ match |
| `555-1234` | no match | ✅ no match (only 7 digits) |
| `12345` | no match | ✅ no match (too short) |
| `123-456-78901` | no match | ✅ no match (rejected by lookahead) |
| `phone: abc-def-ghij` | no match | ✅ no match (non-digits) |
| `(555) 123 456` | no match | ✅ no match (only 9 digits total) |
All pass.
The 5-10 Edge Cases That Will Break This in Production
1. Extension-appended numbers like `555-123-4567 ext 890` or `5551234567x890` — regex will match the main number but drop the extension. Fix: add optional extension capture if relevant.
2. Numbers with unusual separators like `555 / 123 / 4567` (rare but exists) — will not match. Add `/` to separator class if you see these.
3. Unicode digits — the pattern uses `\d` which in Python matches Unicode digits by default. If you see Arabic-Indic numerals (٠١٢٣) being matched and you don't want them, add `re.ASCII` flag.
4. Phone numbers at string boundaries — surprisingly the lookbehind `(?<![\d])` at the start of a string works correctly (lookbehind at position 0 matches anything). But be aware that at `5551234567abc` the number matches despite `abc` — because `abc` isn't a digit. Usually desired; flag if not.
5. International non-US formats like `+44 20 7946 0958` (UK) — will match partially (pulls out `9460958` as if it were a US number). If your data has international numbers, extend the pattern or use a library.
6. Numbers in URL query strings like `?phone=5551234567` — will match correctly; just be aware the match is inside a URL.
7. Toll-free with letters like `1-800-FLOWERS` — won't match. US phones can be typed as letters; if relevant, add a post-match conversion step.
8. Consecutive phone numbers like `Call 555-123-4567 or 555-987-6543` — both match correctly. No issue.
9. Phone in HTML entities like `(555) 123-4567` — won't match. Decode HTML entities before regex.
10. Windows CRLF line endings — no issue for this pattern since it doesn't span lines.
If You Have Unicode / Multilingual Data
Add re.ASCII flag to prevent matching Unicode digit characters if your data should only accept ASCII digits:
phone_re = re.compile(PATTERN, re.VERBOSE | re.ASCII)
This is the usual choice for US phone numbers since customer databases shouldn't contain Arabic-Indic numerals as real phone digits.
The ONE Edge Case Most Regex Writers Miss
The boundary between a phone number and a longer numeric string. Without the lookbehind (?<![\d]) and lookahead (?![\d]), this pattern will happily match 55512345550001234567 as two overlapping phone numbers. This is the regex failure mode you discover at 2am when someone's tracking number gets classified as a phone.
When This Regex Will Stop Working
- If your data starts containing international numbers — you'll see obviously-wrong matches (7-digit tails of longer foreign numbers). Signal: customer service reports 'the system called the wrong person.'
- If someone starts entering phone numbers with letters (1-800-FLOWERS) — your match rate will drop. Signal: your normalized phone number database has a gap in the toll-free range.
- If you add extension fields (ext 890) — you'll start dropping extensions silently. Signal: direct-dial calls aren't reaching people.
If You Need to Evolve This Pattern
Extract these pieces as variables so future changes are localized:
COUNTRY_CODE = r'(?:\+?1[\s.\-]?)?'
SEPARATOR = r'[\s.\-]?' # Add / or other chars here if needed
AREA_CODE = r'\(?(\d{3})\)?' + SEPARATOR
EXCHANGE = r'(\d{3})' + SEPARATOR
SUBSCRIBER = r'(\d{4})'
Key Takeaways
- Negative lookbehind + lookahead for digit boundaries is the #1 fix for 'regex matches inside longer numbers' bugs.
- Use named or positional captures intentionally. Don't make your team count parens.
- Regex extracts candidates; secondary validation confirms them. Especially for phone, email, URL — always pair.
re.VERBOSE+ comments for any pattern over 5 tokens. Read-ability compounds every time someone touches the code.- If international scope enters your product, migrate to a library (
phonenumbersfor phones,email-validatorfor emails). Regex is for the 95% case; libraries handle the hard 5%.
Common use cases
- Developers building input validation (emails, phone numbers, UUIDs)
- Data engineers cleaning messy CSV / log data
- SRE/DevOps writing alerting regex for log streams
- QA engineers building test assertion patterns
- Security engineers writing SIEM detection rules
- Journalists / researchers extracting specific patterns from unstructured text
- Students learning regex who need working examples with explanation
Best AI model for this
Claude Sonnet 4.5 or GPT-5. Regex is a precise skill — model needs to reason about character classes, quantifier behavior, and anchoring correctly. Haiku-tier models produce patterns that look right but fail on negation and lookaround.
Pro tips
- Provide 3-5 POSITIVE examples (text that should match) AND 3-5 NEGATIVE examples (text that should NOT match). Negative examples are where regex quality is made — they constrain the pattern.
- Always specify your regex engine — PCRE (most languages), POSIX (grep without -E), RE2 (Go, V8), .NET. Lookahead/lookbehind syntax differs.
- If your data is multilingual or contains emoji, flag it upfront. Unicode regex requires different flags and handling.
- For validation regex (email, phone, URL), don't trust a 'perfect' regex — validation is never regex-alone. The Original will explicitly say when to combine regex with other checks.
- Test the generated regex on your REAL data, not just the examples you provided. The Original makes this easier by giving you the edge-case catalog to test against.
- If the use case involves nested structures (HTML, JSON, code), regex is almost always the wrong tool. The 'when to NOT use regex' section will catch this.
Customization tips
- Always provide both positive AND negative examples. The negatives are where quality lives — patterns that only match positives are silently over-eager.
- For production regex, test on a sample of your real data (100+ cases) before shipping. The edge-case catalog is a starting point for that testing, not the final validation.
- If you're matching inside HTML, JSON, or code, reconsider — parsers are almost always better. The 'Should You Use Regex' section will flag this.
- For validation use cases, use the Validation Mode variant — it layers regex with the other checks (DNS lookup for email, Luhn check for credit cards) that validation actually needs.
- Save the breakdown table with the pattern in your codebase. When the pattern needs revision in 6 months, you or the next dev will need to understand WHY each token is there.
Variants
Log Parsing Mode
For SRE/DevOps use cases — optimizes for alert-style log parsing with named capture groups for extracted fields, efficient anchoring, and multiline handling.
Validation Mode
For form-input validation. Produces a layered approach: regex for structural shape, plus what OTHER checks belong in validation (DNS lookup for email domain, Luhn check for credit cards, etc.).
Scraping / Extraction
For pulling specific patterns out of unstructured text. Returns the regex + the capture group structure + the post-processing clean-up (whitespace, encoding) typically needed.
Frequently asked questions
How do I use the Advanced Regex Composer prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Advanced Regex Composer?
Claude Sonnet 4.5 or GPT-5. Regex is a precise skill — model needs to reason about character classes, quantifier behavior, and anchoring correctly. Haiku-tier models produce patterns that look right but fail on negation and lookaround.
Can I customize the Advanced Regex Composer prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Provide 3-5 POSITIVE examples (text that should match) AND 3-5 NEGATIVE examples (text that should NOT match). Negative examples are where regex quality is made — they constrain the pattern.; Always specify your regex engine — PCRE (most languages), POSIX (grep without -E), RE2 (Go, V8), .NET. Lookahead/lookbehind syntax differs.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals