Toolverse

How to Write Better Regular Expressions: A Practical Guide

8 min read

Regular expressions are one of the most powerful tools in a developer's toolkit — and one of the most misunderstood. A well-written regex can validate input, extract data, and transform text in a single line. A poorly written one can freeze your application or match things you never intended. This guide covers the patterns, pitfalls, and practices that separate good regex from bad.

The Building Blocks

Every regex is built from a few core concepts. Understanding these deeply is more valuable than memorizing patterns.

Character Classes

A character class matches one character from a set. The syntax is square brackets:

  • [abc] — matches a, b, or c
  • [a-z] — matches any lowercase letter
  • [^0-9] — matches anything that is NOT a digit
  • \d — shorthand for [0-9]
  • \w — shorthand for [a-zA-Z0-9_]
  • \s — whitespace (space, tab, newline)

Quantifiers

Quantifiers control how many times a pattern matches:

  • * — zero or more (greedy)
  • + — one or more (greedy)
  • ? — zero or one
  • {3} — exactly 3
  • {2,5} — between 2 and 5

Adding ? after a quantifier makes it lazy (match as little as possible): *?, +?. This is critical when matching content between delimiters. For example, <.*> matches the entire string <a>text</a>, while <.*?> matches only <a>.

Anchors

  • ^ — start of string (or line, with m flag)
  • $ — end of string (or line, with m flag)
  • \b — word boundary

Always use anchors in validation patterns. \d+ matches "abc123def" (it finds "123"). ^\d+$ correctly rejects it because the entire string is not digits.

Capture Groups and Lookaheads

Named Capture Groups

Use named capture groups instead of numbered ones in production code. They survive refactoring:

// Numbered (fragile)
const match = "2026-03-10".match(/(\d{4})-(\d{2})-(\d{2})/);
const year = match[1]; // "2026"

// Named (robust)
const match = "2026-03-10".match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
const year = match.groups.year; // "2026"

Lookahead and Lookbehind

Zero-width assertions match a position without consuming characters:

  • (?=...) — positive lookahead: position is followed by the pattern
  • (?!...) — negative lookahead: position is NOT followed by the pattern
  • (?<=...) — positive lookbehind: position is preceded by the pattern
  • (?<!...) — negative lookbehind: position is NOT preceded by the pattern

Example: match a number only if it is preceded by $: (?<=\$)\d+ matches "42" in "$42" but not in "42 items".

Practical Patterns

Email Validation

The RFC 5321-compliant email regex is over 6,000 characters long. In practice, use a simple pattern and verify with a confirmation email:

/[^\s@]+@[^\s@]+\.[^\s@]+/

This rejects obvious non-emails while accepting all valid formats. Over-engineering email validation with regex is a common anti-pattern — you will reject valid addresses.

URL Matching

/https?:\/\/[^\s/$.?#].[^\s]*/i

Duplicate Word Detection

Using backreferences to find repeated words like "the the":

/\b(\w+)\s+\1\b/gi

Common Mistakes

PatternProblemBetter Version
.*Greedy, matches too much[^"]* for content inside quotes
\d+ for phonesToo loose, matches any number\+?[1-9]\d{1,14} (E.164)
[a-zA-Z0-9]+ASCII only[\p{L}\p{N}]+ with u flag
^.*$Matches entire fileUse with m flag for per-line

Catastrophic Backtracking

This is the most dangerous regex pitfall. Certain patterns cause the JavaScript engine to try exponentially many match paths, effectively freezing your application. This is known as ReDoS (Regular expression Denial of Service).

The classic example: (a+)+ applied to the string aaaaab. The engine tries 2n combinations before concluding there is no match. With just 25 'a' characters, this takes seconds. With 30, it takes minutes.

How to avoid it: Never nest quantifiers on the same characters ((a+)+, (a*)*, (a|a)+). Use possessive quantifiers or atomic groups when your engine supports them. In performance-critical applications, consider using RE2-based engines (used by Go) which guarantee linear-time matching by disallowing backreferences.

Flags That Matter

  • g (global) — find all matches, not just the first
  • i (case-insensitive) — /hello/i matches "Hello"
  • m (multiline) — ^ and $ match line boundaries
  • s (dotAll) — . matches newlines
  • u (unicode) — enables Unicode property escapes like \p{L}

The ECMAScript specification also defines the v flag (ES2024) which enables set operations in character classes: [[\p{Letter}&&\p{ASCII}]] for intersection and [\p{Letter}--[aeiou]] for subtraction.

When NOT to Use Regex

  • Parsing HTML/XML — Use a DOM parser. Regex cannot handle nested tags.
  • Complex JSON — Use JSON.parse().
  • Arithmetic expressions — Use a proper parser or the Function constructor (carefully).
  • When a simple string.includes() or split() works — regex adds complexity. Use the simplest tool that gets the job done.

Key Takeaways

  • Use anchors (^, $) in all validation patterns
  • Prefer named groups over numbered groups for maintainability
  • Never nest quantifiers on the same characters — it causes catastrophic backtracking
  • Simple email validation + confirmation email beats a complex regex
  • Use the u flag for Unicode-aware patterns
  • Test every regex with both matching and non-matching inputs

Practice your patterns in real time with our Regex Tester — it highlights matches as you type, shows capture groups, and supports all JavaScript flags including global, multiline, and dotAll.

Try it yourself

Put what you learned into practice with our free tool.

Open Tool

Frequently Asked Questions

What is catastrophic backtracking in regex?
Catastrophic backtracking occurs when a regex engine tries exponentially many paths to match a pattern. For example, (a+)+ on the string 'aaaaab' causes the engine to try 2^n combinations. This can freeze your application. Avoid nested quantifiers on the same characters.
Should I use regex to validate email addresses?
Use a simple pattern like [^\s@]+@[^\s@]+\.[^\s@]+ for basic validation, then verify with a confirmation email. The RFC 5321-compliant regex is over 6,000 characters long and still does not guarantee the address exists.
What is the difference between greedy and lazy quantifiers?
Greedy quantifiers (*, +, ?) match as much text as possible, then backtrack. Lazy quantifiers (*?, +?, ??) match as little as possible, then expand. Use lazy quantifiers when you need the shortest match, such as extracting content between HTML tags.