Regex to Match URLs in Text — Stop Copy-Pasting Broken Patterns

Q: How do I match URLs without the protocol?

Add `(?:https?://|ftp://|www\.)?` as the prefix, but be prepared for more false positives. Without a protocol, every dot-separated string looks like a URL candidate.

Q: Why does my URL regex match trailing periods?

Most naive patterns use `[^\s]+` which includes periods. A URL followed by a period at the end of a sentence gets the period included in the match. Strip trailing punctuation from matched URLs as a post-processing step.

Every project needs to extract URLs from text at some point. Parsing log files for error links. Rendering clickable links in user-generated content. Validating bookmark imports. Extracting references from markdown or HTML.

And every project seems to start the same way: with a regex copied from Stack Overflow, a GitHub comment, or ten-year-old blog post.

Most of those patterns are wrong. They miss valid URLs with uncommon TLDs, reject perfectly valid internationalized domain names, fail on URLs embedded in parentheses, or — worst of all — match strings that are not URLs at all.

This guide presents one production-tested regex pattern, dissects it piece by piece, and then shows you the practical test suite you should run against any URL-matching pattern before deploying it.

Test patterns interactively as you read using the Regex Tester.

The Naive Pattern Most Developers Start With

/https?:\/\/[^\s]+/g

This matches any string starting with http:// or https:// followed by non-whitespace characters.

Problems:

Matches trailing punctuation: https://example.com). includes the parenthesis and period
Matches incomplete URLs: https:// alone followed by anything
No protocol-relative URL support (//example.com)
No support for ftp://, mailto:, or other schemes
Matches garbage like https://///

The Production Pattern

/(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi

This is a modified version of the pattern used in several open-source markdown renderers and linkification libraries. Let me explain each component.

Pattern Decomposition

Scheme or Protocol-Relative Match

(https?:\/\/|ftp:\/\/|www\.)

This matches:

http://
https://
ftp://
www. (protocol-relative)

The www. alternative catches the common case where users type www.example.com without a protocol. The g and i flags make it case-insensitive, so WWW.EXAMPLE.COM also matches.

Core URL Characters

[^\s<>"{\}|\\^`()\[\]]+

This character class defines what characters are allowed inside a URL. Anything not in this exclusion set is valid. The excluded characters are:

Characters	Reason for Exclusion
`\s`	Whitespace terminates the URL
`<>`	HTML angle brackets — indicates markup boundaries
`"`	Quotes — indicates attribute boundaries
`{}	`
`\\`	Escape character in many languages
`^`	Caret in regex contexts
`	Backtick — code fence boundaries
`()`	Parentheses — balanced matching below
`[]`	Brackets — balanced matching below

This handles the most common case: a URL is terminated by whitespace or markup delimiters.

Trailing Punctuation and Balanced Parentheses

(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*

This is the critical refinement. It allows parentheses inside the URL as long as they are balanced:

(example) → allowed (balanced)
( → URL stops at ( (unbalanced)
)text → URL stops before ) unless preceded by (

Without this, Wikipedia URLs like https://en.wikipedia.org/wiki/Regex_(programming) would be truncated at the first parenthesis.

The entire group is optionally repeated (*) to handle URLs with multiple parenthetical segments: https://example.com/a(b)c(d)e.

Putting It Together

/(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi

This pattern:

Matches http://, https://, ftp://, or www.
Requires at least one valid URL character
Allows balanced parentheses
Stops at whitespace, HTML markup, quotes, or other structural delimiters

Language-Specific Implementations

JavaScript

const URL_REGEX = /(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi;

function extractUrls(text) {
  return text.match(URL_REGEX) || [];
}

function linkify(text) {
  return text.replace(URL_REGEX, (url) => {
    const href = url.startsWith('www.') ? `https://${url}` : url;
    return `<a href="${href}" target="_blank" rel="noopener noreferrer">${url}</a>`;
  });
}

Python

import re

URL_PATTERN = re.compile(
    r'(https?://|ftp://|www\.)'
    r'[^\s<>"{\}|\\^`()\[\]]+'
    r'(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*',
    re.IGNORECASE
)

def extract_urls(text: str) -> list[str]:
    return URL_PATTERN.findall(text)

def linkify(text: str) -> str:
    def replace_url(url: str) -> str:
        href = f'https://{url}' if url.startswith('www.') else url
        return f'<a href="{href}" target="_blank" rel="noopener noreferrer">{url}</a>'
    return URL_PATTERN.sub(replace_url, text)

Go

Go's regexp package does not support lookahead, and RE2's limitations make parenthetical matching difficult. For URL extraction in Go, consider using a dedicated library like github.com/mvdan/xurls:

import "github.com/mvdan/xurls"

func ExtractURLs(text string) []string {
    return xurls.Strict.FindAllString(text, -1)
}

This library maintains its own TLD list and URL validation logic, which is more maintainable than a single regex in Go's engine.

Test Suite

Run these test cases against any URL-matching pattern before deploying:

const testCases = [
  // Should match
  ['https://example.com', true],
  ['http://example.com/path?query=1&lang=en#section', true],
  ['https://example.com/path/file.html', true],
  ['ftp://files.example.com/document.pdf', true],
  ['www.example.com', true],
  ['https://en.wikipedia.org/wiki/Regex_(programming)', true],
  ['https://example.com/a(b)c', true],
  ['https://example.com/中文路径', true],
  ['https://example.com:8080/path', true],
  ['https://192.168.1.1/admin', true],
  ['https://subdomain.example.co.uk/path', true],
  ['https://example.com/path%20with%20spaces', true],
  ['https://example.com?param=value&other=123', true],

  // Should NOT match
  ['example.com', false],        // No scheme or www.
  ['https://', false],           // Empty host
  ['https://.com', false],       // Invalid host
  ['just some text', false],
  ['<https://example.com>', false],  // Should match the URL without <>
  ['(https://example.com)', false],  // Should match the URL without ()
  ['"https://example.com"', false],  // Should match the URL without quotes
];

testCases.forEach(([input, shouldMatch]) => {
  const matches = input.match(URL_REGEX);
  const matched = matches !== null && matches.length > 0;
  const status = matched === shouldMatch ? 'PASS' : 'FAIL';
  if (status === 'FAIL') {
    console.log(`${status}: "${input}" expected ${shouldMatch} got ${matched}`);
  }
});

Edge Cases That Break Most Patterns

Trailing Punctuation

Input: Visit https://example.com. It is great.

Bad pattern matches: https://example.com.

Good pattern matches: https://example.com

The trailing period after the URL is not part of the URL. The pattern stops at the period because it is not a valid URL character in that position.

Parentheses in URLs

Input: See https://en.wikipedia.org/wiki/Regex_(programming) for details.

Bad pattern: https://en.wikipedia.org/wiki/Regex_(programming or https://en.wikipedia.org/wiki/Regex_

Good pattern: https://en.wikipedia.org/wiki/Regex_(programming)

URL at End of Sentence

Input: The site is https://example.com.

The period terminates the sentence, not the URL. Most patterns either include the period (wrong) or reject the entire URL because the period looks like an invalid character.

Our pattern handles this because [^\s<>"...] excludes periods only when they appear in specific contexts. In practice, an additional post-processing step to strip common trailing punctuation (.,;:!?) is recommended for production use.

Internationalized Domain Names

Input: Check https://例子.example.com for examples.

IDNs (Internationalized Domain Names) use non-ASCII characters. Ensure your pattern includes Unicode characters if you expect international domains. The pattern in this guide does by default if the regex engine supports Unicode (JavaScript does with the u flag).

Many of these edge cases map to Common Regex Mistakes Developers Keep Making — trailing punctuation, character class confusion, and anchor misuse are responsible for most broken URL patterns.

When Regex Is Not Enough

Regex-based URL matching has inherent limitations:

No DNS validation: A pattern may match https://thisdomaindoesnotexistzzz.com which is syntactically valid but non-existent
No TLD validation: https://example.404 passes the regex but .404 is not a valid TLD
No protocol validation: ftp:// is structurally valid but your application may not support FTP links
Context-dependent boundaries: A URL followed by a closing parenthesis in a markdown link ([text](url)) is structurally different from a URL inside sentence parentheses

For critical applications — security scanners, content filters, link preview generators — use regex as a first pass, then validate extracted URLs against a TLD list, perform DNS resolution, or use a URL parsing library. If your regex behaves differently in different environments, read Regex Works in Regex101 but Not in JavaScript — Why:

function validateUrl(url) {
  try {
    const parsed = new URL(url.startsWith('//') ? `https:${url}` : url);
    return parsed.protocol === 'https:' || parsed.protocol === 'http:';
  } catch {
    return false;
  }
}

function extractAndValidateUrls(text) {
  return extractUrls(text).filter(validateUrl);
}

FAQ

What is the best regex for URL matching?

There is no single "best" regex. The right pattern depends on your use case. For linkification in user-generated content, use the pattern in this guide. For security-critical URL validation, combine regex extraction with DNS resolution and TLD verification. For log parsing, consider whether you need to match IP-based URLs, non-standard ports, or unusual schemes.

Should I use a regex or a URL parsing library for validation?

Regex is excellent for extraction from unstructured text. URL parsing libraries are better for validation. Use both: regex to find candidate URLs, then a parser to validate each one.

How do I match URLs without the protocol?

Add (?:https?://|ftp://|www\.)? as the prefix, but be prepared for more false positives. Without a protocol, every dot-separated string looks like a URL candidate.

Why does my URL regex match trailing periods?

Most naive patterns use [^\s]+ which includes periods. A URL followed by a period at the end of a sentence gets the period included in the match. Strip trailing punctuation from matched URLs as a post-processing step.

How do I handle URLs inside markdown or HTML?

Do not use the same regex for raw URLs and URLs inside markup. For markdown, parse the document structure first to distinguish [link](https://example.com) from raw https://example.com. Applying a URL regex to already-rendered HTML can double-encode or corrupt existing links.

Final Thoughts

URL extraction with regex is a solved problem that developers keep solving incorrectly because they copy patterns without understanding the edge cases. Balanced parentheses, trailing punctuation, international domains, and protocol-relative URLs all require explicit handling that most ten-line Stack Overflow answers do not provide.

The pattern in this guide is a starting point, not a final answer. Adapt it to your specific use case. Strip trailing punctuation for sentence-level extraction. Add scheme restrictions for security-sensitive contexts. Combine it with URL parsing for applications that need verified, actionable links. If your URL regex fails on input that looks correct, Why Your Regex Is Not Matching — And How to Fix It covers the diagnostic steps.

For testing your patterns against real-world text — log dumps, user comments, API responses — the Regex Tester provides instant feedback on matches, captures, and edge cases. The URL Encoder/Decoder helps when extracted URLs contain percent-encoded characters that need decoding. And the Base64 Encoder & Decoder is useful when URLs contain Base64-encoded payloads that need inspection.