Regex to Match URLs in Text — Stop Copy-Pasting Broken Patterns
Every project needs to extract URLs from text at some point. Parsing log files for error links. Rendering clickable links in user-generated content. Validating bookmark imports. Extracting references from markdown or HTML.
And every project seems to start the same way: with a regex copied from Stack Overflow, a GitHub comment, or ten-year-old blog post.
Most of those patterns are wrong. They miss valid URLs with uncommon TLDs, reject perfectly valid internationalized domain names, fail on URLs embedded in parentheses, or — worst of all — match strings that are not URLs at all.
This guide presents one production-tested regex pattern, dissects it piece by piece, and then shows you the practical test suite you should run against any URL-matching pattern before deploying it.
Test patterns interactively as you read using the Regex Tester.
The Naive Pattern Most Developers Start With
/https?:\/\/[^\s]+/g
This matches any string starting with http:// or https:// followed by non-whitespace characters.
Problems:
- Matches trailing punctuation:
https://example.com).includes the parenthesis and period - Matches incomplete URLs:
https://alone followed by anything - No protocol-relative URL support (
//example.com) - No support for
ftp://,mailto:, or other schemes - Matches garbage like
https://///
The Production Pattern
/(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi
This is a modified version of the pattern used in several open-source markdown renderers and linkification libraries. Let me explain each component.
Pattern Decomposition
Scheme or Protocol-Relative Match
(https?:\/\/|ftp:\/\/|www\.)
This matches:
http://https://ftp://www.(protocol-relative)
The www. alternative catches the common case where users type www.example.com without a protocol. The g and i flags make it case-insensitive, so WWW.EXAMPLE.COM also matches.
Core URL Characters
[^\s<>"{\}|\\^`()\[\]]+
This character class defines what characters are allowed inside a URL. Anything not in this exclusion set is valid. The excluded characters are:
| Characters | Reason for Exclusion |
|---|---|
\s | Whitespace terminates the URL |
<> | HTML angle brackets — indicates markup boundaries |
" | Quotes — indicates attribute boundaries |
| ` | ` |
\\ | Escape character in many languages |
^ | Caret in regex contexts |
` | Backtick — code fence boundaries |
() | Parentheses — balanced matching below |
[] | Brackets — balanced matching below |
This handles the most common case: a URL is terminated by whitespace or markup delimiters.
Trailing Punctuation and Balanced Parentheses
(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*
This is the critical refinement. It allows parentheses inside the URL as long as they are balanced:
(example)→ allowed (balanced)(→ URL stops at((unbalanced))text→ URL stops before)unless preceded by(
Without this, Wikipedia URLs like https://en.wikipedia.org/wiki/Regex_(programming) would be truncated at the first parenthesis.
The entire group is optionally repeated (*) to handle URLs with multiple parenthetical segments: https://example.com/a(b)c(d)e.
Putting It Together
/(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi
This pattern:
- Matches
http://,https://,ftp://, orwww. - Requires at least one valid URL character
- Allows balanced parentheses
- Stops at whitespace, HTML markup, quotes, or other structural delimiters
Language-Specific Implementations
JavaScript
const URL_REGEX = /(https?:\/\/|ftp:\/\/|www\.)[^\s<>"{\}|\\^`()\[\]]+(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*/gi;
function extractUrls(text) {
return text.match(URL_REGEX) || [];
}
function linkify(text) {
return text.replace(URL_REGEX, (url) => {
const href = url.startsWith('www.') ? `https://${url}` : url;
return `<a href="${href}" target="_blank" rel="noopener noreferrer">${url}</a>`;
});
}
Python
import re
URL_PATTERN = re.compile(
r'(https?://|ftp://|www\.)'
r'[^\s<>"{\}|\\^`()\[\]]+'
r'(?:[^\s<>"{\}|\\^`()\[\]]|\([^\s<>"{\}|\\^`()\[\]]*\))*',
re.IGNORECASE
)
def extract_urls(text: str) -> list[str]:
return URL_PATTERN.findall(text)
def linkify(text: str) -> str:
def replace_url(url: str) -> str:
href = f'https://{url}' if url.startswith('www.') else url
return f'<a href="{href}" target="_blank" rel="noopener noreferrer">{url}</a>'
return URL_PATTERN.sub(replace_url, text)
Go
Go's regexp package does not support lookahead, and RE2's limitations make parenthetical matching difficult. For URL extraction in Go, consider using a dedicated library like github.com/mvdan/xurls:
import "github.com/mvdan/xurls"
func ExtractURLs(text string) []string {
return xurls.Strict.FindAllString(text, -1)
}
This library maintains its own TLD list and URL validation logic, which is more maintainable than a single regex in Go's engine.
Test Suite
Run these test cases against any URL-matching pattern before deploying:
const testCases = [
// Should match
['https://example.com', true],
['http://example.com/path?query=1&lang=en#section', true],
['https://example.com/path/file.html', true],
['ftp://files.example.com/document.pdf', true],
['www.example.com', true],
['https://en.wikipedia.org/wiki/Regex_(programming)', true],
['https://example.com/a(b)c', true],
['https://example.com/中文路径', true],
['https://example.com:8080/path', true],
['https://192.168.1.1/admin', true],
['https://subdomain.example.co.uk/path', true],
['https://example.com/path%20with%20spaces', true],
['https://example.com?param=value&other=123', true],
// Should NOT match
['example.com', false], // No scheme or www.
['https://', false], // Empty host
['https://.com', false], // Invalid host
['just some text', false],
['<https://example.com>', false], // Should match the URL without <>
['(https://example.com)', false], // Should match the URL without ()
['"https://example.com"', false], // Should match the URL without quotes
];
testCases.forEach(([input, shouldMatch]) => {
const matches = input.match(URL_REGEX);
const matched = matches !== null && matches.length > 0;
const status = matched === shouldMatch ? 'PASS' : 'FAIL';
if (status === 'FAIL') {
console.log(`${status}: "${input}" expected ${shouldMatch} got ${matched}`);
}
});
Edge Cases That Break Most Patterns
Trailing Punctuation
Input: Visit https://example.com. It is great.
Bad pattern matches: https://example.com.
Good pattern matches: https://example.com
The trailing period after the URL is not part of the URL. The pattern stops at the period because it is not a valid URL character in that position.
Parentheses in URLs
Input: See https://en.wikipedia.org/wiki/Regex_(programming) for details.
Bad pattern: https://en.wikipedia.org/wiki/Regex_(programming or https://en.wikipedia.org/wiki/Regex_
Good pattern: https://en.wikipedia.org/wiki/Regex_(programming)
URL at End of Sentence
Input: The site is https://example.com.
The period terminates the sentence, not the URL. Most patterns either include the period (wrong) or reject the entire URL because the period looks like an invalid character.
Our pattern handles this because [^\s<>"...] excludes periods only when they appear in specific contexts. In practice, an additional post-processing step to strip common trailing punctuation (.,;:!?) is recommended for production use.
Internationalized Domain Names
Input: Check https://例子.example.com for examples.
IDNs (Internationalized Domain Names) use non-ASCII characters. Ensure your pattern includes Unicode characters if you expect international domains. The pattern in this guide does by default if the regex engine supports Unicode (JavaScript does with the u flag).
Many of these edge cases map to Common Regex Mistakes Developers Keep Making — trailing punctuation, character class confusion, and anchor misuse are responsible for most broken URL patterns.
When Regex Is Not Enough
Regex-based URL matching has inherent limitations:
- No DNS validation: A pattern may match
https://thisdomaindoesnotexistzzz.comwhich is syntactically valid but non-existent - No TLD validation:
https://example.404passes the regex but.404is not a valid TLD - No protocol validation:
ftp://is structurally valid but your application may not support FTP links - Context-dependent boundaries: A URL followed by a closing parenthesis in a markdown link (
[text](url)) is structurally different from a URL inside sentence parentheses
For critical applications — security scanners, content filters, link preview generators — use regex as a first pass, then validate extracted URLs against a TLD list, perform DNS resolution, or use a URL parsing library. If your regex behaves differently in different environments, read Regex Works in Regex101 but Not in JavaScript — Why:
function validateUrl(url) {
try {
const parsed = new URL(url.startsWith('//') ? `https:${url}` : url);
return parsed.protocol === 'https:' || parsed.protocol === 'http:';
} catch {
return false;
}
}
function extractAndValidateUrls(text) {
return extractUrls(text).filter(validateUrl);
}
FAQ
What is the best regex for URL matching?
There is no single "best" regex. The right pattern depends on your use case. For linkification in user-generated content, use the pattern in this guide. For security-critical URL validation, combine regex extraction with DNS resolution and TLD verification. For log parsing, consider whether you need to match IP-based URLs, non-standard ports, or unusual schemes.
Should I use a regex or a URL parsing library for validation?
Regex is excellent for extraction from unstructured text. URL parsing libraries are better for validation. Use both: regex to find candidate URLs, then a parser to validate each one.
How do I match URLs without the protocol?
Add (?:https?://|ftp://|www\.)? as the prefix, but be prepared for more false positives. Without a protocol, every dot-separated string looks like a URL candidate.
Why does my URL regex match trailing periods?
Most naive patterns use [^\s]+ which includes periods. A URL followed by a period at the end of a sentence gets the period included in the match. Strip trailing punctuation from matched URLs as a post-processing step.
How do I handle URLs inside markdown or HTML?
Do not use the same regex for raw URLs and URLs inside markup. For markdown, parse the document structure first to distinguish [link](https://example.com) from raw https://example.com. Applying a URL regex to already-rendered HTML can double-encode or corrupt existing links.
Final Thoughts
URL extraction with regex is a solved problem that developers keep solving incorrectly because they copy patterns without understanding the edge cases. Balanced parentheses, trailing punctuation, international domains, and protocol-relative URLs all require explicit handling that most ten-line Stack Overflow answers do not provide.
The pattern in this guide is a starting point, not a final answer. Adapt it to your specific use case. Strip trailing punctuation for sentence-level extraction. Add scheme restrictions for security-sensitive contexts. Combine it with URL parsing for applications that need verified, actionable links. If your URL regex fails on input that looks correct, Why Your Regex Is Not Matching — And How to Fix It covers the diagnostic steps.
For testing your patterns against real-world text — log dumps, user comments, API responses — the Regex Tester provides instant feedback on matches, captures, and edge cases. The URL Encoder/Decoder helps when extracted URLs contain percent-encoded characters that need decoding. And the Base64 Encoder & Decoder is useful when URLs contain Base64-encoded payloads that need inspection.