Regular expression is a group of characters or symbols which is used to find a specific pattern from a text.
A regular expression is a pattern that is matched against a subject string from left to right. The word "Regular expression" is a mouthful, you will usually find the term abbreviated as "regex" or "regexp". Regular expression is used for replacing a text within a string, validating form, extract a substring from a string based upon a pattern match, and so much more.
Imagine you are writing an application and you want to set the rules when user choosing their username. We want the username can contains letter, number, underscore and hyphen. We also want to limit the number of characters in username so it does not look ugly. We use the following regular expression to validate a username:
Above regular expression can accept the strings john_doe, jo-hn\_doe and john12\_as. It does not match Jo because that string
contains uppercase letter and also it is too short.
A regular expression is just a pattern of letters and digits that we used to search in a text. For example the regular expression
cat means: the letter c, followed by the letter a, followed by the letter t.
"cat" => The cat sat on the mat
The regular expression 123 matches the string "123". The regular expression is matched against an input string by comparing each
character in the regular expression to each character in the input string, one after another. Regular expressions are normally
case-sensitive so the regular expression Cat would not match the string "cat".
"Cat" => The cat sat on the Cat
Meta characters are the building blocks of the regular expressions. Meta characters do not stand for themselves but instead are interpreted in some special way. Some meta characters have a special meaning that are written inside the square brackets. The meta character are as follows:
| Meta character | Description |
|---|---|
| . | Period matches any single character except a line break. |
| [ ] | Character class. Matches any character contained between the square brackets. |
| [^ ] | Negated character class. Matches any character that is not contained between the square brackets |
| * | Matches 0 or more repetitions of the preceding symbol. |
| + | Matches 1 or more repetitions of the preceding symbol. |
| ? | Makes the preceding symbol optional. |
| {n,m} | Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol. |
| (xyz) | Character group. Matches the characters xyz in that exact order. |
| | | Alternation. Matches either the characters before or the characters after the symbol. |
| \ | Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ | |
| ^ | Matches the beginning of the input. |
| $ | Matches the end of the input. |
Full stop . is the simplest example of meta character. The meta character . matches any single character. It will not match return
or new line characters. For example the regular expression .ar means: any character, followed by the letter a, followed by the
letter r.
".ar" => The car parked in the garage.
Character sets are also called character class. Square brackets are used to specify character sets. Use hyphen inside character set to
specify the characters range. The order of the character range inside square brackets doesn't matter. For example the regular
expression [Tt]he means: an uppercase T or lowercase t, followed by the letter h, followed by the letter e.
"[Tt]he" => The car parked in the garage.
<<<<<<< HEAD
Just like above example the regular expression ge[.] means: a lowercase character g, followed by letter e, followed by . character.
"ge[.]" => The car parked in the garage. ======= A period inside a character set, however, means a literal period. The regular expression `ar[.]` means: a lowercase character `a`, followed by letter `r`, followed by a period."ar[.]" => A garage is a good place to park a car. >>>>>>> 7cac291415345a24a7bf1db02b6612576aab0446In general the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the character set. For example the regular expression
[^c]armeans: any character exceptc, followed by the charactera, followed by the letterr."[^c]ar" => The car parked in the garage.Following meta characters
+,*or?are used to specify how many times a subpattern can occurs. These meta characters act differently in different situations.The symbol
*matches zero or more repetitions of the preceding matcher. The regular expressiona*means: zero or more repetitions of preceding lowercase charactera. But if it appears after a character set or class that it finds the repetitions of the whole character set. For example the regular expression[a-z]*means: any number of lowercase letters in a row."[a-z]*" => The car parked in the garage #21.The
*symbol can be used with the meta character.to match any string of characters.*. The*symbol can be used with the whitespace character\sto match a string of whitespace characters. For example the expression\s*cat\s*means: zero or more spaces, followed by lowercase characterc, followed by lowercase charactera, followed by lowercase charactert, followed by zero or more spaces."\s*cat\s*" => The fat cat sat on the cat.The symbol
+matches one or more repetitions of the preceding character. For example the regular expressionc.+tmeans: lowercase letterc, followed by any number of character, followed by the lowercase charactert."c.+t" => The fat cat sat on the mat.In regular expression the meta character
?makes the preceding character optional. This symbol matches zero or more repetitions of the preceding character. For example the regular expression[T]?hemeans: Optional the uppercase letterT, followed by the lowercase characterh, followed by the lowercase charactere."[T]he" => The car is parked in the garage."[T]?he" => The car is parked in the garage.In regular expression braces that are also called quantifiers used to specify the number of times that a group of character or a character can be repeated. For example the regular expression
[0-9]{2,3}means: Match at least 2 digits but not more than 3 ( characters in the range of 0 to 9)."[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.We can leave out the second number. For example the regular expression
[0-9]{2,}means: Match 2 or more digits. If we also remove the comma the regular expression[0-9]{2}means: Match exactly 2 digits."[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0."[0-9]{2}" => The number was 9.9997 but we rounded it off to 10.0.Character group is a group of sub-pattern that is written inside Parentheses
(...). As we discussed before that in regular expression if we put quantifier after character than it will repeats the preceding character. But if we put quantifier after a character group than it repeats the whole character group. For example the regular expression(ab)*matches zero or more repetitions of the character "ab". We can also use the alternation|meta character inside character group. For example the regular expression(c|g|p)armeans: lowercase characterc,gorp, followed by charactera, followed by characterr."(c|g|p)ar" => The car is parked in the garage.In regular expression Vertical bar
|is used to define alternation. Alternation is like a condition between multiple expressions. Now, you maybe thinking that character set and alternation works the same way. But the big difference between character set and alternation is that character set works on character level but alternation works on expression level. For example the regular expression(T|t)he|carmeans: uppercase characterTor lowercaset, followed by lowercase characterh, followed by lowercase charactereor lowercase characterc, followed by lowercase charactera, followed by lowercase characterr."(T|t)he|car" => The car is parked in the garage.Backslash
\is used in regular expression to escape the next character. This allows to to specify a symbol as a matching character including reserved characters{ } [ ] / \ + * . $ ^ | ?. To use a special character as a matching character prepend\before it. For example the regular expression.is used to match any character except new line. Now to match.in an input string the regular expression(f|c|m)at\.?means: lowercase letterf,corm, followed by lowercase charactera, followed by lowercase lettert, followed by optional.character."(f|c|m)at\.?" => The fat cat sat on the mat.In regular expression to check if the matching symbol is the starting symbol or ending symbol of the input string for this purpose we use anchors. Anchors are of two types: First type is Caret
^that check if the matching character is the start character of the input and the second type is Dollar$that checks if matching character is the last character of the input string.Caret
^symbol is used to check if matching character is the first character of the input string. If we apply the following regular expression^a(if a is the starting symbol) to input stringabcit matchesa. But if we apply regular expression^bon above input string it does not match anything. Because in input stringabc"b" is not the starting symbol. Let's take a look on another regular expression^(T|t)hewhich means: uppercase characterTor lowercase charactertis the start symbol of the input string, followed by lowercase characterh, followed by lowercase charactere."(T|t)he" => The car is parked in the garage."^(T|t)he" => The car is parked in the garage.Dollar
$symbol is used to check if matching character is the last character of the input string. For example regular expression(at\.)$means: a lowercase charactera, followed by lowercase charactert, followed by a.character and the matcher must be end of the string."(at\.)" => The fat cat. sat. on the mat."(at\.)$" => The fat cat sat on the mat.Regular expression provides shorthands for the commonly used character sets, which offer convenient shorthands for commonly used regular expressions. The shorthand character sets are as follows:
Shorthand Description . Any character except new line \w Matches alphanumeric characters: [a-zA-Z0-9_]\W Matches non-alphanumeric characters: [^\w]\d Matches digit: [0-9]\D Matches non-digit: [^\d]\s Matches whitespace character: [\t\n\f\r\p{Z}]\S Matches non-whitespace character: [^\s]Lookbehind and lookahead sometimes known as lookaround are specific type of non-capturing group (Use to match the pattern but not included in matching list). Lookaheads are used when we have the condition that this pattern is preceded or followed by another certain pattern. For example we want to get all numbers that are preceded by
$character from the following input string$4.44 and $10.88. We will use following regular expression(?<=\$)[0-9\.]*which means: get all the numbers which contains.character and preceded by$character. Following are the lookarounds that are used in regular expressions:
Symbol Description ?= Positive Lookahead ?! Negative Lookahead ?<= Positive Lookbehind ?<! Negative Lookbehind The positive lookahead asserts that the first part of the expression must be followed by the lookahead expression. The returned match only contains the text that is matched by the first part of the expression. To define a positive lookahead braces are used and within those braces question mark with equal sign is used like this
(?=...). Lookahead expression is written after the equal sign inside braces. For example the regular expression(T|t)he(?=\sfat)means: optionally match lowercase lettertor uppercase letterT, followed by letterh, followed by lettere. In braces we define positive lookahead which tells regular expression engine to matchTheorthewhich are followed by the wordfat."(T|t)he(?=\sfat)" => The fat cat sat on the mat.Negative lookahead is used when we need to get all matches from input string that are not followed by a pattern. Negative lookahead defined same as we define positive lookahead but the only difference is instead of equal
=character we use negation!character i.e.(?!...). Let's take a look at the following regular expression(T|t)he(?!\sfat)which means: get allTheorthewords from input string that are not followed by the wordfatprecedes by a space character."(T|t)he(?!\sfat)" => The fat cat sat on the mat.Positive lookbehind is used to get all the matches that are preceded by a specific pattern. Positive lookbehind is denoted by
(?<=...). For example the regular expression(?<=(T|t)he\s)(fat|mat)means: get allfatormatwords from input string that are after the wordTheorthe."(?<=(T|t)he\s)(fat|mat)" => The fat cat sat on the mat.Negative lookbehind is used to get all the matches that are not preceded by a specific pattern. Negative lookbehind is denoted by
(?<!...). For example the regular expression(?<!(T|t)he\s)(cat)means: get allcatwords from input string that are after not after the wordTheorthe."(?<!(T|t)he\s)(cat)" => The cat sat on cat.Flags are also called modifiers because they modify the output of a regular expression. These flags can be used in any order or combination, and are an integral part of the RegExp.
Flag Description i Case insensitive: Sets matching to be case-insensitive. g Global Search: Search for a pattern throughout the input string. m Multiline: Anchor meta character works on each line. The
imodifier is used to perform case-insensitive matching. For example the regular expression/The/gimeans: uppercase letterT, followed by lowercase characterh, followed by charactere. And at the end of regular expression theiflag tells the regular expression engine to ignore the case. As you can see we also providedgflag because we want to search for the pattern in the whole input string."The" => The fat cat sat on the mat."/The/gi" => The fat cat sat on the mat.The
gmodifier is used to perform a global match (find all matches rather than stopping after the first match). For example the regular expression/.(at)/gmeans: any character except new line, followed by lowercase charactera, followed by lowercase charactert. Because we providedgflag at the end of the regular expression now it will find every matches from whole input string.".(at)" => The fat cat sat on the mat."/.(at)/g" => The fat cat sat on the mat.The
mmodifier is used to perform a multi line match. As we discussed earlier anchors(^, $)are used to check if pattern is the beginning of the input or end of the input string. But if we want that anchors works on each line we usemflag. For example the regular expression/at(.)?$/gmmeans: lowercase charactera, followed by lowercase charactert, optionally anything except new line. And because ofmflag now regular expression engine matches pattern at the end of each line in a string."/.at(.)?$/" => The fat cat sat on the mat."/.at(.)?$/gm" => The fat cat sat on the mat.
- Positive Integers:
^\d+$ - Negative Integers:
^-\d+$ - US Phone Number:
^+?[\d\s]{3,}$ - US Phone with code:
^+?[\d\s]+(?[\d\s]{10,}$ - Integers:
^-?\d+$ - Username:
^[\w\d_.]{4,16}$ - Alpha-numeric characters:
^[a-zA-Z0-9]*$ - Alpha-numeric characters with spaces:
^[a-zA-Z0-9 ]*$ - Password:
^(?=^.{6,}$)((?=.*[A-Za-z0-9])(?=.*[A-Z])(?=.*[a-z]))^.*$ - email:
^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})*$ - IPv4 address:
^((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))*$ - Lowercase letters only:
^([a-z])*$ - Uppercase letters only:
^([A-Z])*$ - URL:
^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$ - VISA credit card numbers:
^(4[0-9]{12}(?:[0-9]{3})?)*$ - Date (MM/DD/YYYY):
^(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}$ - Date (YYYY/MM/DD):
^(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])$ - MasterCard credit card numbers:
^(5[1-5][0-9]{14})*$
- Report issues
- Open pull request with improvements
- Spread the word
- Reach out to me directly at [email protected] or
MIT © Zeeshan Ahmed

