Regex
All about regex in Ruby
- Regular Expression
- A pattern that describes a set of Strings
- are used to describe regular languages
- tool used to search for text
- Regex is built-in and is a class in Ruby
- Regexp
- Syntax:
/pattern/
or/pat/
# Find the word 'like'
"Do you like cats?" =~ /like/ # Returns the index of the occurrance or nil
=> 7
- Another way
if "Do you like cats?".match(/like/)
puts "Match found!"
end
Character Class 🔡
-
A character class is delimited with square brackets
[
, `] -
[ab]
means a or b -
/ab/
means a followed by b -
Lets define a range or a list of characters to match.
[aeiou]
=> matches any vowel- Enclosed by
/[a]/
matches for one character. - To match two characters,
/[a][b]/
def contains_vowel(str)
str =~ /[aeiou]/
end
contains_vowel("test") # returns 1
contains_vowel("sky") # returns nil
Ranges 🔍
/./
=> Any character except newline/./m
=> Any character,m
enables multiline mode\w
=> A word character ([a-zA-Z0-9_]
)\W
=> A non-word character ([^a-zA-Z0-9_]
)\d
- A digit character ([0-9]
)\D
- A non-digit character ([^0-9]
)\h
- A hexdigit character ([0-9a-fA-F]
)\H
- A non-hexdigit character ([^0-9a-fA-F]
)\s
- A whitespace character:/[ \t\r\n\f\v]/
\S
- A non-whitespace character:/[^ \t\r\n\f\v]/
\R
- A linebreak:\n
,\v
,\f
,\r
\u0085
(NEXT LINE),\u2028
(LINE SEPARATOR),\u2029
(PARAGRAPH SEPARATOR) or\r\n
.
Anchors ⚓️
Anchors are metacharacters that match to a specific position:
^
- Matches beginning of line$
- Matches end of line\A
- Matches beginning of string.\Z
- Matches end of string. If string ends with a newline, it matches just before newline\z
- Matches end of string\G
- Matches first matching position:\b
- Matches word boundaries when outside brackets; backspace (0x08) when inside brackets\B
- Matches non-word boundaries
Repetition 🔁
To match multiple characters we can use pattern modifiers
+
=> Matches 1 or more characters*
=> Matches 0 or more?
=> Matches 0 or 1{ n }
=> Matches exactly 1{n ,}
=> Matchesn
or more{, m}
=> Matchesm
or less{n,m}
- Matches betweenx
and `y
# example of '+'
"Hello".match(/[A-Z]+[a-z]+l{2}o/) #=> #<MatchData "Hello">
# [1 or more uppercase], [1 or more lowercase], 2 times 'l' characters,
# and one 'o' character
Parenthesis in Regex 🔘
Capturing
We can backreference to an n
group of parenthesis with \n
/(\d) (\w)/
\1
references the first group parenthesis(\d)
\2
references the second group(\w)
- and so on
- only 1-9 for
n
are supported using\n
- Otherwise we can use Regex Global Variables
- Example
"The cat sat in the hat".match(/[csh](..) [csh]\1 in/)
# [csh] = c
# (..) = at
# space
# [csh] = s
# \1 = (..) = at
- When using
.match
, it returns the group of parenthesis matched in an index
# From previous example, its return value
=> <MatchData "cat sat in" 1:"at"> # at index 1: "at"
"The cat sat in the hat".match(/[csh](..) [csh]\1 in/)[1] # returns 'at'
\0
represents the whole matched string- We can use
\0
and.gsub
to substitute a matched pattern. Very useful
"The cat sat in the hat".gsub(/[csh]at/, '\0s') # => "The cats sats in the hats"
# To every occurrence of "[csh]at" = '\0', add an 's' such that '\0s'
# Therefore, we'll have cats, sats, hats
No Capturing
- When we want to group a pattern but not capture it, we use
?:
(?:[0-9])([0-9])
Grouping
Parenthesis group terms, allowing the pattern only work for this group.
/(pat)/
Backreference (Global Variables)
We can use these Regex global variables to backreference group parenthesis
.
$~
is equivalent toRegexp.last_match
;$&
contains the complete matched text;$`
contains string before match;$'
contains string after match;$1
,$2
and so on contain text matching first, second, etc capture group;$+
contains last capture group
# examples
m = 'haystack'.match(/s(\w{2}).*(c)/) #=> #<MatchData "stac" 1:"ta" 2:"c">
$~ #=> #<MatchData "stac" 1:"ta" 2:"c">
Regexp.last_match #=> #<MatchData "stac" 1:"ta" 2:"c">
$& #=> "stac" # same as m[0]
$` #=> "hay" # same as m.pre_match
$' #=> "k" # same as m.post_match
$1 #=> "ta" # same as m[1]
$2 #=> "c" # same as m[2]
$3 #=> nil # no third group in pattern
$+ #=> "c" # same as m[-1]
Alternation
The vertical bar metacharacter a|b
is like OR
in programming, either matches a
or b
# example 1
"Feliformia".match(/\w(and|or)\w/) #=> #<MatchData "form" 1:"or">
# \w = f
# (and|or) = or
# \w = m
# Match = form
# example 2
"furandi".match(/\w(and|or)\w/) #=> #<MatchData "randi" 1:"and">
# \w = r
# (and|or) = and
# \w = i
# Match = randi
Options
End delimiter of a regex can be followed by one or more single-letter options:
/pat/i
- Ignore case/pat/m
- Treat a newline as a character matched by.
/pat/x
- Ignore whitespace and comments in the pattern/pat/o
- Perform#{}
interpolation only once
Encoding
/_pat_/u
- UTF-8/_pat_/e
- EUC-JP/pat/s
- Windows-31J/pat/n
- ASCII-8BIT
Regex with Ruby Methods
- Regular expression can be used with many Ruby methods
- .split
- .scan
- .gsub