Regex

All about regex in Ruby


  • Regular Expression
    • A pattern that describes a set of Strings
    • are used to describe regular languages
    • tool used to search for text
  • Regex is built-in and is a class in Ruby
    • Regexp
    • Syntax: /pattern/ or /pat/
# Find the word 'like'
"Do you like cats?" =~ /like/ # Returns the index of the occurrance or nil
=> 7 
  • Another way
if "Do you like cats?".match(/like/)
	puts "Match found!"
end

Character Class 🔡

  • character class is delimited with square brackets [, `]

  • [ab] means a or b

  • /ab/ means a followed by b

  • Lets define a range or a list of characters to match.

    • [aeiou] => matches any vowel
    • Enclosed by /[a]/ matches for one character.
    • To match two characters, /[a][b]/
def contains_vowel(str)
	str =~ /[aeiou]/
end
 
contains_vowel("test") # returns 1
contains_vowel("sky") # returns nil

Ranges 🔍

  • /./ => Any character except newline
  • /./m => Any character, m enables multiline mode
  • \w => A word character ([a-zA-Z0-9_])
  • \W => A non-word character ([^a-zA-Z0-9_])
  • \d - A digit character ([0-9])
  • \D - A non-digit character ([^0-9])
  • \h - A hexdigit character ([0-9a-fA-F])
  • \H - A non-hexdigit character ([^0-9a-fA-F])
  • \s - A whitespace character: /[ \t\r\n\f\v]/
  • \S - A non-whitespace character: /[^ \t\r\n\f\v]/
  • \R - A linebreak: \n\v\f\r \u0085 (NEXT LINE), \u2028 (LINE SEPARATOR), \u2029 (PARAGRAPH SEPARATOR) or \r\n.

Anchors ⚓️

Anchors are metacharacters that match to a specific position:

  • ^ - Matches beginning of line
  • $ - Matches end of line
  • \A - Matches beginning of string.
  • \Z - Matches end of string. If string ends with a newline, it matches just before newline
  • \z - Matches end of string
  • \G - Matches first matching position:
  • \b - Matches word boundaries when outside brackets; backspace (0x08) when inside brackets
  • \B - Matches non-word boundaries

Repetition 🔁

To match multiple characters we can use pattern modifiers

  • +  => Matches 1 or more characters
  • * => Matches 0 or more
  • ? => Matches 0 or 1
  • { n } => Matches exactly 1
  • {n ,} => Matches n or more
  • {, m} => Matches m or less
  • {n,m} - Matches between x and `y
# example of '+'
"Hello".match(/[A-Z]+[a-z]+l{2}o/) #=> #<MatchData "Hello">
# [1 or more uppercase], [1 or more lowercase], 2 times 'l' characters, 
# and one 'o' character

Parenthesis in Regex 🔘

Capturing

We can backreference to an n group of parenthesis with \n

  • /(\d) (\w)/
    • \1 references the first group parenthesis (\d)
    • \2 references the second group (\w)
    • and so on
    • only 1-9 for n are supported using \n
    • Otherwise we can use Regex Global Variables
  • Example
"The cat sat in the hat".match(/[csh](..) [csh]\1 in/)
# [csh] = c
# (..) = at
# space
# [csh] = s
# \1 = (..) = at
  • When using .match, it returns the group of parenthesis matched in an index
# From previous example, its return value
=> <MatchData "cat sat in" 1:"at">   # at index 1: "at"
 
"The cat sat in the hat".match(/[csh](..) [csh]\1 in/)[1] # returns 'at'
  • \0 represents the whole matched string
  • We can use \0 and .gsub to substitute a matched pattern. Very useful
"The cat sat in the hat".gsub(/[csh]at/, '\0s') # => "The cats sats in the hats"
# To every occurrence of "[csh]at" = '\0', add an 's' such that '\0s'
# Therefore, we'll have cats, sats, hats

No Capturing

  • When we want to group a pattern but not capture it, we use ?:
  • (?:[0-9])([0-9])

Grouping

Parenthesis group terms, allowing the pattern only work for this group.

  • /(pat)/

Backreference (Global Variables)

We can use these Regex global variables to backreference group parenthesis.

  • $~ is equivalent to Regexp.last_match;
  • $& contains the complete matched text;
  • $`  contains string before match;
  • $' contains string after match;
  • $1$2 and so on contain text matching first, second, etc capture group;
  • $+ contains last capture group
# examples
m = 'haystack'.match(/s(\w{2}).*(c)/) #=> #<MatchData "stac" 1:"ta" 2:"c">
$~                                    #=> #<MatchData "stac" 1:"ta" 2:"c">
Regexp.last_match                     #=> #<MatchData "stac" 1:"ta" 2:"c">
 
$&   #=> "stac" # same as m[0]
$`   #=> "hay"  # same as m.pre_match
$'   #=> "k"    # same as m.post_match
$1   #=> "ta"   # same as m[1]
$2   #=> "c"    # same as m[2]
$3   #=> nil    # no third group in pattern
$+   #=> "c"    # same as m[-1]

Alternation

The vertical bar metacharacter a|b is like OR in programming, either matches a or b

# example 1
"Feliformia".match(/\w(and|or)\w/) #=> #<MatchData "form" 1:"or">
# \w = f
# (and|or) = or
# \w = m
# Match = form
# example 2
"furandi".match(/\w(and|or)\w/)    #=> #<MatchData "randi" 1:"and">
# \w = r
# (and|or) = and
# \w = i
# Match = randi

Options

End delimiter of a regex can be followed by one or more single-letter options:

  • /pat/i - Ignore case
  • /pat/m - Treat a newline as a character matched by .
  • /pat/x - Ignore whitespace and comments in the pattern
  • /pat/o - Perform #{} interpolation only once

Encoding

  • /_pat_/u - UTF-8
  • /_pat_/e - EUC-JP
  • /pat/s - Windows-31J
  • /pat/n - ASCII-8BIT

Regex with Ruby Methods

  • Regular expression can be used with many Ruby methods
    • .split
    • .scan
    • .gsub