diff options
Diffstat (limited to 'doc/_regexp.rdoc')
| -rw-r--r-- | doc/_regexp.rdoc | 123 |
1 files changed, 69 insertions, 54 deletions
diff --git a/doc/_regexp.rdoc b/doc/_regexp.rdoc index da323e913f..4ad6118ddd 100644 --- a/doc/_regexp.rdoc +++ b/doc/_regexp.rdoc @@ -26,20 +26,20 @@ A regexp may be used: re.match('good') # => nil See sections {Method match}[rdoc-ref:Regexp@Method+match] - and {Operator =~}[rdoc-ref:Regexp@Operator+-3D~]. + and {Operator =~}[rdoc-ref:Regexp@Operator-]. - To determine whether a string matches a given pattern: re.match?('food') # => true re.match?('good') # => false - See section {Method match?}[rdoc-ref:Regexp@Method+match-3F]. + See section {Method match?}[rdoc-ref:Regexp@Method+match]. - As an argument for calls to certain methods in other classes and modules; most such methods accept an argument that may be either a string or the (much more powerful) regexp. - See {Regexp Methods}[rdoc-ref:regexp/methods.rdoc]. + See {Regexp Methods}[rdoc-ref:language/regexp/methods.rdoc]. == \Regexp Objects @@ -64,7 +64,7 @@ A regular expression may be created with: /foo/ # => /foo/ - A <tt>%r</tt> regexp literal - (see {%r: Regexp Literals}[rdoc-ref:syntax/literals.rdoc@25r-3A+Regexp+Literals]): + (see {%r: Regexp Literals}[rdoc-ref:syntax/literals.rdoc@r-regexp+literals]): # Same delimiter character at beginning and end; # useful for avoiding escaping characters @@ -78,9 +78,9 @@ A regular expression may be created with: %r(foo) # => /foo/ %r<foo> # => /foo/ -- \Method Regexp.new. +- Method Regexp.new. -== \Method <tt>match</tt> +== Method <tt>match</tt> Each of the methods Regexp#match, String#match, and Symbol#match returns a MatchData object if a match was found, +nil+ otherwise; @@ -99,7 +99,7 @@ each also sets {global variables}[rdoc-ref:Regexp@Global+Variables]: 'foo bar' =~ /bar/ # => 4 /baz/ =~ 'foo bar' # => nil -== \Method <tt>match?</tt> +== Method <tt>match?</tt> Each of the methods Regexp#match?, String#match?, and Symbol#match? returns +true+ if a match was found, +false+ otherwise; @@ -113,7 +113,7 @@ none sets {global variables}[rdoc-ref:Regexp@Global+Variables]: Certain regexp-oriented methods assign values to global variables: - <tt>#match</tt>: see {Method match}[rdoc-ref:Regexp@Method+match]. -- <tt>#=~</tt>: see {Operator =~}[rdoc-ref:Regexp@Operator+-3D~]. +- <tt>#=~</tt>: see {Operator =~}[rdoc-ref:Regexp@Operator-]. The affected global variables are: @@ -127,6 +127,9 @@ The affected global variables are: Note that <tt>$0</tt> is quite different; it returns the name of the currently executing program. +These variables, except for <tt>$~</tt>, are shorthands for methods of +<tt>$~</tt>. See MatchData@Global+variables+equivalence. + Examples: # Matched string, but no matched groups. @@ -411,21 +414,21 @@ Each of these anchors matches a boundary: Lookahead anchors: -- <tt>(?=_pat_)</tt>: Positive lookahead assertion: +- <tt>(?=pat)</tt>: Positive lookahead assertion: ensures that the following characters match _pat_, but doesn't include those characters in the matched substring. -- <tt>(?!_pat_)</tt>: Negative lookahead assertion: +- <tt>(?!pat)</tt>: Negative lookahead assertion: ensures that the following characters <i>do not</i> match _pat_, but doesn't include those characters in the matched substring. Lookbehind anchors: -- <tt>(?<=_pat_)</tt>: Positive lookbehind assertion: +- <tt>(?<=pat)</tt>: Positive lookbehind assertion: ensures that the preceding characters match _pat_, but doesn't include those characters in the matched substring. -- <tt>(?<!_pat_)</tt>: Negative lookbehind assertion: +- <tt>(?<!pat)</tt>: Negative lookbehind assertion: ensures that the preceding characters do not match _pat_, but doesn't include those characters in the matched substring. @@ -436,6 +439,10 @@ without including the tags in the match: /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favors the <b>bold</b>.") # => #<MatchData "bold"> +The pattern in lookbehind must be fixed-width. +But top-level alternatives can be of various lengths. +ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed. + ==== Match-Reset Anchor - <tt>\K</tt>: Match reset: @@ -477,7 +484,7 @@ Each alternative is a subexpression, and may be composed of other subexpressions re.match('bar') # => #<MatchData "b" 1:"b"> re.match('ooz') # => #<MatchData "z" 1:"z"> -\Method Regexp.union provides a convenient way to construct +Method Regexp.union provides a convenient way to construct a regexp with alternatives. === Quantifiers @@ -495,7 +502,7 @@ An added _quantifier_ specifies how many matches are required or allowed: /\w*/.match('x') # => #<MatchData "x"> /\w*/.match('xyz') - # => #<MatchData "yz"> + # => #<MatchData "xyz"> - <tt>+</tt> - Matches one or more times: @@ -554,9 +561,9 @@ Quantifier matching may be greedy, lazy, or possessive: More: - About greedy and lazy matching, see - {Choosing Minimal or Maximal Repetition}[https://doc.lagout.org/programmation/Regular%20Expressions/Regular%20Expressions%20Cookbook_%20Detailed%20Solutions%20in%20Eight%20Programming%20Languages%20%282nd%20ed.%29%20%5BGoyvaerts%20%26%20Levithan%202012-09-06%5D.pdf#tutorial-backtrack]. + {Choosing Minimal or Maximal Repetition}[https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch02s13.html]. - About possessive matching, see - {Eliminate Needless Backtracking}[https://doc.lagout.org/programmation/Regular%20Expressions/Regular%20Expressions%20Cookbook_%20Detailed%20Solutions%20in%20Eight%20Programming%20Languages%20%282nd%20ed.%29%20%5BGoyvaerts%20%26%20Levithan%202012-09-06%5D.pdf#tutorial-backtrack]. + {Eliminate Needless Backtracking}[https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch02s14.html]. === Groups and Captures @@ -567,7 +574,7 @@ A simple regexp has (at most) one match: re.match('1943-02-04').size # => 1 re.match('foo') # => nil -Adding one or more pairs of parentheses, <tt>(_subexpression_)</tt>, +Adding one or more pairs of parentheses, <tt>(subexpression)</tt>, defines _groups_, which may result in multiple matched substrings, called _captures_: @@ -640,8 +647,8 @@ A regexp may contain any number of groups: - For a large number of groups: - - The ordinary <tt>\\_n_</tt> notation applies only for _n_ in range (1..9). - - The <tt>MatchData[_n_]</tt> notation applies for any non-negative _n_. + - The ordinary <tt>\\n</tt> notation applies only for _n_ in range (1..9). + - The <tt>MatchData[n]</tt> notation applies for any non-negative _n_. - <tt>\0</tt> is a special backreference, referring to the entire matched string; it may not be used within the regexp itself, @@ -654,7 +661,7 @@ A regexp may contain any number of groups: As seen above, a capture can be referred to by its number. A capture can also have a name, -prefixed as <tt>?<_name_></tt> or <tt>?'_name_'</tt>, +prefixed as <tt>?<name></tt> or <tt>?'name'</tt>, and the name (symbolized) may be used as an index in <tt>MatchData[]</tt>: md = /\$(?<dollars>\d+)\.(?'cents'\d+)/.match("$3.67") @@ -669,7 +676,7 @@ When a regexp contains a named capture, there are no unnamed captures: /\$(?<dollars>\d+)\.(\d+)/.match("$3.67") # => #<MatchData "$3.67" dollars:"3"> -A named group may be backreferenced as <tt>\k<_name_></tt>: +A named group may be backreferenced as <tt>\k<name></tt>: /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy') # => #<MatchData "ototo" vowel:"o"> @@ -682,7 +689,7 @@ the captured substrings are assigned to local variables with corresponding names dollars # => "3" cents # => "67" -\Method Regexp#named_captures returns a hash of the capture names and substrings; +Method Regexp#named_captures returns a hash of the capture names and substrings; method Regexp#names returns an array of the capture names. ==== Atomic Grouping @@ -706,7 +713,7 @@ Analysis: 1. The leading subexpression <tt>"</tt> in the pattern matches the first character <tt>"</tt> in the target string. -2. The next subexpression <tt>.*</tt> matches the next substring <tt>Quote“</tt> +2. The next subexpression <tt>.*</tt> matches the next substring <tt>Quote"</tt> (including the trailing double-quote). 3. Now there is nothing left in the target string to match the trailing subexpression <tt>"</tt> in the pattern; @@ -725,10 +732,10 @@ see {Atomic Group}[https://www.regular-expressions.info/atomic.html]. ==== Subexpression Calls -As seen above, a backreference number (<tt>\\_n_</tt>) or name (<tt>\k<_name_></tt>) +As seen above, a backreference number (<tt>\\n</tt>) or name (<tt>\k<name></tt>) gives access to a captured _substring_; the corresponding regexp _subexpression_ may also be accessed, -via the number (<tt>\\g<i>n</i></tt>) or name (<tt>\g<_name_></tt>): +via the number n (<tt>\\gn</tt>) or name (<tt>\g<name></tt>): /\A(?<paren>\(\g<paren>*\))*\z/.match('(())') # ^1 @@ -757,16 +764,16 @@ The pattern: 9. Matches the fourth character in the string, <tt>')'</tt>. 10. Matches the end of the string. -See {Subexpression calls}[https://learnbyexample.github.io/Ruby_Regexp/groupings-and-backreferences.html?highlight=subexpression#subexpression-calls]. +See {Subexpression calls}[https://learnbyexample.github.io/Ruby_Regexp/groupings-and-backreferences.html#subexpression-calls]. ==== Conditionals -The conditional construct takes the form <tt>(?(_cond_)_yes_|_no_)</tt>, where: +The conditional construct takes the form <tt>(?(cond)yes|no)</tt>, where: - _cond_ may be a capture number or name. - The match to be applied is _yes_ if _cond_ is captured; otherwise the match to be applied is _no_. -- If not needed, <tt>|_no_</tt> may be omitted. +- If not needed, <tt>|no</tt> may be omitted. Examples: @@ -795,7 +802,7 @@ The absence operator is a special group that matches anything which does _not_ m ==== Unicode Properties -The <tt>/\p{_property_name_}/</tt> construct (with lowercase +p+) +The <tt>/\p{property_name}/</tt> construct (with lowercase +p+) matches characters using a Unicode property name, much like a character class; property +Alpha+ specifies alphabetic characters: @@ -814,7 +821,7 @@ Or by using <tt>\P</tt> (uppercase +P+): /\P{Alpha}/.match('1') # => #<MatchData "1"> /\P{Alpha}/.match('a') # => nil -See {Unicode Properties}[rdoc-ref:regexp/unicode_properties.rdoc] +See {Unicode Properties}[rdoc-ref:language/regexp/unicode_properties.rdoc] for regexps based on the numerous properties. Some commonly-used properties correspond to POSIX bracket expressions: @@ -836,8 +843,9 @@ Some commonly-used properties correspond to POSIX bracket expressions: These are also commonly used: - <tt>/\p{Emoji}/</tt>: Unicode emoji. -- <tt>/\p{Graph}/</tt>: Non-blank character - (excludes spaces, control characters, and similar). +- <tt>/\p{Graph}/</tt>: Characters excluding <tt>/\p{Cntrl}/</tt> and <tt>/\p{Space}/</tt>. + Note that invisible characters under the Unicode + {"Format"}[https://www.compart.com/en/unicode/category/Cf] category are included. - <tt>/\p{Word}/</tt>: A member in one of these Unicode character categories (see below) or having one of these Unicode properties: @@ -897,7 +905,7 @@ Numbers: - {Nl, Letter_Number}[https://www.compart.com/en/unicode/category/Nl]. - {No, Other_Number}[https://www.compart.com/en/unicode/category/No]. -Punctation: +Punctuation: - +P+, +Punctuation+: +Pc+, +Pd+, +Pe+, +Pf+, +Pi+, +Po+, or +Ps+. - {Pc, Connector_Punctuation}[https://www.compart.com/en/unicode/category/Pc]. @@ -922,7 +930,7 @@ Punctation: - +C+, +Other+: +Cc+, +Cf+, +Cn+, +Co+, or +Cs+. - {Cc, Control}[https://www.compart.com/en/unicode/category/Cc]. - {Cf, Format}[https://www.compart.com/en/unicode/category/Cf]. -- {Cn, Unassigned}[https://www.compart.com/en/unicode/category/Cn]. +- {Cn, Unassigned}[http://zuga.net/articles/unicode/category/unassigned/]. - {Co, Private_Use}[https://www.compart.com/en/unicode/category/Co]. - {Cs, Surrogate}[https://www.compart.com/en/unicode/category/Cs]. @@ -1025,23 +1033,23 @@ See also {Extended Mode}[rdoc-ref:Regexp@Extended+Mode]. Each of these modifiers sets a mode for the regexp: -- +i+: <tt>/_pattern_/i</tt> sets +- +i+: <tt>/pattern/i</tt> sets {Case-Insensitive Mode}[rdoc-ref:Regexp@Case-Insensitive+Mode]. -- +m+: <tt>/_pattern_/m</tt> sets +- +m+: <tt>/pattern/m</tt> sets {Multiline Mode}[rdoc-ref:Regexp@Multiline+Mode]. -- +x+: <tt>/_pattern_/x</tt> sets +- +x+: <tt>/pattern/x</tt> sets {Extended Mode}[rdoc-ref:Regexp@Extended+Mode]. -- +o+: <tt>/_pattern_/o</tt> sets +- +o+: <tt>/pattern/o</tt> sets {Interpolation Mode}[rdoc-ref:Regexp@Interpolation+Mode]. Any, all, or none of these may be applied. Modifiers +i+, +m+, and +x+ may be applied to subexpressions: -- <tt>(?_modifier_)</tt> turns the mode "on" for ensuing subexpressions -- <tt>(?-_modifier_)</tt> turns the mode "off" for ensuing subexpressions -- <tt>(?_modifier_:_subexp_)</tt> turns the mode "on" for _subexp_ within the group -- <tt>(?-_modifier_:_subexp_)</tt> turns the mode "off" for _subexp_ within the group +- <tt>(?modifier)</tt> turns the mode "on" for ensuing subexpressions +- <tt>(?-modifier)</tt> turns the mode "off" for ensuing subexpressions +- <tt>(?modifier:subexp)</tt> turns the mode "on" for _subexp_ within the group +- <tt>(?-modifier:subexp)</tt> turns the mode "off" for _subexp_ within the group Example: @@ -1056,7 +1064,7 @@ Example: re.match('tEst') # => #<MatchData "tEst"> re.match('tEST') # => nil -\Method Regexp#options returns an integer whose value showing +Method Regexp#options returns an integer whose value showing the settings for case-insensitivity mode, multiline mode, and extended mode. === Case-Insensitive Mode @@ -1070,7 +1078,7 @@ Modifier +i+ enables case-insensitive mode: /foo/i.match('FOO') # => #<MatchData "FOO"> -\Method Regexp#casefold? returns whether the mode is case-insensitive. +Method Regexp#casefold? returns whether the mode is case-insensitive. === Multiline Mode @@ -1120,6 +1128,13 @@ Regexp in extended mode: re = /#{pattern}/x re.match('MCMXLIII') # => #<MatchData "MCMXLIII" 1:"CM" 2:"XL" 3:"III"> +Comments in regexp literals cannot include unescaped terminator +characters: + + / + foo # the following slash \/ must be escaped + /x + === Interpolation Mode Modifier +o+ means that the first time a literal regexp with interpolations @@ -1158,22 +1173,22 @@ A regular expression containing non-US-ASCII characters is assumed to use the source encoding. This can be overridden with one of the following modifiers. -- <tt>/_pat_/n</tt>: US-ASCII if only containing US-ASCII characters, +- <tt>/pat/n</tt>: US-ASCII if only containing US-ASCII characters, otherwise ASCII-8BIT: /foo/n.encoding # => #<Encoding:US-ASCII> /foo\xff/n.encoding # => #<Encoding:ASCII-8BIT> /foo\x7f/n.encoding # => #<Encoding:US-ASCII> -- <tt>/_pat_/u</tt>: UTF-8 +- <tt>/pat/u</tt>: UTF-8 /foo/u.encoding # => #<Encoding:UTF-8> -- <tt>/_pat_/e</tt>: EUC-JP +- <tt>/pat/e</tt>: EUC-JP /foo/e.encoding # => #<Encoding:EUC-JP> -- <tt>/_pat_/s</tt>: Windows-31J +- <tt>/pat/s</tt>: Windows-31J /foo/s.encoding # => #<Encoding:Windows-31J> @@ -1243,7 +1258,7 @@ the potential vulnerability arising from this is the {regular expression denial- \Regexp matching can apply an optimization to prevent ReDoS attacks. When the optimization is applied, matching time increases linearly (not polynomially or exponentially) -in relation to the input size, and a ReDoS attach is not possible. +in relation to the input size, and a ReDoS attack is not possible. This optimization is applied if the pattern meets these criteria: @@ -1264,13 +1279,13 @@ because the optimization uses memoization (which may invoke large memory consump == References -Read (online PDF books): +Read: -- {Mastering Regular Expressions}[https://ia902508.us.archive.org/10/items/allitebooks-02/Mastering%20Regular%20Expressions%2C%203rd%20Edition.pdf] +- <i>Mastering Regular Expressions</i> by Jeffrey E.F. Friedl. -- {Regular Expressions Cookbook}[https://doc.lagout.org/programmation/Regular%20Expressions/Regular%20Expressions%20Cookbook_%20Detailed%20Solutions%20in%20Eight%20Programming%20Languages%20%282nd%20ed.%29%20%5BGoyvaerts%20%26%20Levithan%202012-09-06%5D.pdf] +- <i>Regular Expressions Cookbook</i> by Jan Goyvaerts & Steven Levithan. -Explore, test (interactive online editor): +Explore, test: -- {Rubular}[https://rubular.com/]. +- {Rubular}[https://rubular.com/]: interactive online editor. |
