summaryrefslogtreecommitdiff
path: root/doc/strscan/strscan.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/strscan/strscan.md')
-rw-r--r--doc/strscan/strscan.md544
1 files changed, 544 insertions, 0 deletions
diff --git a/doc/strscan/strscan.md b/doc/strscan/strscan.md
new file mode 100644
index 0000000000..bbdeccd75e
--- /dev/null
+++ b/doc/strscan/strscan.md
@@ -0,0 +1,544 @@
+\Class `StringScanner` supports processing a stored string as a stream;
+this code creates a new `StringScanner` object with string `'foobarbaz'`:
+
+```rb
+require 'strscan'
+scanner = StringScanner.new('foobarbaz')
+```
+
+## About the Examples
+
+All examples here assume that `StringScanner` has been required:
+
+```rb
+require 'strscan'
+```
+
+Some examples here assume that these constants are defined:
+
+```rb
+MULTILINE_TEXT = <<~EOT
+Go placidly amid the noise and haste,
+and remember what peace there may be in silence.
+EOT
+
+HIRAGANA_TEXT = 'こんにちは'
+
+ENGLISH_TEXT = 'Hello'
+```
+
+Some examples here assume that certain helper methods are defined:
+
+- `put_situation(scanner)`:
+ Displays the values of the scanner's
+ methods #pos, #charpos, #rest, and #rest_size.
+- `put_match_values(scanner)`:
+ Displays the scanner's [match values][9].
+- `match_values_cleared?(scanner)`:
+ Returns whether the scanner's [match values][9] are cleared.
+
+See examples at [helper methods](doc/strscan/helper_methods.md).
+
+## The `StringScanner` \Object
+
+This code creates a `StringScanner` object
+(we'll call it simply a _scanner_),
+and shows some of its basic properties:
+
+```rb
+scanner = StringScanner.new('foobarbaz')
+scanner.string # => "foobarbaz"
+put_situation(scanner)
+# Situation:
+# pos: 0
+# charpos: 0
+# rest: "foobarbaz"
+# rest_size: 9
+```
+
+The scanner has:
+
+* A <i>stored string</i>, which is:
+
+ * Initially set by StringScanner.new(string) to the given `string`
+ (`'foobarbaz'` in the example above).
+ * Modifiable by methods #string=(new_string) and #concat(more_string).
+ * Returned by method #string.
+
+ More at [Stored String][1] below.
+
+* A _position_;
+ a zero-based index into the bytes of the stored string (_not_ into its characters):
+
+ * Initially set by StringScanner.new to `0`.
+ * Returned by method #pos.
+ * Modifiable explicitly by methods #reset, #terminate, and #pos=(new_pos).
+ * Modifiable implicitly (various traversing methods, among others).
+
+ More at [Byte Position][2] below.
+
+* A <i>target substring</i>,
+ which is a trailing substring of the stored string;
+ it extends from the current position to the end of the stored string:
+
+ * Initially set by StringScanner.new(string) to the given `string`
+ (`'foobarbaz'` in the example above).
+ * Returned by method #rest.
+ * Modified by any modification to either the stored string or the position.
+
+ <b>Most importantly</b>:
+ the searching and traversing methods operate on the target substring,
+ which may be (and often is) less than the entire stored string.
+
+ More at [Target Substring][3] below.
+
+## Stored \String
+
+The <i>stored string</i> is the string stored in the `StringScanner` object.
+
+Each of these methods sets, modifies, or returns the stored string:
+
+| Method | Effect |
+|----------------------|-------------------------------------------------|
+| ::new(string) | Creates a new scanner for the given string. |
+| #string=(new_string) | Replaces the existing stored string. |
+| #concat(more_string) | Appends a string to the existing stored string. |
+| #string | Returns the stored string. |
+
+## Positions
+
+A `StringScanner` object maintains a zero-based <i>byte position</i>
+and a zero-based <i>character position</i>.
+
+Each of these methods explicitly sets positions:
+
+| Method | Effect |
+|--------------------------|-----------------------------------------------------------|
+| #reset | Sets both positions to zero (beginning of stored string). |
+| #terminate | Sets both positions to the end of the stored string. |
+| #pos=(new_byte_position) | Sets byte position; adjusts character position. |
+
+### Byte Position (Position)
+
+The byte position (or simply _position_)
+is a zero-based index into the bytes in the scanner's stored string;
+for a new `StringScanner` object, the byte position is zero.
+
+When the byte position is:
+
+* Zero (at the beginning), the target substring is the entire stored string.
+* Equal to the size of the stored string (at the end),
+ the target substring is the empty string `''`.
+
+To get or set the byte position:
+
+* \#pos: returns the byte position.
+* \#pos=(new_pos): sets the byte position.
+
+Many methods use the byte position as the basis for finding matches;
+many others set, increment, or decrement the byte position:
+
+```rb
+scanner = StringScanner.new('foobar')
+scanner.pos # => 0
+scanner.scan(/foo/) # => "foo" # Match found.
+scanner.pos # => 3 # Byte position incremented.
+scanner.scan(/foo/) # => nil # Match not found.
+scanner.pos # => 3 # Byte position not changed.
+```
+
+Some methods implicitly modify the byte position;
+see:
+
+* [Setting the Target Substring][4].
+* [Traversing the Target Substring][5].
+
+The values of these methods are derived directly from the values of #pos and #string:
+
+- \#charpos: the [character position][7].
+- \#rest: the [target substring][3].
+- \#rest_size: `rest.size`.
+
+### Character Position
+
+The character position is a zero-based index into the _characters_
+in the stored string;
+for a new `StringScanner` object, the character position is zero.
+
+\Method #charpos returns the character position;
+its value may not be reset explicitly.
+
+Some methods change (increment or reset) the character position;
+see:
+
+* [Setting the Target Substring][4].
+* [Traversing the Target Substring][5].
+
+Example (string includes multi-byte characters):
+
+```rb
+scanner = StringScanner.new(ENGLISH_TEXT) # Five 1-byte characters.
+scanner.concat(HIRAGANA_TEXT) # Five 3-byte characters
+scanner.string # => "Helloこんにちは" # Twenty bytes in all.
+put_situation(scanner)
+# Situation:
+# pos: 0
+# charpos: 0
+# rest: "Helloこんにちは"
+# rest_size: 20
+scanner.scan(/Hello/) # => "Hello" # Five 1-byte characters.
+put_situation(scanner)
+# Situation:
+# pos: 5
+# charpos: 5
+# rest: "こんにちは"
+# rest_size: 15
+scanner.getch # => "こ" # One 3-byte character.
+put_situation(scanner)
+# Situation:
+# pos: 8
+# charpos: 6
+# rest: "んにちは"
+# rest_size: 12
+```
+
+## Target Substring
+
+The target substring is the part of the [stored string][1]
+that extends from the current [byte position][2] to the end of the stored string;
+it is always either:
+
+- The entire stored string (byte position is zero).
+- A trailing substring of the stored string (byte position positive).
+
+The target substring is returned by method #rest,
+and its size is returned by method #rest_size.
+
+Examples:
+
+```rb
+scanner = StringScanner.new('foobarbaz')
+put_situation(scanner)
+# Situation:
+# pos: 0
+# charpos: 0
+# rest: "foobarbaz"
+# rest_size: 9
+scanner.pos = 3
+put_situation(scanner)
+# Situation:
+# pos: 3
+# charpos: 3
+# rest: "barbaz"
+# rest_size: 6
+scanner.pos = 9
+put_situation(scanner)
+# Situation:
+# pos: 9
+# charpos: 9
+# rest: ""
+# rest_size: 0
+```
+
+### Setting the Target Substring
+
+The target substring is set whenever:
+
+* The [stored string][1] is set (position reset to zero; target substring set to stored string).
+* The [byte position][2] is set (target substring adjusted accordingly).
+
+### Querying the Target Substring
+
+This table summarizes (details and examples at the links):
+
+| Method | Returns |
+|------------|-----------------------------------|
+| #rest | Target substring. |
+| #rest_size | Size (bytes) of target substring. |
+
+### Searching the Target Substring
+
+A _search_ method examines the target substring,
+but does not advance the [positions][11]
+or (by implication) shorten the target substring.
+
+This table summarizes (details and examples at the links):
+
+| Method | Returns | Sets Match Values? |
+|-----------------------|-----------------------------------------------|--------------------|
+| #check(pattern) | Matched leading substring or +nil+. | Yes. |
+| #check_until(pattern) | Matched substring (anywhere) or +nil+. | Yes. |
+| #exist?(pattern) | Matched substring (anywhere) end index. | Yes. |
+| #match?(pattern) | Size of matched leading substring or +nil+. | Yes. |
+| #peek(size) | Leading substring of given length (bytes). | No. |
+| #peek_byte | Integer leading byte or +nil+. | No. |
+| #rest | Target substring (from byte position to end). | No. |
+
+### Traversing the Target Substring
+
+A _traversal_ method examines the target substring,
+and, if successful:
+
+- Advances the [positions][11].
+- Shortens the target substring.
+
+
+This table summarizes (details and examples at links):
+
+| Method | Returns | Sets Match Values? |
+|----------------------|------------------------------------------------------|--------------------|
+| #get_byte | Leading byte or +nil+. | No. |
+| #getch | Leading character or +nil+. | No. |
+| #scan(pattern) | Matched leading substring or +nil+. | Yes. |
+| #scan_byte | Integer leading byte or +nil+. | No. |
+| #scan_until(pattern) | Matched substring (anywhere) or +nil+. | Yes. |
+| #skip(pattern) | Matched leading substring size or +nil+. | Yes. |
+| #skip_until(pattern) | Position delta to end-of-matched-substring or +nil+. | Yes. |
+| #unscan | +self+. | No. |
+
+## Querying the Scanner
+
+Each of these methods queries the scanner object
+without modifying it (details and examples at links)
+
+| Method | Returns |
+|---------------------|----------------------------------|
+| #beginning_of_line? | +true+ or +false+. |
+| #charpos | Character position. |
+| #eos? | +true+ or +false+. |
+| #fixed_anchor? | +true+ or +false+. |
+| #inspect | String representation of +self+. |
+| #pos | Byte position. |
+| #rest | Target substring. |
+| #rest_size | Size of target substring. |
+| #string | Stored string. |
+
+## Matching
+
+`StringScanner` implements pattern matching via Ruby class [Regexp][6],
+and its matching behaviors are the same as Ruby's
+except for the [fixed-anchor property][10].
+
+### Matcher Methods
+
+Each <i>matcher method</i> takes a single argument `pattern`,
+and attempts to find a matching substring in the [target substring][3].
+
+| Method | Pattern Type | Matches Target Substring | Success Return | May Update Positions? |
+|--------------|-------------------|--------------------------|--------------------|-----------------------|
+| #check | Regexp or String. | At beginning. | Matched substring. | No. |
+| #check_until | Regexp or String. | Anywhere. | Substring. | No. |
+| #match? | Regexp or String. | At beginning. | Match size. | No. |
+| #exist? | Regexp or String. | Anywhere. | Substring size. | No. |
+| #scan | Regexp or String. | At beginning. | Matched substring. | Yes. |
+| #scan_until | Regexp or String. | Anywhere. | Substring. | Yes. |
+| #skip | Regexp or String. | At beginning. | Match size. | Yes. |
+| #skip_until | Regexp or String. | Anywhere. | Substring size. | Yes. |
+
+<br>
+
+Which matcher you choose will depend on:
+
+- Where you want to find a match:
+
+ - Only at the beginning of the target substring:
+ #check, #match?, #scan, #skip.
+ - Anywhere in the target substring:
+ #check_until, #exist?, #scan_until, #skip_until.
+
+- Whether you want to:
+
+ - Traverse, by advancing the positions:
+ #scan, #scan_until, #skip, #skip_until.
+ - Keep the positions unchanged:
+ #check, #check_until, #match?, #exist?.
+
+- What you want for the return value:
+
+ - The matched substring: #check, #scan.
+ - The substring: #check_until, #scan_until.
+ - The match size: #match?, #skip.
+ - The substring size: #exist?, #skip_until.
+
+### Match Values
+
+The <i>match values</i> in a `StringScanner` object
+generally contain the results of the most recent attempted match.
+
+Each match value may be thought of as:
+
+* _Clear_: Initially, or after an unsuccessful match attempt:
+ usually, `false`, `nil`, or `{}`.
+* _Set_: After a successful match attempt:
+ `true`, string, array, or hash.
+
+Each of these methods clears match values:
+
+- ::new(string).
+- \#reset.
+- \#terminate.
+
+Each of these methods attempts a match based on a pattern,
+and either sets match values (if successful) or clears them (if not);
+
+- \#check(pattern)
+- \#check_until(pattern)
+- \#exist?(pattern)
+- \#match?(pattern)
+- \#scan(pattern)
+- \#scan_until(pattern)
+- \#skip(pattern)
+- \#skip_until(pattern)
+
+#### Basic Match Values
+
+Basic match values are those not related to captures.
+
+Each of these methods returns a basic match value:
+
+| Method | Return After Match | Return After No Match |
+|-----------------|----------------------------------------|-----------------------|
+| #matched? | +true+. | +false+. |
+| #matched_size | Size of matched substring. | +nil+. |
+| #matched | Matched substring. | +nil+. |
+| #pre_match | Substring preceding matched substring. | +nil+. |
+| #post_match | Substring following matched substring. | +nil+. |
+
+<br>
+
+See examples below.
+
+#### Captured Match Values
+
+Captured match values are those related to [captures][16].
+
+Each of these methods returns a captured match value:
+
+| Method | Return After Match | Return After No Match |
+|-----------------|-----------------------------------------|-----------------------|
+| #size | Count of captured substrings. | +nil+. |
+| #\[\](n) | <tt>n</tt>th captured substring. | +nil+. |
+| #captures | Array of all captured substrings. | +nil+. |
+| #values_at(*n) | Array of specified captured substrings. | +nil+. |
+| #named_captures | Hash of named captures. | <tt>{}</tt>. |
+
+<br>
+
+See examples below.
+
+#### Match Values Examples
+
+Successful basic match attempt (no captures):
+
+```rb
+scanner = StringScanner.new('foobarbaz')
+scanner.exist?(/bar/)
+put_match_values(scanner)
+# Basic match values:
+# matched?: true
+# matched_size: 3
+# pre_match: "foo"
+# matched : "bar"
+# post_match: "baz"
+# Captured match values:
+# size: 1
+# captures: []
+# named_captures: {}
+# values_at: ["bar", nil]
+# []:
+# [0]: "bar"
+# [1]: nil
+```
+
+Failed basic match attempt (no captures);
+
+```rb
+scanner = StringScanner.new('foobarbaz')
+scanner.exist?(/nope/)
+match_values_cleared?(scanner) # => true
+```
+
+Successful unnamed capture match attempt:
+
+```rb
+scanner = StringScanner.new('foobarbazbatbam')
+scanner.exist?(/(foo)bar(baz)bat(bam)/)
+put_match_values(scanner)
+# Basic match values:
+# matched?: true
+# matched_size: 15
+# pre_match: ""
+# matched : "foobarbazbatbam"
+# post_match: ""
+# Captured match values:
+# size: 4
+# captures: ["foo", "baz", "bam"]
+# named_captures: {}
+# values_at: ["foobarbazbatbam", "foo", "baz", "bam", nil]
+# []:
+# [0]: "foobarbazbatbam"
+# [1]: "foo"
+# [2]: "baz"
+# [3]: "bam"
+# [4]: nil
+```
+
+Successful named capture match attempt;
+same as unnamed above, except for #named_captures:
+
+```rb
+scanner = StringScanner.new('foobarbazbatbam')
+scanner.exist?(/(?<x>foo)bar(?<y>baz)bat(?<z>bam)/)
+scanner.named_captures # => {"x"=>"foo", "y"=>"baz", "z"=>"bam"}
+```
+
+Failed unnamed capture match attempt:
+
+```rb
+scanner = StringScanner.new('somestring')
+scanner.exist?(/(foo)bar(baz)bat(bam)/)
+match_values_cleared?(scanner) # => true
+```
+
+Failed named capture match attempt;
+same as unnamed above, except for #named_captures:
+
+```rb
+scanner = StringScanner.new('somestring')
+scanner.exist?(/(?<x>foo)bar(?<y>baz)bat(?<z>bam)/)
+match_values_cleared?(scanner) # => false
+scanner.named_captures # => {"x"=>nil, "y"=>nil, "z"=>nil}
+```
+
+## Fixed-Anchor Property
+
+Pattern matching in `StringScanner` is the same as in Ruby's,
+except for its fixed-anchor property,
+which determines the meaning of `'\A'`:
+
+* `false` (the default): matches the current byte position.
+
+ ```rb
+ scanner = StringScanner.new('foobar')
+ scanner.scan(/\A./) # => "f"
+ scanner.scan(/\A./) # => "o"
+ scanner.scan(/\A./) # => "o"
+ scanner.scan(/\A./) # => "b"
+ ```
+
+* `true`: matches the beginning of the target substring;
+ never matches unless the byte position is zero:
+
+ ```rb
+ scanner = StringScanner.new('foobar', fixed_anchor: true)
+ scanner.scan(/\A./) # => "f"
+ scanner.scan(/\A./) # => nil
+ scanner.reset
+ scanner.scan(/\A./) # => "f"
+ ```
+
+The fixed-anchor property is set when the `StringScanner` object is created,
+and may not be modified
+(see StringScanner.new);
+method #fixed_anchor? returns the setting.
+