path: root/doc/encoding.rdoc
diff options
Diffstat (limited to 'doc/encoding.rdoc')
1 files changed, 169 insertions, 1 deletions
diff --git a/doc/encoding.rdoc b/doc/encoding.rdoc
index 6f663b14cd..490066b5df 100644
--- a/doc/encoding.rdoc
+++ b/doc/encoding.rdoc
@@ -132,7 +132,175 @@ returns the \Encoding of the concatenated string, or +nil+ if incompatible:
s1 = "\xa1\xa1".force_encoding('euc-jp') # => "\x{A1A1}"
Encoding.compatible?(s0, s1) # => nil
-==== \Encoding Options
+=== \String \Encoding
+A Ruby String object has an encoding that is an instance of class \Encoding.
+The encoding may be retrieved by method String#encoding.
+The default encoding for a string literal is the script encoding
+(see Encoding@Script+encoding):
+ 's'.encoding # => #<Encoding:UTF-8>
+The default encoding for a string created with method is:
+- For a \String object argument, the encoding of that string.
+- For a string literal, the script encoding (see Encoding@Script+encoding).
+In either case, any encoding may be specified:
+ s = 'UTF-8') # => ""
+ s.encoding # => #<Encoding:UTF-8>
+ s ='foo', encoding: 'ASCII-8BIT') # => "foo"
+ s.encoding # => #<Encoding:ASCII-8BIT>
+The encoding for a string may be changed:
+ s = "R\xC3\xA9sum\xC3\xA9" # => "Résumé"
+ s.encoding # => #<Encoding:UTF-8>
+ s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9"
+ s.encoding # => #<Encoding:ISO-8859-1>
+Changing the assigned encoding does not alter the content of the string;
+it changes only the way the content is to be interpreted:
+ s # => "R\xC3\xA9sum\xC3\xA9"
+ s.force_encoding('UTF-8') # => "Résumé"
+The actual content of a string may also be altered;
+see {Transcoding a String}[#label-Transcoding+a+String].
+Here are a couple of useful query methods:
+ s = "abc".force_encoding("UTF-8") # => "abc"
+ s.ascii_only? # => true
+ s = "abc\u{6666}".force_encoding("UTF-8") # => "abc晦"
+ s.ascii_only? # => false
+ s = "\xc2\xa1".force_encoding("UTF-8") # => "¡"
+ s.valid_encoding? # => true
+ s = "\xc2".force_encoding("UTF-8") # => "\xC2"
+ s.valid_encoding? # => false
+=== \Symbol and \Regexp Encodings
+The string stored in a Symbol or Regexp object also has an encoding;
+the encoding may be retrieved by method Symbol#encoding or Regexp#encoding.
+The default encoding for these, however, is:
+- US-ASCII, if all characters are US-ASCII.
+- The script encoding, otherwise (see Encoding@Script+encoding).
+=== Filesystem \Encoding
+The filesystem encoding is the default \Encoding for a string from the filesystem:
+ Encoding.find("filesystem") # => #<Encoding:UTF-8>
+=== Locale \Encoding
+The locale encoding is the default encoding for a string from the environment,
+other than from the filesystem:
+ Encoding.find('locale') # => #<Encoding:IBM437>
+=== \IO Encodings
+An IO object (an input/output stream), and by inheritance a File object,
+has at least one, and sometimes two, encodings:
+- Its _external_ _encoding_ identifies the encoding of the stream.
+- Its _internal_ _encoding_, if not +nil+, specifies the encoding
+ to be used for the string constructed from the stream.
+==== External \Encoding
+Bytes read from the stream are decoded into characters via the external encoding;
+by default (that is, if the internal encoding is +nil),
+those characters become a string whose encoding is set to the external encoding.
+The default external encoding is:
+- UTF-8 for a text stream.
+- ASCII-8BIT for a binary stream.
+ f ='t.rus', 'rb')
+ f.external_encoding # => #<Encoding:ASCII-8BIT>
+The external encoding may be set by the open option +external_encoding+:
+ f ='t.txt', external_encoding: 'ASCII-8BIT')
+ f.external_encoding # => #<Encoding:ASCII-8BIT>
+The external encoding may also set by method #set_encoding:
+ f ='t.txt')
+ f.set_encoding('ASCII-8BIT')
+ f.external_encoding # => #<Encoding:ASCII-8BIT>
+==== Internal \Encoding
+If not +nil+, the internal encoding specifies that the characters read
+from the stream are to be converted to characters in the internal encoding;
+those characters become a string whose encoding is set to the internal encoding.
+The default internal encoding is +nil+ (no conversion).
+The internal encoding may set by the open option +internal_encoding+:
+ f ='t.txt', internal_encoding: 'ASCII-8BIT')
+ f.internal_encoding # => #<Encoding:ASCII-8BIT>
+The internal encoding may also set by method #set_encoding:
+ f ='t.txt')
+ f.set_encoding('UTF-8', 'ASCII-8BIT')
+ f.internal_encoding # => #<Encoding:ASCII-8BIT>
+=== Script \Encoding
+A Ruby script has a script encoding, which may be retrieved by:
+ __ENCODING__ # => #<Encoding:UTF-8>
+The default script encoding is UTF-8;
+a Ruby source file may set its script encoding with a magic comment
+on the first line of the file (or second line, if there is a shebang on the first).
+The comment must contain the word +coding+ or +encoding+,
+followed by a colon, space and the Encoding name or alias:
+ # encoding: ISO-8859-1
+ __ENCODING__ #=> #<Encoding:ISO-8859-1>
+=== Transcoding
+_Transcoding_ is the process of revising the content of a string or stream
+by changing its encoding.
+==== Transcoding a \String
+Each of these methods transcodes a string:
+String#encode :: Transcodes a string into a new string
+ according to a given destination encoding,
+ a given or default source encoding, and encoding options.
+String#encode! :: Like String#encode,
+ but transcodes the string in place.
+String#scrub :: Transcodes a string into a new string
+ by replacing invalid byte sequences
+ with a given or default replacement string.
+String#scrub! :: Like String#scrub, but transcodes the string in place.
+String#unicode_normalize :: Transcodes a string into a new string
+ according to Unicode normalization:
+String#unicode_normalize! :: Like String#unicode_normalize,
+ but transcodes the string in place.
+=== \Encoding Options
A number of methods in the Ruby core accept keyword arguments as encoding options.