summaryrefslogtreecommitdiff
path: root/doc/marshal.rdoc
diff options
context:
space:
mode:
Diffstat (limited to 'doc/marshal.rdoc')
-rw-r--r--doc/marshal.rdoc313
1 files changed, 313 insertions, 0 deletions
diff --git a/doc/marshal.rdoc b/doc/marshal.rdoc
new file mode 100644
index 0000000..f97db00
--- /dev/null
+++ b/doc/marshal.rdoc
@@ -0,0 +1,313 @@
+= Marshal Format
+
+The Marshal format is used to serialize ruby objects. The format can store
+arbitrary objects through three user-defined extension mechanisms.
+
+For documentation on using Marshal to serialize and deserialize objects, see
+the Marshal module.
+
+This document calls a serialized set of objects a stream. The Ruby
+implementation can load a set of objects from a String, an IO or an object
+that implements a +getc+ method.
+
+== Stream Format
+
+The first two bytes of the stream contain the major and minor version, each as
+a single byte encoding a digit. The version implemented in Ruby is 4.8
+(stored as "\x04\x08") and is supported by ruby 1.8.0 and newer.
+
+Different major versions of the Marshal format are not compatible and cannot
+be understood by other major versions. Lesser minor versions of the format
+can be understood by newer minor versions. Format 4.7 can be loaded by a 4.8
+implementation but format 4.8 cannot be loaded by a 4.7 implementation.
+
+Following the version bytes is a stream describing the serialized object. The
+stream contains nested objects (the same as a Ruby object) but objects in the
+stream do not necessarily have a direct mapping to the Ruby object model.
+
+Each object in the stream is described by a byte indicating its type followed
+by one or more bytes describing the object. When "object" is mentioned below
+it means any of the types below that defines a Ruby object.
+
+=== true, false, nil
+
+These objects are each one byte long. "T" is represents +true+, "F"
+represents +false+ and "0" represents +nil+.
+
+=== Fixnum and long
+
+"i" represents a signed 32 bit value using a packed format. One through five
+bytes follows the type. The value loaded will always be a Fixnum. On
+32 bit platforms (where the precision of a Fixnum is less than 32 bits)
+loading large values will cause overflow on CRuby.
+
+The fixnum type is used to represent both ruby Fixnum objects and the sizes of
+marshaled arrays, hashes, instance variables and other types. In the
+following sections "long" will mean the format described below, which supports
+full 32 bit precision.
+
+The first byte has the following special values:
+
+"\x00"::
+ The value of the integer is 0. No bytes follow.
+
+"\x01"::
+ The total size of the integer is two bytes. The following byte is a
+ positive integer in the range of 0 through 255. Only values between 123
+ and 255 should be represented this way to save bytes.
+
+"\xff"::
+ The total size of the integer is two bytes. The following byte is a
+ negative integer in the range of -1 through -256.
+
+"\x02"::
+ The total size of the integer is three bytes. The following two bytes are a
+ positive little-endian integer.
+
+"\xfe"::
+ The total size of the integer is three bytes. The following two bytes are a
+ negative little-endian integer.
+
+"\x03"::
+ The total size of the integer is four bytes. The following three bytes are
+ a positive little-endian integer.
+
+"\xfd"::
+ The total size of the integer is two bytes. The following three bytes are a
+ negative little-endian integer.
+
+"\x04"::
+ The total size of the integer is five bytes. The following four bytes are a
+ positive little-endian integer. For compatibility with 32 bit ruby,
+ only Fixnums less than 1073741824 should be represented this way. For sizes
+ of stream objects full precision may be used.
+
+"\xfc"::
+ The total size of the integer is two bytes. The following four bytes are a
+ negative little-endian integer. For compatibility with 32 bit ruby,
+ only Fixnums greater than -10737341824 should be represented this way. For
+ sizes of stream objects full precision may be used.
+
+Otherwise the first byte is a sign-extended eight-bit value with an offset.
+If the value is positive the value is determined by subtracting 5 from the
+value. If the value is negative the value is determined by adding 5 to the
+value.
+
+There are multiple representations for many values. CRuby always outputs the
+shortest representation possible.
+
+=== Symbols and Byte Sequence
+
+":" represents a real symbol. A real symbol contains the data needed to
+define the symbol for the rest of the stream as future occurrences in the
+stream will instead be references (a symbol link) to this one. The reference
+is a zero-indexed 32 bit value (so the first occurrence of <code>:hello</code>
+is 0).
+
+Following the type byte is byte sequence which consists of a long indicating
+the number of bytes in the sequence followed by that many bytes of data. Byte
+sequences have no encoding.
+
+For example, the following stream contains the Symbol <code>:hello</code>:
+
+ "\x04\x08:\x0ahello"
+
+";" represents a Symbol link which references a previously defined Symbol.
+Following the type byte is a long containing the index in the lookup table for
+the linked (referenced) Symbol.
+
+For example, the following stream contains <code>[:hello, :hello]</code>:
+
+ "\x04\b[\a:\nhello;\x00"
+
+When a "symbol" is referenced below it may be either a real symbol or a
+symbol link.
+
+=== Object References
+
+Separate from but similar to symbol references, the stream contains only one
+copy of each object (as determined by #object_id) for all objects except
+true, false, nil, Fixnums and Symbols (which are stored separately as
+described above) a one-indexed 32 bit value will be stored and reused when the
+object is encountered again. (The first object has an index of 1).
+
+"@" represents an object link. Following the type byte is a long giving the
+index of the object.
+
+For example, the following stream contains an Array of the object
+<code>"hello"</code> twice:
+
+ "\004\b[\a\"\nhello@\006"
+
+=== Instance Variables
+
+"I" indicates that instance variables follow the next object. An object
+follows the type byte. Following the object is a length indicating the number
+of instance variables for the object. Following the length is a set of
+name-value pairs. The names are symbols while the values are objects. The
+symbols must be instance variable names (<code>:@name</code>).
+
+An Object ("o" type, described below) uses the same format for its instance
+variables as described here.
+
+For a String and Regexp (described below) a special instance variable
+<code>:E</code> is used to indicate the Encoding.
+
+=== Extended
+
+"e" indicates that the next object is extended by a module. An object follows
+the type byte. Following the object is a symbol that contains the name of the
+module the object is extended by.
+
+=== Array
+
+"[" represents an Array. Following the type byte is a long indicating the
+number of objects in the array. The given number of objects follow the
+length.
+
+=== Bignum
+
+"l" represents a Bignum which is composed of three parts:
+
+sign::
+ A single byte containing "+" for a positive value or "-" for a negative
+ value.
+length::
+ A long indicating the number of bytes of Bignum data follows, divided by
+ two. Multiply the length by two to determine the number of bytes of data
+ that follow.
+data::
+ Bytes of Bignum data representing the number.
+
+The following ruby code will reconstruct the Bignum value from an array of
+bytes:
+
+ result = 0
+
+ bytes.each_with_index do |byte, exp|
+ result += (byte * 2 ** (exp * 8))
+ end
+
+=== Class and Module
+
+"c" represents a Class object, "m" represents a Module and "M" represents
+either a class or module (this is an old-style for compatibility). No class
+or module content is included, this type is only a reference. Following the
+type byte is a byte sequence which is used to look up an existing class or
+module, respectively.
+
+Instance variables are not allowed on a class or module.
+
+If no class or module exists an exception should be raised.
+
+For "c" and "m" types, the loaded object must be a class or module,
+respectively.
+
+=== Data
+
+"d" represents a Data object. (Data objects are wrapped pointers from ruby
+extensions.) Following the type byte is a symbol indicating the class for the
+Data object and an object that contains the state of the Data object.
+
+To dump a Data object Ruby calls _dump_data. To load a Data object Ruby calls
+_load_data with the state of the object on a newly allocated instance.
+
+=== Float
+
+"f" represents a Float object. Following the type byte is a byte sequence
+containing the float value. The following values are special:
+
+"inf"::
+ Positive infinity
+
+"-inf"::
+ Negative infinity
+
+"nan"::
+ Not a Number
+
+Otherwise the byte sequence contains a C double (loadable by strtod(3)).
+Older minor versions of Marshal also stored extra mantissa bits to ensure
+portability across platforms but 4.8 does not include these. See
+[ruby-talk:69518] for some explanation.
+
+=== Hash and Hash with Default Value
+
+"{" represents a Hash object while "}" represents a Hash with a default value
+set (<code>Hash.new 0</code>). Following the type byte is a long indicating
+the number of key-value pairs in the Hash, the size. Double the given number
+of objects follow the size.
+
+For a Hash with a default value, the default value follows all the pairs.
+
+=== Module and Old Module
+
+=== Object
+
+"o" represents an object that doesn't have any other special form (such as
+a user-defined or built-in format). Following the type byte is a symbol
+containing the class name of the object. Following the class name is a long
+indicating the number of instance variable names and values for the object.
+Double the given number of pairs of objects follow the size.
+
+The keys in the pairs must be symbols containing instance variable names.
+
+=== Regular Expression
+
+"/" represents a regular expression. Following the type byte is a byte
+sequence containing the regular expression source. Following the type byte is
+a byte containing the regular expression options (case-insensitive, etc.) as a
+signed 8-bit value.
+
+Regular expressions can have an encoding attached through instance variables
+(see above). If no encoding is attached escapes for the following regexp
+specials not present in ruby 1.8 must be removed: g-m, o-q, u, y, E, F, H-L,
+N-V, X, Y.
+
+=== String
+
+'"' represents a String. Following the type byte is a byte sequence
+containing the string content. When dumped from ruby 1.9 an encoding instance
+variable (<code>:E</code> see above) should be included unless the encoding is
+binary.
+
+=== Struct
+
+"S" represents a Struct. Following the type byte is a symbol containing the
+name of the struct. Following the name is a long indicating the number of
+members in the struct. Double the number of objects follow the member count.
+Each member is a pair containing the member's symbol and an object for the
+value of that member.
+
+If the struct name does not match a Struct subclass in the running ruby an
+exception should be raised.
+
+If there is a mismatch between the struct in the currently running ruby and
+the member count in the marshaled struct an exception should be raised.
+
+=== User Class
+
+"C" represents a subclass of a String, Regexp, Array or Hash. Following the
+type byte is a symbol containing the name of the subclass. Following the name
+is the wrapped object.
+
+=== User Defined
+
+"u" represents an object with a user-defined serialization format using the
++_dump+ instance method and +_load+ class method. Following the type byte is
+a symbol containing the class name. Following the class name is a byte
+sequence containing the user-defined representation of the object.
+
+The class method +_load+ is called on the class with a string created from the
+byte-sequence.
+
+=== User Marshal
+
+"U" represents an object with a user-defined serialization format using the
++marshal_dump+ and +marshal_load+ instance methods. Following the type byte
+is a symbol containing the class name. Following the class name is an object
+containing the data.
+
+Upon loading a new instance must be allocated and +marshal_load+ must be
+called on the instance with the data.
+