External data representation and marshalling.

External data representation and marshalling.  


In a computer programming language, you may refer to data that is stored in data structures using the language’s internal syntax. For instance, you might have objects like: “linked list” or “hash map” or “binary tree” or “string”. or “integer”. or “set”. Outside of a computer program you have things like: filesystem, file. Now the file format is independent of whatever your favorite programming language is.
Most files are just a big bunch of bytes or characters or whatnot and you can’t tell what’s in them without extra information (like a filename extension or the first few bytes of the file might be magic).
But in general you can come up with way more ways to represent data in a computer program than you can in a file on disk. So if you want to move information back and forth between your computer program and pretty much anywhere outside of your computer program (like a disk or a network packet), you’ll need to come up with a systematic way to turn data structures (like sets or integers or whatever) into streams of bytes that could presumably be reversed someday. So you need a reversible encoding for arbitrary data structures in the language of your choice. Doing that is called marshalling. The way that you do that determines your external data representation.

Marshalling: process of taking a collection of data items and assembling them into a form suitable for transmission. Unmarshalling: disassembling (restoring) to original on arrival. 

Three alter. approaches to external data representation and marshelling.
  1. CORBA’s common data representation (CDR)
  2. Java’s object serialization
  3. XML (Extensible Markup Language)

Although we are interested in the use of external data representation for the arguments and results of RMIs and RPCs, it has a more general use for representing data structures, objects, or structured documents in a form suitable for transmission or storing in files.

CORBA CDR
15 primitive types: short, long, unsigned short, unsigned long, float, double, char, boolean, octet, any
Constructed types: sequence, string, array, struct, enum and union. note that it does not deal with objects (only Java does: objects and tree of objects).

Type                      Representation
sequence              length (unsigned long) followed by elements in order
string                    length (unsigned long) followed by characters in order (can also can have wide                                      characters)

array                      array elements in order (no length specified because it is fixed)
struct                     in the order of declaration of the components.
enumerated          unsigned long (the values are specified by the order declared)
union                    type tag followed by the selected member.

Type of a data item not given: assumed sender and recipient have common knowledge of the order and types of data items. Types of data structures and types of basic data items are described in CORBA IDL.
Provides a notation for describing the types of arguments and results of RMI methods.

Struct Person {
string name;
string place;
unsigned long year;
};

Java object serialization

Both objects and primitive data values may be passed as arguments and results of method invocations. The following Java class is equivalent to Person struct. 

public class Person implements Serializable {
private String name;
private String place;
private int year;
public Person(String aName, String aPlace, int aYear) {
name = aName;
place = aPlace;
year = aYear;
}
// followed by methods for accessing the instance variables
}

Serializable interface (provided in java.io package) allows its instances to be serialized.
Serialization: flattening objects into a serial form for storing on disk or transmitting in a message. Deserialization: restoring the state of objects from serialized form
  • Assumed has no prior knowledge of the types of the objects in the serialized form
  • Some information about the class of each object is included in the serialized form
Java objects can contain references to other objects.
  1. All objects it references are serialized
  2. References are serialized as handles
  •  A handle is a reference to an object within the serialized form.
  • Each object is written once only
  • Handle is written in subsequent occurrences
To serialize an object:
  1. its class info is written out: name, version number
  2. values of instance variables
   3. types and names of instance variables.
  • If an instance variable belong to a new class, then new class info must be written out, recursively  
  • Each class is given a handle
To make use of Java serialization
  • To serialize: create an instance of ObjectOutputStream
  • Invoke writeObject method passing Person object as argument
  • To deserialize: create an instance of ObjectInputStream
  • Invoke readObject method to reconstruct the original object
ObjectOutputStream out = new ObjectOutputStream(… );
out.writeObject(originalPerson);
ObjectInputStream in = new ObjectInputStream(…);
Person thePerson = in.readObject();

Use of reflection
  • Reflection: inquiring about class properties, e.g., names, types of methods and variables, of objects.
  • Allows to do serialization and deserialization in a generic manner, unlike in CORBA, which needs IDL specifications
For serialization, use reflection to find out (1) class name of the object to be serialized and (2) the names, types and (3) values of its instance variables.
For deserialization, (1) class name in the serialized form is used to create a class, (2) it is then used to create a constructor with arguments types corresponding to those specified in the serialized form. (3) the new constructor is used to create a new object with instance variables whose values are read from the serialized form.

Each process contains objects, some of which can receive remote invocations, others only local invocations. Those that can receive remote invocations are called remote objects. Java and CORBA support distributed object model. Objects need to know the remote object reference of an object in another process in order to invoke its methods. The remote interface specifies which methods can be invoked remotely. Remote object references are passed as arguments and compared to ensure uniqueness. 

A remote object reference must be unique over space and time. 
Over space: there may be many processes hosting remote objects.
Over time: It should not be reused after the object is deleted.
its potential invoker may retain obsolete references

(IP address + port #) + (time of creation + local object number)
  • local object number is incremented each time an object is created in that process
  • identifies the object within the process
  • in case objects live only in the process that created them, the reference can be used as an address of the remote object
  • to allow remote objects to be relocated in a different process on a different computer, the reference cannot be used as address

XML (Extensible Markup Language)


XML stands for extensible markup language. Extensible means that the language is a shell, or skeleton that can be extended by anyone who wants to create additional ways to use XML. Markup means that XML's primary task is to give definition to text and symbols.

XML data items are tagged with markup strings. The tags are used to describe the logical structure of the data and to associate attribute value pairs with logical structures.

XML is used to enable clients to communicate with web services and for defining the interfaces and other properties of web services. XML is also used in many other ways, including in archiving and retrieval systems although an XML archive may be larger than a binary one, it has the advantage of being readable on any computer.

XML is extensible in the sense that user can define their own tags. If a XML document is intended to be used by more than one application, then the name of the tags must be agreed between them.
Some external data representations such as CORBA CDR don’t need to be self describing, because it’s assumed that the client and server exchanging a message have prior knowledge of the order and the types of the information it contains. XML was intended to be used by multiple applications for different purposes. The provision of tags, together with the use of namespaces to define the meaning of the tags, has made this possible. The use of tags enables applications to select just those parts of a document it needs to process, it will not be affected by the addition of information relevant to other applications.

XML documents, being textual, can read by humans. In practice , most XML documents are generated and read by XML processing software, but the ability to read XML can be useful when things go wrong. The use of text makes XML independent of any particular platform. Use of textual rather than binary representation, togather with the use of tags, makes the message large, so they require longer processing and transmission time, as well as more space to store. The efficiency of messages using the CORBA CDR is better than SOAP XML format.

Marshalling and unmarshalling in first two cases are intended to be done by a middle ware layer without any interfere on the part of the application programmer. Even XML is textual and therefor more accessible to hand encoding, marshalling and unmarshalling software is available for all commonly used platforms and programming environments. Because marshalling requires the consideration of all finest details of the representation of primitive components of composite objects, the process is likely to be error prone if done by hand. Another problem is compactness, which can be addressed in the design of automatically generated marshalling procedures. The primitive data types are marshalled into binary form in the first two approaches. But in XML primitive data types are represented textually. Textual representation of data value will generally be longer than the same binary representation.

Another problem with regard to the design of marshalling method is whether the marshalled data should include information concerning the type of its content. CORBA’s representation includes nothing about their types, it contains only the values of the transmitted objects. Java’s object serialization and XML include type information. They are using different ways to do this. Java puts all the required type information into the serialized form. XML documents may refer to externally defined sets of names called namespaces.

The external data representation although used for the arguments and results of RMIs and RPCs, it has a more general use for representing data structures, objects or structured documents in a form suitable for transmission in messages or storing in files.

Comments

Popular Posts