Mark Overmeer
Perl modules
XML::Compile
Papers

XML and Perl

 

This article was published in the Summer 2012 edition of $foo, Perl Magazin, where it got translated into German. German version (PDF, 190kB)

Perl as glue language

When internet was young, Perl became the ultimate glue language between the operating system, databases and websites. It was not thát hard to pass the featurelist of awk, shell and sed; Perl soon became a major player. However, you do not stay on top automatically: a language has to evolve with the World around it. Perl has lost its leading position in many areas... for instance by totally ignoring XML based standards.

Not without reason, XML has a bad name among Perl programmers. Perl-people like programs which are powerful, DWIM, and run efficiently. The XML environment is extremely verbose, formal, and often designed by people with little programming experience.

For instance, XML schemas are clearly designed by librarians. The default type for an element is "anyType", which is in programmers-speak short for: "I don't care, it's your problem now". Programmers start with a well defined single bit, and then glue those into bytes, and so on.

So, Perl-people hate XML by nature. "They" fell in love with YAML, which seemed very simple, therefore much preferred. Rule 1 of programming: when you think that something new is much better and faster than something existing, than probably you do not understand all complications. So, there is currently no full support of the new YAML standard in Perl. The new ideal pet syntax is JSON, YAML dismissed.

Introducing XML::Compile

Meanwhile, the `Professional World' has standardized on XML. If Perl wants to regain some terrain as generic glue language, it should provide implementations for most XML-based standards. There are only very few of those on CPAN. Probably because there was no up-to-standard basic library to implement them.

On CPAN, you can find a few dozen modules which can read and write XML, but those are only dealing in XML in the simpelest way. Modern XML protocols are schema based: types and structure are described in detail. The schemas grow larger and larger. It is horrible to translate hundreds of elements in such standard into Perl statement by statement, node by node, as most CPAN modules require.

Since 2006, I have been working on intelligent support for XML schemas, later also the SOAP and WSDL standards. Schemas are large and Perl is relatively slow, so I decided to lay the burden of schema processing in the initiation phase of the program, to make runtime fast. In other implementations (like java libraries) messages get matched against the schema at run-time; they run interpreted. XML::Compile however, translates schemas into reusable code references which can translate XML into Perl, Perl into XML, and even produce example XML messages and Perl hashes.

Reading XML

Actually, schema-driven reading is quite simple with XML::Compile:

  # compile once
  use XML::Compile::Schema;
  my $schema = XML::Compile::Schema->new($schemafn);
  my $reader = $schema->compile(READER => $type);

  # run often
  my $hash   = $reader->($xml);

  # Data::Dumper is your friend
  print Dumper $hash;

The $xml to be processed can be an XML as string, a filename containing XML, or an XML::LibXML::Document. XML::Compile is based on XML::LibXML, an XS wrapper around Gnome's libxml2. Be sure to have a recent version of both installed, because there are many bugs in older versions of both.

A bit harder is the $type. You have to figure-out which element in the schema is the top-element in your message. Maybe from the documentation. And then, these elements have both a namespace and a name within that namespace: a pair. Pairs are not very practical to pass around. Therefore, you construct $type like this:

  my $type = "{$namespace}$name";

  # cleaner
  use XML::Compile::Util 'pack_type';
  my $type = pack_type $namespace, $name;

After some months, I figured-out that it is quite inconvenient to pass around these compiled code-references between functions and modules. It would be much nicer to have a kind of generic schema manager. So a new module was added to collect these compiled structures. This module also simplifies your type specification with prefixes. The first example rewritten:

  # compile once
  use XML::Compile::Cache;
  my $schema = XML::Compile::Cache->new($schemafn, allow_undeclared => 1);
  $schema->prefix(xyz => $namespace);

  # run often
  my $hash = $schema->reader("xyz:$name")->($xml);

The reader will be compiled the first time that it is used. This is not what you want in daemons: in that case you always want to compile all possible readers before you start forking-off childs. In that case, use this:

  # compile once in the parent
  use XML::Compile::Cache;
  my $schema = XML::Compile::Cache->new($schemafn);
  $schema->prefix(xyz => 'http://something');
  $schema->declare(READER => "xyz:$name");
  $schema->compileAll;

  ...fork...
  # run often in any child
  my $hash = $schema->reader("xyz:$name")->($xml);

Validation

All elements in the $xml are validated during the reading. At the same time, they get converted to useful Perl values. For instance, for some datatypes, optional blanks need to be removed (whitespace `collapse'). The dataTime values are translated into convenient time ticks, base64 encoded data gets decoded, booleans `false' and `true' become `0' and `1', and so on.

A few parameters can be used to optimize the translations decoding. With their defaults:

  validation      => true
  check_values    => true
  check_occurs    => true
  ignore_facets   => false
  sloppy_integers => false
  sloppy_floats   => false

The latter two can often be applied. The schema types "integer" and "float" allow values which do not fit Perl's integer and float, XML::Compile puts those in inefficient Math::BigInt and Math::BigFloat objects. In many cases, the schema designer had better used types like `unsignedShort'. Being sloppy here means: promise that there are no big values.

  $schema->declare(READER => "xyz:$name", sloppy_float => 1);

My typical program contains many input validation checks, on the structure of the data and the sanity of the values. In my code, about 3/4 of the logic is needed for this task: producing and handling input parameters of scripts and functions. However, when the schema is strict, the XML is (automatically) validated before entering your program. Much fewer tests are needed. For instance, when the schema contains:

  <element name="gender">
    <simpleType>
      <restriction base="token">
        <enumeration value="male"    />
        <enumeration value="female"  />
        <enumeration value="unknown" />
      </restriction>
    </simpleType>
  </element>

XML::Compile automatically produces an error message on gender "none", so your program doesn't need to verify it anymore. Reality is harder: many schema's are unspecific.

Nested HASHes

The result of reading is a single (sometimes very deep) structure of nested HASHes and ARRAYs. Let's have a look at the translation of the usual constructs. XML schemas have the following building bricks:

  <record>
    <price>3.14

    <name lang="nl">ijsje

    <order number="PO1234">
      <paid>false
    </order>
  </record>

The first element within the record is a "simple" type. Then a simple with attributes (complexType with simpleContent), followed by a container (complexType, may have attributes) Only the middle case is a bit difficult to map on a single HASH: the value has no name. The reader will return above examples as:

  record =>
    { price => Math::BigFloat->new('3.14')
    , name  => {lang => 'nl', _ => 'ijsje'}
    , order => {number => 'PO1234', paid => 0}
    }

When an element may be repeated, it will always return an ARRAY. For instance:

  <element name="x" type="int" maxOccurs="unbounded" />
  x => [ 1, 2, 3 ]
  x => [ 4 ]

The many non-schema-driven XML processors, like XML::Simple, would not return an ARRAY in the latter case: they can only handle element repetition if they see it happen in the message. That makes using the data a little complex:

  # when you use XML::Simple
  my $r = $data->{...}{x} || [];
  my @x = ref $r eq 'ARRAY' ? @$r : $r;

  # when you use XML::Compile
  my @x = @{$data->{...}{x} || []};

Why not simply XML::(LibXML::)Simple?

A side-effect of schema-based processing, is value sanitation. Often, schema types are quite different from Perl types, but only when studied in detail. The schema type "integer", for instance, may be used with blanks around it, but also white-space between the digits! And it must support at least 19 digits decimal (64 bit) Is your (customer's) Perl compiled with 64 bit integers?

When you use XML::Simple to read your files, you either must spend a lot of time on validation and sanitation, or trust on the other party to use the same subset of the value-space as you expect from the examples. You may think to know that all XML integers fit in Perl integers. You may predict that a boolean is always encoded as `0' or '1', not `true' and `false'. However, this is an unstable situation in the long run.

Key rewrites

By default, the keys presented in the Perl HASH are the plain names of the elements in XML; the namespace is ignored. Often, that's no problem, because different schema's are rarely entangled enough to produce name collisions. On the other hand, you may wish to see them for more clarity about the namespace, just for fun or documentation.

  $schema->addKeyRewrite('PREFIXES');  # all!
  $schema->addKeyRewrite('PREFIXES(abc,xyz)');

Also, the style preferred by the XML designers uses lower-cased names with dashes between the words. Hash keys with dashes are inconvenient in Perl5 (Perl6 loves them). With the next line, dashes will be rewritten to underscores:

  $schema->addKeyRewrite('UNDERSCORES');

You may even pass HASHes with name-to-key mappings. You may use that to directly map elements into database fields. These key-mappings only influence the Perl keys, not the XML being read or written.

For example, with PREFIXES and UNDERSCORE rewrite rules enabled, the element named "{http://something}my-name" would not result in a key "my-name", but the bareword "xyz_my_name".

Producing examples

These data-structures can be large and the schema's can be very complex, spread over multiple files with complex inheritance relations. To help you understanding the data-structure, XML::Compile can produce annotated examples.

  print $schema->template(PERL => $type);
  print $schema->template(XML  => $type);

The output cannot directly be used (as it dumps all options in a choice) but will prove very useful in understanding the data-structure. When you have defined key-rewrite rules, they will be applied to the Perl examples as well.

Writing

Finally, you can also write perfect writers of XML elements. You do not need to be aware of element order, namespaces, rounding-errors, type conversions and so on: it DWIMs! The result will be validated. I will only show you the version with ::Cache.

  use XML::Compile::Cache;
  my $schema = XML::Compile::Cache->new($xsdfn
    , allow_undeclared => 1);
  $schema->prefix(p => $namespace);
  
  my $doc  = XML::LibXML::Document->new('1.0', 'UTF-8');
  my $node = $schema->writer("p:$name")->($doc, $hash);
  $doc->setDocumentElement($node);
  print $doc->toString(1);

When your schema uses ugly constructs like "any", you need to call more than one writer to build the whole DOM tree. All of these sub-trees of the resulting document need to be constructed on the same document $doc. This is the purpose of these extra three lines around the actual writer.

Be warned that, in XML::LibXML, there are a few differences between calling toString() on a single node and on the document. Only the latter case promisses to get the utf8 encoding right.

Implementing a protocol

Supporting a protocol which is based on XML is actually quite easy, as long as there is a schema available. We have seen readers and writers in previous examples: just a dozen lines. We need to wrap that up into a package and provide some abstraction. You can find a few examples on CPAN, for different uses of XML, like handling files and SOAP. Have a look at those which suits best. The base modules carry some examples as well.

First, where do you install xsd and wsdl file? Traditionally, on UNIX systems, the constant files are kept in /usr/share/ or /usr/local/share. This is rather inconvenient in general: when your program starts, where are those files? And it is not platform independent either. However, did you know that everything in lib directory of a distribution get's installed? For instance, in Geo::KML, I use this:

 lib/Geo/KML.pm
 lib/Geo/KML/xsd/kml-2.1/kml21.xsd
 lib/Geo/KML/xsd/kml-2.2.0/kml22gx.xsd
 lib/Geo/KML/xsd/kml-2.2.0/ogckml22.xsd

Schema's for different versions of the schema are provided, relative from the main module. That module collects these files along this lines:

  package Geo::KML;
  sub new(%)
  {   my ($class, %args) = @_;
      my $version = $args{version} || '2.2.0';
      (my $dir    = __FILE__) =~ s!\.pm$!/xsd/kml-$version!;
      my @xsd     = glob "$dir/*.xsd";
      my $schema  = XML::Compile::Cache->new(\@xsd);
      bless {schema => $schema}, $class;
  }

  sub writeKML($$)
  {   my ($self, $data, $filename) = @_;
      my $doc    = XML::LibXML::Document->new('1.0', 'UTF-8');
      my $xml    = $self->{schema}->writer('kml')->($doc, $data);
      ...
  }

It is easy to calculate the location of the constant files based on the location of a package via __FILE__, the constant indicating the file's absolute location. Be warned that `glob' does not correctly handles paths with blanks in them.

Of course, you will need to add `declare' and `prefix' declarations to the $schema object. These are often also version related, and the number of versions may grow, so usually I create a HASH with these settings.

Above set-up provides a nice abstraction, where the schema is a little out of reach for the end-users. On one hand clean, on the other hand inconvenient to debug or for template generation. The other solution would be to use inheritance:

  package Geo::KML;
  use base 'XML::Compile::Cache';
  sub init($)
  {   my ($self, $args) = @_;
      my $version = $args{version} || '2.2.0';
      (my $dir    = __FILE__) =~ s!\.pm$!/xsd/kml-$version!;
      $self->importDefinitions( [glob "$dir/*.xsd"] );
      $self;
  }

  sub writeKML($$$)
  {   my ($self, $data, $filename) = @_;
      my $doc    = XML::LibXML::Document->new('1.0', 'UTF-8');
      my $xml    = $self->writer('kml')->($doc, $data);
      ...
  }

You can load additional schemas with `importDefinitions()', at any time. As many as you want, different versions of the schema if you like. It can also be used to overrule erroneous schema components: the last definition loaded will overrule previous definition with the same name.

Now, a little bit harder is the reader: when a file is provided to be read, we do not know which top-level element to expect. We can also not initialize with an explicit version unless we know the root element.

   my $data = Geo::KML->readKML($source);

   package Geo::KML;
   sub readKML($)
   {   my ($class, $source) = @_;
       my ($root, %details) = XML::Compile->dataToXML($source);
       my $rootns  = $root->namespaceURI;
       my $version = $self->namespace2version($rootns);
       my $self = $class->new(version => $version);
       $self->reader($kml)->($root);
   }

Normally, the reader will automatically parse filenames and strings into XML, using `dataToXML()'. In this case, we have to parse the source ourselves, to figure-out which namespace is used in the root element of the document. Some magic is needed to translate the namespace into the version of the protocol used.

At this moment, the lowest level of support has been reached. Users are expected to understand and produce the complex data structure which represent their information in XML. Will we only support such a low level? Much nicer would be to add an abstraction layer on top of this module to hide the message structure but also version difference from the user's view. Sometimes that's easy, often that is a huge task. For KML, it's too much work.

Conclusion

XML::Compile is the base of growing number of modules on CPAN: XML::Compile::SOAP, ::SOAP::Daemon, Geo::KML, Geo::GML, and many more. When you like this story, I will explain some of these in the next edition of Perl-Foo.