::::: : the wood : davidrobins.com

Perl Office suite generates first document!

News ·Wednesday January 31, 2007 @ 20:09 EST (link)

Driving home today, around 1910 on Avondale Road: silver Mazda sedan, plate WA 092 VMW is weaving drunkenly back and forth onto the shoulder, and generally acting lethargic when it was time to go. Not quite sure what drug he was on.

The Office series of modules that I'm developing has generated its first valid Word document today; AB, a Word file format developer, helped me work through a minor hiccup (which also helped find a minor bug), and then everything worked as planned. The new .docx format (and .xlsx etc.) are actually zip files, although that's just the implementation of the package; you can easily look inside. LW, a recent ex-Word developer who moved into the web applications group, was also experimenting with generating new Word documents on the fly, but using C# he already had the System.IO.Packaging layer, whereas I had to write that from scratch. Well, not entirely; I drew on a host of existing CPAN modules: Archive::Zip for the physical package, XML::SAX and a host of filters (one of which, XML::Filter::NSNormalise (sic.) I sent in a patch for and the author got back to me the same day) for parsing and generating XML, DateTime (DateTime::Format::W3CDTF for xsi dates), Params::Validate, etc.

There are probably about 20 modules in the Office namespace (off the top of my head: Pkg, Pkg::Part, Pkg::Rels, Pkg::Rel, Pkg:ContentTypes, Pkg::CoreProps, Document, Document::AppProps, Word::Document (the abstract document), Word::Document::Document (represents the word/document.xml part), Word::Document::Para, Word::Document::Run, SAX::Writer, SAX::Parser (parser helpers, I've been using ExpatXS as my SAX parser of choice and Writer for output). And I've barely started—I can only write basic paragraphs with bold and italics. But this is a big step—adding other properties and part handlers will be incremental. Already it's a usable suite of modules, although I hope to get it to a somewhat more mature form before I make a release (I've registered the Office namespace in preparation). Sample working code, does what it says:
use Office::Word::Document;

my $docx = Office::Word::Document->new;

$docx->add_para('First paragraph.');

my $p = $docx->add_para('This second paragraph has some '); $p->add_run({ bold => 1 }, 'BOLD TEXT')->add_run(' in it.');

$docx->save('test.docx');
Doesn't do much yet, but watch this space, and watch for an alpha release coming soon. Note: This is not an officially supported Microsoft product; I'm doing it on my own using publicly available specifications. Right now I'm defaulting the Company property to Gippazoid Novelty; I'm debating leaving it in, and seeing how many documents show up with it set like that!

The pH (think acidity) markup language, an invention of mine which is basically just a more concise form of XML, which I use to write this log, among other things, has undergone the first change in a long time. Like perl, the definition of pH is in its single implementation, a perl XS C++ module called pH::Parser (no, it's not on CPAN, since it's not general enough, although that never stopped anyone else). pH is also my local wiki (here, it's a fairly old quick hack), which is about to get a big upgrade (for the Word development internal Wiki, but I'll port it back here).

To summarize, the pH markup language is XML without the close tags, or other unnecessary baggage, e.g. instead of <element attribute="value">some text</element>, it's <element attribute=value some text>. The equivalent of CDATA is << ... >>. Computer scientists have probably already noticed a few seeming flaw in it. What if your text begins with name=value? Well, you can escape the = as \= (and <, >, and & similarly), but that's annoying to check for in generated text (although it could be just auto-escaped along with the other markup characters). But the addition is to allow = as a lone pseudo-attribute which enforces the end of attributes for that element, e.g. <element a=b c=d = a=1 whenever b=2>. But, I also decided this was all somewhat silly (close tags are only a few bytes, and the expansion takes time, and there were other logistical problems getting pH expansion into the right place in the processing chain), and am primarily sticking with XHTML for the new wiki templates.

Back in reality-land, we're still waiting on the insurance (they're working on their estimate with the contractor, who thinks they're lowballing it a bit, and the contractor is getting more estimates to prove it). There may be some more (mainly cosmetic) damage downstairs&mdashlsome ripples in the ceiling and some nail pops. And I have a dental appointment (bi-annual clearning) tomorrow, fun.

Honey gave me a Leatherman tool and knife set for my birthday, and a nice card; we went out to eat on the day (yesterday), at a local Teriyaki-Sushi place (former for her, latter for me). One of the new GPM developers at Hilton wished me happy birthday—apparently I'd left my birthday in the code somewhere (in a test, I think, although I don't remember noting that it was my birthday; Honey's was there too). The spirit was nice, the English was a little broken. I don't remember Bob mentioning him, but apparently he's working out alright, which is nice after the hordes of idiots that have been paraded past them by the headhunters.

Next: designing the Office modules, and the new pH wiki (yes, I overloaded the name a little, it's short and succinct and fairly meaningless and I like it), and a little on the new Word COM dispatch. First, why I built my own: most of those out there are file-backed (ick, use a database, I'm using PostgreSQL but the layer is fairly flexible), or in PHP (a filthy language), or don't do revisions. It's definitely time for a good technical article.