Computer grammars, Unicode in Eclipse, and problems with Java’s XML parsers…
Death to proprietary Grammars!
In my first attempts at writing Korean applications, before I had any font rendering up and running, and with no special XML editor to hand, I had a simple but nasty problem: how do you store Korean characters persistently? I wanted to have to type out the vocabulary once and once only, and re-load it in multiple applications.
I knew lots of theory about how unicode “might” work, but also knew a lot of it was poorly implemented in practice, so I tried to avoid being dependent on everything working “as it is supposed to”. Instead, I went with a simple hack: I took the “standard” transliterations used in English/Korean today – a,i,o,eo,u,ya,yo,yeo,yu etc – and used them as an intermediate language that I could:
- write out to an ordinary ASCII file
- read in from an ASCII file
- guaranteed readable *correctly* in every IDE, tool, editor etc (everything speaks ASCII…)
- 100% correctly translate from this into Korean for display
- 100% correctly translate into this from typed Korean characters to check that the player typed the “correct” translation in-game
But…that’s pretty nasty to write a grammar for, and yacc and lex and friends were still horrendously under-documented last time I checked (NB: years ago at University, back when the smorgasbord of Compiler Theory was still in my head, these apps were merely tricky to use, but I could manage. In the years since, as the knowledge faded from the front of my mind, and I’ve had reason to re-use those tools, I’ve found them a complete nightmare. The last few times I needed them, I actually didn’t use them at all and knowingly wrote *bad* source code that I knew to be buggy, fragile, and I even knew was taking me substantially longer to write than it would have taken just to use the tools. Why? Because they are so obtuse and weakly documented that I gave up all hope of working out / remembering how to use them. I felt pretty guilty. I also felt very frustrated. I hope things have improved by now, but I’m trying to write these Korean games as rapidly as possible, and didn’t want to risk the loss of time finding out)
So, I took the short-cut, and converted the standard transliterations into a context-free grammar (because context-free grammars are excessively easy to implement parsers for, by hand, by humans). This means I have a bunch of vocab files with things like:
“ngaan1 n1yeong haa s2ay ngyo”
where the real transliteration would have been:
“an nyeong ha say yo”
Similar enough to be workable, but confusing and, frankly, starting to get irritating – I know I’ll have to fix it sooner or later (assuming I ever want anyone else to write their own vocab files), and it’s such an ugly hack, and definitely not the best way to do things long-term.
Although, incidentally, there was an minor beneficial side-effect to this proprietary grammar: by accident, I ended up learning my consonants in Korean better – I learnt by accident which can conjoin together, which can be doubled, which are single-only, etc. This is mostly because I had to explicitly make each different letter unique in my grammar, and then whenever writing out vocab – or debugging my games – I had to remember what I’d named each letter, and also which letters were transliterated as one latin letter (e.g. h, which is only ever alone), and which as two latin letters (e.g. s which can be single or doubles, leading to two letters in my grammer: s1 and s2)).
But the time has come to “fix” this: my font rendering is, dare I say it, pretty good now, with my applications happily switching back and forth between conjoining jamos and single-ideograph rendering pipelines on the fly, and with auto-detection of fonts (they find the prettiest font you have installed when they startup, and adapt to use that one – or pop up an error message to tell you to go install some Korean fonts).
And now the fun starts. There’s only one correct way to do this: store Unicode characters directly in my vocab files (this is practically the raison d’etre for the existence of Unicode; it excels at doing this well)…
Unicode, unicode everywhere, but ne’er a char to read
…and this is where I had some good news, and some bad news.
Good news: Eclipse’s ultra-simple XML editor not only fully supports typing unicode Korean characters directly into its editing window, and automatically render them in a Korean font, but it even automatically correctly saves the entire XML file with whatever encoding you specify in the first line of the file (it notices when I change that line and silently changes the save format when I next save! Rocking!).
(NB: I thought this was one that I’d installed manually, but when I tried to find out what the name was and who wrote it so I could give them a link here, I could only find references in Eclipse’s plugin list to a generic, Eclipse-provided, XML plugin. Anyone want to confirm/deny?)
I had assumed I’d need to go and buy a commercial XML editor to get full unicode support (unicode support being something that most apps do “partially” these days – c.f. my first post on writing Korean apps, and the detailed look at how Microsoft’s unicode fonts get all the conjoined jamos wrong, so that they won’t even render Korean :( ). But instead, I just type direct into the IDE.
Although…I can’t type Korean characters in Windows yet (sadly, Windows comes with this functionality uninstalled by default, and has to be installed off the original CD – but my Windows CD is several hundred miles away :(). However, using Character Map, which is part of the base windows install, I can get it to type them, and then I can copy/paste them into the text editor, and mostly get by for now.
Bad news: Java’s standard libraries corrupted the unicode characters when reading them from disk; in some cases, they even outright refused to load the files at all (SAXParseException’s being thrown for invalid file data) … which was what gave the game away in the end. This took me a long time to figure out … it wasn’t until I started experimenting with different UTF encodings that I saw the parser crashes and worked it out. The root cause was the last thing I suspected: depending upon a very minor change in how you invoke it, the XML parser from the standard library completely ignores the encoding of the file. Argh.
Various exceptions that can be thrown that are symptoms of this problem (all depends which encoding you’ve used):
Exception in thread "main" org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
Exception in thread "main" org.xml.sax.SAXParseException: Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
Or, if you use UTF-8 encoding, you just get gibberish printing to the screen, such as this:
For reference, once I fixed the problem, the same code rendered like this (as I intended):
This is bad. Really bad. You get no warnings about this. Also, as far as I can remember, the way I’ve usually seen this library invoked in all the hundreds of times I’ve seen it in source-code on the web and in books and tutorials … is the way I was using it – which simply isn’t correct. It works in some situations, but it assumes too much. So long as you only ever read and write XML with US-ASCII, and only ever live in America or England, you’ll probably never even notice. The moment someone starts giving your app a 100% valid perfectly-formed XML file which just happens to have an encoding different from the one that your JVM uses (based on your OS settings and language settings) … BOOM!
This is documented in the library docs – as soon as I realised it was the parser doing something it shouldn’t be doing, it took literally only a couple of minutes to find the ref in the docs, see the problem, and fix it. Here’s the documentation:
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So, where I had this line of code:
xpath.evaluate( "vocab/*", new InputSource( new FileReader( vocabFile ) ), XPathConstants.NODESET );
I had to change it to this:
xpath.evaluate( "vocab/*", new InputSource( new FileInputStream( vocabFile ) ), XPathConstants.NODESET );
The first line asks the Java standard library to automatically do byte-to-character conversions – and the JVM will typically buffer its reads from the hard-disk, providing (potentially) much better performance. The second line does raw byte reading from the hard disk, with no buffering. Java programmers who care about performance generally never use the second approach, unless forced to. And, in general, the first approach is the “right” one, because it causes the JVM to correctly perform all the character conversion too.
But in this case … raw bytes are the only way to go.
IMHO, this is a flaw in the XML parser – I would have expected it to at least notice when an incompatible encoding was being specified and signal an alert, perhaps output something to stderr? Or even throw an exception – if you get an incorrect encoding, that’s a pretty big sign that incorrect data is being fed in, and “do nothing, but crash a few thousand instructions later” doesn’t seem to me to be the most appropriate response…
Conjoining revisited – more valuable than first thought?
I mentioned in my first post the concept of Conjoining Jamos – a system of rendering Korean ideographs/glyphs by assembling them from the individual letters that make them up (which works because Korean is an alphabet-based language, but the letters have a more interesting positioning scheme than Latin’s right,right,right-a-bit-more, etc). I pointed out that Microsoft’s Unicode fonts that provide Korean characters are fundamentally broken and can’t render conjoined characters at all (the tables of metadata inside the font itself have the wrong information). A commenter pointed out that in practice conjoined jamos may rarely get used anyway – instead, nearly all applications simply calculate the pre-made ideograph (because Unicode can have hundreds of thousands of “letters”, they’ve actually worked out every possible combination of letters into ideographs, and put ALL of them inside Unicode. Cool).
And, for completeness, I added an implementation using these pre-baked characters (which actually render correctly even with Microsoft fonts!). Lovely. Ah. But, when adding some new features to my game, I realised that maybe I *have* to use conjoining jamos. How the heck do you hilight individual letters within an ideograph if you’re NOT using conjoining jamos?
I would guess that in every situation where you can hilight text, you probably need to use conjoining jamos – which means a large percentage of computer applications. I’m not entirely sure (I can think of ways you could hack around it) but conjoining seems the only sensible way to do it.
All *I* want to do is change the colours of individual letter-groups within the ideographs. Because I’ve written a new minigame: words and phrases fly towards you, and you have to type them in Korean to nullify them before they collide with your avatar.
I haven’t worked out how I’m going to do this yet – I don’t have any fonts that I can really use. I need a font that is:
- Pre-installed on every windows PC, or easy to install
- Only Microsoft can pre-install fonts on windows, and they don’t pre-install ANY Unicode fonts. If you have MS Office, that automatically installs some unicode fonts, so … “partial”
- For everyone else, I recommend UnBatang, which I’ve been using. It’s free, but sadly until/unless I find a license I can’t include it in the game – you have to download it manually
- Includes all the letters, and renders them correctly
- You may think this is guaranteed – but no, Microsoft’s fonts seem to be broken for the conjoining jamos. Non-conjoined characters render perfectly, but you’re out of luck with conjoining.
- UnBatang renders ALL the characters correctly – but the conjoining jamos it has are extremely ugly, so low-resolution that they’re almost impossible to read a lot of the time