Writing games in Korean – part 1

First, check whether I can actually render Korean characters. I know that Unicode supports nearly all of the glyphs (or, more correctly it seems, “ideographs”) needed for Chinese, Japanese, Korean (typically abbreviated as “CJK”) – but I also know that MS Windows fonts are infamous for missing some/most/all of the unicode characters (there are tens of thousands of them, so this is not all that surprising – except that Microsoft has been shipping OS’s localized to those countries for many years now, so I’d hoped maybe they would include at least ONE “CJK + Latin” font on all machines by now, no? No; they don’t).

Quick test program using Java (NB: java one of the easier languages for this because it was invented late enough that Unicode had settled down, and so it has extremely good Unicode support built-in to the core libraries, unlike C++ et al. By default, Java uses unicode almost everywhere, so I don’t need to debug unicode support, yay!):

NB: …but then when I tried it I quickly discovered a problem. I quickly gave up on the obvious route, so might have missed a workaround, but… I could have done this using Java’s built-in GUI toolkit (Swing), which supports using any font as a text label. However, it seems there’s a design flaw in Swing that’s been around since 1997 (yes, folks, 11 years and counting): the only component that can have the font changed – JLabel – which is a “general-purpose” label for any GUI component (good idea!) … is incompatible with the only component capable of being a button – JButton – which is a “general-purpose” button. Crap. If I’m not missing something here, with one small piece of poor API design, Sun have made their *almost* international-friendly GUI system almost completely useless for multi-language GUIs. Just to be clear: it’s not specifically internationalization they’ve broken here: its a general issue – any forms of customized rendering cannot use the JButton / JLabel combo. Why? Why, why, why would you do such a stupid thing, after going to the effort of making these generic widgets? Oh, well. Custom rendering it is…

/**
 * Simple test program that renders Unicode characters from a particular font of
 * your choice, a couple of thousand at a time, automatically wrapping them based on
 * render width on screen.
 * 
 * You need to manually change the font name and the start/end co-ordinates of where
 * in the unicode constellation to render.
 */

import java.awt.*;
import javax.swing.*;

public class PaintSomeUnicodeChars extends JFrame
{
	void setup()
	{
		getContentPane().add( new charPane() );
	}
	
	public static void main( String[] args )
	{
		PaintSomeUnicodeChars window = new PaintSomeUnicodeChars();
		window.setDefaultCloseOperation( JFrame.EXIT_ON_CLOSE );
		window.setup();
		window.pack();
		window.setSize( 600, 500 );
		
		window.setVisible( true );
	}
}

class charPane extends JPanel
{
	public void paint( Graphics g )
	{
		int si = 44032;
		int ei = 46000;
		g.setColor( Color.red );
		g.setFont( new Font( "Verdana", 10, 40 ) );
		g.drawString( "Chars from " + si + " to " + ei, 30, 30 );
		
		g.setColor( Color.black );
		g.setFont( new Font( "Lucida Sans Unicode", 10, 20 ) );
		
		int accumulator = 0;
		int height = 20;
		int xwidth = getWidth();
		for( int i = si; i < ei; i++ )
		{
			char[] chars = Character.toChars( i );
			String drawString = String.copyValueOf( chars );
			int advance = g.getFontMetrics().charsWidth( chars, 0, chars.length );
			
			if( accumulator + advance > xwidth )
			{
				accumulator = 0;
				height += g.getFontMetrics().getHeight();
			}
			
			g.drawString( drawString, accumulator, height );
			
			accumulator += advance;
		}
	}
}

Why choose Lucida Sans Unicode?

Apart from the fact that it’s present on all modern Windows machines automatically, I cheated a bit and used Character Map (Start > Programs > Accessories > System Tools > Character Map) to quickly inspect each of the fonts already installed and find out which ones had lots of unicode in them. Character Map has a neat feature where it completely ignores missing characters, automatically skipping over them, so you can very quickly see if a font has lots of unicode, some, or very little.

Stepping through in the java app, I fairly quickly found lots of Unicode characters. success – I can correctly render arbitrary Unicode characters in java.

NB: important because it took a lot more lines of source than I expected; java’s built-in “char” datatype is incapable of being used to render Unicode, because it’s too small a range of values. That also means you can’t use ANY methods that take a char as argument (this is documented in the Java API docs for Character). I’d never really used the methods that take int’s as argument before…

Fonts…

Not really happy with the font, though, and it’s clearly missing a whole bunch of characters. Some googling for free CJK fonts found me that allegedly Arial Unicode MS would work – which comes free with Microsoft Office (which I have on one of my machines).

NB: the way font licensing works, it’s probably illegal to copy a font you have on one machine to another machine you own. Unless you can obtain the font from the original source (e.g. in this case by buying a second copy of MS Office).

Arial Unicode MS has lots of characters, but … they’re wrong. At least, one third of the ranges of Korean characters are completely useless, as far as I can tell, because whoever made this font (whoever Microsoft licensed it from?) didn’t read the spec carefully and stuck the wrong characters in from U+1100-U+11FF. More on this later…

Giving up on that font, I went looking for others. The first four or five I tried that didn’t look ugly were all from download sites in Japan or Korea and kept crashing on the download. I found a couple of places hosting UnBatang (“UnBatangOdal.ttf”) and claiming it was free, but the first I found where the download didn’t crash was on http://www.i18nl10n.com/fonts

UnBatang (the name you have to use in Windows / Java from inside an application in order to load it) seems to work very well. It has nicely painted ideographs for the main range of Hangul characters/syllables, and it has *correct* glyphs for the other two ranges (although they’re a bit ugly).

Unicode Hangul

Specifications for official standards tend to be big. Really big.

Specifications for anything to do with internationalization tend to be huge.

Add in localization and data-transfer between different cultures, and you can expect something gargantuan.

So, I was really really happy to find a downloadable copy of only the CJK Chapter of the Unicode 5.0 specification (it’s still a big document!). This contains a full explanation of what Hangul is in Unicode, where to find it (yay!), and why there are not one but three separate sets of Hangul glyphs.

Here’s the preface to the Chapter 12 PDF I downloaded, in case you want to find the spec / electronic version yourself:

Electronic Edition

This file is part of the electronic edition of The Unicode Standard, Version 5.0, provided for online
access, content searching, and accessibility. It may not be printed. Bookmarks linking to specific
chapters or sections of the whole Unicode Standard are available at

http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html

Purchasing the Book

For convenient access to the full text of the standard as a useful reference book, we recommend purchasing
the printed version. The book is available from the Unicode Consortium, the publisher, and
booksellers. Purchase of the standard in book format contributes to the ongoing work of the Unicode
Consortium. Details about the book publication and ordering information may be found at

http://www.unicode.org/book/aboutbook.html

Unicode 5, Hangul, and Arial Unicode MS

So, armed with the offical spec, I read up on Hangul. What did I find?

The Unicode Standard contains both the complete set of precomposed modern Hangul syllable
blocks and the set of conjoining Hangul jamo. This set of conjoining Hangul jamo can
be used to encode all modern and ancient syllable blocks.

(this is the glyphs at U+1100–U+11FF)

“conjoining Hangul jamo” means “these glyphs have been positioned inside their spaces in the font so that if you need to make one ideograph out of, say, four Korean letters, you just pick the top-left version of the first letter, the top-right version of the second, etc, and OVERLAY all the glyphs, and what comes out will autamatically be correctly spaced out etc”.

What did the author of Arial Unicode MS do?

Made all those glyphs take up the full available space, and centre them horizontally and vertically.

Why? Seriously, why? Because any application that tries to render those characters is going to render them on top of each other (according to the specification, this is the ONLY point of having those characters), and you won’t be able to read at all what the letters say.

If you want to render just individual letters from the Korean alphabet, there’s a different range of Unicode where you can find them all centred etc (which Arial Unicode MS also has … so it seems to be just copy/pasting internally).

I guess the font author just didn’t read the spec. Or I’m completely misunderstanding the spec. But the fact that UnBatang spaces the conjoining jamos out in such a way that this works as I expected it to suggests to me that it’s the Arial Unicode that’s broken…

The joy of Conjoining

I couldn’t get hold of the Unicode spec section on how to conjoin, because of some mimetype problems between their server and my PDA’s web browser. I figured I could probably work it out by trial and error fairly quickly.

I can’t remember how to spell my name in Korean yet – I know the letters, I’m just a bit flaky on what the placement of them is. So, a quick experiment, to see how easy it is to position characters using the Conjoining Jamo from UnBatang:

	... constants used later - the Unicode values for the letters of my name: ...
	
	int a = 0x1161; // jungseong vowel
	int d = 0x1103; // choseong consonant
	int d2 = 0x11ae; // jongseong consonant
	int m = 0x1106; // choseong consonant
	int m2 = 0x11b7; // jongseong consonant
	int blank = 0x110b; // silent char you place at front if first letter is a vowel
	
	... the body of the paint method: ...
	
	g.setColor( Color.black );
	g.setFont( new Font( "UnBatang", 10, 30 ) );
		
	int w = 0;
	int lead = 0;
	w += renderIdeograph( g, blank, lead, 40  );
	w += renderIdeograph( g, a, lead+w, 40  );
	w += renderIdeograph( g, d2, lead+w, 40  );

	lead+=30; w = 0;
	w += renderIdeograph( g, blank, lead+w, 40  );
	w += renderIdeograph( g, a, lead+w, 40  );
	w += renderIdeograph( g, m2, lead+w, 40  );
		
	lead=0; w = 0;
	w += renderIdeograph( g, blank, lead+w, 80  );
	w += renderIdeograph( g, a, lead+w, 80  );
		
	lead+=30; w = 0;
	w += renderIdeograph( g, d, lead+w, 80  );
	w += renderIdeograph( g, a, lead+w, 80  );
	w += renderIdeograph( g, m2, lead+w, 80  );
	
	... and then this method to do the paint of each part of an ideograph: ...
	
	/** Doesn't really render an ideograph, renders a single glyph from the Font */
	public int renderIdeograph( Graphics g, int i, int x, int y )
	{
		char[] chars = Character.toChars( i );
		String drawString = String.copyValueOf( chars );
		int advance = g.getFontMetrics().charsWidth( chars, 0, chars.length );
		
		g.drawString( drawString, x, y );
		
		return advance;
	}

Which renders exactly like this:

hangul-adam-unbatang.PNG

Which makes me want to point out the nice thing about properly specified fonts: in the source code, I simply did “the natural thing”, as if I were outputting characters in an arbitrary conjoined language:

  1. Render the first part of the first ideograph at (x,y)
  2. Ask the font to tell you how many pixels wide (w) it just rendered that part
  3. Render the next part of the first ideograph at (x+w,y)
  4. …repeat until first ideograph is complete, increasing w more and more each time…
  5. Choose a value for how much you want the ideographs separated from start to start (ideograph_width)
  6. Render the first part of the second ideograph at (x + ideograph_width, y)
  7. …repeat as for first ideograph

And it worked. First time. I didn’t expect it to – I expected to have to do something strange like manually “reset” the (w) value each time I went from the first line of the ideograph to the second (Korean orders letters top-left, top-right, bottom-left, bottom-right, (repeat) … as opposed to Latin which is just left, right, more right, even more right, etc).

For this to have worked, it means the font is deliberately rendering the lower letters (e.g. d2 and m2 in my constants) a long way to the left of the origin that you tell it to render them at. This would be very, very confusing if you just tried to render these letters individually (well, duh).

Of course, if you try to run that code using Microsoft’s Arial Unicode MS font, you get a complete mess instead, because that font is FUBAR, as mentioned before. You get this:

hangul-adam-arial-unicode.PNG

…which is completely incomprehensible.

8 thoughts on “Writing games in Korean – part 1

  1. elFarto

    The setFont method is specified in java.awt.Component, which is the parent class for every GUI component either AWT or Swing.

    This means JButton does have a setFont method and change be used to display any Unicode character (I’ve just tested this and it works).

    You could also use the UIDefaults class to change the fontd of the current Look & Feel of Swing, but that’s a bit trickier.

    Regards
    elFarto

  2. elFarto

    Also, regarding:

    “java’s built-in “char” datatype is incapable of being used to render Unicode, because it’s too small a range of values”

    This is incorrect at worse, and misleading at best. A single char in Java is able to represent the Basic Multilingual Plane in Unicode, which covers most of the worlds popular languages.

    Regards
    elFarto

  3. adam Post author

    Thanks for the tips.

    However, since I’m dealing with translation between languages, I need to override the setText functionality. The correct way of doing that in OOP terms (and best for me) is to make just ONE custom rendering component, which is what I was getting at – JLabel acts like this and works for almost everything I need EXCEPT for buttons?

  4. adam Post author

    Thanks also for the improvements to how I described char, although I think it’s only as bad as you imply if you take the staement out of context.

    A char simply, be definition, cannot hold all unicode code-points. There is a workaround to this provided as part of the standard libraries, which I used and drew attention to. If you write your own apps only using chars you will be deliberately writing code that is no more than “mostly” compatible with the standard, no? Doesn’t this kind of approach lead to the kinds of problems with the arial font that I notioced – it is mostly correct (has the correct glyphs), looks ok on a basic inspection, but isn’t actually correct (wrong layout info, such that that part of the font is mostly useless in practice)?

    It doesn’t seem to be an issue for my current Hangul rendering, but this particularly matters to me when the context is of rendering ideographic languages, which seem to expose a lot of places where software has incomplete i18n or unicode support. For the want of a couple of lines of source I can avoid creating undocumented incompatibilities with the standard I’m using.

  5. sigman

    As far as I know there is no korean software (perhaps with the exception of specific word processing tools) which actually renders ideographs by composing jamo. Most of the time text is stored “precomposed” and displayed with these composite glyphs. Perhaps for the obvious reason – composite glyphs might look different from their jamo counterparts (as it happens in handwritten script). Interestingly (and I suppose counterintuitively) such approach of dealing with the complete syllables greatly simplifies many things, like searching the text, text selections, strings manipulations and such.

    my 5 cents

  6. adam Post author

    I’ve got it working both ways around now :), wasn’t hard once I got hold of the Unicode spec chapter which gives the algorithm they use for assigning codepoints to jamo combinations.

  7. Kim Hyoun Woo

    Totally what *sigman* said is right. There are two way to render text: one is to use the way compisiting and the other is to use precomposed one as *sigman* mentioned. And MOST of software uses the second at the current time. Sorry for bloating with the duplicate comment but I want to point it out – this is what Korean engineer says. :-)

Leave a Reply

Your email address will not be published. Required fields are marked *