Even Better Java i18n Pluralisation using ICU4J

Last week I wrote about Java’s built-in ChoiceFormat class and the support it provides for pluralisation. It is a very useful class, but as pointed out by two commenters (btw… thanks for the feedback!) it doesn’t cater well for all languages – particularly those that have more complex rules. This led me to investigate further, as I was certain there would be something useful out there – after all, internationalisation is a very common requirement of a large number of applications. After a little digging, I found that the one library that stands out is ICU4J. So who uses it? Well… pretty much everyone!

So for those that have more complex internationalisation requirements, this is an excellent library to use! I generally find that the best way to find out how something works is to see an example, so I’ve used the pluralisation example provided in the comments of my previous post to demonstrate ICU4J. I chose this example for a few reasons: firstly, because someone took the time to ask a question and I want to answer it; secondly, because it is clearly not supported by the JDK ChoiceFormat class; and lastly, because I only know languages with simple pluralisation rules.

I wrote a very basic class that simply prints out a localised message looked up from a ResourceBundle – which is probably the most commonly used approach and therefore familiar to most readers.

import com.ibm.icu.text.MessageFormat;

import java.util.Locale;
import java.util.ResourceBundle;

public class IcuDemo {

    private static final int[] NUMBERS = new int[] {0, 1, 2, 5, 11, 22, 39};

    public static void main(String[] args) {
        printLocalisedMessages("plural", Locale.ENGLISH, new Locale("pl"));
    }

    private static void printLocalisedMessages(String key, Locale... locales) {
        for (Locale locale : locales) {
            System.out.println(locale.getDisplayLanguage() + ":");
            printLocalisedMessage(key, locale);
        }
    }

    private static void printLocalisedMessage(String key, Locale locale) {
        ResourceBundle bundle = ResourceBundle.getBundle("icu", locale);
        String pattern = bundle.getString(key);
        MessageFormat msgFormat = new MessageFormat(pattern, locale);

        for (int i : NUMBERS) {
            System.out.println(msgFormat.format(new Object[] {i}));
        }

        System.out.println();
    }
}

The code above should be familiar to everyone, as it shouldn’t be all that different from how you’re already doing i18n. However, note that I’ve imported com.ibm.icu.text.MessageFormat instead of the usual java.text.MessageFormat. The really interesting part comes in when we use ICU4J’s “plural” format type, which is shown in the following properties files:

icu.properties:

plural=Undefined

icu_en.properties:

plural={0} {0, plural, one{car}other{cars}}

icu_pl.properties:

plural={0} {0, plural, one{auto}few{auta}many{aut}other{aut}}

I’m sure you’ll immediately notice that I’m not specifying numbers in these patterns, as we did with ChoiceFormat. Instead, I’m simply referring to categories of numbers by predefined mnemonics. This really cool feature is available because a number of language pluralisation rules have already been defined by the Unicode CLDR (Common Locale Data Repository). In particular, we’re using the Language Plural Rules, which are provided in the ICU4J package. To explain how this works, let’s look at the English example and then work our way up to the Polish example.

English has two categories – singular/plural. These two categories are named as “one” and “other” – fairly straightforward. What this really means in terms of plural rule definition is:

one: n is 1
(by implication, every other number falls into the "other" category)

Polish is more complex than this and requires a number of rules to be defined:

one: n is 1
few: n mod 10 in 2..4 and n mod 100 not in 12..14
many: n is not 1 and n mod 10 in 0..1 or n mod 10 in 5..9 or n mod 100 in 12..14
(by implication, every other number falls into the "other" category)

Clearly the definition of rules makes our lives a lot easier. All we need to know is which category of numbers we want to provide a pluralisation for, and define the message against that name using the format “keyword{message}”.

Note: The CLDR points out that the names are just mnemonics and aren’t inteded to describe the exact contents of the category, so try not to focus too much on them. It’s merely providing categorisation by a recognisable name.

The above example only uses the predefined number categories, but we could easily mix this with explicit values if needed. In this case, the explicit values would be checked first for an exact match, and if none was found then the categories would be searched, and failing that the “other” category would be used. Here’s an example of how you can mix the two concepts together:

example={0, plural, =1{one}=5{five}other{#}}

If we formatted this with the numbers 1 to 5 in a loop, this would be formatted as follows:

one
2
3
4
five

Of course, there may be circumstances where the predefined rules don’t do what you want (although, we’re probably talking about exceptional circumstances now). In this case, you can simply define your own set of rules. This can be done using the PluralRules class or by customising the locale data that’s available to ICU4J.

I’ve only scratched the surface of what you can do with this library – and pluralisation is only one very small part of what it provides – but I hope this is useful and is able to help get you started using it.

Advertisements

Java i18n Pluralisation using ChoiceFormat

Betfair‘s site is hugely popular all around the world, and obviously needs to provide a fully localised experience for users across different locales. Yesterday I was looking at how best to provide internationalisation (i18n) support within our core platform and came across the really useful ChoiceFormat class. I had seen it before but hadn’t actually explored what it can do until now. It’s an extremely basic class in terms of the functionality it provides, but it is as powerful as it is simple.

I’m sure we all know about Java resource bundles for i18n and have probably used them quite a lot – so I won’t go into any detail on that. I’ll just assume it’s common ground. However, if you haven’t come across java.text.ChoiceFormat, you’re missing out! This is the guts of pluralisation within the i18n stable, and is definitely your friend when you realise that users don’t like messages that say “Updated 1 second(s) ago”. You know it’s 1 second, right? So why not just say that and skip the “(s)” bit? I’m sure many have tried to solve this the hard way by doing the intelligence themselves (defining multiple keys and choosing which to use based on the argument to be passed into it), but this is where ChoiceFormat comes into play.

As per the docs, “The choice is specified with an ascending list of doubles, where each item specifies a half-open interval up to the next item”. So let’s use the example above to show how it would be defined in a resource bundle:

lastUpdated=Updated {0} {0,choice,0#seconds|1#second|1<seconds} ago

What this does is creates a bunch of contiguous ranges (in ascending order) and finds the best match. I said best match because sometimes there isn’t an exact match. In the example above, negative numbers don’t match any range – but because they’re smaller than the starting range, the first choice is selected. The same logic applies for values that are larger than the highest range (which isn’t possible in this example as the highest range ends at positive infinity).

So let’s run through the scenarios very quickly:

  • If you pass in a negative number, the first choice “seconds” is returned (because it is too small to match anything and the ranges work in ascending order)
  • If you pass in a number between 0 (inclusive) and 1 (exclusive), the first choice matches and “seconds” is returned
  • If you pass in number 1 (exactly), the second choice matches and “second” is returned
  • If you pass in a number greater than 1, the last choice matches and “seconds” is returned

You may have noticed that the definition is different for the last range. What this is effectively doing in code is calling ChoiceFormat.nextDouble(1) which returns the smallest double greater than 1, which is then used as the start of the range. This is not restricted to being used the last range, but can actually be used anywhere. There is a similar ChoiceFormat.previousDouble(double d) that is fairly self-explanatory.

Nifty! Even more so when you consider that some languages have multiple pluralisations to consider (e.g. Russian). So you can’t reasonably assume that you’re always dealing with a simple singular/plural as we have in English – sometimes there are many different plurals to take into account.


Bug in Java 6 DecimalFormat.format()

While trying out some options for decimal formatting, I came across what appears to be a bug in Java 6. I haven’t yet traced back exactly which version introduced the bug, but I have managed to verify it on both Windows 7 running Java 6 Update 23 & 24 and on Mac OS X Snow Leopard running Java 6 Update 26.

According to the JavaDocs for java.text.DecimalFormat:

If there is an explicit negative subpattern, it serves only to specify the negative prefix and suffix; the number of digits, minimal digits, and other characteristics are all the same as the positive pattern. That means that “#,##0.0#;(#)” produces precisely the same behavior as “#,##0.0#;(#,##0.0#)”.

However, when putting this into practice, the formatter truncates the final character from the output. This is shown in the following JUnit test:

@Test
public void testDecimalFormat() {
  double value = -4000d;

  final String expected = "(4,000.00)";
  final String actualA = new DecimalFormat("#,##0.00;(#,##0.00)").format(value);
  final String actualB = new DecimalFormat("#,##0.00;(#)").format(value);

  // passes
  assertEquals(expected, actualA);

  // fails - actualB = "(4,000.00"
  assertEquals(expected, actualB);
}

I have logged this on the Java Bugs Database and will update this post once I have a response from Oracle. But hopefully this helps someone else that has come across the same issue.

UPDATE: Yes, it is a bug. You can track it here (may take a day or two to appear on the external bug database apparently).