Character Sets

Java

Java provides relatively a convenient recoding mechanism through the String class.

Unfortunately, the list of supported encodings is rather difficult to find. Required by the Java Specification are ASCII, ISO-8859-1, UTF-8, and UTF-16 in big, little and specified endian order.

Sun's JDK contains a lot more encodings in the international version. For example EBCDIC, is Cp037, though you need to be careful, because EBCDIC can mean any number of codepages. Though if in doubt (and in an English speaking country), Codepage 037 is probably your best bet.

Since I can never find the list of supported encodings and their names when I'm looking, here is the link for Sun's JDK 1.3.

top

Conversion

You can of course also use recode (or even dd) in order to convert from one character set to another.

recode lat1..cp037 $FILENAME

For conversion inside of a C program, Unix provides the iconv facility (man 3 iconv), though it seems the Unix Spec doesn't provide a list of charsets required in the implementations. This discussion of GNU's implementation is the closest I've been able to find.

top

'Official' Names

IANA maintains a list of canonical charset names here and not at ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets as is stated in RFC 2278 IANA Charset Registration Procedures. The IANA list is useful because it provides a list with official names as well as known aliases.

top

EBCDIC

EBCDIC stand for Extended Binary Coded Decimal Interchange Code. Chances are, you're lucky enough to never have to deal with it. EBCDIC is a character set IBM uses on mainframes, so you'll probably only come in contact with it if dealing with legacy business applications. EBCDIC characters are 8-bit wide. They're table-based because this corresponds roughly to positioning on punchcards. It also goes to explain why characters in EBCDIC aren't consecutive, i.e. 'i' isn't followed by 'j'. Characters are basically sorted into a table, and the first four bits of the value specify the columns, while the second four bits specify the row. For example the character 't' has the hex value 0xA3, so you can find 't' in column A row 3 in the table below.

0 1 2 3 4 5 6 7 8 9 A B C D E F
0   & - ø Ø ° µ ^ { } \ 0
1      é / É a j ~ £ A J ÷ 1
2    â ê Â Ê b k s ¥ B K S 2
3   ƒ ä ë Ä Ë c l t · C L T 3
4 œ  à è À È d m u © D M U 4
5 á í Á Í e n v § E N V 5
6   ã î Ã Î f o w F O W 6
7    å ï Å Ï g p x ¼ G P X 7
8  ˆ ˜ ç ì Ç Ì h q y ½ H Q Y 8
9   ñ ß Ñ ` i r z ¾ I R Z 9
A Ž Š š ¢ ! ¦ : « ª ¡ [ ­ ¹ ² ³
B  . $ , # » º ¿ ] ô û Ô Û
C  Œ  < * % @ ð æ Ð ¯ ö ü Ö Ü
D    ( ) _ ' ý ¸ Ý ¨ ò ù Ò Ù
E    ž + ; > = þ Æ Þ ´ ó ú Ó Ú
F     | ¬ ? " ± ¤ ® × õ ÿ Õ Ÿ

EBCDIC codepage 037

In case you're wondering: line feed (\n) is 0x15, carriage return (\r) is 0x0d, which is the same as in ASCII! The space character is 0x40.

If you're looking at a hex dump of EBCDIC characters, it's fairly easy to single out the digits, they're the bytes starting with 'F', so:

123456789

looks like:

F1F2F3F4F5F6F7F8F9

in hex. Same thing applies in ASCII, by the way, only with '3':

313233343536373839

Comment on this page: