Character Sets
Java
Java provides relatively a convenient recoding mechanism through the
String
class.
Converting bytes from a specific encoding:
String str = new String (bytes, encoding);
Converting a String to a specific encoding:
byte [] bytes = str.getBytes(encoding);
Unfortunately, the list of supported encodings is rather difficult to
find. Required by the Java Specification are ASCII
, ISO-8859-1
,
UTF-8
, and UTF-16
in big, little and specified endian order.
Sun's JDK contains a lot more encodings in the international version.
For example EBCDIC
, is Cp037
, though you need to be careful,
because EBCDIC
can mean any number of codepages. Though if in doubt
(and in an English speaking country), Codepage 037 is probably your best bet.
Since I can never find the list of supported encodings and their names when I'm looking, here is the link for Sun's JDK 1.3.
Conversion
You can of course also use recode
(or even dd
) in order to
convert from one character set to another.
recode lat1..cp037 $FILENAME
For conversion inside of a C
program, Unix provides the iconv
facility (man 3 iconv
), though it
seems the Unix Spec doesn't provide a list of charsets required in
the implementations.
This discussion of GNU's implementation is the closest I've been
able to find.
'Official' Names
IANA maintains a list of canonical charset names here and not at
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
as is
stated in RFC 2278 IANA Charset Registration Procedures. The IANA
list is useful because it provides a list with official names as well as
known aliases.
EBCDIC
EBCDIC
stand for Extended Binary Coded Decimal Interchange
Code. Chances are, you're lucky enough to never have to deal with it.
EBCDIC
is a character set IBM uses on mainframes, so you'll probably
only come in contact with it if dealing with legacy business
applications. EBCDIC
characters are 8-bit wide. They're table-based
because this corresponds roughly to positioning on punchcards. It also
goes to explain why characters in EBCDIC
aren't consecutive, i.e. 'i'
isn't followed by 'j'. Characters are basically sorted into a table, and
the first four bits of the value specify the columns, while the second
four bits specify the row. For example the character 't' has the hex
value 0xA3
, so you can find 't' in column A
row 3
in the table
below.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | | | & | - | ø | Ø | ° | µ | ^ | { | } | \ | 0 | |||
1 | | | é | / | É | a | j | ~ | £ | A | J | ÷ | 1 | |||
2 | | â | ê | Â | Ê | b | k | s | ¥ | B | K | S | 2 | |||
3 | | | ä | ë | Ä | Ë | c | l | · | C | L | T | 3 | |||
4 | | | | | à | è | À | È | d | m | u | © | D | M | U | 4 |
5 | | á | í | Á | Í | e | n | v | § | E | N | V | 5 | |||
6 | | | ã | î | Ã | Î | f | o | w | ¶ | F | O | W | 6 | ||
7 | | | å | ï | Å | Ï | g | p | x | ¼ | G | P | X | 7 | ||
8 | | | | ç | ì | Ç | Ì | h | q | y | ½ | H | Q | Y | 8 | |
9 | | | | ñ | ß | Ñ | ` | i | r | z | ¾ | I | R | Z | 9 | |
A | | | | | ¢ | ! | ¦ | : | « | ª | ¡ | [ | | ¹ | ² | ³ |
B | | | | . | $ | , | # | » | º | ¿ | ] | ô | û | Ô | Û | |
C | | < | * | % | @ | ð | æ | Ð | ¯ | ö | ü | Ö | Ü | |||
D | ( | ) | _ | ' | ý | ¸ | Ý | ¨ | ò | ù | Ò | Ù | ||||
E | | + | ; | > | = | þ | Æ | Þ | ´ | ó | ú | Ó | Ú | |||
F | | | ¬ | ? | " | ± | ¤ | ® | × | õ | ÿ | Õ | |
EBCDIC
codepage 037
In case you're wondering: line feed (\n) is 0x15
, carriage return
(\r) is 0x0d
, which is the same as in ASCII
! The space character
is 0x40
.
If you're looking at a hex dump of EBCDIC
characters, it's fairly
easy to single out the digits, they're the bytes starting with 'F', so:
123456789
looks like:
F1F2F3F4F5F6F7F8F9
in hex. Same thing applies in ASCII
, by the way, only with '3':
313233343536373839