Unicode Implementations Overview

Unicode Implementations Overview

This overview breaks the database vendors into three categories based on chronological availability, followed by the physical data storage characteristics. The physical storage type was chosen because it has a direct bearing on how the data is to be processed (double-byte or multi-byte), and the default operation of the Data Definition Language (DDL). A "CHAR(30) Unicode" statement with UCS-2 will store 30 Unicode characters. This has more efficient processing characteristics, but doubles the space required for Latin-1 data.

A "CHAR(30)" statement with UTF-8 will store between 10 and 30 Unicode characters. This has a minor performance penalty due to "byte counting" but does not significantly increase the space needed for Latin-1 data and can be implemented using existing multibyte schemes. Asian data will increase its data footprint by 50% with UTF-8.

Note that you may not know how the data is processed internally and that a UCS2 datatype does not ensure true Unicode internals.

Unicode available now

Available now, March 1997

UCS-2 can be manipulated and stored today with the IBM DB2 database using the GRAPHIC datatype set with the CCSID set to UCS-2 [18]. ADABAS D [16] and Teradata [27] offer separate Unicode datatypes, along with the standard CHAR types. ADABAS D can also store in the UTF-7 and UTF-8 encodings.

UTF-8 is offered today as an alternate default character set on a server-wide (Sybase), Database-wide (Oracle), and per-column (Interbase and ADABAS D) basis.

Note: it was unclear at the time this was written if Interbase [26] used UCS-2 or UTF-8. An outside source indicated that it used the UTF-8 encoding. The author was unable to confirm this based on Borland sources, so it is included here with a question mark.