Are you maintaining an application that has been around for more than 5, 6 years? Using a pre-Unicode Delphi version (pre D2009) to do so?
Would you like to switch to a newer Delphi version to gain the advantages of Unicode, generics, extended RTTI, 64 bit, the REST client library and other such niceties? So you can build new features for your users in ways that are not possible, or not cost effective, with your current version? Or perhaps so you can offer your application as a multi-platform solution?
Or, are you already on a Unicode enabled Delphi version and now faced with having to deal with textual data coming at you from all sorts of sources in all sorts of different character encodings (ASCII / Ansi being just one of them)?
Is having to deal with the Ansi/ASCII to Unicode conversion holding you back?
Do you dread having to deal with all the string types when reading or writing your data?
Strings are still strings albeit with a different encoding.
As long as you haven’t done any fancy tricks, or (ab)used arrays of chars where you should have used arrays of bytes, your application should make the transition from the pre-Unicode world without too much hassle.
The additions to the RTL to support Unicode make dealing with files using different character encodings and Unicode transformation formats relatively straightforward.
Below are 20 resources to help you deal with your data in a Unicode world.
Computerphile provides a thoroughly enjoyable explanation of why Unicode came to be in the first place. He also illustrates the “greatest hack” which nowadays is Unicode’s most ubiquitous transformation format, UTF-8:
2 and 3
The number one and two resources on Unicode and Delphi are Cary Jensen’s white paper Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines (direct link to the pdf), and Marco Cantù’s white paper Delphi and Unicode (direct link to the pdf)
Both include a technical overview on how Delphi implements Unicode support and what parts of your application may be affected by it. Marco’s was the original white paper that accompanied the Delphi 2009 version. Cary’s was published about a year later and has the benefit of including advice based on practical experience.
A list on Unicode resources isn’t complete without a reference to Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
A must read for every developer, even if you are not that technically minded and leave the nitty gritty of reading and writing characters to your component and database vendors or colleagues that are more technically inclined.
5, 6 and 7
Nick Hodges wrote a triad of articles on Delphi and Unicode that accompanied the Delphi 2009 launch. They are not as comprehensive as Marco’s and Cary’s white papers. They do give a quick overview of Delphi’s Unicode capabilities and introduce pre-D2009 version users to a couple of interesting additions to the RTL.
- Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?
- Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode
- Delphi in a Unicode World Part III: Unicodifying Your Code
8, 9, 10, 11, 12 and 13
Delphi’s Unicode implementation comes with a gotcha. Many text functions come in two flavors: the plain one and the Ansi one. For example
CompareText compares text without giving any thought to the locale in which the text is used.
AnsiCompareText is locale sensitive.
When these were introduced there names seemed like a good idea. After all, the way to deal with locale issues was to use the Ansi “extension” of the ASCII character encoding.
With the introduction of Unicode support it became obvious that naming locale sensitive functions for the implementation of that locale sensitivity wasn’t the brightest idea. Especially as the names needed to be kept for backwards compatibility reasons.
In the Unicode world, where you need to deal with the difference between Unicode and ASCII/Ansi character encodings, having to use
Ansi named functions for locale sensitivity, is confusing to say the least.
Another couple of interesting tidbits and useful experiences can be found in these posts:
- Be careful with Ord function in Unicode Delphi versions by Serg
- Widechar sets for Unicode by Peter Below
- Will The Real UTF8String Stand Up? by Jan Goyvaerts
- I Like My Bytes Raw by Jan Goyvaerts
- Using RawByteString Effectively by Jan Goyvaerts
- What should I use? UTF8 or UTF16? question on StackOverflow
If converting to and from Unicode is something that you need to do a lot, then the DIConverters library (LGPL open source) may be of help to you. It is a Delphi character conversion library that provides conversion functions for a dozen dozen character encodings.
Working with Unicode in XML files can present some challenges. Guidelines by the World Wide Web Consortium (W3C) can be found in Unicode in XML and other Markup Languages.
16, 17, 18, 19 and 20
The Unicode specification is incredibly extensive. You can quite literally get lost in there. The pages and “entry points” I have found most useful are:
- The Unicode® Standard: A Technical Introduction An important realisation I took away after reading this page is that most software can get away with supporting “only” the Basic Multilingual Plane (BMP for short) that comprised 64K code points and covers the majority of common-use characters, including the “simplified” Asian character sets.
- Frequently asked questions
- Unicode 7.0 Character Code Charts Provides character charts much like the familiar ASCII / Ansi character charts for code pages.
- Unicode Transformation Formats: UTF-8 & Co.
- UTF-8 Encoding a written illustration of the “greatest hack” that is UTF-8.
The Unicode specification contains not only characters, but also punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. Version 7.0 provides codes for 112,956 characters from the world’s alphabets, ideograph sets, and symbol collections.
A few examples from the character code charts are: braille, dingbats, chess, domino tiles, Mahjong tiles, playing cards, musical symbols, cuneiform, technical symbols, transport and map symbols, mathematical symbols and operators, and much much more.
You name it and Unicode probably has it. Just have a look through the Miscellaneous Symbols And Pictographs chart. It would seem that there is hardly anything left that you can’t depict with a single Unicode character.
Just one caveat: the font of your choice needs to support them if you want to use Unicode characters to display them.