Home » Should UTF-16 be considered harmful?

Should UTF-16 be considered harmful?

Solutons:


This is an old answer.
See UTF-8 Everywhere for the latest updates.

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the “anglo-centrism” of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There’re few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

To people who say “use what needed where it is needed“, I say: there’s a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the “use what you want” approach. I see no reason for such liberty. There’s enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I’d be last expected to attack UTF-16 based on religious grounds).

I’d like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Don’t use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
  • Don’t use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
  • std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
  • All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
  • only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:

    ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
    

    (The policy uses conversion functions below.)

  • With MFC strings:

    CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
    
    std::string s = str(boost::format("Hello %sn") % Convert(someoneElse));
    AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
    
  • Working with files, filenames and fstream on Windows:

    • Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
    • Convert std::string arguments to std::wstring with Utils::Convert:

      std::ifstream ifs(Utils::Convert("hello"),
                        std::ios_base::in |
                        std::ios_base::binary);
      

      We’ll have to manually remove the convert, when MSVC’s attitude to fstream changes.

    • This code is not multi-platform and may have to be changed manually in the future
    • See fstream unicode research/discussion case 4215 for more info.
    • Never produce text output files with non-UTF8 content
    • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}

Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).

Some examples:

  • Roman numeral codepoints like “ⅲ”. (A single character that looks like “iii”.)
  • Accented characters like “á”, which can be represented as either a single combined character “u00e1” or a character and separated diacritic “u0061u0301”.
  • Characters like Greek lowercase sigma, which have different forms for middle (“σ”) and end (“ς”) of word positions, but which should be considered synonyms for search.
  • Unicode discretionary hyphen U+00AD, which might or might not be visually displayed, depending on context, and which is ignored for semantic search.

The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use:
– utf-8 for storage and comunication
– utf-16 for data processing
– you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt).
Also see this document: http://unicode.org/notes/tn12/

Back to “UTF-16 as harmful”, I would say: definitely not.

People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don’t understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.

Just read this series here http://www.siao2.com/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.

Related Solutions

When should I not kill -9 a process?

Generally, you should use kill (short for kill -s TERM, or on most systems kill -15) before kill -9 (kill -s KILL) to give the target process a chance to clean up after itself. (Processes can't catch or ignore SIGKILL, but they can and often do catch SIGTERM.)...

Default value for UUID column in Postgres

tl;dr Call DEFAULT when defining a column to invoke one of the OSSP uuid functions. The Postgres server will automatically invoke the function every time a row is inserted. CREATE TABLE tbl ( pkey UUID NOT NULL DEFAULT uuid_generate_v1() , CONSTRAINT pkey_tbl...

comparing five integers with if , else if statement

try this : int main () { int n1, n2, n3, n4, n5, biggest,smallest; cout << "Enter the five numbers: "; cin >> n1 >> n2 >> n3 >> n4 >> n5 ; smallest=biggest=n1; if(n2>biggest){ biggest=n2; } if(n2<smallest){ smallest=n2;...

How to play YouTube audio in background/minimised?

Here's a solution using entirely free and open source software. The basic idea is that although YouTube can't play clips in the background, VLC for Android can play clips in the background, so all we need to do is pipe the clip to VLC where we can listen to it...

Why not use “which”? What to use then?

Here is all you never thought you would ever not want to know about it: Summary To get the pathname of an executable in a Bourne-like shell script (there are a few caveats; see below): ls=$(command -v ls) To find out if a given command exists: if command -v...

Split string into Array of Arrays [closed]

If I got correct what you want to receive as a result, then this code would make what you want: extension Array { func chunked(into size: Int) -> [[Element]] { return stride(from: 0, to: self.count, by: size).map { Array(self[$0 ..< Swift.min($0 + size,...

Retrieving n rows per group

Let's start with the basic scenario. If I want to get some number of rows out of a table, I have two main options: ranking functions; or TOP. First, let's consider the whole set from Production.TransactionHistory for a particular ProductID: SELECT...

Don’t understand how my mum’s Gmail account was hacked

IMPORTANT: this is based on data I got from your link, but the server might implement some protection. For example, once it has sent its "silver bullet" against a victim, it might answer with a faked "silver bullet" to the same request, so that anyone...

What is /storage/emulated/0/?

/storage/emulated/0/Download is the actual path to the files. /sdcard/Download is a symlink to the actual path of /storage/emulated/0/Download However, the actual files are located in the filesystem in /data/media, which is then mounted to /storage/emulated/0...

How can I pass a command line argument into a shell script?

The shell command and any arguments to that command appear as numbered shell variables: $0 has the string value of the command itself, something like script, ./script, /home/user/bin/script or whatever. Any arguments appear as "$1", "$2", "$3" and so on. The...

What is pointer to string in C?

argv is an array of pointers pointing to zero terminated c-strings. I painted the following pretty picture to help you visualize something about the pointers. And here is a code example that shows you how an operating system would pass arguments to your...

How do I change the name of my Android device?

To change the hostname (device name) you have to use the terminal (as root): For Eclair (2.1): echo MYNAME > /proc/sys/kernel/hostname For Froyo (2.2): (works also on most 2.3) setprop net.hostname MYNAME Then restart your wi-fi. To see the change, type...

How does reverse SSH tunneling work?

I love explaining this kind of thing through visualization. 🙂 Think of your SSH connections as tubes. Big tubes. Normally, you'll reach through these tubes to run a shell on a remote computer. The shell runs in a virtual terminal (tty). But you know this part...

Difference between database vs user vs schema

In Oracle, users and schemas are essentially the same thing. You can consider that a user is the account you use to connect to a database, and a schema is the set of objects (tables, views, etc.) that belong to that account. See this post on Stack Overflow:...