Designing an application for UTF-8 or UTF-16 usage

Question

I am developing an application that will be primarily used by English and Spanish readers. However, in the future I would like to be able to support more extended languages, such as Japanese. While thinking of the design of the program I have hit a wall in the UTF-8 vs. UTF-16 vs. multibyte. I would like to compile my program to support either UTF-8 or UTF-16 (for when languages such as Chinese are used). For this to happen, I was thinking that I should have something such as

#if _UTF8
typedef char char_type;
#elif _UTF16
typedef unsigned short char_type;
#else
#error
#endif

That way, in the future when I use UTF-16, I can switch the #define (and of course, have the same type of #if/#endif for things such as sprintf, etc.). I have my own custom string type, so that would also make use of this case also.

Would replacing every use of just the single use of "char" with my "char_type" using the scenario mentioned above, be considered a "bad idea"? If so, why is it considered a bad idea and how could I achieve what I mentioned above?

The reason I would like to use one or the other is due to memory efficiency. I would rather not use UTF-16 all the time if I am not using it.

So English and Spanish are "basic" languages, and Japanese is an "extended" language? Like, basic healthcare vs dental whitening? — Kerrek SB, Jan 22 '12 at 04:52
Don't bother. Make your internal representation UTF32, using `char32_t` if you can, and provide clean interfaces. — Kerrek SB, Jan 22 '12 at 04:53
I am not sure if that was a joke, however, I did not mean for it seem like I was being rude about one or another language. I was simply trying to explain why I was asking this question. I am truly sorry if there was an disrespect interpreted from my question — chadb, Jan 22 '12 at 04:55
The point was that you shouldn't be thinking of any one language of being any more basic than any other. Just design your program from the start to work with every and any input, and you'll get a much cleaner result. (Also, never use UTF16 internally, as it's a pain without gain. It's still variable-length, and it adds other problems. You don't need to worry about space in the program's memory.) — Kerrek SB, Jan 22 '12 at 04:57
I would to be more memory efficient and not just always use the largest possible value — chadb, Jan 22 '12 at 04:58
@chadb there is another builtin type wchar_t which is used for utf16. On linux it is 32bit on windows 16. It is used in std::wstring. So it makes your life easier using wchar_t. — David Feurle, Jan 22 '12 at 06:36
I have yet to encounter a situation where wchar_t made my life easier. It has consistently caused a mess and added complexity. — StilesCrisis, Jan 22 '12 at 07:10
@KerrekSB: UTF-8 is also variable-length, and not without its share of issues. UTF-8 is more compact than UTF-16 for Latin-based languages, but less compact for Eastern Asian languages. UTF-16 is easier to seek through than UTF-8, especially backwards. UTF-16 tends to be easier to parse than UTF-8, since UTF-8 has more variations to account for (Unicode characters can be 1, 2, 3, or 4 bytes) than UTF-16 does (Unicode characters are either 2 or 4 bytes). Most popular programming languages/libraries tend to use/favor UTF-16 over UTF-8. UTF-8 tends to be better used for storage and communications — Remy Lebeau, Jan 04 '17 at 18:28

score 5 · Accepted Answer · edited Jan 04 '17 at 18:18

5

UTF-8 can represent every Unicode character. If your application properly supports UTF-8, you are golden for any language.

Note that Windows' native controls do not have APIs to set UTF-8 text in them, if you are writing a Windows application. However, it's easy to make an application which uses UTF-8 internally for everything, and converts UTF-8 -> UTF-16 when setting text in Windows, and converts UTF-16 -> UTF-8 when getting text from Windows. I've done it, and it worked awesome and was MUCH nicer than writing a WCHAR application. It's trivial to convert UTF-8 <-> 16; Windows has APIs for it, or you can find a simple (one page) function to do it in your own code.

edited Jan 04 '17 at 18:18

Peter Mortensen

30,738
21
105
131

answered Jan 22 '12 at 05:00

StilesCrisis

15,972
4
39
62

If I should just always use UTF8, then why is there UTF16, or why is there options in some ideas, say Visual Studio for "Unicode or Multibyte"? – chadb Jan 22 '12 at 05:03
1

@chadb, the option for "Multibyte" is for older programs that still worked with code pages. Since Windows went to UTF-16 internally there's never a reason to use it. As far as Windows is concerned "Unicode" means "UTF-16", which is a shame since UTF-8 is better for most purposes. – Mark Ransom Jan 22 '12 at 05:38
@StilesCrisis, if UTF8 can represent every Unicode character, then why isn't Windows just UTF8 rather than UTF16? – chadb Jan 22 '12 at 06:13
2

Microsoft did all their Unicode stuff before UTF8 caught on, unfortunately. – StilesCrisis Jan 22 '12 at 06:13
2

Microsoft switched to UTF-16 in Windows 2000. Before then, Windows NT4 used UCS-2 instead. Windows had to use UTF-16 to remain backwards compatible with existing code, and continues to do so to this day. – Remy Lebeau Jan 22 '12 at 06:14
Please note that utf16 itself is as well a multibyte character set (adds the same complexety than utf8 has). UCS-2 wasn't but is no longer used. – David Feurle Jan 22 '12 at 06:34
2

@chadb: Urgently read http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful. And huge +1 from me to StilesCrisis! – Yakov Galka Jan 22 '12 at 06:58
It would seem that I should just use UTF8 (char) everywhere and not have a UTF16 type being used, rather my own custom type that I would switch. Is this the conclusion? – chadb Jan 22 '12 at 08:37
What do you mean by a custom type? – StilesCrisis Jan 22 '12 at 15:38
By custom type I mean what I posted in my original post of 'char_type'. – chadb Jan 22 '12 at 17:23
No. With UTF8 you just use 'char.' Using a custom type misses the point. – StilesCrisis Jan 22 '12 at 20:28
Alright, that seems to clear everything up. I will just use char and not worry about UTF16 support. – chadb Jan 22 '12 at 23:25

score 2 · Answer 2 · edited Apr 12 '17 at 07:31

2

I believe choosing UTF-8 is just enough for your needs. Keep in mind, that char_type as above is less than a character in both encodings.

You may wish to have a look at this discussion: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful for the benefits of different types of popular encodings.

edited Apr 12 '17 at 07:31

Community

1
1

answered Jan 22 '12 at 14:04

Pavel Radzivilovsky

18,794
5
57
67

How can I effectively use UTF8 throughout my code then? Would I need to use the "char_type" as a different typedef than just "char"? If so, which type should it be? – chadb Jan 22 '12 at 17:25
You just use regular char strings, which contain UTF8 data. No new types are needed. – StilesCrisis Jan 22 '12 at 21:16
2

Agree with SC; only keep in mind that char is not a character (it is less). – Pavel Radzivilovsky Jan 23 '12 at 16:42
"char is not a character (it is less)", I like the pun. – Yakov Galka Jun 30 '12 at 08:42

score 0 · Answer 3 · edited May 23 '17 at 12:02

0

This is essentially what Windows does with TCHAR (except that the Windows API interprets char as the "ANSI" code page instead of UTF-8).

I think it's a bad idea.

edited May 23 '17 at 12:02

Community

1
1

answered Jan 23 '12 at 22:58

dan04

87,747
23
163
198

Designing an application for UTF-8 or UTF-16 usage

3 Answers3