Special Character Conversion Problems – ISO-8859-1 to Unicode

0
16

Ever see one of these funny characters (black diamond with question mark) or (4-digits in a box).

The black question mark appears when the value for a special character doesn’t match a character in the character set used for displaying the text. I found this happened when working with text data that was encoded in ISO-8859-1 and displayed in UTF-8. This is a correct problem as character sets need to be converted before you can display them.

Conversion Problems

But what do you do when you convert the data from the incoming encoding to the outgoing encoding and end up with (black diamond with question mark) or (4-digits in a box) still.

Well that was the problem I encountered.

For about a week I banged my head trying to solve that character conversion problem in a.NET project. I received data in ISO-8859-1 format and displayed it in UTF-8. The problem occurred when the special character bullet (•) was in the data received. It showed up as (black diamond with question mark) because I was not doing a conversion to UTF-8. So I did some research and found:

public static string iso8859ToUnicode(string textToConvert)

Encoding iso8859 = Encoding.GetEncoding(“iso-8859-1”);

Encoding unicode = Encoding.Unicode;

byte[] srcTextBytes = iso8859.GetBytes(textToConvert);

byte[] destTextBytes = Encoding.Convert(iso8859,unicode, srcTextBytes);

char[] destChars = new char[unicode.GetCharCount(destTextBytes, 0, destTextBytes.Length)];

unicode.GetChars(destTextBytes, 0, destTextBytes.Length, destChars, 0);

return destChars.ToString(); }

After I used this function and displayed the text, I found that the bullet was then converted to u0095 which was displayed as a box with 0095 in it. I thought that it did not convert correctly and I searched Google for u0095 and I kept getting references to Unicode. So I started to suspect that the conversion was incorrect. I came across Bullet – Unicode Character which listed the conversion chart for a bullet and the correct Unicode character is u2022. Obviously this is not correct so I wondered if the conversion was broken. I researched a little more and found Message Waiting – Unicode Character which is the u0095 character.

So I have converted successfully from ISO-8859-1 to Unicode but when displayed in a browser with UTF-8 it doesn’t seem to recognize that character so I end up with the box and four digits in it.

How To Get the Browser To Display The Special Unicode Characters

As I examined the chart at FileFormat for Message Waiting and it indicated that

• (•) is the HTML entity for the Message Waiting Dot. So I looked for how to convert Unicode to html entities in.NET. The method to use is:

string html = Server.HTMLEncode(str);

But this didn’t solve my problem. HTMLEncode only converted special characters below 127 in the ASCII table. My research led me to a post about expanding the HTMLEncode to include special characters above 127. Apparently the integer value of the Unicode character is also the HTML entity number. So appending &# to the integer value followed by a semi-colon is the HTML entity for that Unicode character. Example:

• (•).

The code for the special character conversion is:

StringBuilder result = new StringBuilder(textToConvert.Length + (int)(textToConvert.Length * 0.1));

foreach (char c in destChars)

{

int value = Convert.ToInt32(c);

if (value > 127)

result.AppendFormat(“&#{0};”, value);

else

result.Append(c);

}

string html = result.ToString();

The Final Conversion Method

I put the ISO-8859-1 conversion to Unicode together with the special character conversion to make sure the data will display in the browser. The entire method is:

public static string iso8859ToUnicode(string textToConvert)

Encoding iso8859 = Encoding.GetEncoding(“iso-8859-1”);

Encoding unicode = Encoding.Unicode;

byte[] srcTextBytes = iso8859.GetBytes(textToConvert);

byte[] destTextBytes = Encoding.Convert(iso8859,unicode, srcTextBytes);

char[] destChars = new char[unicode.GetCharCount(destTextBytes, 0, destTextBytes.Length)];

unicode.GetChars(destTextBytes, 0, destTextBytes.Length, destChars, 0);

StringBuilder result = new StringBuilder(textToConvert.Length + (int)(textToConvert.Length * 0.1));

foreach (char c in destChars)

{

int value = Convert.ToInt32(c);

if (value > 127)

result.AppendFormat(“&#{0};”, value);

else

result.Append(c);

}

return result.ToString(); }

Source

Leave a Reply