literal thoughts

May 1, 2008

IRC and character encoding

Filed under: english — Tags: , — Hinrik Örn Sigurðsson @ 1:44 am

A while ago, I wrote an IRC logger for POE::Component::IRC, which is an IRC client module for Perl. The main challenge I faced was the issue of character encodings. Since IRC is ripe with clients that use different encodings, messages must be reliably decoded before they are written to a file.

You see, RFC 1459, the standards document describing the IRC protocol, does not regulate the use of character encodings:

2.2 Character codes
 
   No specific character set is specified. The protocol is based on a
   set of codes which are composed of eight (8) bits, making up an
   octet. Each message may be composed of any number of these octets;
   however, some octet values are used for control codes which act as
   message delimiters.
 
   Regardless of being an 8-bit protocol, the delimiters and keywords
   are such that protocol is mostly usable from USASCII terminal and a
   telnet connection.

ASCII uses the first 7 bits. So, from the looks of it, you should only be able to rely on the first seven bits representing an ASCII character, the interpretation of the last bit being anyone’s guess. That’s bad.

For most of IRC’s history, the most popular IRC client has been mIRC. Until recently, mIRC decoded incoming messages using the ANSI code page that was currently being used on the user’s Windows system. This meant that whenever mIRC users wanted to communicate using anything other than ASCII characters, they’d better be using the same code page. In later versions, mIRC decodes incoming messages as UTF-8 if they look UTF-8 encoded, or code page 1252 (used by most Westerners). As for how it does this, I cannot know since mIRC is closed-source.

The open-source client irssi handles the situation similarly. It uses GLib’s g_utf8_validate() function to check if the incoming message is UTF-8 encoded, otherwise it falls back to CP1252 by default. As for XChat, it uses the same GLib function, but if it determines that the message is not UTF-8, XChat decodes the message in a rather novel way. Here is an excerpt from its src/common/text.c:

/* converts a CP1252/ISO-8859-1(5) hybrid to UTF-8                           */
/* Features: 1. It never fails, all 00-FF chars are converted to valid UTF-8 */
/*           2. Uses CP1252 in the range 80-9f because ISO doesn't have any- */
/*              thing useful in this range and it helps us receive from mIRC */
/*           3. The five undefined chars in CP1252 80-9f are replaced with   */
/*              ISO-8859-15 control codes.                                   */
/*           4. Handles 0xa4 as a Euro symbol ala ISO-8859-15.               */
/*           5. Uses ISO-8859-1 (which matches CP1252) for everything else.  */
/*           6. This routine measured 3x faster than g_convert :)            */

How would I handle this in Perl? I don’t want to depend on GLib, and I don’t want to write any C code (requiring the user to have a C compiler). At first I tried using Encode::Detect, but there are two problems with it. It’s an extra dependency, and more importantly, it works heuristically, deciding which character set is being used based on the number of occurences of each character code. As such, it’s only reliable when large amounts of data are involved. Like a whole web page, for example, which is what the code was written for. Then I learned of Encode::Guess, which is included with Perl as of version 5.6.0. The following decodes $line as UTF-8 if Encode::Guess is sure that it’s UTF-8. Otherwise it decodes it as CP1252.

use Encode qw(decode);
use Encode::Guess;
 
my $utf8 = guess_encoding($line, 'utf8');
$line = ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);

So far this method has worked flawlessly for me on channels with mixed encodings. However, I don’t know exactly how Encode::Guess works, so I’m not as confident in this method as I could be. Any feedback on this issue would be quite welcome.

23 Comments »

  1. The only feedback I have is that I want all my encoding problems fixed :)

    Comment by Viðar — May 10, 2008 @ 8:41 am

  2. Your reading of the docs/source of E::G are as good as mine, but it looks sane and comparable to g_utf8_validate() to me.

    Comment by LionsPhil — August 20, 2008 @ 7:52 pm

  3. I have been looking for content like this for a research project I am working. Thanks very much. Most think to save the outfit they wore home from the hospital and perhaps the ID bracelets, but they don’t save much more.

    Comment by Tracy Nyce — September 18, 2010 @ 9:11 am

  4. Please tell me that youre going to keep this up! Its so beneficial and so important. I cant wait to read more from you. I just really feel like you know so very much and know how to make people listen to what you’ve to say. This weblog is just also cool to be missed. Great stuff, seriously. Please, PLEASE keep it up!

    Comment by gamblingparlour — December 9, 2010 @ 1:15 am

  5. Great job here. I seriously enjoyed what you had to say. Keep heading because you absolutely bring a new voice to this topic. Not many people would say what youve said and still make it interesting. Well, at least Im interested. Cant wait to see much more of this from you.

    Comment by winamp download — December 9, 2010 @ 4:04 am

  6. Congratulations on having 1 of the most sophisticated blogs Ive come throughout in some time! Its just incredible how much you can take away from something simply because of how visually beautiful it’s. Youve put collectively a good blog space –great graphics, videos, layout. This is surely a must-see blog!

    Comment by Mazda Used Cars — December 9, 2010 @ 6:03 am

  7. Resources these as the 1 you mentioned here will be incredibly helpful to myself! I will publish a hyperlink to this page on my private blog. I’m positive my site website visitors will discover that very advantageous.

    Comment by vibrater — December 14, 2010 @ 7:26 am

  8. Thanks for taking the time to talk about this, I really feel strongly about it and like finding out extra on this subject. If feasible, as you gain experience, would you thoughts updating your weblog with far more information and facts? It’s extremely helpful for me.

    Comment by penispumps — December 14, 2010 @ 9:16 am

  9. In truth, immediately i didn’t understand the essence. But after re-reading all at once became clear. This type of baby changing table is not technically a table, but it works on any sturdy surface when you have to change baby away from home.

    Comment by Avelina Langworthy — December 24, 2010 @ 2:53 am

  10. Find this posting highly beneficial and a huge help for myself! Wish to check out more up-dates of great posting in your website! Great job!

    Comment by auto rijschool — January 27, 2011 @ 2:20 pm

  11. More and more, the courts are adopting this mentality and favoring frequent and continuing contact with both parents as the best arrangement for the child. Family problems may vary from adoption laws, alimony, domestic violence, child abduction, post nuptial agreements, divorce, same sex marriages, bigamy, and adultery, sharing of pension, family property harassment and health issues as well.

    Comment by Francis Coote — January 27, 2011 @ 3:57 pm

  12. You completed a number of fine points there. I did a search on the issue and found a good number of people will have the same opinion with your blog.

    Comment by lifecell skin care — February 4, 2011 @ 9:57 pm

  13. Simply desire to say your article is as astonishing. The clarity in your post is simply excellent and i can assume you’re an expert on this subject. Well with your permission allow me to grab your RSS feed to keep updated with forthcoming post. Thanks a million and please carry on the enjoyable work.

    Comment by Immigration Solicitors in Essex — February 13, 2011 @ 10:11 pm

  14. Ive been meaning to read this and just never got a chance. Its an issue that Im incredibly interested in, I just started reading and Im glad I did. Youre a wonderful blogger, one of the greatest that Ive seen. This weblog unquestionably has some data on subject that I just wasnt aware of. Thanks for bringing this things to light.

    Comment by big pot — February 25, 2011 @ 12:54 am

  15. Out of my observation, shopping for electronic devices online may be easily expensive, however there are some principles that you can use to acquire the best offers. There are always ways to obtain discount offers that could help make one to hold the best electronic products products at the lowest prices. Thanks for your blog post.

    Comment by Herschel Mithani — June 13, 2011 @ 1:37 pm

  16. Often times citizens are layered prefer that. There’s a thing completely different directly below than what’s on the surface. But at times, there’s another, actually deeper level, knowning that one is the same as the top end surface one.

    Comment by Cody Cunliffe — July 17, 2011 @ 7:20 am

  17. Have been searching for more info about this for a few days… I’m happy that I sumbled across your post. Informative post! You just earned youself a new regular visitor.

    Comment by Blog Posting Tool — July 19, 2011 @ 12:40 am

  18. When I originally commented I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get several e-mails with the same comment. Is there any way you can remove people from that service? Appreciate it!

    Comment by ips capacity building — July 25, 2011 @ 4:54 am

  19. Hi there, just became alert to your blog through Google, and found that it’s truly informative. I am going to watch out for brussels. I’ll appreciate if you continue this in future. A lot of people will be benefited from your writing. Cheers!

    Comment by schecter bass — August 5, 2011 @ 8:22 am

  20. Excellent read, I just passed this onto a colleague who had been doing little research on that. And he actually bought me lunch because I came across it for him smile So allow me to rephrase that: Many thanks for lunch!

    Comment by no fax payday loans online — August 12, 2011 @ 6:10 am

  21. Thanks for the great info.

    Comment by apartments near temple university — August 19, 2011 @ 5:40 pm

  22. obviously like your website but you need to check the spelling on quite a few of your posts. Several of them are rife with spelling problems and I find it very bothersome to tell the truth nevertheless I will surely come back again.

    Comment by 4g mobile phones — October 29, 2011 @ 1:15 pm

  23. Thanks a lot for the blog article. Really Cool.

    Comment by Tayler Hocker — January 24, 2012 @ 10:25 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress