Conversion of Ancient Buddhist Texts to Unicode

Unicode Logo

Last year I updated the Ancient Buddhist Texts website so that it was possible to read the texts through Unicode, though as the underlying encoding was still in the old standard, and the encoding was changed through javascript, it took a couple of seconds to change over on each page, giving a flash of the old encoding.

Over the past couple of weeks I have made a new Unicode font (ITM_TMS_UNI) and I have now converted the whole website to the new format, and will only be publishing in this format from now on. Later I also plan to convert the .pdf documents to the same format, but as the font is embedded in these it is less urgent.

The conversion itself was a bit nerve-racking as I had over 2,200 html documents to convert and it was all done at one go. I did of course have back-ups, but there was always the possibility that something would go wrong, and I wouldn’t be able to find it out until too late. As it was I spent a whole day checking documents for errors.

I used a modification of a script that was developed for the javascript on-the-fly conversion that was being used on the site, and I carefully tested it also. As far as I can see up till now, the whole conversion went off very well, with only a few idiosyncacies to sort out in the files themselves.

The adavantage of Unicode is that even a casual visitor can read the documents in two sections of the website (Texts & Translations and English Only), and won’t need to install any special font.

The disadvantage is that although Unicode can encode nearly every script, I use many characters that have been given no font code-slot, which makes every font that uses that character set essentially a “private” font, requiring (as with the old encoding) that the font be installed to be able to read the documents successfully.

This is very problematic, especially for Indologists. For instance, there is no font-slot for ring under r, which signifies one of the main vowel characters in Sanskrit. This also applies, of course, to all its derivatives, like ring under r with macron, ring under r with accent, etc., as well as to ring under l and its derivatives.

Therefore anybody wishing to display Sanskrit in Roman script has two choices, use a private font which will leave the characters unreadable on most machines, unless a special font is installed, or change the Unicode encoding to something non-standard, but which is close to the encoding required.

In preparing the new site I have in fact used both solutions. As ring under r is the only unallocated character that is used in the two sections named above, I have replaced it with dot under r, which has a code-slot, and is found in most extended Unicode fonts, but is the wrong character for the vowel.

But in the Original Texts and Prosody sections, there are dozens more characters that have no code-slots, so there would be no advantage with this solution as the font will have to be installed anyway, so there I have maintained strict encoding, and the only solution is to download and install the font first.

I might mention that the .epub and .mobi files which come from the English Only section were already prepared in Unicode. The Reference and Maps section has been converted to quasi-Unicode with the dot under r character in the few places where it applies.

I am now eagerly awaiting the promised update to the text editor I use to prepare the site, NoteTab Pro, which is in the process of being updated to deal with Unicode and add extra syntax highlighting. I rely heavily on this editor because of its scripting ability, and it was one of the main reasons I held back on the conversion to Unicode for so long.


Possibly Related Posts:

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>