Development/Howto/UTF8 Migration

From Mandriva

Jump to: navigation, search
UTF8 Migration HowTo

The details of what will be done on migrating from previous encodings to current UTF-8, as well as the admin/user points of attention that should be addressed after completing the migration.

Contents

[edit]

Unicode, UTF-8 and migration

From the distribution 2007 onwards the default encoding is now UTF-8. It was already available for some time, but it is now the encoding of choice as its advantages far outweigh the inconvenients.

If you do a fresh installation, there is nothing special to do, everything will be in UTF-8 from the start. However, if you do an update, it may be possible that you had files or data in old encodings and wish to convert them.

A lot of data types already include information about the encoding in which they were written, so it is not necessary to convert them, as the data is recoded automatically and on-the-fly whenever needed.

But there is a problem for those datafiles that don't give information about their encoding, and so it is supposed that they are in the users' default encoding; but this behaviour has changed just now in this update if you came from an old encoding to UTF-8

The different places where a conversion may be needed are:

  • the filesystem encoding (file names)
  • the names of the users (as used by the system)
  • the user's language configuration (if some users have a different config)
  • the text data files
[edit]

1. The encoding of the file systems

Each partition or removable disk that is mounted on the system is mounted supposing that the data on it are in a given encoding. From now on, they have to be mounted to UTF-8, and maybe convert their file names.

In the /etc/fstab file (or through the graphical configuration tool of the Mandriva Control Center, i.e. diskdrake ) you have to change the option 'iocharset=xxxxx' into 'iocharset=utf8'.

Then you can convert the file names for which you could install the convmv package (available on the Contrib media) and this can be used to convert the file names. Read its man page for the parameters.

[edit]

2. The users' names

In the file /etc/passwd you can have, on the 5th field, text data like the full user name, etc; that may have non-ascii characters.

You can use userdrake , the graphical user management tool in MCC to check and correct those data if need be; or if you have a lot of users and if all data was in the same old encoding, you can do this (for example if the old encoding was EUC-JP):

cat /etc/passwd > /etc/passwd-
cat /etc/passwd-

(Should something go wrong, restore passwd from the backup copy in /etc/passwd- ).

[edit]

3. The language configuration

Each user can have a language configuration that is different from the system default; on a UTF-8 system it's much easier that everybody be in UTF-8. So we can change all the /home/*/.i18n files that exist to put ".UTF-8" as the encoding for the LC_* variables:

#!/bin/bash
for i in /home/*/.i18n
do
  cp ${i} ${i}~
  cat ${i}~ | sed 's/^\(LC_[^=]*\|LANG\)\(=[^_\.]*_[^\.][^\.]\)\(\.[^@]*\)*\(@.*\)*$/\1\2.UTF-8\4/' > ${i}
done
[edit]

4. Text data

Here it depends on in the individual user's needs, whether or not they decide to convert their text files. Most advanced text editors allow to specify which the encoding a file is in and can display it correctly.

In case you want to convert, you can use iconv, for example if the old encoding is KOI8-R: iconv -f KOI8-R -t UTF-8 < oldfile > newfile

[edit]

5. vim

The vim text editor has the ability to recognize non-UTF-8 text files and convert them back and forth on-the-fly. By default it assumes latin1. If you use another encoding you have to set it, e.g. put in your ~/.vimrc a link like this:

set fencs=ucs-bom,utf-8,default,xxxxx

with 'xxxxx' being your old encoding.

[edit]

6. How to know the old encoding

You had to removed the ".UTF-8" part of the locale name, and you will have the locale name as it was used until the 2006 distribution; and so you will be able to know the encoding that was used.

To have the locale name, type locale on a terminal, and look at the line with LC_CTYPE it will be something like: LC_CTYPE=xx_YY.UTF-8.

To learn the old encoding, type:

LC_CTYPE=xx_YY locale charmap

(that is, without the ".UTF-8" part) and it will tell you the old encoding (in some cases it was already UTF-8, in that case there is nothing to change).

[edit]

A small script

Here is a small script that automates those steps; it converts a whole system with several users; and is to be launched by root: Media:Media-utf8-migration.sh

See also convert-filenames-to-utf8.pl and Releases/Mandriva/2007/Errata#UTF8 issue when reinstalling and keeping a previous /home that was not in UTF8.

Personal tools