Policies/Charset
From Mandriva Community Wiki
This page describes why the text encoding in spec files should be in UTF-8 or ASCII, and what kind of problems would arise otherwise.
Contents |
[edit] Introduction
In short, all of the spec file should be written in UTF-8 encoding if any non-ASCII has to be used. Places where such characters occur frequently include changelogs, descriptions, summaries and names. Not only because Mandriva has been using UTF-8 by default for some time, using legacy encoding inside a spec file introduces a number of risks, which may render the spec file broken at worst.
[edit] Why?
rpm supports translation of description (though less documented) with -l option after %description. For summary and group, the format is adding ISO639 language code with parenthesis before colon (more info here), like:
- Summary: summary
- Summary(fr): sommaire
- %description pkg
- This is text.
- %description -l fr pkg
- C'est texte.
This is proven to be a nightmare, since packagers use their own encoding of choice, leaving a multiple-encoded file as result. Such a spec file can't possibly be handled properly by any editor if more than one translation exists. Thus for description or summary translation, po file is the way to go.
Since in most such cases Mandriva spec is using ISO-8859-1 encoding, I'm using that as an example.
Anyway such translations are forbidden by policy. Summary translations myst be done in the mdv-rpm-summary package.
[edit] Problematic to read your file
Everybody has their own system, and they may not migrate because the old encoding "just works". However, lots of the encodings in this world are not compatible with each other. Their only common part is ASCII characters (so ASCII is safe for spec files). ISO-8859-1 characters are only shown as junk for other locales.
This, of course, requires both of writer and reader to migrate to UTF-8; but a fair amount of users are starting to do so, and you will get more help from people all around the world. Otherwise, people would less likely help because what you code or write is unreadable -- unless your intention is exactly not expecting others to help.
[edit] Editors may damage spec file
Related to the first point but worse. For systems with non-ISO-8859-1 legacy charset, text editors may not handle the "junk" text, and may even attempt to "correct" it by modifying characters, rendering the file even more broken.
[edit] How to fix?
[edit] In spec file itself
It is the spec file that matters, so remember to use UTF-8 throughout the file. To test if your spec file is indeed in UTF-8, use iconv to filter it:
iconv -f UTF-8 -t UTF-8 -o /dev/null yourpackage.spec
If it doesn't complain, the file is in UTF-8 (or ASCII). Otherwise it will tell you the UTF-8 test is broken in which position; you can also remove =-o /dev/null= argument to have a look yourself.
[edit] In various config files
Remember to use UTF-8 in ~/.rpmmacros, from where your name is read (in %packager line):
- %distribution Mandriva Linux
- %vendor Mandriva
- %packager José Huang <blahblah@mandriva.org>
If you also use rebuild-rpm and similar building tools, please remember to change your name in the corresponding config files too.
[edit] UTF-8 enabled editors
There are many text editors that support UTF-8 natively, be it GUI one or text mode one.
[edit] Language environment
Use UTF-8 in your shell environment too whenever possible, that would eliminate lots of headaches. For example, this is my locale setting from locale command:
- LANG=zh_HK.UTF-8
- LC_CTYPE=zh_HK.UTF-8
- LC_NUMERIC=zh_HK.UTF-8
- LC_TIME=zh_HK.UTF-8
- LC_COLLATE=zh_HK.UTF-8
- LC_MONETARY=zh_HK.UTF-8
- LC_MESSAGES=zh_TW
- LC_PAPER=zh_HK.UTF-8
- LC_NAME=zh_HK.UTF-8
- LC_ADDRESS=zh_HK.UTF-8
- LC_TELEPHONE=zh_HK.UTF-8
- LC_MEASUREMENT=zh_HK.UTF-8
- LC_IDENTIFICATION=zh_HK.UTF-8
- LC_ALL=
You can change language settings in the ~/.i18n file located under your home directory.

