Re: Unicode normalisation

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Unicode normalisation

Sunday Bolaji
      I tried to download the OOo 3.1 but after searching from one page to another page, i only got OOo 3.0  to download.After the  installation  OOo  3.0  , i tried to  use  the temporary  unicode Normalisation in the hunspell but it did not work. My input conversion table is shown below.


ICONV ọ  ọ

ICONV ọ̀  ọ̀

ICONV ọ́  ọ́

ICONV ṣ  ṣ

ICONV ẹ̀  ẹ̀

ICONV ẹ́  ẹ́

ICONV ẹ  ẹ

The  character in  the  second column were written in  these  sequency: alphabet  first, followed by tone mark ( ) and then by the underdot last (.) while the character in third column were written in these sequency: alphabet first, followed by underdot(.) and then by tone mark( ).The OpenOffice writer did not recongnise the character as the same.

--- On Mon, 1/5/09, Németh László <[hidden email]> wrote:
From: Németh László <[hidden email]>
Subject: Re: [lingu-dev] Unicode normalisation
To: [hidden email]
Date: Monday, January 5, 2009, 7:10 AM


Really, this is not only a spell checking problem. has
problems with both of visual and functional equivalence of
characters.  For example, here is the result of the Find all ä
operation on ÄÄää, i.e. on the "A U+0308 (COMBINING DIARESIS) Ä a
U+0308 ä" character sequence:

It would be fine to solve this problem in the future
versions by automatic Unicode normalization, also by OpenType support.
Hunspell 1.2.x (I hope, it will be in OOo 3.1) has a temporary
solution for Unicode normalization (canonical and compatiblity), the
optional input/output conversion:

ICONV 가 ᄀ ᅡ
ICONV fi fi

First three conversion is canonical normalization: two composition and
a Hangul decomposition. Conversion of the fi ligature is a
compatibility normalization (but spell checking of words with
f-ligatures needs fixed word breaking in OOo, too).

Conversion of the spell checking suggestions to
 the original composed form:

OCONV ᄀ ᅡ 가
OCONV fi fi

(Special spell checking requirements needs special solution. For
example, German typography uses only f-ligatures within words, bot not
in compound word boundary, so the previous OCONV fi fi conversion is
not right for German. A redundant dictionary with non-suggested
decomposed forms, and dictionary words with ligatures helps to check
the correct typography of a German text:

--- affix file ---
REP fi fi
REP fi fi

--- dictionary file ----

Hyphenation of both of composed and decomposed characters is possible
in OOo by redundant hyphenation patterns in
Compatibility equivalent ligatures can be handled by non-standard
hyphenation (alternations):


For thesauri it is a temporary solution using redundant items or


Incoming stemming in OOo thesaurus by Hunspell is also can handle
normalization problem temporarily.
ICONV input conversion or explicit stems (
--- dic file ---
finden st:finden
) can give the normalized stems to the thesaurus component.

Maybe a new Hunspell tool could help the spelling dictionary
developers by the automatic generation of the ICONV normalization


2009/1/5 Stephan Bergmann <[hidden email]>:
> On 01/02/09 09:51, F Wolff wrote:
>> Hallo all
>> We recently had a discussion on a list for African localisation about
>> the utility of having Unicode normalisation automatically done in
>> Hunspell, so that creators of spell checkers wouldn't need to
>> about that.
>> Is this a feature that would be useful to
 more people? Is there
>> something generic in OOo that handles normalisation issues for other
>> purposes? (searching, thesaurus, indexes, etc.)  I can think of many
>> places where it could be relevant.
>> I'm curious to hear what other people think.
> I brought this up years ago as point 4 of
<>, but
> nothing became of it back then...
> -Stephan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]