Re: Unicode normalisation

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

Sunday Bolaji
Hi,
      I tried to download the OOo 3.1 but after searching from one page to another page, i only got OOo 3.0  to download.After the  installation  OOo  3.0  , i tried to  use  the temporary  unicode Normalisation in the hunspell but it did not work. My input conversion table is shown below.


ICONV 7

ICONV ọ  ọ

ICONV ọ̀  ọ̀

ICONV ọ́  ọ́

ICONV ṣ  ṣ

ICONV ẹ̀  ẹ̀

ICONV ẹ́  ẹ́

ICONV ẹ  ẹ

The  character in  the  second column were written in  these  sequency: alphabet  first, followed by tone mark ( ) and then by the underdot last (.) while the character in third column were written in these sequency: alphabet first, followed by underdot(.) and then by tone mark( ).The OpenOffice writer did not recongnise the character as the same.
Regards,
Bolaji
 

--- On Mon, 1/5/09, Németh László <[hidden email]> wrote:
From: Németh László <[hidden email]>
Subject: Re: [lingu-dev] Unicode normalisation
To: [hidden email]
Date: Monday, January 5, 2009, 7:10 AM

Hi,

Really, this is not only a spell checking problem. OpenOffice.org has
problems with both of visual and functional equivalence of
 Unicode
characters.  For example, here is the result of the Find all ä
operation on ÄÄää, i.e. on the "A U+0308 (COMBINING DIARESIS) Ä a
U+0308 ä" character sequence:
http://www.flickr.com/photos/85171764@N00/3170574450/

It would be fine to solve this problem in the future OpenOffice.org
versions by automatic Unicode normalization, also by OpenType support.
Hunspell 1.2.x (I hope, it will be in OOo 3.1) has a temporary
solution for Unicode normalization (canonical and compatiblity), the
optional input/output conversion:

ICONV 4
ICONV Ä Ä
ICONV ä ä
ICONV 가 ᄀ ᅡ
ICONV fi fi

First three conversion is canonical normalization: two composition and
a Hangul decomposition. Conversion of the fi ligature is a
compatibility normalization (but spell checking of words with
f-ligatures needs fixed word breaking in OOo, too).

Conversion of the spell checking suggestions to
 the original composed form:

OCONV 2
OCONV ᄀ ᅡ 가
OCONV fi fi

(Special spell checking requirements needs special solution. For
example, German typography uses only f-ligatures within words, bot not
in compound word boundary, so the previous OCONV fi fi conversion is
not right for German. A redundant dictionary with non-suggested
decomposed forms, and dictionary words with ligatures helps to check
the correct typography of a German text:

--- affix file ---
NOSUGGEST *
REP 2
REP fi fi
REP fi fi

--- dictionary file ----
finden/*
finden
)

Hyphenation of both of composed and decomposed characters is possible
in OOo by redundant hyphenation patterns in OpenOffice.org.
Compatibility equivalent ligatures can be handled by non-standard
hyphenation (alternations):

fi1/f=i,1,1

For thesauri it is a temporary solution using redundant items or
 references:

finden->finden

Incoming stemming in OOo thesaurus by Hunspell is also can handle
normalization problem temporarily.
ICONV input conversion or explicit stems (
--- dic file ---
finden st:finden
) can give the normalized stems to the thesaurus component.

Maybe a new Hunspell tool could help the spelling dictionary
developers by the automatic generation of the ICONV normalization
table.

Regards,
László


2009/1/5 Stephan Bergmann <[hidden email]>:
> On 01/02/09 09:51, F Wolff wrote:
>>
>> Hallo all
>>
>> We recently had a discussion on a list for African localisation about
>> the utility of having Unicode normalisation automatically done in
>> Hunspell, so that creators of spell checkers wouldn't need to
worry
>> about that.
>>
>> Is this a feature that would be useful to
 more people? Is there
>> something generic in OOo that handles normalisation issues for other
>> purposes? (searching, thesaurus, indexes, etc.)  I can think of many
>> places where it could be relevant.
>>
>> I'm curious to hear what other people think.
>
> I brought this up years ago as point 4 of
>
<http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=7099>, but
> nothing became of it back then...
>
> -Stephan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>