Unicode normalisation

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode normalisation

F Wolff

Hallo all

We recently had a discussion on a list for African localisation about
the utility of having Unicode normalisation automatically done in
Hunspell, so that creators of spell checkers wouldn't need to worry
about that.

Is this a feature that would be useful to more people? Is there
something generic in OOo that handles normalisation issues for other
purposes? (searching, thesaurus, indexes, etc.)  I can think of many
places where it could be relevant.

I'm curious to hear what other people think.

Keep well
Friedel


--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/re-bringing-all-translation-management-tools-together


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

stephan.bergmann
On 01/02/09 09:51, F Wolff wrote:

> Hallo all
>
> We recently had a discussion on a list for African localisation about
> the utility of having Unicode normalisation automatically done in
> Hunspell, so that creators of spell checkers wouldn't need to worry
> about that.
>
> Is this a feature that would be useful to more people? Is there
> something generic in OOo that handles normalisation issues for other
> purposes? (searching, thesaurus, indexes, etc.)  I can think of many
> places where it could be relevant.
>
> I'm curious to hear what other people think.

I brought this up years ago as point 4 of
<http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=7099>, but
nothing became of it back then...

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

Németh László-2
Hi,

Really, this is not only a spell checking problem. OpenOffice.org has
problems with both of visual and functional equivalence of Unicode
characters.  For example, here is the result of the Find all ä
operation on ÄÄää, i.e. on the "A U+0308 (COMBINING DIARESIS) Ä a
U+0308 ä" character sequence:
http://www.flickr.com/photos/85171764@N00/3170574450/

It would be fine to solve this problem in the future OpenOffice.org
versions by automatic Unicode normalization, also by OpenType support.
Hunspell 1.2.x (I hope, it will be in OOo 3.1) has a temporary
solution for Unicode normalization (canonical and compatiblity), the
optional input/output conversion:

ICONV 4
ICONV Ä Ä
ICONV ä ä
ICONV 가 ᄀ ᅡ
ICONV fi fi

First three conversion is canonical normalization: two composition and
a Hangul decomposition. Conversion of the fi ligature is a
compatibility normalization (but spell checking of words with
f-ligatures needs fixed word breaking in OOo, too).

Conversion of the spell checking suggestions to the original composed form:

OCONV 2
OCONV ᄀ ᅡ 가
OCONV fi fi

(Special spell checking requirements needs special solution. For
example, German typography uses only f-ligatures within words, bot not
in compound word boundary, so the previous OCONV fi fi conversion is
not right for German. A redundant dictionary with non-suggested
decomposed forms, and dictionary words with ligatures helps to check
the correct typography of a German text:

--- affix file ---
NOSUGGEST *
REP 2
REP fi fi
REP fi fi

--- dictionary file ----
finden/*
finden
)

Hyphenation of both of composed and decomposed characters is possible
in OOo by redundant hyphenation patterns in OpenOffice.org.
Compatibility equivalent ligatures can be handled by non-standard
hyphenation (alternations):

fi1/f=i,1,1

For thesauri it is a temporary solution using redundant items or references:

finden->finden

Incoming stemming in OOo thesaurus by Hunspell is also can handle
normalization problem temporarily.
ICONV input conversion or explicit stems (
--- dic file ---
finden st:finden
) can give the normalized stems to the thesaurus component.

Maybe a new Hunspell tool could help the spelling dictionary
developers by the automatic generation of the ICONV normalization
table.

Regards,
László


2009/1/5 Stephan Bergmann <[hidden email]>:

> On 01/02/09 09:51, F Wolff wrote:
>>
>> Hallo all
>>
>> We recently had a discussion on a list for African localisation about
>> the utility of having Unicode normalisation automatically done in
>> Hunspell, so that creators of spell checkers wouldn't need to worry
>> about that.
>>
>> Is this a feature that would be useful to more people? Is there
>> something generic in OOo that handles normalisation issues for other
>> purposes? (searching, thesaurus, indexes, etc.)  I can think of many
>> places where it could be relevant.
>>
>> I'm curious to hear what other people think.
>
> I brought this up years ago as point 4 of
> <http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=7099>, but
> nothing became of it back then...
>
> -Stephan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

F Wolff
In reply to this post by stephan.bergmann
Op Ma, 2009-01-05 om 11:08 +0100 skryf Stephan Bergmann:

> On 01/02/09 09:51, F Wolff wrote:
> > Hallo all
> >
> > We recently had a discussion on a list for African localisation about
> > the utility of having Unicode normalisation automatically done in
> > Hunspell, so that creators of spell checkers wouldn't need to worry
> > about that.
> >
> > Is this a feature that would be useful to more people? Is there
> > something generic in OOo that handles normalisation issues for other
> > purposes? (searching, thesaurus, indexes, etc.)  I can think of many
> > places where it could be relevant.
> >
> > I'm curious to hear what other people think.
>
> I brought this up years ago as point 4 of
> <http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=7099>, but
> nothing became of it back then...
>
> -Stephan

Thank you for your reply, Stephan.

In your mail you ask if it is severe enough. I would think that it is a
relevant problem. Unfortunately, it is probably mostly a problem for
languages that are not usually well represented in the developer
communities. Many African languages have not yet standardised their
keyboard layouts, and for some there are several competing designs. What
this means is that documents could be created with different "encodings"
of the same text, which will make searching not work correctly (unless
proper normalisation is done), as Németh indicated.

While somebody might be able to see certain text is present (instead of
searching), it is unrealistic for spell checker authors to add all
possible ways of writing letters into account in all possible
combinations for each word. In the case of Yoruba, vowels can have zero,
one or two diacritics. This can be represented with one, two or three
code points. As far as I know there are several keyboard layouts for
Yoruba, so this is not a theoretical issue we are describing.

Németh, the ICONV solution sounds interesting, and I guess would work. I
don't know if that would then also work in Firefox. (Do they update
their copy of Hunspell from time to time?) Automatic conversion means
that people would benefit from the normalisation even if the spell
checker authors didn't think about the problem, which is probably ideal.
I can't image there being a very large overhead for this, although it
probably won't come for free either.

Friedel



--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/re-bringing-all-translation-management-tools-together


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

stephan.bergmann
On 01/05/09 19:50, F Wolff wrote:
> In your mail you ask if it is severe enough. I would think that it is a
> relevant problem. Unfortunately, it is probably mostly a problem for
> languages that are not usually well represented in the developer
> communities. Many African languages have not yet standardised their
> keyboard layouts, and for some there are several competing designs. What
> this means is that documents could be created with different "encodings"
> of the same text, which will make searching not work correctly (unless
> proper normalisation is done), as Németh indicated.

If you want to further discuss this general problem outside the rather
narrow scope of spell checking (and I too think it is indeed a problem,
but unfortunately do not have time to help address it), I would suggest
to move to a more general mailing list (like [hidden email]) to get
the necessary attention.

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

Németh László-2
In reply to this post by F Wolff
Hi,

2009/1/5 F Wolff <[hidden email]>:
> Németh, the ICONV solution sounds interesting, and I guess would work. I
> don't know if that would then also work in Firefox. (Do they update

In fact, development version of Firefox has already contained Hunspell
1.2.8, unlike OpenOffice.org.

> their copy of Hunspell from time to time?) Automatic conversion means
> that people would benefit from the normalisation even if the spell
> checker authors didn't think about the problem, which is probably ideal.
> I can't image there being a very large overhead for this, although it
> probably won't come for free either.

I thought of only an automatic preprocessor utility, based on Unicode
canonical and compatibility equivalence data.
A preprocessed dictionary can use only a small subset of this data
without any extra overhead.

Regards,
László


>
> Friedel
>
>
>
> --
> Recently on my blog:
> http://translate.org.za/blogs/friedel/en/content/re-bringing-all-translation-management-tools-together
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

Sunday Bolaji
Hi,
  Please can anybody assisit on how to make hunspell recognize word formed with hyphen like this " marun-un " as word in the dictionary file.
 Also any assistance on how to use ICONV in OOo 3.0.

--- On Tue, 1/6/09, Németh László <[hidden email]> wrote:
From: Németh László <[hidden email]>
Subject: Re: [lingu-dev] Unicode normalisation
To: [hidden email]
Date: Tuesday, January 6, 2009, 3:07 AM

Hi,

2009/1/5 F Wolff <[hidden email]>:
> Németh, the ICONV solution sounds interesting, and I guess would work. I
> don't know if that would then also work in Firefox. (Do they update

In fact, development version of Firefox has already contained Hunspell
1.2.8, unlike OpenOffice.org.

> their copy of Hunspell from time to time?) Automatic conversion means
> that people would benefit from the normalisation even if the spell
> checker authors didn't think about the problem, which is probably
ideal.
> I can't image there being a very large overhead for this, although it
> probably won't come for free either.

I thought of only an automatic preprocessor utility, based on Unicode
canonical and compatibility equivalence data.
A preprocessed dictionary can use only a small subset of this data
without any extra overhead.

Regards,
László


>
> Friedel
>
>
>
> --
> Recently on my blog:
>
http://translate.org.za/blogs/friedel/en/content/re-bringing-all-translation-management-tools-together
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: Unicode normalisation

Németh László-2
Hi,

Hyphen will be a default word character only in OOo 3.1 or later, see
http://www.openoffice.org/issues/show_bug.cgi?id=64400. Now the best (but
not perfect) method to split your words in the spelling dictionary:

marun
un

For command-line spell checking, use

WORDCHARS -

in the affix file.

It seems, ICONV Hunspell feature will be supported by OOo 3.1. Instead of
the missing Unicode normalization, you can use a redundant dictionary with
the different encodings of the same words, using also the NOSUGGEST feature
to reduce the suggestions only for one encoding.

Regards,
László


2009/1/28 Sunday Bolaji <[hidden email]>

> Hi,
>   Please can anybody assisit on how to make hunspell recognize word formed
> with hyphen like this " marun-un " as word in the dictionary file.
>  Also any assistance on how to use ICONV in OOo 3.0.
>
> --- On Tue, 1/6/09, Németh László <[hidden email]> wrote:
> From: Németh László <[hidden email]>
> Subject: Re: [lingu-dev] Unicode normalisation
> To: [hidden email]
> Date: Tuesday, January 6, 2009, 3:07 AM
>
> Hi,
>
> 2009/1/5 F Wolff <[hidden email]>:
> > Németh, the ICONV solution sounds interesting, and I guess would work. I
> > don't know if that would then also work in Firefox. (Do they update
>
> In fact, development version of Firefox has already contained Hunspell
> 1.2.8, unlike OpenOffice.org.
>
> > their copy of Hunspell from time to time?) Automatic conversion means
> > that people would benefit from the normalisation even if the spell
> > checker authors didn't think about the problem, which is probably
> ideal.
> > I can't image there being a very large overhead for this, although it
> > probably won't come for free either.
>
> I thought of only an automatic preprocessor utility, based on Unicode
> canonical and compatibility equivalence data.
> A preprocessed dictionary can use only a small subset of this data
> without any extra overhead.
>
> Regards,
> László
>
>
> >
> > Friedel
> >
> >
> >
> > --
> > Recently on my blog:
> >
>
> http://translate.org.za/blogs/friedel/en/content/re-bringing-all-translation-management-tools-together
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>