Anyone familiar with the ICU?

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Changed hyphen policy for Dutch (and other languages)

thomas.lange

Hi all,

Ruud Baars wrote:

> Together with one of the people of OOo we have tested our newest
> dictionary for Dutch, using the new way OOo 3.2  treats words with
> hyphens in the middle.
>
> I am happy to be able to inform you our tests were successfull.
> And indeed, it is a promising improvement for Dutch.
>
> Even more when we will be able to finally finish our dictionary that
> uses compounding algorithms.
>
> Thanks.
>
> Ruud
>  

One curious question just popped up in my mind though:

Since Firefox and Thunderbird can use Hunspell dictionaries at least as
add-on (don't know if otherwise they are still using Myspell
dictionaries), but for both applications the hyphen is still a word
breaker, does it mean they have to use different dictionary versions?
Or will it still be possible to share the dictionaries system wide?
Since that is what most Linux installation currently do.


Thomas



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changed hyphen policy for Dutch (and other languages)

Rene Engelhard-7
Hi,

On Fri, Aug 28, 2009 at 10:20:12AM +0200, Thomas Lange - Sun Germany - ham02 - Hamburg wrote:
> Since Firefox and Thunderbird can use Hunspell dictionaries at least as
> add-on (don't know if otherwise they are still using Myspell
> dictionaries), but for both applications the hyphen is still a word
> breaker, does it mean they have to use different dictionary versions?

Good question, yes.

> Or will it still be possible to share the dictionaries system wide?
> Since that is what most Linux installation currently do.

OTOH, at least Debian builds the dict directly out of "dutch" (the ispell
version. So unless that build system is adapted we probably won't get
this "improvement" anyway)

Grüße/Regards,

Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changed hyphen policy for Dutch (and other languages)

Németh László-2
In reply to this post by thomas.lange
Hi,

2009/8/28 Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:

>
> Hi all,
>
> Ruud Baars wrote:
>> Together with one of the people of OOo we have tested our newest
>> dictionary for Dutch, using the new way OOo 3.2  treats words with
>> hyphens in the middle.
>>
>> I am happy to be able to inform you our tests were successfull.
>> And indeed, it is a promising improvement for Dutch.

I'm glad of it.

>>
>> Even more when we will be able to finally finish our dictionary that
>> uses compounding algorithms.
>>
>> Thanks.
>>
>> Ruud
>>
>
> One curious question just popped up in my mind though:
>
> Since Firefox and Thunderbird can use Hunspell dictionaries at least as
> add-on (don't know if otherwise they are still using Myspell
> dictionaries), but for both applications the hyphen is still a word
> breaker, does it mean they have to use different dictionary versions?
> Or will it still be possible to share the dictionaries system wide?
> Since that is what most Linux installation currently do.

It is the task of the dictionary developers to check and decide about
the modifications.
They could provide different dictionaries for Firefox and
OpenOffice.org, too. 98% of the users use Windows and Firefox with its
bundled dictionaries.

For example, this modification has very limited impact for English.
Some of the related words (words only in hyphenated compounds), for
example "scot" is missing from the recent en_US dictionary, so the
word "scot-free" is already incorrectly stated as spelling mistake in
Firefox and OpenOffice.org.

I believe, this modification forces to fix imperfect spell checking of
Firefox. For example, lack of abbreviation (ambiguous dot/full stop)
handling is an old spell checking problem of Firefox, too. (But the
most annoying problem for me is the size limited on-the-fly spell
checking in text areas and the lack of the paragraph based automatic
language detection.)

Best regards,

László

>
>
> Thomas
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changed hyphen policy for Dutch (and other languages)

Olivier R.-2
In reply to this post by thomas.lange
Hi Thomas,

Thomas Lange - Sun Germany - ham02 - Hamburg a écrit :

> Since Firefox and Thunderbird can use Hunspell dictionaries at least as
> add-on (don't know if otherwise they are still using Myspell
> dictionaries), but for both applications the hyphen is still a word
> breaker, does it mean they have to use different dictionary versions?
> Or will it still be possible to share the dictionaries system wide?

It’s not necessary to provide different dictionaries, for if a word with
hyphen is not recognized by Hunspell, Hunspell will check both parts a
the word separately.

László Németh wrote: “Without any dictionary modification the nearly
integrated Hunspell 1.2.8 can break the input token at hyphens like the
tokenizator of OpenOffice.org (default back compatibility), but Hunspell
checks also the whole token (with hyphen, like in "scot-free") before
this tokenization. Now Hunspell has also hyphenated multiword
suggestions ("good-words-badd-words"->"good-words-bad-words").”
http://www.openoffice.org/issues/show_bug.cgi?id=64400

A question: is it planned to cancel this back-compatibility in the future?

I do not say it should be done. The current behavior is fine for me now.
:)

Best regards,
--
Olivier R.

== Adresse mail réservée aux listes de discussion.                ==
== Les messages venant d’ailleurs sont _automatiquement_ effacés. ==
** E-mail dedicated to mailing-lists.                             **
** Messages from anywhere else are _automatically_ erased.        **

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changed hyphen policy for Dutch (and other languages)

Németh László-2
Hi

2009/8/28 Olivier R. <[hidden email]>:

> Hi Thomas,
>
> Thomas Lange - Sun Germany - ham02 - Hamburg a écrit :
>
>> Since Firefox and Thunderbird can use Hunspell dictionaries at least as
>> add-on (don't know if otherwise they are still using Myspell
>> dictionaries), but for both applications the hyphen is still a word
>> breaker, does it mean they have to use different dictionary versions?
>> Or will it still be possible to share the dictionaries system wide?
>
> It’s not necessary to provide different dictionaries, for if a word with
> hyphen is not recognized by Hunspell, Hunspell will check both parts a the
> word separately.

Using old tokenization with the new dictionaries is the possible
problem. Firefox's tokenization hasn't been modified yet and replacing
the word parts with hyphenated forms in Hunspell dictionaries could
result bad error alerts instead of the quiet errors (using the word
parts in separated forms). Luckily, in most cases this is not problem,
because the dictionary developers didn't add the rare word parts to
the dictionary. For example the English word "scot-free" contains the
word part "scot". It's possible to add to the dictionary to avoid the
unnecessary error alerts, but this is an imperfect solution.

>
> László Németh wrote: “Without any dictionary modification the nearly
> integrated Hunspell 1.2.8 can break the input token at hyphens like the
> tokenizator of OpenOffice.org (default back compatibility), but Hunspell
> checks also the whole token (with hyphen, like in "scot-free") before this
> tokenization. Now Hunspell has also hyphenated multiword suggestions
> ("good-words-badd-words"->"good-words-bad-words").”
> http://www.openoffice.org/issues/show_bug.cgi?id=64400

Yes, this is the back-compatibility for using the new tokenization
with old dictionaries.

>
> A question: is it planned to cancel this back-compatibility in the future?

It is not planned.

>
> I do not say it should be done. The current behavior is fine for me now.
> :)

And I hope, for the French users, too. :)

Best regards,
László


>
> Best regards,
> --
> Olivier R.
>
> == Adresse mail réservée aux listes de discussion.                ==
> == Les messages venant d’ailleurs sont _automatiquement_ effacés. ==
> ** E-mail dedicated to mailing-lists.                             **
> ** Messages from anywhere else are _automatically_ erased.        **
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12