Anyone familiar with the ICU?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Anyone familiar with the ICU?

thomas.lange

Hi all,

Does anyone know how to modify the ICU (i.e. probably the word.txt file)
to allow for pre- and postfix "HYPHEN-MINUS" and "EN DASH" as part of
the word (in order to get them passed on to the spell checker as well)?

This would be useful e.g. for German where there are correct word parts
like
  "Arbeits- und Verwaltungsrecht"

Regards,
Thomas


BTW: Is there any other language where hyphens/dashes should be handled
similarly?



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Anyone familiar with the ICU?

Németh László-2
Hi,

See extended ALetter definitions of the Hungarian word breaking rules:

http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/dict_word_hu.txt
http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/edit_word_hu.txt

By the way, it also contains numbers and other special signs, because
Hungarian uses their affixed forms. (For example, "with 25%" is
"25%-kal" in Hungarian, and not the frequent bad form "25%-al"):

$ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name=
HEBREW PUNCTUATION GERESH:]
                [:name = PERCENT SIGN:] [:name = PER MILLE SIGN:]
[:name = PER TEN THOUSAND SIGN:]
                [:name = SECTION SIGN:] [:name = DEGREE SIGN:] [:name
= EURO SIGN:]
                [:name = HYPHEN-MINUS:] [:name = EN DASH:] [:name = EM DASH:]
                [:name = DIGIT ZERO:]
                [:name = DIGIT ONE:]
                [:name = DIGIT TWO:]
                [:name = DIGIT THREE:]
                [:name = DIGIT FOUR:]
                [:name = DIGIT FIVE:]
                [:name = DIGIT SIX:]
                [:name = DIGIT SEVEN:]
                [:name = DIGIT EIGHT:]
                [:name = DIGIT NINE:]
                           - $Ideographic
                           - $Katakana
                           - $Hangul
                           - [:Script = Thai:]
                           - [:Script = Lao:]
                           - [:Script = Hiragana:]];

Best regards,
László


2009/6/10 Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:

>
> Hi all,
>
> Does anyone know how to modify the ICU (i.e. probably the word.txt file)
> to allow for pre- and postfix "HYPHEN-MINUS" and "EN DASH" as part of
> the word (in order to get them passed on to the spell checker as well)?
>
> This would be useful e.g. for German where there are correct word parts
> like
>  "Arbeits- und Verwaltungsrecht"
>
> Regards,
> Thomas
>
>
> BTW: Is there any other language where hyphens/dashes should be handled
> similarly?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Anyone familiar with the ICU?

thomas.lange

Hello László,  :-)

> Hi,
>
> See extended ALetter definitions of the Hungarian word breaking rules:
>
> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/dict_word_hu.txt
> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/edit_word_hu.txt
>
> By the way, it also contains numbers and other special signs, because
> Hungarian uses their affixed forms. (For example, "with 25%" is
> "25%-kal" in Hungarian, and not the frequent bad form "25%-al"):
>
> $ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name=
> HEBREW PUNCTUATION GERESH:]
>                 [:name = PERCENT SIGN:] [:name = PER MILLE SIGN:]
> [:name = PER TEN THOUSAND SIGN:]
>                 [:name = SECTION SIGN:] [:name = DEGREE SIGN:] [:name
> = EURO SIGN:]
>                 [:name = HYPHEN-MINUS:] [:name = EN DASH:] [:name = EM DASH:]
>                 [:name = DIGIT ZERO:]
>                 [:name = DIGIT ONE:]
>                 [:name = DIGIT TWO:]
>                 [:name = DIGIT THREE:]
>                 [:name = DIGIT FOUR:]
>                 [:name = DIGIT FIVE:]
>                 [:name = DIGIT SIX:]
>                 [:name = DIGIT SEVEN:]
>                 [:name = DIGIT EIGHT:]
>                 [:name = DIGIT NINE:]
>                            - $Ideographic
>                            - $Katakana
>                            - $Hangul
>                            - [:Script = Thai:]
>                            - [:Script = Lao:]
>                            - [:Script = Hiragana:]];
>  

I tried something similar. I did the following changes:

$ALetter   = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]
[:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:]
                           - $Ideographic
                           - $Katakana
                           - $Hangul
                           - [:Script = Thai:]
                           - [:Script = Lao:]
                           - [:Script = Hiragana:]];
...
$SufixLetter = [:name= FULL STOP:] [:name = HYPHEN-MINUS:] [:name = EN
DASH:];

Basically it worked, but an unwanted side effect was that multiple
dashes got accepted at the start or end of the word. That is "---water"
and "river---" were regarded as one word. Whereas if I use text like
"...water" and "river...", always only one of the dashes was included
with the word. Thus I am wondering if it could be done similar for the
dashes...
Also, since I'm completely new to the ICU, I don't know if my above try
has any unwanted side effects.

Do you have any clues for me?

Regards,
Thomas


Reply | Threaded
Open this post in threaded view
|

Re: Anyone familiar with the ICU?

Németh László-2
2009/6/10 Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:
>
> Hello László,  :-)

Hello Thomas,

Glad to hear about the word breaking fixes. :) Unfortunatelly, I have
had no time to follow the Issue 64400, yet. I have also found a
relevant bug in Hunspell 1.2.8. I fixed it in the OpenOffice.org in
the last minute before the OOo 3.1 code freeze, but not yet for the
OpenOffice.org distributions with external Hunspell 1.2.8. The bug is
special enough: the words with these dashes at the ends cause seg
fault under thesaurus usage (the improved thesaurus uses Hunspell for
stemming), but I didn't want to force solving this issue before
Hunspell 1.2.9 release.

>> Hi,
>>
>> See extended ALetter definitions of the Hungarian word breaking rules:
>>
>> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/dict_word_hu.txt
>> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/edit_word_hu.txt
>>
>> By the way, it also contains numbers and other special signs, because
>> Hungarian uses their affixed forms. (For example, "with 25%" is
>> "25%-kal" in Hungarian, and not the frequent bad form "25%-al"):
>>
>> $ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name=
>> HEBREW PUNCTUATION GERESH:]
>>                 [:name = PERCENT SIGN:] [:name = PER MILLE SIGN:]
>> [:name = PER TEN THOUSAND SIGN:]
>>                 [:name = SECTION SIGN:] [:name = DEGREE SIGN:] [:name
>> = EURO SIGN:]
>>                 [:name = HYPHEN-MINUS:] [:name = EN DASH:] [:name = EM DASH:]
>>                 [:name = DIGIT ZERO:]
>>                 [:name = DIGIT ONE:]
>>                 [:name = DIGIT TWO:]
>>                 [:name = DIGIT THREE:]
>>                 [:name = DIGIT FOUR:]
>>                 [:name = DIGIT FIVE:]
>>                 [:name = DIGIT SIX:]
>>                 [:name = DIGIT SEVEN:]
>>                 [:name = DIGIT EIGHT:]
>>                 [:name = DIGIT NINE:]
>>                            - $Ideographic
>>                            - $Katakana
>>                            - $Hangul
>>                            - [:Script = Thai:]
>>                            - [:Script = Lao:]
>>                            - [:Script = Hiragana:]];
>>
>
> I tried something similar. I did the following changes:
>
> $ALetter   = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]
> [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:]
>                           - $Ideographic
>                           - $Katakana
>                           - $Hangul
>                           - [:Script = Thai:]
>                           - [:Script = Lao:]
>                           - [:Script = Hiragana:]];
> ...
> $SufixLetter = [:name= FULL STOP:] [:name = HYPHEN-MINUS:] [:name = EN
> DASH:];
>
> Basically it worked, but an unwanted side effect was that multiple
> dashes got accepted at the start or end of the word. That is "---water"
> and "river---" were regarded as one word. Whereas if I use text like
> "...water" and "river...", always only one of the dashes was included
> with the word. Thus I am wondering if it could be done similar for the
> dashes...
> Also, since I'm completely new to the ICU, I don't know if my above try
> has any unwanted side effects.
>
> Do you have any clues for me?

It seems, ICU uses regex-like syntax, so a similar definition may help:

$attheend = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]];

And the modification of the first line of the LetterSequency
definition for the optional dashes:

$LetterSequence = $attheend? $ALetterEx ($FormatEx* $MidLetterEx?
$FormatEx* $ALetterEx $attheend?)*;

Regards,
László

>
> Regards,
> Thomas
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Anyone familiar with the ICU?

Marcin Miłkowski
In reply to this post by thomas.lange
Hi Thomas,

>
> BTW: Is there any other language where hyphens/dashes should be handled
> similarly?

Yes, Polish can have dashes that are at the end of prefixes or certain
compound adjectives, but they don't occur normally at the beginning of
the word.

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]