Question about twofold suffix stripping

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about twofold suffix stripping

Mehmet D. AKIN-3
Hi, we are working on a Turkish Hunspell dictionary and affix file, I have a
question about 2 fold affix stripping.

Below is an example affix file for 2 basic suffixes for Turkish, Plural
(-ler -lar) and direction (-e , -a).

My question is , Hunspell supports 2 fold suffix stripping, so it sounds it
should be possible to define suffix combinations like -ler-e and -lar-a  (as
in kedilere = to cats elmalara = to apples )

But on my tests :

mdakin@pardus:~$ hunspell -m -d turkish
kedi
kedi  st:kedi
kediler
kediler  st:kedi fl:L
kedilere
kedilere  st:kedi fl:L fl:E
kedilera
kedilera  st:kedi fl:L fl:E

Though it worked for "kedilere", the word "kedilera" is definitely wrong. It
seems it does not care about the rules defined for the second suffix, so
does this mean that the rules defined for suffixes (like  SFX E1 0 ye
[eiöü]) only valid if suffix is attached to a word but not valid if it is
attached to another suffix?

If it is the second case, we'll have to define combinations of suffixes as
different affix rules like:

SFX L1 0 lere [eiöü]
...

thanks for any help,

Mehmet



example dictionary (turkish.dic):

3
kedi/L0E1C1S1
elma/L0E1C1S1
dolap/L0E2C1S2

turkish.aff :

LANG tr_TR
SET UTF-8
TRY İiIıŞşÇçĞğÜüÖö-qwertyuopasdfghjklzxcvbnmQWERTYUOPASDFGHJKLZXCVBNM'

# Names, plural suffixes. -ler -lar
# if last vowel is (eiöü) -lar is appended
# if last vowel is (aıou) -lar is appended
# Examples: kedi-ler elma-lar Kurt-lar
SFX L0 Y 6
SFX L0 0 ler/E1 [eiöü]
SFX L0 0 ler/E1 [eiöü][^aeioöuü]
SFX L0 0 ler/E1 [eiöü][^aeioöuü][^aeioöuü]
SFX L0 0 lar/E1 [aıou]
SFX L0 0 lar/E1 [aıou][^aeioöuü]
SFX L0 0 lar/E1 [aıou][^aeioöuü][^aeioöuü]

# Names, direction -e -a
# if last vowel is (eiöü) -e is appended sepet-e
# if last vowel is (aıou) -a is appended salon-a
# if last letter of word is a vowel, and extra y appended elma-ya or kedi-ye
# if word has more than 1 sounds, last consonant is (pçtk) and comes after a
vowel,
#    it transforms into (bcdğ) dolap-a -> dolab-a.
SFX E1 Y 6
SFX E1 0 ye [eiöü]
SFX E1 0 e [eiöü][bcdfgğhjlmnrsşvyz]
SFX E1 0 e [eiöü][^aeioöuü][bcdfgğhjlmnrsşvyz]
SFX E1 0 ya [aıou]
SFX E1 0 a [aıou][bcdfgğhjlmnrsşvyz]
SFX E1 0 a [aıou][^aeioöuü][bcdfgğhjlmnrsşvyz]

SFX E2 Y 8
SFX E2 p ba [aıou]p
SFX E2 ç ca [aıou]ç
SFX E2 t da [aıou]t
SFX E2 k ğa [aıou]k
SFX E2 p be [eiöü]p
SFX E2 ç ce [eiöü]ç
SFX E2 t de [eiöü]t
SFX E2 k ğe [eiöü]k
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Erdal Ronahi
A small correction:

> # Names, plural suffixes. -ler -lar
> # if last vowel is (eiöü) -lar is appended
> # if last vowel is (aıou) -lar is appended

In the second line "-lar" should probably be replaced with "-ler". Of
course, this does not affect the actual spelling engine. But it may
confuse people who don't know Turkish.

Erdal
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

ge-7
In reply to this post by Mehmet D. AKIN-3
Mehmet,

I set up a test for your case.
I established /tmp/turkish.dic and /tmp/turkish.aff, as you wrote in your email.
I work with hunspell 1.2.1

I run after that example, and I get the expected (in my opinion proper) result:
en@gepem:~/program/hunspell-1.2.1/src/tools$ example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1
"kedi" is okay
"kediler" is okay
"kedilere" is okay
"kedilera" is incorrect!
   suggestions:
    ..."kediler"

Do I oversee something?

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

ge-7
In reply to this post by Mehmet D. AKIN-3
Mehmet,

Some additional tests, they all seem to be ok:
en@gepem:~/program/hunspell-1.2.1/src/tools$ example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1
"kedi" is okay
"kediler" is okay
"kedilere" is okay
"kedilera" is incorrect!
   suggestions:
    ..."kediler"
"kedilar" is incorrect!
   suggestions:
    ..."kedi"
    ..."elmalar"
    ..."dolaplar"
    ..."dolap"
"kedilara" is incorrect!
   suggestions:
    ..."kedi"
"elma" is okay
"elmalar" is okay
"elmalara" is okay
"elmaler" is incorrect!
   suggestions:
    ..."elma"
    ..."kediler"
"elmalere" is incorrect!
   suggestions:
    ..."elma"
"elmalera" is incorrect!
   suggestions:
    ..."elma"
"elmalare" is incorrect!
   suggestions:
    ..."elmalar"

What do you think?

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Mehmet D. AKIN-3
In reply to this post by Mehmet D. AKIN-3
Hi,

Thanks a lot for testing. The weird thing is, I am using Hunspell 1.2.7 and
, though your test results show that my aff file is correct, my results with
Hunspell on Linux  still gives me wrong results:

test.txt:

kediler
kedilar
kedilere
kedilera


mdakin@pardus ~ $ hunspell -d turkish < test.txt
Hunspell 1.2.7
+ kedi

& kedilar 2 0: kediler, kedi

+ kedi

+ kedi

So it still thinks kedilera is correct. Could this be a regression in the
new version? Can anyone with hunspell 1.2.7 confirm this?

Mehmet

On Mon, Sep 22, 2008 at 11:06 PM, ge <[hidden email]> wrote:

> Mehmet,
>
> Some additional tests, they all seem to be ok:
> en@gepem:~/program/hunspell-1.2.1/src/tools$ example /tmp/turkish.aff
> /tmp/turkish.dic /tmp/x1
> "kedi" is okay
> "kediler" is okay
> "kedilere" is okay
> "kedilera" is incorrect!
>   suggestions:
>    ..."kediler"
> "kedilar" is incorrect!
>   suggestions:
>    ..."kedi"
>    ..."elmalar"
>    ..."dolaplar"
>    ..."dolap"
> "kedilara" is incorrect!
>   suggestions:
>    ..."kedi"
> "elma" is okay
> "elmalar" is okay
> "elmalara" is okay
> "elmaler" is incorrect!
>   suggestions:
>    ..."elma"
>    ..."kediler"
> "elmalere" is incorrect!
>   suggestions:
>    ..."elma"
> "elmalera" is incorrect!
>   suggestions:
>    ..."elma"
> "elmalare" is incorrect!
>   suggestions:
>    ..."elmalar"
>
> What do you think?
>
> -eleonora
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Mehmet D. AKIN-3
In reply to this post by Erdal Ronahi
Thanks Erdal,

A  copy - paste accident. Lets see how far we will go fro Turkish with this
affix files :)

Mehmet


2008/9/22 Erdal Ronahi <[hidden email]>

> A small correction:
>
> > # Names, plural suffixes. -ler -lar
> > # if last vowel is (eiöü) -lar is appended
> > # if last vowel is (aıou) -lar is appended
>
> In the second line "-lar" should probably be replaced with "-ler". Of
> course, this does not affect the actual spelling engine. But it may
> confuse people who don't know Turkish.
>
> Erdal
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

ge-7
In reply to this post by Mehmet D. AKIN-3
Mehmet:

I downloaded and compiled hunspell 1.2.7 freshly, I still get correct results:
en@gepem:~/program/hunspell-1.2.7/src/tools$ example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1
"kedi" is okay
"kediler" is okay
"kedilere" is okay
"kedilera" is incorrect!
   suggestions:
    ..."kediler"
"kedilar" is incorrect!
   suggestions:
    ..."kedi"
    ..."elmalar"
    ..."dolaplar"
    ..."dolap"
"kedilara" is incorrect!
   suggestions:
    ..."kedi"
"elma" is okay
"elmalar" is okay
"elmalara" is okay
"elmaler" is incorrect!
   suggestions:
    ..."elma"
    ..."kediler"
"elmalere" is incorrect!
   suggestions:
    ..."elma"
"elmalera" is incorrect!
   suggestions:
    ..."elma"
"elmalare" is incorrect!
   suggestions:
    ..."elmalar"

Could you try you words using example? It gets automatically
built after make, so you should have it. (in src/tools)

Hunspell manual says:
(http://sourceforge.net/docman/display_doc.php?docid=90720&group_id=143754)
Correct words signed with an -*-, -+- or ---, unrecognized words signed with -#- or -&- in output lines (see later). (Close the standard input with Ctrl-d on Unix/Linux and Ctrl-Z Enter or Ctrl-C on Windows.)

Therefore you are right, hunspell library gives an incorrect output for kedilera, however example works perfectly with hunspell 1.2.7 on linux, at least for me.

If you instruct me, how to configure hunspell properly, I can repeat the test with hunspell. (I use example all the time here, never hunspell as you do).


-eleonora


Hi,

Thanks a lot for testing. The weird thing is, I am using Hunspell 1.2.7 and
, though your test results show that my aff file is correct, my results with
Hunspell on Linux  still gives me wrong results:

test.txt:

kediler
kedilar
kedilere
kedilera


mdakin@pardus ~ $ hunspell -d turkish < test.txt
Hunspell 1.2.7
+ kedi

& kedilar 2 0: kediler, kedi

+ kedi

+ kedi

So it still thinks kedilera is correct. Could this be a regression in the
new version? Can anyone with hunspell 1.2.7 confirm this?

Mehmet


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Mehmet D. AKIN-3
In reply to this post by Mehmet D. AKIN-3
Thanks Eleonora, but it seems There is something seriously wrong here :)

mdakin@pardus tools $ example ~/turkish.aff ~/turkish.dic ~/test.txt
"kediler" is okay

"kedilar" is incorrect!
   suggestions:
    ..."kediler"
    ..."kedi"
    ..."elmalar"
    ..."dolaplar"
    ..."dolap"

"kedilere" is okay

"kedilera" is okay


My OS is Pardus 2007.3 , I compiled hunspell myself, I am not sure what is
the difference between our systems.

Maybe example application is using systems default installed hunspell
library which is probably still the old version on your machine? can you
first uninstall hunspell completeley, make uninstall, then make a fresh
install with hunspell 1.2.7 ?

Mehmet


On Mon, Sep 22, 2008 at 11:57 PM, ge <[hidden email]> wrote:

> Mehmet:
>
> I downloaded and compiled hunspell 1.2.7 freshly, I still get correct
> results:
> en@gepem:~/program/hunspell-1.2.7/src/tools$ example /tmp/turkish.aff
> /tmp/turkish.dic /tmp/x1
> "kedi" is okay
> "kediler" is okay
> "kedilere" is okay
> "kedilera" is incorrect!
>   suggestions:
>    ..."kediler"
> "kedilar" is incorrect!
>   suggestions:
>    ..."kedi"
>    ..."elmalar"
>    ..."dolaplar"
>    ..."dolap"
> "kedilara" is incorrect!
>   suggestions:
>    ..."kedi"
> "elma" is okay
> "elmalar" is okay
> "elmalara" is okay
> "elmaler" is incorrect!
>   suggestions:
>    ..."elma"
>    ..."kediler"
> "elmalere" is incorrect!
>   suggestions:
>    ..."elma"
> "elmalera" is incorrect!
>   suggestions:
>    ..."elma"
> "elmalare" is incorrect!
>   suggestions:
>    ..."elmalar"
>
> Could you try you words using example? It gets automatically
> built after make, so you should have it. (in src/tools)
>
> Hunspell manual says:
> (http://sourceforge.net/docman/display_doc.php?docid=90720&group_id=143754
> )
> Correct words signed with an -*-, -+- or ---, unrecognized words signed
> with -#- or -&- in output lines (see later). (Close the standard input with
> Ctrl-d on Unix/Linux and Ctrl-Z Enter or Ctrl-C on Windows.)
>
> Therefore you are right, hunspell library gives an incorrect output for
> kedilera, however example works perfectly with hunspell 1.2.7 on linux, at
> least for me.
>
> If you instruct me, how to configure hunspell properly, I can repeat the
> test with hunspell. (I use example all the time here, never hunspell as you
> do).
>
>
> -eleonora
>
>
> Hi,
>
> Thanks a lot for testing. The weird thing is, I am using Hunspell 1.2.7 and
> , though your test results show that my aff file is correct, my results
> with
> Hunspell on Linux  still gives me wrong results:
>
> test.txt:
>
> kediler
> kedilar
> kedilere
> kedilera
>
>
> mdakin@pardus ~ $ hunspell -d turkish < test.txt
> Hunspell 1.2.7
> + kedi
>
> & kedilar 2 0: kediler, kedi
>
> + kedi
>
> + kedi
>
> So it still thinks kedilera is correct. Could this be a regression in the
> new version? Can anyone with hunspell 1.2.7 confirm this?
>
> Mehmet
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

ge-7
In reply to this post by Mehmet D. AKIN-3
Mehmet:

You are right, 1.2.7 is wrong:
en@gepem:~/program/hunspell-1.2.7/src/tools$ ./example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1
"kedi" is okay
"kediler" is okay
"kedilere" is okay
"kedilera" is okay
"kedilar" is incorrect!
   suggestions:
    ..."kedi"
    ..."elmalar"
    ..."dolaplar"
    ..."dolap"
"kedilara" is incorrect!
   suggestions:
    ..."kedi"
"elma" is okay
"elmalar" is okay
"elmaler" is incorrect!
   suggestions:
    ..."elma"
    ..."kediler"
"elmalere" is incorrect!
   suggestions:
    ..."elma"
"elmalera" is incorrect!
   suggestions:
    ..."elmalare"
"elmalara" is okay
"elmalare" is okay

1.2.1 works well, I can email you the source tree of it for linux.

Here the 1.2.1 results:
en@gepem:~/program/hunspell-1.2.1/src/tools$ ./example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1 g
kedi
kediler
kedilere
elma
elmalar
elmalara
en@gepem:~/program/hunspell-1.2.1/src/tools$ ./example /tmp/turkish.aff /tmp/turkish.dic /tmp/x1 b
kedilera
kedilar
kedilara
elmaler
elmalere
elmalera
elmalare


-eleonora


Thanks Eleonora, but it seems There is something seriously wrong here :)

mdakin@pardus tools $ example ~/turkish.aff ~/turkish.dic ~/test.txt
"kediler" is okay

"kedilar" is incorrect!
   suggestions:
    ..."kediler"
    ..."kedi"
    ..."elmalar"
    ..."dolaplar"
    ..."dolap"

"kedilere" is okay

"kedilera" is okay


My OS is Pardus 2007.3 , I compiled hunspell myself, I am not sure what is
the difference between our systems.

Maybe example application is using systems default installed hunspell
library which is probably still the old version on your machine? can you
first uninstall hunspell completeley, make uninstall, then make a fresh
install with hunspell 1.2.7 ?

Mehmet



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Mehmet D. AKIN-3
In reply to this post by Mehmet D. AKIN-3
Hi,
So this is confirmed. Thanks, if you can send I could try the old version,
but, maybe it is best to report this and confirm that this is a bug and not
a feature. Where should I report this? Hunspell's sourceforge site or as an
openoffice bug. I do not have an account for sourceforge site.. But can open
one for this anyway.

Mehmet

On Tue, Sep 23, 2008 at 12:27 AM, ge <[hidden email]> wrote:

> Mehmet:
>
> You are right, 1.2.7 is wrong:
> en@gepem:~/program/hunspell-1.2.7/src/tools$ ./example /tmp/turkish.aff
> /tmp/turkish.dic /tmp/x1
> "kedi" is okay
> "kediler" is okay
> "kedilere" is okay
> "kedilera" is okay
> "kedilar" is incorrect!
>   suggestions:
>     ..."kedi"
>    ..."elmalar"
>    ..."dolaplar"
>    ..."dolap"
> "kedilara" is incorrect!
>   suggestions:
>    ..."kedi"
> "elma" is okay
> "elmalar" is okay
> "elmaler" is incorrect!
>   suggestions:
>    ..."elma"
>    ..."kediler"
> "elmalere" is incorrect!
>   suggestions:
>    ..."elma"
> "elmalera" is incorrect!
>   suggestions:
>     ..."elmalare"
> "elmalara" is okay
> "elmalare" is okay
>
> 1.2.1 works well, I can email you the source tree of it for linux.
>
> Here the 1.2.1 results:
> en@gepem:~/program/hunspell-1.2.1/src/tools$ ./example /tmp/turkish.aff
> /tmp/turkish.dic /tmp/x1 g
> kedi
> kediler
> kedilere
> elma
> elmalar
> elmalara
> en@gepem:~/program/hunspell-1.2.1/src/tools$ ./example /tmp/turkish.aff
> /tmp/turkish.dic /tmp/x1 b
> kedilera
> kedilar
> kedilara
> elmaler
> elmalere
> elmalera
> elmalare
>
>
> -eleonora
>
>
> Thanks Eleonora, but it seems There is something seriously wrong here :)
>
> mdakin@pardus tools $ example ~/turkish.aff ~/turkish.dic ~/test.txt
> "kediler" is okay
>
> "kedilar" is incorrect!
>    suggestions:
>     ..."kediler"
>     ..."kedi"
>     ..."elmalar"
>     ..."dolaplar"
>     ..."dolap"
>
> "kedilere" is okay
>
> "kedilera" is okay
>
>
> My OS is Pardus 2007.3 , I compiled hunspell myself, I am not sure what is
> the difference between our systems.
>
> Maybe example application is using systems default installed hunspell
> library which is probably still the old version on your machine? can you
> first uninstall hunspell completeley, make uninstall, then make a fresh
> install with hunspell 1.2.7 ?
>
> Mehmet
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

ge-7
In reply to this post by Mehmet D. AKIN-3
Mehmet:

Entered issue (artifact) in SF tracker, issue #2124180
( http://sourceforge.net/tracker/index.php )

Sent hunspell 1.2.1 to your private email.

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about twofold suffix stripping

Németh László-2
In reply to this post by Mehmet D. AKIN-3
Hi,

You are right, it is still a bug of the new condition checking
algorithm (I have compared Hunspell 1.2.7 with Hunspell 1.1.12.)
I am working on a quick fix. I have also got a bug report from Badral
Sanligiin (Mongolian NLP).

Thanks,
László


2008/9/23 ge <[hidden email]>:

> Mehmet:
>
> Entered issue (artifact) in SF tracker, issue #2124180
> ( http://sourceforge.net/tracker/index.php )
>
> Sent hunspell 1.2.1 to your private email.
>
> -eleonora
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]