Adding affixation to a thesaurus

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding affixation to a thesaurus

Andrea Pescetti
Reading http://www.openoffice.org/issues/show_bug.cgi?id=114774 I
understood that the OOo thesaurus support affixation, i.e., that if
"river" admits "stream" as a synonym, then looking for a synonym of
"rivers" will bring up "streams".

Now, this never worked in the Italian thesaurus. Only the base form is
proposed. I mean, if "piccolo" (Italian for "small") admits
"limitato" (Italian for "limited") as a synonym, looking for synonyms of
the plural form "piccoli" does not show the plural "limitati", but the
base form "limitato". And this happens for all words, in OOo 3.2.1 too,
where the English thesaurus has the affixation working and is unaffected
by the issue mentioned above.

It should thus be possible to improve the Italian thesaurus so that it
supports affixation like the English one. Can anybody point me to some
resources on how to do it? I had a look at
http://lingucomponent.openoffice.org/thesaurus.html but I wasn't able to
find an answer there.

Thanks,
  Andrea Pescetti - Italian N-L Project Lead.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding affixation to a thesaurus

Németh László-2
Hi,

[From my previous letters, with new links]:

The new stemming in OpenOffice.org thesaurus works in most languages
without spelling dictionary modification (for example, the word form
"cats" has synonyms in English now), but for morphological generation
(for example, listing "kitties" synonym instead of "kitty" for "cats"
in English) and word forms without (real) stems need some new
dictionary data. See the issue 19563
(http://www.openoffice.org/issues/show_bug.cgi?id=19563), Hunspell
manual (https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/hunspell4.pdf,
morphological analysis section)
morphological regression tests, analyze tool and new -s/-m options of
the hunspell executable in the Hunspell distribution.

The standalone OpenOffice.org MyThes thesaurus
has a configuration option to test your thesaurus with stemming and affixation:
https://sourceforge.net/projects/hunspell/files/MyThes/1.2.1/mythes-1.2.1.tar.gz

See README.NEW and README for compiling.

Test example

Make an input.txt file with two lines, "rodents" and "consumed", and
run MyThes with the
test dictionary:
./example morph.idx morph.dat input.txt morph.aff morph.dic

Thesaurus uses encoding ISO8859-1

stem: rodent
rodent has 1 meanings
   meaning 0: (n) mouse
       mice

stem: consume
consume has 1 meanings
   meaning 0: (v) eat
       eaten, ate
       ingested

The example Hunspell dictionary (meanings of the morphological fields:
po: part of speech category
ts: terminal suffix
al: allomorph
st: stem
is: inflectional suffix, see
http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754#Morphological%20analysis):

$ cat morph.dic
8
rodent/S        po:n        ts:nom
mouse   po:n    al:mice ts:nom
mice    po:n st:mouse        is:plur
consume/TQD     po:v ts:present
ingest/TQD      po:v ts:present
eat/QT  po:v    al:ate  al:eaten        ts:present
ate     po:v    st:eat  is:past_1
eaten   po:v    st:eat  is:past_2

$ cat morph.aff
# example for morphological analysis, stemming and generation
SFX D Y 4
SFX D   0 ed [^e] is:past_1
SFX D   0 d e     is:past_1
SFX D   0 ed [^e] is:past_2
SFX D   0 d e     is:past_2

SFX S Y 1
SFX S   0 s . is:plur

SFX Q Y 1
SFX Q   0 s . is:sg_3

SFX T Y 2
SFX T   0 ing [^e] is:pr_part
SFX T   e ing e    is:pr_part

and the thesaurus (without any extra morphological information):

$ cat morph.dat
ISO8859-1
mouse|1
(n)|rodent
rodent|1
(n)|mouse
eat|1
(v)|consume|ingest
consume|1
(v)|eat|ingest
ingest|1
(v)|eat|consume

Regards,
László


2010/9/27 Andrea Pescetti <[hidden email]>:

> Reading http://www.openoffice.org/issues/show_bug.cgi?id=114774 I
> understood that the OOo thesaurus support affixation, i.e., that if
> "river" admits "stream" as a synonym, then looking for a synonym of
> "rivers" will bring up "streams".
>
> Now, this never worked in the Italian thesaurus. Only the base form is
> proposed. I mean, if "piccolo" (Italian for "small") admits
> "limitato" (Italian for "limited") as a synonym, looking for synonyms of
> the plural form "piccoli" does not show the plural "limitati", but the
> base form "limitato". And this happens for all words, in OOo 3.2.1 too,
> where the English thesaurus has the affixation working and is unaffected
> by the issue mentioned above.
>
> It should thus be possible to improve the Italian thesaurus so that it
> supports affixation like the English one. Can anybody point me to some
> resources on how to do it? I had a look at
> http://lingucomponent.openoffice.org/thesaurus.html but I wasn't able to
> find an answer there.
>
> Thanks,
>  Andrea Pescetti - Italian N-L Project Lead.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]