Questions to implement an hunspell stemmer in Java

Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions to implement an hunspell stemmer in Java

Frédéric Glorieux

Hi all,

Sorry for cross posting if some one read the same message on the users list.

I'm working for european medieval languages (latin for now and soon :
old french, occitan). Searching in texts need stemming. I have
implemented a first working prototype of a Java stemmer, using huspell
dic and aff files, on medieval latin (code will of course be open
source). We started on the dict-la extension project
http://extensions.services.openoffice.org/project/dict-la (thanks a
lot). I have two questions I would be glad to solve in your way, to be
sure that lexical resources developped in our context could be also used
with hunspell. Sadly, I'm not able to read well C, so I have to ask
questions in (bad) English.

A first problem of medieval languages is to have no exact orthography.
For example, “philosophia”, “fylosofya”, “phylozofia”, and all the
possible combinations are right graphies, because these are the graphies
in the manuscript. Latin stems~lemmas in la.dic, if it's possible,
should be kept with their classical graphy, with "ph" for words coming
from greek (philosophia), but "f" for others (faber), same for "y"
(gymnasium, icon) and others. If I understood hunspell rightly, then,
"ph f" or "y i" should not be ICONV rules (unlike "æ ae, è e, ę e...").
I tried a while the idea of REP rules, but I was affraid of all possible
combinations, (y i, i y, z s, s z, ph f, f ph). In a spellchecker, it's
not a critic problem if the right word is not suggested, or if it takes
time, but for a stemmer, too much lookups is expensive. So I implement a
kind of PHONE rules. The code is working, but I'm not really proud of
what I done. First, I haven't really understood the aspell syntax,
sounding like a pre regex era like Porter snowball, so I conclude that I
will not be able to explain it to linguists. For now, to stay compatible
with hunspell, I'm only using simple substitutions (like REP rules)  "ph
f", sometimes verbose (bb b, cc c, dd d...). The implementation is also
a problem. How to apply rules ? I choose the easiest way to understand
for the rule writer, it's a sequence, a program. Real example : 1) ph f,
2) ch k, 3) h _  (strip 'h' after 'ph' and 'ch' resolution). What to do
with a PHONE result ? For now, I maintain a map of the dic file whith
phone reduction as a key, and stems~lemmas as values. Should I apply
phone rules to the affixes ? I should confess that I added the needed
affixes (ex: (ros)-ae=(ros)-e), faster than to code. Any advice are
welcome to find the best way to keep linguistic knowledge on medieval
latin in hunspell syntax.

Second problem, irregular verbs. Like for english (write, wrote,
written) latin (classical or medieval) has a lot (~3500) of irregular
verbs (ex: concedo, concessi, concessum). For the dic file I was able to
understand (in fact, english and latin) the solution was to open a dic
entry for the irregular verbal radical. It's surely perfect for a
spellchecker, but a big problem for stemming (searching for "concedo"
will not find "concessimus" because this form is stemed as "concessi").
The documentation seems to promote another approach, the optional data
fields
  sing al:sang al:sung
  sang st:sing
  sung st:sing
English affix files seems to not yet follow this syntax. Is it too early
to use it ? What could be broken ? For very irregular conjugation (ex:
la:sum, fr:être) common solution seems to open a dic line for each form.
But in latin, a verb like sum appears in different compound with very
different meaning. It's not a good idea to reduce "presentes" to the
stem "sum" by a "prae" (or "pre") suffix rule. Better approach seems to
keep complete conjugation of "sum" in affix rules. But,
is it still an hunspell limit to not allow complete strip of stem ? (ex
: "sum", "erat" ; "sum/." "SFX . sum erat sum").

Sorry for a so long and compact message, the patience is paid by a
little demo

http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=gratia
gratia find gratiam (a flexion rule) but also graciam (a phone rule)
http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=dico
dico find also dictum or dixerunt (st: otional field).
Idea came from this project http://code.google.com/p/lucene-hunspell/,
but the code is written from scratch less lucene centric.

thanks in advance for all advice, I would be glad to not code on sand.

--
Frédéric Glorieux

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Questions to implement an hunspell stemmer in Java

Frédéric Glorieux
Answer to myself for the archives.

Working public example of an hunspell dictionary used as a stemmer with
some specific java code
http://ducange.enc.sorbonne.fr/?q=philosophia&f=text
The “Du Cange” is a medieval dictionary (10 vol.). Searching for
“philosophia” find “Philosophiam”, “Phylosophya” (even φιλοσοφία with
some transliteration outside hunspell resources).

2.2) (être, fero, sum…) FULLSTRIP
SOLVED.

2.1) (irregular verbs) st:, al:, checked in hungarian files, no example
found in hu_HU.dic, some in hu_HU, I'm unable to understand the
linguistic reasons.
SOLVED ?

1) PHONE
advice still needed, thanks in advance.

--
Frédéric Glorieux
http://elec.enc.sorbonne.fr/


> Hi all,
>
> Sorry for cross posting if some one read the same message on the users
> list.
>
> I'm working for european medieval languages (latin for now and soon :
> old french, occitan). Searching in texts need stemming. I have
> implemented a first working prototype of a Java stemmer, using huspell
> dic and aff files, on medieval latin (code will of course be open
> source). We started on the dict-la extension project
> http://extensions.services.openoffice.org/project/dict-la (thanks a
> lot). I have two questions I would be glad to solve in your way, to be
> sure that lexical resources developped in our context could be also used
> with hunspell. Sadly, I'm not able to read well C, so I have to ask
> questions in (bad) English.
>
> A first problem of medieval languages is to have no exact orthography.
> For example, “philosophia”, “fylosofya”, “phylozofia”, and all the
> possible combinations are right graphies, because these are the graphies
> in the manuscript. Latin stems~lemmas in la.dic, if it's possible,
> should be kept with their classical graphy, with "ph" for words coming
> from greek (philosophia), but "f" for others (faber), same for "y"
> (gymnasium, icon) and others. If I understood hunspell rightly, then,
> "ph f" or "y i" should not be ICONV rules (unlike "æ ae, è e, ę e...").
> I tried a while the idea of REP rules, but I was affraid of all possible
> combinations, (y i, i y, z s, s z, ph f, f ph). In a spellchecker, it's
> not a critic problem if the right word is not suggested, or if it takes
> time, but for a stemmer, too much lookups is expensive. So I implement a
> kind of PHONE rules. The code is working, but I'm not really proud of
> what I done. First, I haven't really understood the aspell syntax,
> sounding like a pre regex era like Porter snowball, so I conclude that I
> will not be able to explain it to linguists. For now, to stay compatible
> with hunspell, I'm only using simple substitutions (like REP rules) "ph
> f", sometimes verbose (bb b, cc c, dd d...). The implementation is also
> a problem. How to apply rules ? I choose the easiest way to understand
> for the rule writer, it's a sequence, a program. Real example : 1) ph f,
> 2) ch k, 3) h _ (strip 'h' after 'ph' and 'ch' resolution). What to do
> with a PHONE result ? For now, I maintain a map of the dic file whith
> phone reduction as a key, and stems~lemmas as values. Should I apply
> phone rules to the affixes ? I should confess that I added the needed
> affixes (ex: (ros)-ae=(ros)-e), faster than to code. Any advice are
> welcome to find the best way to keep linguistic knowledge on medieval
> latin in hunspell syntax.
>
> Second problem, irregular verbs. Like for english (write, wrote,
> written) latin (classical or medieval) has a lot (~3500) of irregular
> verbs (ex: concedo, concessi, concessum). For the dic file I was able to
> understand (in fact, english and latin) the solution was to open a dic
> entry for the irregular verbal radical. It's surely perfect for a
> spellchecker, but a big problem for stemming (searching for "concedo"
> will not find "concessimus" because this form is stemed as "concessi").
> The documentation seems to promote another approach, the optional data
> fields
> sing al:sang al:sung
> sang st:sing
> sung st:sing
> English affix files seems to not yet follow this syntax. Is it too early
> to use it ? What could be broken ? For very irregular conjugation (ex:
> la:sum, fr:être) common solution seems to open a dic line for each form.
> But in latin, a verb like sum appears in different compound with very
> different meaning. It's not a good idea to reduce "presentes" to the
> stem "sum" by a "prae" (or "pre") suffix rule. Better approach seems to
> keep complete conjugation of "sum" in affix rules. But,
> is it still an hunspell limit to not allow complete strip of stem ? (ex
> : "sum", "erat" ; "sum/." "SFX . sum erat sum").
>
> Sorry for a so long and compact message, the patience is paid by a
> little demo
>
> http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=gratia
> gratia find gratiam (a flexion rule) but also graciam (a phone rule)
> http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=dico
> dico find also dictum or dixerunt (st: otional field).
> Idea came from this project http://code.google.com/p/lucene-hunspell/,
> but the code is written from scratch less lucene centric.
>
> thanks in advance for all advice, I would be glad to not code on sand.
>
> --
> Frédéric Glorieux
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Questions to implement an hunspell stemmer in Java

Olivier R.-2
Hi,

Le 30/11/2010 10:29, Frédéric Glorieux a écrit :

> 1) PHONE
> advice still needed, thanks in advance.

Have a look in the last English dictionary. This dictionary uses the
command PHONE. I thought for long that it was necessary to use the
morphological field “ph:” to make this command work, but it seems that I
was wrong.

Best regards,
--
Olivier R.

== Adresse mail réservée aux listes de discussion.                ==
== Les messages venant d’ailleurs sont _automatiquement_ effacés. ==
** E-mail dedicated to mailing-lists.                             **
** Messages from anywhere else are _automatically_ erased.        **

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Questions to implement an hunspell stemmer in Java

Frédéric Glorieux
Le 30/11/10 10:41, Olivier R. a écrit :

> Hi,
>
> Le 30/11/2010 10:29, Frédéric Glorieux a écrit :
>
>> 1) PHONE
>> advice still needed, thanks in advance.
>
> Have a look in the last English dictionary. This dictionary uses the
> command PHONE. I thought for long that it was necessary to use the
> morphological field “ph:” to make this command work, but it seems that I
> was wrong.
>
> Best regards,

Hi Olivier,

I read the historical aspell rules in en_AU.aff, like I said in previous
message, syntax seems tricky, not easy to teach to linguists. Before
implementing something more complete, I would be glad to have László
opinion for this kind of features. It is critic for the languages I'm
working for (medieval).

About the ph: field, found in the Hungarian files, understand how the
program use them would be interesting.

--
Frédéric

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]