Notes on hunspell-ln

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Notes on hunspell-ln

Kevin Scannell-2
This is a follow-up to my previous announcement with some notes on the
Lingala hunspell package.

Lingala is a Bantu language and as such has a very complicated verbal
morphology.  This complexity has made it difficult to develop open
source spell checkers for other Bantu languages - the existing
packages are simple word lists that don't even attempt to crack the
verb system.  Only the Swahili package is large enough to provide
decent coverage of everyday texts, but it could still be improved.

The hunspell-ln package is the first attempt I know of to handle Bantu
verbal morphology completely in an affix file.  In addition, Lingala
is a tone language and has vowel harmony marked orthographically -
both of these features are handled correctly in the affix file as
well.   With all of this in mind I'd encourage anyone interested in
Bantu languages to have a look at the "developer's pack" for
hunspell-ln in the SVN repository here:

http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/

It's best to start with the README-dev file, which describes the
different files in the developer's pack, and some useful makefile
targets for dictionary maintenance and development.

http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/README-dev?view=markup

Noun classes are stored in the files, nc*, and these get assigned
appropriate affix flags which generate the correct plurals.  This much
is straightforward.

Verbs in Lingala are formed from a "radical", e.g. "bák", to which
various optional semantic adjuncts can be added in more-or-less
predictable ways, "bákis", "bákisam", "bákisel", etc.   To these are
added obligatory prefixes and suffixes indicating personal pronouns
and tense.   After a lot of experimentation, the best solution for
spell checking seems to be to store the radicals+adjuncts as words in
the .dic file, and add the prefixes and suffixes using the affix file.
  There are many reasons for this choice - among them the fact that
this simplifies the necessary affixes to within the scope of what
hunspell can handle.   Also, it is very difficult to predict which
semantic adjuncts "work" with which radicals, and in which order and
in which combinations.  This all depends on semantics, and so (in my
view) is best left to lexicography vs. automatic generation.  And in
reality, there aren't that many of these combinations used in everyday
Lingala.   The 900 or so in the first release account for at least 90%
of the verb forms in the web corpus (up to diacritic differences).

In any case, the combined radicals+adjuncts are stored in the
(poorly-named) "radicals.txt" in the repository.   A perl script adds
the correct affix flags (two for each word in radicals.txt - "W" for
high tone suffixes and "V" for low tone suffixes) and enters the words
in ln_CD.dic.

The affix file uses some special features of hunspell to handle these
words.  First, the words from radicals.txt get, additionally, a "Z"
flag that is marked as "NEEDAFFIX" in the affix file - this is because
the tense/pronoun affixes are required -- the radicals themselves are
usually not valid words.    Then, the required prefix/suffix pair is
treated as a circumfix X (see the "CIRCUMFIX" declaration in the affix
file) - so a typical suffix (here a subset of "V" for the habitual
tense) is implemented as follows:


SFX V Y 63
...
SFX V 0 aka/PX  .   +HABITUAL1
SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
SFX V 0 ɛkɛ/PX
[ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
+HABITUAL1
SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
SFX V 0 ɔkɔ/PX
[ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
+HABITUAL1
...


PFX P Y 8
PFX P 0 na/X    .
PFX P 0 o/X .
PFX P 0 a/X .
PFX P 0 e/X .
PFX P 0 to/X    .
PFX P 0 bo/X    .
PFX P 0 ba/X    .
PFX P 0 i/X .

In the .dic file, "bikol/ZV" would become "bikolaka/PX" and then
"nabikolaka", "obikolaka", etc.   The complicated-looking cases in the
V suffix handle vowel harmony - "bɔtɔl/ZV" would become "bɔtɔlɔkɔ/PX",
etc.

Everything appears to work nicely.   To this point I've only looked at
morphology of three Bantu languages in any deep way - Lingala,
Kinyarwanda, and Swahili, but my naive hope is that this approach
could serve as a model for developing hunspell packages for other
Bantu languages.   The top candidates (based on having found a
sufficient amount of text on the web with a crawler) would be:
Kikongo, Kikuyu, Luganda, Ndebele (nd/nr), Ndonga, Northern Sotho,
Nyanja/Chichewa, Rundi, Kinyarwanda, Swati, Sesotho, Swahili,
Setswana, Tsonga, Venda, Xhosa, and Zulu.

Comments, questions, suggestions appreciated.

Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Notes on hunspell-ln

Németh László-2
Hi,

I'm very glad to hear your success story about handling complex
morphology. There are tools for simpler morphologies (I will post a
letter about automatic compression of 900 thousand breton words next
week), but we cannot list and compress the possible words of a
language with complex morphology. I plan to develop statictical
versions from the affixcompress tools, but it would be fine to support
formal description based dictionary development, too. Any automatized
methods have big advantage (for example, the source of the Hungarian
spelling dictionary uses (undocumented) awk scripts and M4 macros),
but the best tool would be a (restricted) compiler to support
two-level morphology based morphological descriptions for Hunspell
dictionaries (see also
http://www.lrec-conf.org/proceedings/lrec2008/pdf/274_paper.pdf), or a
similar generalized morphology description language.

Regards,
László



2009/4/8 Kevin Scannell <[hidden email]>:

> This is a follow-up to my previous announcement with some notes on the
> Lingala hunspell package.
>
> Lingala is a Bantu language and as such has a very complicated verbal
> morphology.  This complexity has made it difficult to develop open
> source spell checkers for other Bantu languages - the existing
> packages are simple word lists that don't even attempt to crack the
> verb system.  Only the Swahili package is large enough to provide
> decent coverage of everyday texts, but it could still be improved.
>
> The hunspell-ln package is the first attempt I know of to handle Bantu
> verbal morphology completely in an affix file.  In addition, Lingala
> is a tone language and has vowel harmony marked orthographically -
> both of these features are handled correctly in the affix file as
> well.   With all of this in mind I'd encourage anyone interested in
> Bantu languages to have a look at the "developer's pack" for
> hunspell-ln in the SVN repository here:
>
> http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/
>
> It's best to start with the README-dev file, which describes the
> different files in the developer's pack, and some useful makefile
> targets for dictionary maintenance and development.
>
> http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/README-dev?view=markup
>
> Noun classes are stored in the files, nc*, and these get assigned
> appropriate affix flags which generate the correct plurals.  This much
> is straightforward.
>
> Verbs in Lingala are formed from a "radical", e.g. "bák", to which
> various optional semantic adjuncts can be added in more-or-less
> predictable ways, "bákis", "bákisam", "bákisel", etc.   To these are
> added obligatory prefixes and suffixes indicating personal pronouns
> and tense.   After a lot of experimentation, the best solution for
> spell checking seems to be to store the radicals+adjuncts as words in
> the .dic file, and add the prefixes and suffixes using the affix file.
>  There are many reasons for this choice - among them the fact that
> this simplifies the necessary affixes to within the scope of what
> hunspell can handle.   Also, it is very difficult to predict which
> semantic adjuncts "work" with which radicals, and in which order and
> in which combinations.  This all depends on semantics, and so (in my
> view) is best left to lexicography vs. automatic generation.  And in
> reality, there aren't that many of these combinations used in everyday
> Lingala.   The 900 or so in the first release account for at least 90%
> of the verb forms in the web corpus (up to diacritic differences).
>
> In any case, the combined radicals+adjuncts are stored in the
> (poorly-named) "radicals.txt" in the repository.   A perl script adds
> the correct affix flags (two for each word in radicals.txt - "W" for
> high tone suffixes and "V" for low tone suffixes) and enters the words
> in ln_CD.dic.
>
> The affix file uses some special features of hunspell to handle these
> words.  First, the words from radicals.txt get, additionally, a "Z"
> flag that is marked as "NEEDAFFIX" in the affix file - this is because
> the tense/pronoun affixes are required -- the radicals themselves are
> usually not valid words.    Then, the required prefix/suffix pair is
> treated as a circumfix X (see the "CIRCUMFIX" declaration in the affix
> file) - so a typical suffix (here a subset of "V" for the habitual
> tense) is implemented as follows:
>
>
> SFX V Y 63
> ...
> SFX V 0 aka/PX  .   +HABITUAL1
> SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
> SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
> SFX V 0 ɛkɛ/PX
> [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
> +HABITUAL1
> SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
> SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
> SFX V 0 ɔkɔ/PX
> [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
> +HABITUAL1
> ...
>
>
> PFX P Y 8
> PFX P 0 na/X    .
> PFX P 0 o/X .
> PFX P 0 a/X .
> PFX P 0 e/X .
> PFX P 0 to/X    .
> PFX P 0 bo/X    .
> PFX P 0 ba/X    .
> PFX P 0 i/X .
>
> In the .dic file, "bikol/ZV" would become "bikolaka/PX" and then
> "nabikolaka", "obikolaka", etc.   The complicated-looking cases in the
> V suffix handle vowel harmony - "bɔtɔl/ZV" would become "bɔtɔlɔkɔ/PX",
> etc.
>
> Everything appears to work nicely.   To this point I've only looked at
> morphology of three Bantu languages in any deep way - Lingala,
> Kinyarwanda, and Swahili, but my naive hope is that this approach
> could serve as a model for developing hunspell packages for other
> Bantu languages.   The top candidates (based on having found a
> sufficient amount of text on the web with a crawler) would be:
> Kikongo, Kikuyu, Luganda, Ndebele (nd/nr), Ndonga, Northern Sotho,
> Nyanja/Chichewa, Rundi, Kinyarwanda, Swati, Sesotho, Swahili,
> Setswana, Tsonga, Venda, Xhosa, and Zulu.
>
> Comments, questions, suggestions appreciated.
>
> Kevin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]