Automatic dictionary development and optimization (Re: Questions regarding Hunspell format for new Norwegian dictionaries)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Automatic dictionary development and optimization (Re: Questions regarding Hunspell format for new Norwegian dictionaries)

Németh László-2
Hi,

2009/2/6 Karl Ove Hufthammer <[hidden email]>:
> ... could (should) I just use affixcompress on the words in the third column
> to generate a dictionary file. It seems to work well. Or is there a way to use
> the information on each word to *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or reduce
> the dictionary size, or improve the speed for lookups and suggestions?

Affixcompress script is for compression of a huge word list, searching
potential stems and affixes. The most important results of the affix
compression are the smaller memory footprint and shorter loading time.
(In fact, affix rich languages need affix compression.) Compressed
dictionaries may have slower lookups for the compressed words. The
time consuming dictionary based (ngram and phonetic) suggestion is
much faster with smaller dic files (suggestion speed is the bottleneck
during the normal usage of the spelling dictionaries).

Example

Generating a compressed dictionary from the standard English
dictionary (/usr/share/dict/words) of the Linux:

$ LC_ALL=C sort /usr/share/dict/words >en
$ hunspell-1.2.8/src/tools/affixcompress en 1000

The compressed dictionary contains only 30 thousand words in the file
en.dic instead of the 99 thousand words of the original word list. The
file en.aff contains the predefined 1000 affixes (but it misses some
of the default settings, SET character encoding, TRY definition etc.).

Alias compression is an optimization method for the dictionaries of
the affix rich (agglutinative) languages, but it reduces the memory
usage and improve the affix analysis, too:

$ hunspell-1.2.8/src/tools/makealias en.dic en.aff
output: en_alias.dic, en_alias.aff

Memory usage (RSS, VSZ fields are in kB):

$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0 &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias &
$ ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid | sed -n
'1p;/lt-hunspell/p'
  PID  PPID   RSS    VSZ %CPU %MEM CMD
 6767  6314  5564   8956  0.1  0.2
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0
 6768  6314  3444   6776  0.0  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en
 6769  6314  3160   6492  0.1  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias

(The "en0" dictionary is the original word list without affix compression:
$ cp /usr/share/dict/words en0.dic
$ touch en0.aff)

Also Hunspell 1.2.x has a ZIP-like compression format for dictionary
compression:

$ hunspell-1.2.8/src/tools/hzip en.* en_alias.*
$ ls -lh en.* en_alias.*
-rw-r--r-- 1 laci laci  29K 2009-02-19 15:16 en.aff
-rw-r--r-- 1 laci laci 9,5K 2009-02-19 16:04 en.aff.hz
-rw-r--r-- 1 laci laci 425K 2009-02-19 15:16 en.dic
-rw-r--r-- 1 laci laci 197K 2009-02-19 16:04 en.dic.hz
-rw-r--r-- 1 laci laci 166K 2009-02-19 15:36 en_alias.aff
-rw-r--r-- 1 laci laci  74K 2009-02-19 16:04 en_alias.aff.hz
-rw-r--r-- 1 laci laci 311K 2009-02-19 15:36 en_alias.dic
-rw-r--r-- 1 laci laci 131K 2009-02-19 16:04 en_alias.dic.hz

The hzip compressed en_alias dictionary needs 205 kB disk space. (The
size of the original Linux English word list was 910 kB).

Measuring suggestion speed

(The affix file of the dictionaries was extended with the following header:

TRY qwertzuiopasdfghjklyxcvbnm'-
WORDCHARS '-)

Generate 10-character length random misspelled words from the Linux words:

$ sed -n '1~500p' /usr/share/dict/words | tr -d '\n' | grep -o
'.........' >bad.txt
$ wc -l <bad.txt
184
$ tail bad.txt
'sunderta
kersunoff
iciallyup
rootedvea
ledvindic
ationwage
redwavere
dwhiniest
wintrywra
pzipper's

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, shiniest, whinniest, grainiest, brainiest
& wintrywra 4 0: wintry, wintrier, wintriest, wintery
& pzipper's 6 0: zipper's, p zipper's, clipper's, shipper's,
skipper's, slipper's
13.75user 0.01system 0:13.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4945minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester, daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's, skipper's, Dipper's, kipper's
4.20user 0.01system 0:04.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en_alias
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester, daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's, skipper's, Dipper's, kipper's
4.17user 0.01system 0:04.18elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4297minor)pagefaults 0swaps

So suggestion time depends strongly on the word count of the dic file.

Languages with complex morphology can use the second-level affixation
of Hunspell. There is a new tool "doubleaffixcompress"
(http://downloads.sourceforge.net/hunspell/doubleaffixcompress) to
compress the output dictionary of the affixcompress script or other
Hunspell dictionaries using second-level affixes. For example, on the
old en_US dictionary of Openoffice.org we got 50% compression rate:

$ doubleaffixcompress en_US
$ wc -l en_US.dic new_en_US.dic
  62157 en_US.dic
  30442 new_en_US.dic
$ grep abolish en_US.dic
abolisher/M
abolish/LZRSDG
abolishment/MS
$ grep abolish new_en_US.dic
abolish/5193,6535,64991,64993,64995,64996,64997,65001
$ grep '\(5193\|6535\)' new_en_US.aff
SFX  5193 Y 1
SFX  5193 0 er/64999 .
SFX  6535 Y 1
SFX  6535 0 ment/64997,64999 .

A more important result on the (too big) he_IL dictionary. (This
dictionary recognizes more than 100 million Hebrew word forms):

$ LC_ALL=C doubleaffixcompress he_IL
$ wc he_IL.dic new_he_IL.dic
 329237  328996 3212113 he_IL.dic
  37913   37879 1940612 new_he_IL.dic
$ LC_ALL=C ~/hunspell-1.2.8/src/tools/makealias new_he_IL.{dic,aff}
output: new_he_IL_alias.dic, new_he_IL_alias.aff

Memory usage has been reduced from 19 MB to 5.5 MB by
doubleaffixcompress and makealias.

2009/2/6 Karl Ove Hufthammer <[hidden email]>:
> Hi!
>
> I couldn't find a mailing list for questions regarding Hunspell, so I'm writing
> to you. Please feel free to direct me the the relevant mailing list or forum
> instead of answering me directly.

I will post your letter to the Lingucomponent development list of
OpenOffice.org with a detailed example,

>
> I am about to create a new spellchecker for the Norwegian Nynorsk language
> (and possibly Norwegian Bokmål too), based on Hunspell. However, I have some
> questions on how to best proceed.
>
> We are lucky enough to have access to (GPL 3+-based) fullform dictionary for
> Norwegian, which most other languages using Hunspell doesn't seem to have.
> But I'm not sure how to best make use of the information in this databae. Here
> is an example output, for the word «hoppe»:
>
> 37933   hoppe   hoppe   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint bu
> 37933   hoppe   hopper  subst fem appell fl ub
> 37933   hoppe   hoppor  subst fem appell fl ub
> 37933   hoppe   hoppene subst fem appell fl bu
> 37933   hoppe   hoppone subst fem appell fl bu
> 37934   hoppe   hoppe   verb inf
> 37934   hoppe   hoppa   verb inf
> 37934   hoppe   hoppar  verb pres
> 37934   hoppe   hoppast verb inf pres st-form
> 37934   hoppe   hoppa   verb pret
> 37934   hoppe   hoppa   verb perf-part
> 37934   hoppe   hoppa   adj <perf-part> nøyt ub eint
> 37934   hoppe   hoppa   adj <perf-part> m/f ub eint
> 37934   hoppe   hoppa   adj <perf-part> bu eint
> 37934   hoppe   hoppa   adj <perf-part> fl
> 37934   hoppe   hoppande        adj <pres-part>
> 37934   hoppe   hopp    verb imp
> 37934   hoppe   hoppe   verb imp
> 37934   hoppe   hoppa   verb imp
>
> (Here the code «subst» means noun. And yes, we *do* have words with more
> irregular inflection in Norwegian too. :) )
>
> As indicated by the numeric code, there are actually two root words «hoppe».
> One (37933) is a noun, meaning mare (female horse), and the other (37934) is a
> verb, meaning «jump». The adjective (code «adj») derived is derived from the
> verb, and therefore has the same code as it. «fem» is the gender, «eint» means
> singular, and «ub» and «bu» means indefinite and definite form, respectively.
>
> Is this information of any use when generating the dictionary file, and how can
> I use it? From what I've read about hunspell, the main part of the affix file is
> only used as a way to compress the dictionary, and doesn't have any effect on
> which words are suggested by hunspell.
>
> If so, could (should) I just use affixcompress on the words in the third column
> to generate a dictionary file. It seems to work well. Or is there a way to use
> the information on each word to *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or reduce
> the dictionary size, or improve the speed for lookups and suggestions?

Automatic compression is perfect for a spelling dictionary, but the
upcoming thesaurus extension needs real data for stemming and needs
extra information for morphological generation.

The automatic dictionary compression has a drawback for stemming, the
possible artificial morphology:

$ hunspell -d en
windows
+ wind

(This is not too good for the dictionary based suggestion, too.)

Future versions of affixcompress will be able to use word frequency
data to correct the stem analysis.
Your dictionary development needs a new script to keep the real stems
(you can add irregular forms to the dic file:
"mice st:mouse", see
http://www.openoffice.org/issues/show_bug.cgi?id=19563) and encode the
morphological informations in the dictionary. When you need this
development for the Norwegian thesaurus, I will help you.

Thanks for your questions.
Regards,
László

>
> Thanks in advance for your reply.
>
> --
> Regards,
> Karl Ove Hufthammer
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Automatic dictionary development and optimization (Re: Questions regarding Hunspell format for new Norwegian dictionaries)

ge-7
Hi, Karl Ove,

I set up a working example for the word you gave: hoppe

norge.aff:
-------------------------------
SET ISO8859-1
TRY esianrtolcdugmphbyfvkwäüößáéêàâñESIANRTOLCDUGMPHBYFVKWÄÜÖ

SFX A Y 9
SFX A e a e
SFX A e er e
SFX A e or e
SFX A e ar e
SFX A e ene e
SFX A e one e
SFX A e ast e
SFX A e ande e
SFX A e  0   e
---------------------------------

norge.dic:
---------------------------------
1
hoppe/A
---------------------------------

A good example for you are the German .aff and .dic files, if
you can German.

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]