How to get list of valid word in hunspell

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get list of valid word in hunspell

Sunday Bolaji
Hi,
    Please is there any way or command that can be used to get list of all valid words in Hunspell library, both the ones in the dictionary file and the ones generated using affix rule.
  Secondly, is there any way to let hunspell know that two the same combined character write in different way are the same.Example is  the character " ọ́ " can be written by first write " o " and add under dot and tone mark or first write " ọ " and add tone mark or first write " ó " and add under dot to it.

Regards,
Jeje




--- On Wed,
 2/25/09, Németh László <[hidden email]> wrote:
From: Németh László <[hidden email]>
Subject: [lingu-dev] Automatic dictionary development and optimization (Re: Questions  regarding Hunspell format for new Norwegian dictionaries)
To: [hidden email], "Karl Ove Hufthammer" <[hidden email]>
Date: Wednesday, February 25, 2009, 7:12 AM

Hi,

2009/2/6 Karl Ove Hufthammer <[hidden email]>:
> ... could (should) I just use affixcompress on the words in the third
column
> to generate a dictionary file. It seems to work well. Or is there a way to
use
> the information on each word to *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or
reduce
> the dictionary
 size, or improve the speed for lookups and suggestions?

Affixcompress script is for compression of a huge word list, searching
potential stems and affixes. The most important results of the affix
compression are the smaller memory footprint and shorter loading time.
(In fact, affix rich languages need affix compression.) Compressed
dictionaries may have slower lookups for the compressed words. The
time consuming dictionary based (ngram and phonetic) suggestion is
much faster with smaller dic files (suggestion speed is the bottleneck
during the normal usage of the spelling dictionaries).

Example

Generating a compressed dictionary from the standard English
dictionary (/usr/share/dict/words) of the Linux:

$ LC_ALL=C sort /usr/share/dict/words >en
$ hunspell-1.2.8/src/tools/affixcompress en 1000

The compressed dictionary contains only 30 thousand words in the file
en.dic instead of the 99
 thousand words of the original word list. The
file en.aff contains the predefined 1000 affixes (but it misses some
of the default settings, SET character encoding, TRY definition etc.).

Alias compression is an optimization method for the dictionaries of
the affix rich (agglutinative) languages, but it reduces the memory
usage and improve the affix analysis, too:

$ hunspell-1.2.8/src/tools/makealias en.dic en.aff
output: en_alias.dic, en_alias.aff

Memory usage (RSS, VSZ fields are in kB):

$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0 &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias &
$ ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid | sed -n
'1p;/lt-hunspell/p'
  PID  PPID   RSS    VSZ %CPU %MEM CMD
 6767  6314  5564   8956  0.1  0.2
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0
 6768  6314  3444  
 6776  0.0  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en
 6769  6314  3160   6492  0.1  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias

(The "en0" dictionary is the original word list without affix
compression:
$ cp /usr/share/dict/words en0.dic
$ touch en0.aff)

Also Hunspell 1.2.x has a ZIP-like compression format for dictionary
compression:

$ hunspell-1.2.8/src/tools/hzip en.* en_alias.*
$ ls -lh en.* en_alias.*
-rw-r--r-- 1 laci laci  29K 2009-02-19 15:16 en.aff
-rw-r--r-- 1 laci laci 9,5K 2009-02-19 16:04 en.aff.hz
-rw-r--r-- 1 laci laci 425K 2009-02-19 15:16 en.dic
-rw-r--r-- 1 laci laci 197K 2009-02-19 16:04 en.dic.hz
-rw-r--r-- 1 laci laci 166K 2009-02-19 15:36 en_alias.aff
-rw-r--r-- 1 laci laci  74K 2009-02-19 16:04 en_alias.aff.hz
-rw-r--r-- 1 laci laci 311K 2009-02-19 15:36 en_alias.dic
-rw-r--r-- 1 laci laci 131K 2009-02-19 16:04 en_alias.dic.hz

The
 hzip compressed en_alias dictionary needs 205 kB disk space. (The
size of the original Linux English word list was 910 kB).

Measuring suggestion speed

(The affix file of the dictionaries was extended with the following header:

TRY qwertzuiopasdfghjklyxcvbnm'-
WORDCHARS '-)

Generate 10-character length random misspelled words from the Linux words:

$ sed -n '1~500p' /usr/share/dict/words | tr -d '\n' | grep
-o
'.........' >bad.txt
$ wc -l <bad.txt
184
$ tail bad.txt
'sunderta
kersunoff
iciallyup
rootedvea
ledvindic
ationwage
redwavere
dwhiniest
wintrywra
pzipper's

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, shiniest, whinniest, grainiest,
brainiest
& wintrywra 4 0: wintry, wintrier, wintriest, wintery
& pzipper's 6 0: zipper's, p zipper's,
 clipper's,
shipper's,
skipper's, slipper's
13.75user 0.01system 0:13.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4945minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester,
daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's,
skipper's, Dipper's, kipper's
4.20user 0.01system 0:04.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en_alias
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester,
daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's,
skipper's, Dipper's,
 kipper's
4.17user 0.01system 0:04.18elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4297minor)pagefaults 0swaps

So suggestion time depends strongly on the word count of the dic file.

Languages with complex morphology can use the second-level affixation
of Hunspell. There is a new tool "doubleaffixcompress"
(http://downloads.sourceforge.net/hunspell/doubleaffixcompress) to
compress the output dictionary of the affixcompress script or other
Hunspell dictionaries using second-level affixes. For example, on the
old en_US dictionary of Openoffice.org we got 50% compression rate:

$ doubleaffixcompress en_US
$ wc -l en_US.dic new_en_US.dic
  62157 en_US.dic
  30442 new_en_US.dic
$ grep abolish en_US.dic
abolisher/M
abolish/LZRSDG
abolishment/MS
$ grep abolish new_en_US.dic
abolish/5193,6535,64991,64993,64995,64996,64997,65001
$ grep '\(5193\|6535\)'
 new_en_US.aff
SFX  5193 Y 1
SFX  5193 0 er/64999 .
SFX  6535 Y 1
SFX  6535 0 ment/64997,64999 .

A more important result on the (too big) he_IL dictionary. (This
dictionary recognizes more than 100 million Hebrew word forms):

$ LC_ALL=C doubleaffixcompress he_IL
$ wc he_IL.dic new_he_IL.dic
 329237  328996 3212113 he_IL.dic
  37913   37879 1940612 new_he_IL.dic
$ LC_ALL=C ~/hunspell-1.2.8/src/tools/makealias new_he_IL.{dic,aff}
output: new_he_IL_alias.dic, new_he_IL_alias.aff

Memory usage has been reduced from 19 MB to 5.5 MB by
doubleaffixcompress and makealias.

2009/2/6 Karl Ove Hufthammer <[hidden email]>:
> Hi!
>
> I couldn't find a mailing list for questions regarding Hunspell, so
I'm writing
> to you. Please feel free to direct me the the relevant mailing list or
forum
> instead of answering me directly.

I will post your letter to the
 Lingucomponent development list of
OpenOffice.org with a detailed example,

>
> I am about to create a new spellchecker for the Norwegian Nynorsk language
> (and possibly Norwegian Bokmål too), based on Hunspell. However, I have
some
> questions on how to best proceed.
>
> We are lucky enough to have access to (GPL 3+-based) fullform dictionary
for
> Norwegian, which most other languages using Hunspell doesn't seem to
have.
> But I'm not sure how to best make use of the information in this
databae. Here
> is an example output, for the word «hoppe»:
>
> 37933   hoppe   hoppe   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint bu
> 37933   hoppe   hopper  subst fem appell fl ub
> 37933   hoppe   hoppor  subst fem appell fl ub
> 37933   hoppe   hoppene subst fem appell
 fl bu

> 37933   hoppe   hoppone subst fem appell fl bu
> 37934   hoppe   hoppe   verb inf
> 37934   hoppe   hoppa   verb inf
> 37934   hoppe   hoppar  verb pres
> 37934   hoppe   hoppast verb inf pres st-form
> 37934   hoppe   hoppa   verb pret
> 37934   hoppe   hoppa   verb perf-part
> 37934   hoppe   hoppa   adj <perf-part> nøyt ub eint
> 37934   hoppe   hoppa   adj <perf-part> m/f ub eint
> 37934   hoppe   hoppa   adj <perf-part> bu eint
> 37934   hoppe   hoppa   adj <perf-part> fl
> 37934   hoppe   hoppande        adj <pres-part>
> 37934   hoppe   hopp    verb imp
> 37934   hoppe   hoppe   verb imp
> 37934   hoppe   hoppa   verb imp
>
> (Here the code «subst» means noun. And yes, we *do* have words with more
> irregular inflection in Norwegian too. :) )
>
> As indicated by the numeric code, there
 are actually two root words
«hoppe».
> One (37933) is a noun, meaning mare (female horse), and the other (37934)
is a
> verb, meaning «jump». The adjective (code «adj») derived is derived
from the
> verb, and therefore has the same code as it. «fem» is the gender,
«eint» means
> singular, and «ub» and «bu» means indefinite and definite form,
respectively.
>
> Is this information of any use when generating the dictionary file, and
how can
> I use it? From what I've read about hunspell, the main part of the
affix file is
> only used as a way to compress the dictionary, and doesn't have any
effect on
> which words are suggested by hunspell.
>
> If so, could (should) I just use affixcompress on the words in the third
column
> to generate a dictionary file. It seems to work well. Or is there a way to
use
> the information on each word to
 *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or
reduce
> the dictionary size, or improve the speed for lookups and suggestions?

Automatic compression is perfect for a spelling dictionary, but the
upcoming thesaurus extension needs real data for stemming and needs
extra information for morphological generation.

The automatic dictionary compression has a drawback for stemming, the
possible artificial morphology:

$ hunspell -d en
windows
+ wind

(This is not too good for the dictionary based suggestion, too.)

Future versions of affixcompress will be able to use word frequency
data to correct the stem analysis.
Your dictionary development needs a new script to keep the real stems
(you can add irregular forms to the dic file:
"mice st:mouse", see
http://www.openoffice.org/issues/show_bug.cgi?id=19563) and encode
 the
morphological informations in the dictionary. When you need this
development for the Norwegian thesaurus, I will help you.

Thanks for your questions.
Regards,
László

>
> Thanks in advance for your reply.
>
> --
> Regards,
> Karl Ove Hufthammer
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: How to get list of valid word in hunspell

ge-7
Hi, Jeje,

The munch and unmunch utilites help to get all valid words; you must provide affix and dic file, and they create all valid words.
I am not sure, where they are located right now, they were part of myspell, that is now replaced by hunspell.
maybe http://hunspell.sf.net gives some clue.

To the 2-nd question: I believe, there is no such option, however that would be very useful also for Hungarian, and I would support to enter an issue for such an option.
For example:
SAME õó
SAME ûú
something like this, and then hunspell would treate õ and ó being identical.

Regards: Eleonora

> Hi,
>     Please is there any way or command that can be used to get list of all
> valid words in Hunspell library, both the ones in the dictionary file and
> the ones generated using affix rule. Secondly, is there any way to let
> hunspell know that two the same combined character write in different way
> are the same.Example is  the character " x " can be written by first write
> " o " and add under dot and tone mark or first write " x " and add tone
> mark or first write " ó " and add under dot to it.
>
> Regards,
> Jeje





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get list of valid word in hunspell

Marcin Miłkowski
In reply to this post by Sunday Bolaji
ge pisze:
> Hi, Jeje,
>
> The munch and unmunch utilites help to get all valid words; you must provide affix and dic file, and they create all valid words.
> I am not sure, where they are located right now, they were part of myspell, that is now replaced by hunspell.
> maybe http://hunspell.sf.net gives some clue.

Munch and unmunch work only for myspell dictionaries without any
hunspell-specific additions. You can use a special script wordforms to
get all possible forms of a given word (and you could feed it with the
list of all base words from the dictionary just by chopping off the
flags). But it is _slow_.

I was writing my own tool in awk but I didn't have time to finish it.
Works fairly well with myspell files but doesn't know all hunspell
additions yet.

> To the 2-nd question: I believe, there is no such option, however that would be very useful also for Hungarian, and I would support to enter an issue for such an option.

This is the normalization issue that was mentioned not so long ago on
this list.

Regards
Marcin

> For example:
> SAME őó
> SAME űú
> something like this, and then hunspell would treate ő and ó being identical.
>
> Regards: Eleonora
>
>> Hi,
>>     Please is there any way or command that can be used to get list of all
>> valid words in Hunspell library, both the ones in the dictionary file and
>> the ones generated using affix rule. Secondly, is there any way to let
>> hunspell know that two the same combined character write in different way
>> are the same.Example is  the character " x " can be written by first write
>> " o " and add under dot and tone mark or first write " x " and add tone
>> mark or first write " ó " and add under dot to it.
>>
>> Regards,
>> Jeje
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get list of valid word in hunspell

Ruud Baars-2
In reply to this post by Sunday Bolaji
A complete generation of all possibel words, certainly when using the
compounding options, would generate a really enormous list.

But  felt the need to have such a tool too.

I work around it now, by applying hunspell with the option to output
good and bad words to the full list of correct words (to see what is
missing), but also the list of known erroneous words, and the list of
'unknown status'.
Checking the output gives a good idea of the coverage (correct and
incorrect).

Maybe somethign like this might work for you too, Jeje ...

Ruud


ge schreef:

> Hi, Jeje,
>
> The munch and unmunch utilites help to get all valid words; you must provide affix and dic file, and they create all valid words.
> I am not sure, where they are located right now, they were part of myspell, that is now replaced by hunspell.
> maybe http://hunspell.sf.net gives some clue.
>
> To the 2-nd question: I believe, there is no such option, however that would be very useful also for Hungarian, and I would support to enter an issue for such an option.
> For example:
> SAME õó
> SAME ûú
> something like this, and then hunspell would treate õ and ó being identical.
>
> Regards: Eleonora
>
>  
>> Hi,
>>     Please is there any way or command that can be used to get list of all
>> valid words in Hunspell library, both the ones in the dictionary file and
>> the ones generated using affix rule. Secondly, is there any way to let
>> hunspell know that two the same combined character write in different way
>> are the same.Example is  the character " x " can be written by first write
>> " o " and add under dot and tone mark or first write " x " and add tone
>> mark or first write " ó " and add under dot to it.
>>
>> Regards,
>> Jeje
>>    
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: How to get list of valid word in hunspell

ge-7
In reply to this post by Sunday Bolaji
>>
Munch and unmunch work only for myspell dictionaries without any
hunspell-specific additions. You can use a special script wordforms to
get all possible forms of a given word (and you could feed it with the
list of all base words from the dictionary just by chopping off the
flags). But it is _slow_.

I was writing my own tool in awk but I didn't have time to finish it.
Works fairly well with myspell files but doesn't know all hunspell
additions yet.
<<

I also have some awk scripts.
However, maybe for Jeje-s purposes munch and unmunch are perfect.

>>
> To the 2-nd question: I believe, there is no such option, however that would be very useful also for Hungarian, and I would support to enter an issue for such an option.

This is the normalization issue that was mentioned not so long ago on
this list.
<<

I searched back, however, the only messages I found in the
normalization thread are about ICONV and OCONV options,
that are useful only for suggestions.

I would need an other option, similar to Jeje, that
says: character a is equal to character b at spell checking time.

For example, if I say:
SAME óõ

then both the words tór and tõr should be recognized
as correct ones, even it the dictionary contains only tór.
(and of course, with all the proper  affixes, the dic/aff file contains
for tór)

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get list of valid word in hunspell

Németh László-2
In reply to this post by Sunday Bolaji
Hi,

I have made a shell script called unmunch.sh. It supports several
Hunspell features: Unicode encoding, different flag types and double
suffixes (so it can process the output of the doubleaffixcompress
script):

http://downloads.sourceforge.net/hunspell/unmunch.sh

(Also I have updated the doubleaffixcompress script:
http://downloads.sourceforge.net/hunspell/doubleaffixcompress).

Unfortunately, compound words, special options are not supported by unmunch.sh.

2. ICONV feature is for general input encoding, so you can use it for
normalization:

ICONV 2
ICONV ọ́ ọ́
ICONV ọ́ ọ́

(Check the correct encoding with GNU recode:
$ cat your_aff | recode u8..h4
ICONV 2
ICONV o&#803;&#769; &oacute;&#803;
ICONV &#7885;&#769; &oacute;&#803;)

Regards,
László


2009/3/2 Sunday Bolaji <[hidden email]>:
> Hi,
>     Please is there any way or command that can be used to get list of all valid words in Hunspell library, both the ones in the dictionary file and the ones generated using affix rule.
>   Secondly, is there any way to let hunspell know that two the same combined character write in different way are the same.Example is  the character " ọ́ " can be written by first write " o " and add under dot and tone mark or first write " ọ " and add tone mark or first write " ó " and add under dot to it.
>
> Regards,
> Jeje

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]