unmunch separator

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

unmunch separator

Oleg Burlaca
Hello,

I've searched the web and the archive of this list but didn't found how
to separate
the generated forms of one word from another ?

A longer explanation:
I do "unmunch dic_file aff_file > output.txt"
I want the output.txt file to look like this:

word1
word_form1
word1_form2
--separator--
word2
word2_form1
word2_form2

Where wordN a words from the DIC file, and wordN_formN are the generated
forms
right now the "--separator--" is null. i.e. I can't found out the word
forms of a word.

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Jancs
i suppose, you have to edit unmuch source to get such option.

Janis

Citēju Oleg Burlaca <[hidden email]>:

> Hello,
>
> I've searched the web and the archive of this list but didn't found how
> to separate
> the generated forms of one word from another ?
>
> A longer explanation:
> I do "unmunch dic_file aff_file > output.txt"
> I want the output.txt file to look like this:
>
> word1
> word_form1
> word1_form2
> --separator--
> word2
> word2_form1
> word2_form2
>
> Where wordN a words from the DIC file, and wordN_formN are the generated
> forms
> right now the "--separator--" is null. i.e. I can't found out the word
> forms of a word.
>
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Jancs
Laps Cileecish

Veel 267 meeneshi liidz pensijai...

http://openoffice-lv.sourceforge.net
http://tehvi.dv.lv
***

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Oleg Burlaca
Jancs wrote:
> i suppose, you have to edit unmuch source to get such option.
>
> Janis

Yes Jancs, you was write, I've modified the /src/tools/unmunch.c file
from the hunspell package.
Just added a line:
   fprintf(stdout, "%s\n", "---");
after the block that writes out wordforms:
    for (i=0; i < numwords; i++) {
      fprintf(stdout,"%s\n",wlist[i].word);
      free(wlist[i].word);
      wlist[i].word = NULL;
      wlist[i].pallow = 0;
    }


It was easier than I thought :))
Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Kevin B. Hendricks
Hi,

Please remember than unmunch does not guarantee a one-to-one mapping  
between words and root forms.  For example, an unmunched word may be  
generated by many different root words and affixes and not just once.

That is why the unmunched list of words is typically uniquely sorted  
to remove duplicates.

The basic idea is that a raw word list when compressed by affix  
compression (munch) will always expand (unmunch) to exactly the same  
raw word list after sorting uniquely with no additions or deletions.

FWIW,

Kevin


On Apr 10, 2007, at 2:31 PM, Oleg Burlaca wrote:

> Jancs wrote:
>> i suppose, you have to edit unmuch source to get such option.
>>
>> Janis
>
> Yes Jancs, you was write, I've modified the /src/tools/unmunch.c  
> file from the hunspell package.
> Just added a line:
>   fprintf(stdout, "%s\n", "---");
> after the block that writes out wordforms:
>    for (i=0; i < numwords; i++) {
>      fprintf(stdout,"%s\n",wlist[i].word);
>      free(wlist[i].word);
>      wlist[i].word = NULL;
>      wlist[i].pallow = 0;
>    }
>
>
> It was easier than I thought :))
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: dev-
> [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Oleg Burlaca
Kevin B. Hendricks wrote:
> Please remember than unmunch does not guarantee a one-to-one mapping
> between words and root forms.  For example, an unmunched word may be
> generated by many different root words and affixes and not just once.
>
> That is why the unmunched list of words is typically uniquely sorted
> to remove duplicates.
It's ok that the same word will be generated several times. I wanted to
generate a list:
root1, word11
root1,word12
root1,word13
root2,word21
...
and to feed this list to the mnoGoSearch search engine in order to
enable fuzzy search.
i.e. when searching for word12, the search engine will also find docs
with word11, word13.

Kevin, thanks for the comments.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Kevin B. Hendricks
Hi,

FYI: the unmunch algorithm for any one word and affix file is quite  
fast so that instead of pre-expanding the root/word list you could in  
fact simply take pieces of code from myspell that takes a word and  
finds a root with affix flags and then expand it for all affixes on  
the fly so to speak (at least for English).

Effectively, simply spellcheck each word in the search query (which  
can be done on the fly while typing (just like in OOo) which will  
identifies the entry in the hash table formed from the .dic file and  
then expand it on the fly using .aff info stored in memory to create  
the fuzzy word list for each word if you wanted.

Another nice feature of using a spellchecker with affix compression  
in that way is that you would catch typos and could offer suggestions  
to replace mistyped words very very easily.

In fact, you could just incorporate myspell as a library (it is BSD  
licensed) (or any other spellchecker with a compatible license) into  
your search code and get all of these features.

My 2 cents,

Kevin



On Apr 12, 2007, at 5:02 AM, Oleg Burlaca wrote:

> Kevin B. Hendricks wrote:
>> Please remember than unmunch does not guarantee a one-to-one  
>> mapping between words and root forms.  For example, an unmunched  
>> word may be generated by many different root words and affixes and  
>> not just once.
>>
>> That is why the unmunched list of words is typically uniquely  
>> sorted to remove duplicates.
> It's ok that the same word will be generated several times. I  
> wanted to generate a list:
> root1, word11
> root1,word12
> root1,word13
> root2,word21
> ...
> and to feed this list to the mnoGoSearch search engine in order to  
> enable fuzzy search.
> i.e. when searching for word12, the search engine will also find  
> docs with word11, word13.
>
> Kevin, thanks for the comments.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: dev-
> [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: unmunch separator

Oleg Burlaca
In reply to this post by Oleg Burlaca
> FYI: the unmunch algorithm for any one word and affix file is quite
> fast so that instead of pre-expanding the root/word list you could in
> fact simply take pieces of code from myspell that takes a word and
> finds a root with affix flags and then expand it for all affixes on
> the fly so to speak (at least for English).
dataparksearch (http://www.dataparksearch.org/) and mnoGoSearch
(http://www.mnogosearch.org/)
use ispell dictionaries the way you have described. I've wrote a message
to both search engine mail lists
about adding the hunspell/myspell support.  Ispell is too old, and I
don't see the need of maintaining
ispell and hunspell dictionaries, only hunspell should remain IMHO. I
hope they will add support for hunspell.

> Another nice feature of using a spellchecker with affix compression
> in that way is that you would catch typos and could offer suggestions
> to replace mistyped words very very easily.
Yes, 100% agree, it's a very useful side effect.

> In fact, you could just incorporate myspell as a library (it is BSD
> licensed) (or any other spellchecker with a compatible license) into
> your search code and get all of these features.
I've posted your entire message to the SE mailing lists.
Both search enginges are written in C++, and I think it wouldn't be so
hard to provide hunspell support.

Kevin, thanks for your 2 cents, I think these a $2 :)

Kind Regards,
Oleg Burlaca

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]