compound words

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

compound words

ge-7
On Friday 30 June 2006 09:52 am, Simon Brouwer wrote:
> > The shorter the words, the more catastrophic the error rate.
>
> It might then be a good idea if the spell checker would reject guessed
> compounds below a certain minimum length (configurable in the affix file).

Yes, I know, that aspell allows this. Hunspell has so many compound flags, that I am not sure, what it allows and what not. This can be the first step
in creating quality word lists: select only the shorter than say 9 characters words from a web corpus, and compound by machine the rest. Then go ahead up to 12 characters, 15 characters, and the rest. So we can eliminate step for step the error prone mechanic compounding.

> > I assume, that the results are in German analogous, because the first
> > investigations showed that quite clearly, if I have time, I
> > will look also into that somewhat deeper.
>
> I notice that the German examples show mostly wrong compounds that are
> misspellings of other words. Maybe that list is not representative, but
> such errors would be more common and are more difficult to spot by the
> user.
> So a possible improvement could be to disqualify a guessed compound if
> it is too similar to a word that is actually in the word list. The
> existing suggestion mechanism could be used to determine this.
>
> Or maybe such mechanisms have already been implemented?

You are right, I found that there is very often just one more or less character, that makes the mechanically compounded word senseless and erroneous. However, this would cause the elimination of a lot of potentially good words, therefore this needs to be verified. Maybe this approach would spill the child with the bath water.  

Best is to eliminate mechanic compounding completely, the sooner, the better.

Regards: Eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Simon Brouwer
Hi Eleonora

ge wrote:
>
> You are right, I found that there is very often just one more or less character, that makes the mechanically compounded word senseless and erroneous. However, this would cause the elimination of a lot of potentially good words, therefore this needs to be verified. Maybe this approach would spill the child with the bath water.  
>
> Best is to eliminate mechanic compounding completely, the sooner, the better.

I think *that* would be throwing away the child. In languages like Dutch
it is so easy to form a new but perfectly valid word by compounding, it
is impossible to include all the possible combinations in a word list.

It might an idea to identify problem cases by running the list of
known-good words through the suggestion mechanism, and making a list of
all the variations that are accepted (only) using the mechanical
compound mechanism. This list could then be reviewed and the words that
are incorrectly spelled and/or nonsensical placed on a "reject list".

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Daniel Naber-4
On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:

> It might an idea to identify problem cases by running the list of
> known-good words through the suggestion mechanism, and making a list of
> all the variations that are accepted (only) using the mechanical
> compound mechanism. This list could then be reviewed and the words that
> are incorrectly spelled and/or nonsensical placed on a "reject list".

What I did is this: I collected (and automatically generated) similar
German words like Hand, Hund. I then replaced Hand by Hund and vice versa
in a large list of compounds. Then I checked whether results like
"Treuhund" are accepted. These cases have been reported to Björn Jacke,
the maintainer of the German hunspell list.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Simon Brouwer
Hi Daniel,

Daniel Naber schreef:

> On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:
>
>  
>> It might an idea to identify problem cases by running the list of
>> known-good words through the suggestion mechanism, and making a list of
>> all the variations that are accepted (only) using the mechanical
>> compound mechanism. This list could then be reviewed and the words that
>> are incorrectly spelled and/or nonsensical placed on a "reject list".
>>    
>
> What I did is this: I collected (and automatically generated) similar
> German words like Hand, Hund. I then replaced Hand by Hund and vice versa
> in a large list of compounds. Then I checked whether results like
> "Treuhund" are accepted. These cases have been reported to Björn Jacke,
> the maintainer of the German hunspell list.
>  
That sounds like a useful approach. I will keep it in mind!

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Bjoern JACKE-3
In reply to this post by Daniel Naber-4
On 2006-06-30 at 21:17 +0200 Daniel Naber sent off:

>On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:
>
>> It might an idea to identify problem cases by running the list of
>> known-good words through the suggestion mechanism, and making a list of
>> all the variations that are accepted (only) using the mechanical
>> compound mechanism. This list could then be reviewed and the words that
>> are incorrectly spelled and/or nonsensical placed on a "reject list".
>
>What I did is this: I collected (and automatically generated) similar
>German words like Hand, Hund. I then replaced Hand by Hund and vice versa
>in a large list of compounds. Then I checked whether results like
>"Treuhund" are accepted. These cases have been reported to Björn Jacke,
>the maintainer of the German hunspell list.
aditionally I check every compoundable word for commonness against a
big list og words which also contains compounds. If there are
compoundable words, which only occur in very few compound words, I
will take the few compound words into the dictionary instead of taking
the first part of the compound into the dictionary as compoundable
word. Adding compoundable words into the dictionary should be done
very sensitive. It might also happen that silly or bogous words are
being acceped: if "Zieh" is accepted as compoundable word it will
result in "Ziehren" to be corect. Strictly speaking there might be a
"pulling reindeer" but usually this is a typo. Cases like this
and cases like Daniel mentions have to be put into a blacklist which
has to be flagged with hunspell's FORBIDDENWORD flag. Finding out
about those cases can be partly done by a script, that generates
typos automatically but also has to be
done during the buildup of the dictionary by grepping for substrings
of the newly added words in huge wordlists and taking a look at each
match for correctnes if the to be added compound word is still
correct after that or if other forms are created which are incorrect:
Arbets- is a common compoundable word, before adding it, grep a huge
word list for "Arbeit" (the word without any suffix) ... you will
find Arbeitgeber. Adding Arbeits- as compoundable word would make
Arbeitsgeber a correct word, so you have to put Arbeitsgeber with the
FORBIDDENWORD flag into your blacklist, including all affix flags so
that other variants of the bogous "Arbeitsgeber" are blacklisted,
too. There are many cases similar to this, where you find out by
grepping that new compoundable words produce more or less nasty
typos.

Bjoern

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

compound words

ge-7
BJ wrote:
> Cases like this
> and cases like Daniel mentions have to be put into a blacklist which
> has to be flagged with hunspell's FORBIDDENWORD flag.

I found recently, that aspell can not (yet) handle "FORBIDDENWORD" flag or analogous mechanizm. Of course, ispell cannot either. This also might be the case for other spell checkers like mozilla's one.

Regards: Eleonora
--


Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Bjoern JACKE-3
On 2006-07-03 at 15:16 +0200 [hidden email] sent off:
>BJ wrote:
>> Cases like this
>> and cases like Daniel mentions have to be put into a blacklist which
>> has to be flagged with hunspell's FORBIDDENWORD flag.
>
>I found recently, that aspell can not (yet) handle "FORBIDDENWORD" flag or analogous mechanizm. Of course, ispell cannot either. This also might be the case for other spell checkers like mozilla's one.

yes, this is only for hunspell of course. mozilla still using myspell
at the moment can't make use of the hunspell optimized compound word
dictionaries. Mozilla ist going to change the spell engine, hopefully
hunspell or a meta spellcheck engine which can make use of hunspell.

Ispell and aspell both don't have sufficient support for
agglutinative languages with complex compound word rules,
unfortunately.

Bjoern

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Simon Brouwer
In reply to this post by Bjoern JACKE-3
Hi Bjoern,

Bjoern JACKE schreef:

> On 2006-06-30 at 21:17 +0200 Daniel Naber sent off:
>> On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:
>>
>>> It might an idea to identify problem cases by running the list of
>>> known-good words through the suggestion mechanism, and making a list of
>>> all the variations that are accepted (only) using the mechanical
>>> compound mechanism. This list could then be reviewed and the words that
>>> are incorrectly spelled and/or nonsensical placed on a "reject list".
>>
>> What I did is this: I collected (and automatically generated) similar
>> German words like Hand, Hund. I then replaced Hand by Hund and vice
>> versa in a large list of compounds. Then I checked whether results
>> like "Treuhund" are accepted. These cases have been reported to Björn
>> Jacke, the maintainer of the German hunspell list.
>
> aditionally I check every compoundable word for commonness against a
> big list og words which also contains compounds. If there are
> compoundable words, which only occur in very few compound words, I
> will take the few compound words into the dictionary instead of taking
> the first part of the compound into the dictionary as compoundable
> word. Adding compoundable words into the dictionary should be done
> very sensitive. It might also happen that silly or bogous words are
> being acceped: if "Zieh" is accepted as compoundable word it will
> result in "Ziehren" to be corect. Strictly speaking there might be a
> "pulling reindeer" but usually this is a typo. Cases like this and
> cases like Daniel mentions have to be put into a blacklist which has
> to be flagged with hunspell's FORBIDDENWORD flag. Finding out about
> those cases can be partly done by a script, that generates typos
> automatically but also has to be done during the buildup of the
> dictionary by grepping for substrings of the newly added words in huge
> wordlists and taking a look at each match for correctnes if the to be
> added compound word is still correct after that or if other forms are
> created which are incorrect:
> Arbets- is a common compoundable word, before adding it, grep a huge
> word list for "Arbeit" (the word without any suffix) ... you will find
> Arbeitgeber. Adding Arbeits- as compoundable word would make
> Arbeitsgeber a correct word, so you have to put Arbeitsgeber with the
> FORBIDDENWORD flag into your blacklist, including all affix flags so
> that other variants of the bogous "Arbeitsgeber" are blacklisted, too.
> There are many cases similar to this, where you find out by grepping
> that new compoundable words produce more or less nasty typos.
>
> Bjoern
Thanks for this useful explanation! I will take your recommendations to
heart when implementing compounding in the Dutch spell checker files.
Did you do the checking manually, or did you use some software for this?

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

ge-7
In reply to this post by ge-7
BJ wrote:
--------
>Ispell and aspell both don't have sufficient support for
>agglutinative languages with complex compound word rules,
>unfortunately.

Just for the sake of fairness and to be precise:
Ispell is the first spell checker, that introduced the powerful affix concept especially for agglutinative languages. Agglutinative languages do not have complex (or any) compound word rules necessarily or at all. For example the agglutinative, turanian type language,  Turkish  uses compound words just very sparsely. Ispell is an excellent product, and without it's superior concept I seriously doubt, that we had such really high quality tools like myspell or its successor, hunspell.

Aspell started as a very good suggestion speller, but an absolutely poor checker for agglutinative languages without affixing. Since its author had the ambition to replace ispell, he introduced affix concept from version 0.6, in fact, from  last year. However he does not have yet forbidden word concept, neither two folded  affixing, which is a useful  second level word count reduction concept (For Hungarian it reduces 1.01 million words to 830 thousand).

Regards: Eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Bjoern JACKE-3
In reply to this post by Simon Brouwer
Hi Simon,
On 2006-07-03 at 20:01 +0200 Simon Brouwer sent off:
>Thanks for this useful explanation! I will take your recommendations to
>heart when implementing compounding in the Dutch spell checker files.
>Did you do the checking manually, or did you use some software for this?

I did the checking manually, I don't think there's a way to do this
"trustworthy automatically" :-). But to make your work more efficient
in the beginning you might do this: take a big list of compound words
and sort them; look for lines with words which start with "lots of
same characters" - if you find the 2000 most frequent compoundable
words you probably find >90% of all used compound words. This is what
was my experience :-)

Bjoern

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]