compound words

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

compound words

ge-7
Dear All,

The investigation below will be interesting for those, whose language
actively uses compound words (Hungarian, German, Dutch, Swedish, ...)

I now closed the investigation of Hungarian compound words creation.

The results are  in
http://tkltrans.sourceforge.net/tklspell/compound.htm#c01

The full list of bad words is on
http://tkltrans.sf.net/magyar/bbbad2.txt.gz, http://tkltrans.sf.net/magyar/betus94okx.txt.gz http://tkltrans.sf.net/magyar/sav2koz.txt.gz

The conclusion:

The checking of a 30 million word size corpus proved, that the words, that are automatically created compound words, contain approximately 10% wrong words of the above types. Automatic word compounding is a quick a dirty mechanizm, that is not capable to create quality word lists and therefore quality spell checking. Manually created word lists, if carefully created, tend to contain less than 0.5% wrong words.

The number of words.
-------------------
Here the in reality bad words:

[en@noname nagy_fajlok]$ wc el_bad/*
  7889   7889  95583 el_bad/bbbad2.txt
 38175  38175 467018 el_bad/betus94okx.txt
  8401   8401 142604 el_bad/sav2koz.txt  -- long words , over 15 chars long
 54465  54465 705205 total

Here the words, that the checker using  compounder thinks, they are good:

[en@noname nagy_fajlok]$ wc NL_jo/*
  64054   64054  802530 NL_jo/bbbad2.txt
 341204  341204 4309353 NL_jo/betus94okx.txt
 135044  135044 2302654 NL_jo/sav2koz.txt   -- long words , over 15 chars long
 540302  540302 7414537 total
 
The shorter the words, the more catastrophic the error rate.

I assume, that the results are in German analogous, because the first
investigations showed that quite clearly, if I have time, I
will look also into that somewhat deeper.

Regards: Eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: compound words

Simon Brouwer
Hi Eleonora,

ge wrote:

> Dear All,
>
> The investigation below will be interesting for those, whose language
> actively uses compound words (Hungarian, German, Dutch, Swedish, ...)
>
> I now closed the investigation of Hungarian compound words creation.
>
> The results are  in
> http://tkltrans.sourceforge.net/tklspell/compound.htm#c01
>
> The full list of bad words is on
> http://tkltrans.sf.net/magyar/bbbad2.txt.gz, http://tkltrans.sf.net/magyar/betus94okx.txt.gz http://tkltrans.sf.net/magyar/sav2koz.txt.gz
>
> The conclusion:
>
> The checking of a 30 million word size corpus proved, that the words, that are automatically created compound words, contain approximately 10% wrong words of the above types. Automatic word compounding is a quick a dirty mechanizm, that is not capable to create quality word lists and therefore quality spell checking. Manually created word lists, if carefully created, tend to contain less than 0.5% wrong words.
>
> The number of words.
> -------------------
> Here the in reality bad words:
>
> [en@noname nagy_fajlok]$ wc el_bad/*
>   7889   7889  95583 el_bad/bbbad2.txt
>  38175  38175 467018 el_bad/betus94okx.txt
>   8401   8401 142604 el_bad/sav2koz.txt  -- long words , over 15 chars long
>  54465  54465 705205 total
>
> Here the words, that the checker using  compounder thinks, they are good:
>
> [en@noname nagy_fajlok]$ wc NL_jo/*
>   64054   64054  802530 NL_jo/bbbad2.txt
>  341204  341204 4309353 NL_jo/betus94okx.txt
>  135044  135044 2302654 NL_jo/sav2koz.txt   -- long words , over 15 chars long
>  540302  540302 7414537 total
>  
> The shorter the words, the more catastrophic the error rate.

It might then be a good idea if the spell checker would reject guessed
compounds below a certain minimum length (configurable in the affix file).

> I assume, that the results are in German analogous, because the first
> investigations showed that quite clearly, if I have time, I
> will look also into that somewhat deeper.

I notice that the German examples show mostly wrong compounds that are
misspellings of other words. Maybe that list is not representative, but
such errors would be more common and are more difficult to spot by the
user.
So a possible improvement could be to disqualify a guessed compound if
it is too similar to a word that is actually in the word list. The
existing suggestion mechanism could be used to determine this.

Or maybe such mechanisms have already been implemented?

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]