Compound words

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Compound words

ge-7
Dear Simon,

>>
I think *that* would be throwing away the child. In languages like Dutch
it is so easy to form a new but perfectly valid word by compounding, it
is impossible to include all the possible combinations in a word list.
<<

All is impossible, I agree. But it is not impossible to find 99.99% and then you have an acceptable error rate and an acceptable hit rate.

>>It might an idea to identify problem cases by running the list of
known-good words through the suggestion mechanism, and making a list of
all the variations that are accepted (only) using the mechanical
compound mechanism. This list could then be reviewed and the words that
are incorrectly spelled and/or nonsensical placed on a "reject list".
<<

Simon, this is up to you for Dutch. However, bad words are as a minimum 4.9 milliard words (you can see in my study why), therefore I decided not to handle bad words. My life is not long enough to handle them, and also, it would bring no useful result. It is an erroneous technology and way of thinking, when you assume, that you can work with them. They are just too much. I just illustrate a few of them in my table, that's all.

I did the selection of good words for Hungarian, and I can tell you, it was a LOT of work.

If you do that, I strongly advice to use a mechanical compounder for preselection. Wrong words after preselection are thrown away. After that I created word lists with different word length, up to 8 chars, 8-10 chars, 11-15 chars, and above 15 chars. All lists I checked with yahoo/google, each word, and all length groups had now 2 groups, google/yahoo found and google/yahoo not found. These tricks helped me to spped up from 300 words/hour to 6000 words/hour.

Machine compounding helped me a lot to filter the web corpus. It is a useful technology, but the error rate it creates in unacceptable for quality spell checking. Tricks do not help, if the pig remains in the room- it will stink there, no matter, what you try.

Regards, Eleonora

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Compound words

Simon Brouwer
Hi Eleonora,

ge schreef:

> Dear Simon,
>
>  
> I think *that* would be throwing away the child. In languages like Dutch
> it is so easy to form a new but perfectly valid word by compounding, it
> is impossible to include all the possible combinations in a word list.
> <<
>
> All is impossible, I agree. But it is not impossible to find 99.99% and then you have an acceptable error rate and an acceptable hit rate.
>  
Hmmm... 99.99% means that on average, for every 10.000 words of text
only one word is not in the word list. As new compounds are formed all
the time, this seems improbable.

Also, Dutch spell checking in MS Word, which is generally considered
quite good, appears to use mechanical compounding as it accepts certain
contrived, nonsensical compounds. This at least suggests that it is not
an *obviously* bad idea, for Dutch at least.

In any case I agree that mechanical compounding has serious limitations
to be aware of.

(...)

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]