Re: compound recognition and typos

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: compound recognition and typos

Németh László-2
Hi Ruud,

You are absolutely right. A lot of typos will be allowed by the
compound recognition, but Hunspell has already had the suggested
feature to forbid the ugliest spelling mistakes recognized by the
compound analysis: if the (pseudo) compound word can be produced from
a dictionary word (or from its affixed forms) by one of the REP
replacement rules, it won't be accepted by Hunspell. For example, one
of the most typical Hungarian spelling mistake is the i↔í replacement.
Using the

REP i í
REP í i

rules, the bad "szer+víz" or "elit+élt" compounds aren't accepted,
because the dictionary contains the words "szerviz" and "elítélt". You
may have to extend the REP rules also with similar 1-character
replacements to catch the most important spelling mistakes of your
language.

I think, for the average wordprocessing on a language with arbitrary
number of compound words is much better to use the compound
recognition feature of Hunspell. But for other tasks, especially to
check and edit artifically distorted texts, like the output of an OCR
program, you may need to add new REP rules (for the typical OCR
errors) or to offer an optional dictionary without compound
recognition.

Regards,
László


2009/2/17 R.J. Baars <[hidden email]>:

> Laszlo,
>
> One of my colleages in OpenTaal (also project leader of OOo NL) is worried
> about the compounding supporting compounds that could easily be a mistake.
>
> Of course we can try and find these, and flag them as forbiddenword, but
> did you ever think of a function, detecting whether the compounded word is
> a possible type for a word that is in the list itself, and if zo, forbid
> it?
>
> Ruud
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: compound recognition and typos

R.J. Baars
Laszlo, that's good news.

I see two (minor) but's :
1) the words have to be in the dictionary as a whole, while they could be
compound-generated
2) we do have a lot of reps that might interfere.

But is is great the option is in there ...

In case one ot these minor but's is in the way, i'll contact you again.

Thanks, good work !



> Hi Ruud,
>
> You are absolutely right. A lot of typos will be allowed by the
> compound recognition, but Hunspell has already had the suggested
> feature to forbid the ugliest spelling mistakes recognized by the
> compound analysis: if the (pseudo) compound word can be produced from
> a dictionary word (or from its affixed forms) by one of the REP
> replacement rules, it won't be accepted by Hunspell. For example, one
> of the most typical Hungarian spelling mistake is the i↔í replacement.
> Using the
>
> REP i í
> REP í i
>
> rules, the bad "szer+víz" or "elit+élt" compounds aren't accepted,
> because the dictionary contains the words "szerviz" and "elítélt". You
> may have to extend the REP rules also with similar 1-character
> replacements to catch the most important spelling mistakes of your
> language.
>
> I think, for the average wordprocessing on a language with arbitrary
> number of compound words is much better to use the compound
> recognition feature of Hunspell. But for other tasks, especially to
> check and edit artifically distorted texts, like the output of an OCR
> program, you may need to add new REP rules (for the typical OCR
> errors) or to offer an optional dictionary without compound
> recognition.
>
> Regards,
> László
>
>
> 2009/2/17 R.J. Baars <[hidden email]>:
>> Laszlo,
>>
>> One of my colleages in OpenTaal (also project leader of OOo NL) is
>> worried
>> about the compounding supporting compounds that could easily be a
>> mistake.
>>
>> Of course we can try and find these, and flag them as forbiddenword, but
>> did you ever think of a function, detecting whether the compounded word
>> is
>> a possible type for a word that is in the list itself, and if zo, forbid
>> it?
>>
>> Ruud
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]