About proofreader and spell checker interaction

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

About proofreader and spell checker interaction

thomas.lange
Hi all,

Here I'm going to post some e-mail conversation about spell checkers and
proofreaders with spell checking support in order for anyone who is
interested to participate and comment.


The first one is my reply to Marcins initial mail:



Hi Marcin,

Marcin Mi?kowski wrote:

> > Hi Thomas,
> >
> > I tried my own suggestion - replace TextMarkup.PROOFING with SPELLCHECK in the error returned with LanguageTool. However, this seems to have no effect - the curly underline is still blue and not red. Moreover, even if I return true for isSpellChecker(), I still get blue. Is this still unimplemented? (I'm using 3.0.1 right now).
> >
> > Actually, I wanted to return red underlines for some of the errors that LT catches - there are quite sure cases of context-dependent serious spelling mistakes and some of the language maintainers would like to mark them up in red. Is this possible at all?
> >  
>  
Not yet.
Because of the below lines in the gciterator.cxx. For the time being you
just report them as spelling errors (as that is correct) and accept them
being treated as results from the grammar checker with no explicit hint
that it is about spelling only.


> >     // the proofreader may return SPELLING but right now our core
> >     // does only handle PROOFREADING if the result is from the
> > proofreader...
> >     // (later on we may wish to color spelling errors found by the
> > proofreader
> >     // differently for example. But no special handling right now.
> >     if (rDesc.nType == text::TextMarkupType::SPELLCHECK)
> >         rDesc.nType = text::TextMarkupType::PROOFREADING;
>  
Currently, where spell checking is still a separate process and there is
no coordination between it and proofreading it is explicitly disabled.
The reason for this is that it may be bad to have to different and
independent components spell check the same text. There is no mechanism
to prevent/solve inconsistencies.

In the longer run we like to move spell checking to the gciterator as
well. Then it should be possible to nicely solve the related problems in
some way.

The main question arises form the idea that a proofreader might have a
better understanding of the text, and thus if it is also spell checker
should it be more trusted? That is should we even go so far to not use
other spell checkers if the proofreader for that language is also a
spell checker?

Currently spell checkers are chained (that is up for discussion as well
though, since without chaining the route to take seems to be rather
obvious). That means if any of several spell checker for a given
language says this text is correct than no error will be reported. That
would allow for spell checker A to check normal English text, and for
spell checker B to know only about English medical words. Those two
spell checkers can easily be chained and you will get a result that is
better than using just a single one. Without chaining you would need a
spell checker that has to take care of both tasks in one sweep.

But having only a spell checker will usually result in incorrectly
capitalized words within a sentence to go by unnoticed. E.g. in
    This text is not Correct.
This happens because the spell checker does not have the information
that 'Correct' is not at the start of a sentence. A spell checker that
is also a proof reader however can easily notice that 'Correct' should
not be capitalized. But a t least in this case if chaining were still to
be allowed that will still result in no error since the other spell
checker says the word is fine.


Thus the problems at hand and to be discussed are:

a) should we give up on chained spell checkers even though there are
good uses for them? The simple fact that vanilla OOo has only one spell
checker does not mean there aren't other spell checkers around that
already make use of that chaining... Or that someone would like to make
use of it in the future.

b) The easy case is having no spell-checker-only for that language but a
grammar checker that does also spell checking. Nothing much to think
about here.
But even if we give up on chaining but still have a grammar checker that
is also a spell checker AND a second only spell checker, we still have
to decide if we want to make use of the second one. If we want to make
use of that one as well, how to merge the results? Should it simply be
that the grammar checkers spell checker is only allowed to mark errors
where the second one hat found none? That is to introduce additional
ones? (See above mentioned case problem.) Or should it be allowed to
overrule errors found by the second one as not-to-be-reported as well?
Or do we need even more complex handling for this problem?


On short notice however a) can be treated as a special case of b) as
well. ^_-
Thus we probably do not need to change the current behavior of that.


Thomas





Reply | Threaded
Open this post in threaded view
|

Re: About proofreader and spell checker interaction

thomas.lange

Here is Marcins reply (2nd posting):



Hi Thomas,

[snip]


>> >>     // the proofreader may return SPELLING but right now our core
>> >>     // does only handle PROOFREADING if the result is from the
>> >> proofreader...
>> >>     // (later on we may wish to color spelling errors found by the
>> >> proofreader
>> >>     // differently for example. But no special handling right now.
>> >>     if (rDesc.nType == text::TextMarkupType::SPELLCHECK)
>> >>         rDesc.nType = text::TextMarkupType::PROOFREADING;
>>    

Ah...

I was using bonsai before and couldn't find the sources yesterday,
that's why I asked.


> > Currently, where spell checking is still a separate process and there is
> > no coordination between it and proofreading it is explicitly disabled.
> > The reason for this is that it may be bad to have to different and
> > independent components spell check the same text. There is no mechanism
> > to prevent/solve inconsistencies.
>  

ah, you're right. It's non-trivial.

[...]


> > Currently spell checkers are chained (that is up for discussion as well
> > though, since without chaining the route to take seems to be rather
> > obvious). That means if any of several spell checker for a given
> > language says this text is correct than no error will be reported. That
> > would allow for spell checker A to check normal English text, and for
> > spell checker B to know only about English medical words. Those two
> > spell checkers can easily be chained and you will get a result that is
> > better than using just a single one. Without chaining you would need a
> > spell checker that has to take care of both tasks in one sweep.
>  

I'd say that chaining is OK as far as normal (non-context) spellers are
concerned. For grammar checkers it should be different, as they work on
a different principle (most of the time): instead of accepting a word
from a finite list, they search for an error from a finite list to say
that they don't accept the text. So instead of using OR (a disjunction)
of results, use AND (a conjunction) here - all proof-readers should not
raise any errors, but if any of them raises one, display it.

[...]


> > Thus the problems at hand and to be discussed are:
> >
> > a) should we give up on chained spell checkers even though there are
> > good uses for them? The simple fact that vanilla OOo has only one spell
> > checker does not mean there aren't other spell checkers around that
> > already make use of that chaining... Or that someone would like to make
> > use of it in the future.
>  

The easiest solution would be to define that a proofreader that has
isSpellChecker() should be chained as all checkers are. If not, then it
should be treated in the following manner: whenever a proofreader
returns an error marked as spellcheck, display it in red, unless this
error has been found earlier by another checker. Yet, in such a case, a
comment should be in place, so only change the color, nothing else.
(Even in a spellchecking dialog, the error could be reported later than
normal spelling errors).


> > But even if we give up on chaining but still have a grammar checker that
> > is also a spell checker AND a second only spell checker, we still have
> > to decide if we want to make use of the second one. If we want to make
> > use of that one as well, how to merge the results? Should it simply be
> > that the grammar checkers spell checker is only allowed to mark errors
> > where the second one hat found none?
>  

That seems reasonable, otherwise multiple errors would be displayed in
the same position.


> > Or should it be allowed to
> > overrule errors found by the second one as not-to-be-reported as well?
>  

That is interesting. Well, I didn't think of it as we never say "this is
acceptable", we only return errors. The API has no way of overruling
results. I would say an easier solution would be to explicitly say that
spellcheckers should accept all words disregarding the context, so they
would accept "Sri" without "Lanka" or "Burkino" without "Fasa". Next, a
grammar checker would see if Lanka is preceded with Sri, or Sri is
followed by Lanka etc.

Of course, this presupposes that developers of proofreaders are in touch
with developers of spellchecker dictionaries so that dictionaries would
be properly prepared.

Yet, as you probably know, Laci Nemeth wants to add some limited
context-check to hunspell. That would already create some problems...
Probably in such a case, another process in hunspell should use the
proofreader mechanism to lookup the context, but first the individual
words should be normally accepted.

My proposal doesn't require any change to the API - it would only define
what to do with text markup = spellcheck in case when the grammar
checker is not a spellchecker, and when it is a spellchecker.


> > Or do we need even more complex handling for this problem?
>  

I cannot see a use for it.

Marcin

Ps. BTW, I've heard that the comment being visible only after clicking
"Explain" is definitely less usable than the previous dialog box that we
had in LanguageTool. Users I talked to prefer to have the explanation
displayed without clicking. I find this intuitive as well. Maybe we
should ask people from the UX project to comment on this?


Reply | Threaded
Open this post in threaded view
|

Re: About proofreader and spell checker interaction

thomas.lange

The currently last (3rd) posting:



Hello Marcin,


>> > > Currently spell checkers are chained (that is up for discussion as well
>> > > though, since without chaining the route to take seems to be rather
>> > > obvious). That means if any of several spell checker for a given
>> > > language says this text is correct than no error will be reported. That
>> > > would allow for spell checker A to check normal English text, and for
>> > > spell checker B to know only about English medical words. Those two
>> > > spell checkers can easily be chained and you will get a result that is
>> > > better than using just a single one. Without chaining you would need a
>> > > spell checker that has to take care of both tasks in one sweep.
>>    
> >
> > I'd say that chaining is OK as far as normal (non-context) spellers are
> > concerned. For grammar checkers it should be different, as they work on
> > a different principle (most of the time): instead of accepting a word
> > from a finite list, they search for an error from a finite list to say
> > that they don't accept the text. So instead of using OR (a disjunction)
> > of results, use AND (a conjunction) here - all proof-readers should not
> > raise any errors, but if any of them raises one, display it.
> >  
>  

If want you meant here was chaining of grammar checkers than that
probably will never happen. Currently there is only one per language
allowed.
Originally we also had in mind that grammar checkers could be chained.
And we were also told that the relation for chaining them should be AND.
But in the end we dropped that idea for two reasons:

1) chaining grammar checkers will likely be very time consuming and
often enough the process is already somewhat slow.

2) that however is far outweighed by the reason that there is absolutely
NO chance to sort out the problem that there is no solution to the
problem what to do if both grammar checkers had different ideas about
what the sentence end should be. And we don't want to go with separate
line ends for each checker. After all the whole process is sentence
based. Thus a disagreement about the sentence end would be a major
problem. And the only always working way to prevent such disagreements
is to have only one grammar checker per language, since the sentence end
detection must be left to the specific implementation.
Chaining of grammar checkers would only be Ok if OOo would do the
sentence end analysis and enforce the results. But that is not an option.


> > [...]
> >
>  
>> > > Thus the problems at hand and to be discussed are:
>> > >
>> > > a) should we give up on chained spell checkers even though there are
>> > > good uses for them? The simple fact that vanilla OOo has only one spell
>> > > checker does not mean there aren't other spell checkers around that
>> > > already make use of that chaining... Or that someone would like to make
>> > > use of it in the future.
>>    
> >
> > The easiest solution would be to define that a proofreader that has
> > isSpellChecker() should be chained as all checkers are.
>  

Nope.
All other spell checkers already have the limitation that they are word
based.
Thus chaining is also only possible for word based spell checkers. After
all an easy chaining would require the same kind of API interface...
Of course the proofreader component is free to also implement a 'normal'
spell checker as well. (Actually the third party component we coded does
this.)
But you can't chain the word 'Correct' from the example to the proof
reader API on its own without the context (sentence).

Thus I believe it should be something like this:
The grammar checker or more likely the grammar checking iterator has to
make a separate run for all words of the current sentence with the
respective spell checkers. If we decide on a fixed logic of merging the
results with spell checking results from the proofreader then it can be
implemented in the gciterator, otherwise it probably needs to be
implemented by the proofreader itself. In the latter case we should
provide an API for the proofreader to make use of that. At least it
should already take care of presenting only the overall result after
chaining all independent word based spell checkers.

My preference would be to have an overall logic that can be implemented
in the gciterator since it would prevent extra burden from the
proofreader implementation.
Thus the question would be if we can decide on a fixed logic for merging
the results.



> > If not, then it
> > should be treated in the following manner: whenever a proofreader
> > returns an error marked as spellcheck, display it in red, unless this
> > error has been found earlier by another checker. Yet, in such a case, a
> > comment should be in place, so only change the color, nothing else.
> > (Even in a spellchecking dialog, the error could be reported later than
> > normal spelling errors).
> >  
>  
The spelling errors found by a proofreader need to reported (and taken
care of by the user) first. The reason for this is that grammar checking
requires the proofreader to properly identify/tokenize each word, and
usually that can't be done if there are spelling errors. Thus the
quality of proofreading depends on the spelling errors being resolved first.
In which order the spelling errors from different sources are displayed
does not matter much. But probably they should be sorted by their
occurrence in the sentence.


>> > > But even if we give up on chaining but still have a grammar checker that
>> > > is also a spell checker AND a second only spell checker, we still have
>> > > to decide if we want to make use of the second one. If we want to make
>> > > use of that one as well, how to merge the results? Should it simply be
>> > > that the grammar checkers spell checker is only allowed to mark errors
>> > > where the second one hat found none?
>>    
> >
> > That seems reasonable, otherwise multiple errors would be displayed in
> > the same position.
> >  
>  
Yes, avoiding overlapping errors should also be done if possible. So
which error is going to win if the chained spell checkers and the proof
reader report a spelling error at overlapping but NOT identical positions?



>> > > Or should it be allowed to
>> > > overrule errors found by the second one as not-to-be-reported as well?
>>    
> >
> > That is interesting. Well, I didn't think of it as we never say "this is
> > acceptable", we only return errors.
>  
Sure.
I also don't expect any proofreader to implement a
'this-is-100%-correct' check function. It was probably just a useless
thought of mine, since if the spell checker can not provide some
detailed information about the type of error found, then the only choice
for overruling the spell checker results in this case would be for the
proofreader to discard all of them. Thus essentially saying: if the
proofreader returns spelling errors as well, then don't use word-only
spell checkers at all. Thus lets just forget about this thought of mine,
since providing any additional information from the word-only spell
checker will probably need a complete new dictionary implementation to
provide that kind of information.

Or can additional information be provided by Hunspell only?
And more pressing what kind of information can it be that a spell
checker can return in order for a proofreader (or the gciterator) to
decide if a specific error found by e.g. Hunspell should be discarded now?


> > The API has no way of overruling
> > results. I would say an easier solution would be to explicitly say that
> > spellcheckers should accept all words disregarding the context, so they
> > would accept "Sri" without "Lanka" or "Burkino" without "Fasa". Next, a
> > grammar checker would see if Lanka is preceded with Sri, or Sri is
> > followed by Lanka etc.
> >
> > Of course, this presupposes that developers of proofreaders are in touch
> > with developers of spellchecker dictionaries so that dictionaries would
> > be properly prepared.
> >  
>  
Why would that be the case?
The word-only spell checkers like Hunspell will be just fine if "Sri"
and "Lanka" are encountered by themselves. And later on the proofreader
can decide to raise a spelling error if it encounters "Sri" without "Lanka".
Therefore in this case I see no need fore a more close collaboration
between dictionary providers and proofreader implementation.

They only thing that needs to be done is to extend the words (i.e. the
breakiterators definition of a word) to such a level that the spell
checker will not get handed over text parts that are not acceptable as a
single word. This is to prevent it from marking text parts as wrong that
are actually correct.
(A good example for such problems is issue #64400)
The rest is left to the proofreader.


> > Yet, as you probably know, Laci Nemeth wants to add some limited
> > context-check to hunspell.
>  

Yes, I know. And I'm sorry for still not having found the time to
provide him with a rudimentary C++ implementation. :-(



> > Ps. BTW, I've heard that the comment being visible only after clicking
> > "Explain" is definitely less usable than the previous dialog box that we
> > had in LanguageTool. Users I talked to prefer to have the explanation
> > displayed without clicking. I find this intuitive as well. Maybe we
> > should ask people from the UX project to comment on this?
> >  
>  

Sure you can.
The last time I asked I was told that the dialog is already cramped but
the size of the dialog should not increase also. Thus nothing was done
to display the text from the 'Explain' button in a more directly visible
way.


Thomas










Reply | Threaded
Open this post in threaded view
|

Re: About proofreader and spell checker interaction

Ruud Baars-2
I don't know if it is really on-topic, but i want to contribute my thoughts:

The context hould be proof reading, preferrably on the document level.
I can imagine the document being about football, making hyphenation of
the Dutch word balletje to bal-le-tje much more likely then bal=let=je
(small ballet).
Of course, this is far out of reach yet.

Nevertheless, it makes it clear that grammar has more priority then just
spell-checking.

So I plee for a (maybe dummy) context checker, using the spell checker
as subcomponent.
Or, said differently: upgrading the spellchecker to a context sensitive one.

The context version could also be a lot better dertermining language and
'quoted' language parts.

So I would prefer the interface to be directed in a way that:
- the proofreader can ask for a scope (document, paragraph to the
calling software)
- the proofreader does its proofreading, using a defined interface
subcomponents like a spell checker.

I think it is time to integrate spell and grammar checking into one, by
defining the way to work together using interfaces.

Now we are busy discussing this, i would appreciate to define more
warnign colors in text processor, mayby representing the level of
seriousness.

I think really integrating language support like this gives OOo (and by
using a standard API other word processors as well) a big advantage.

Though this is not technical at all, I still hope it contributes.

Ruud

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: About proofreader and spell checker interaction

Marcin Miłkowski
In reply to this post by thomas.lange
Hi all, and Thomas :)

I guess nobody else comments, as these things are highly technical, but
they're of interest to the list. So let me continue our exchange :)


[snip]

> If want you meant here was chaining of grammar checkers than that
> probably will never happen. Currently there is only one per language
> allowed.

No, I meant chaining spell-checkers. So you don't have to argue against
my proposal - there was none :)

[snip]

>>>>> Thus the problems at hand and to be discussed are:
>>>>>
>>>>> a) should we give up on chained spell checkers even though there are
>>>>> good uses for them? The simple fact that vanilla OOo has only one spell
>>>>> checker does not mean there aren't other spell checkers around that
>>>>> already make use of that chaining... Or that someone would like to make
>>>>> use of it in the future.
>>>    
>>>
>>> The easiest solution would be to define that a proofreader that has
>>> isSpellChecker() should be chained as all checkers are.
>>  
>
> Nope.
> All other spell checkers already have the limitation that they are word
> based.

OK, I see your point.

[snip]

> My preference would be to have an overall logic that can be implemented
> in the gciterator since it would prevent extra burden from the
> proofreader implementation.

+1

[snip]

>>  
> The spelling errors found by a proofreader need to reported (and taken
> care of by the user) first. The reason for this is that grammar checking
> requires the proofreader to properly identify/tokenize each word, and
> usually that can't be done if there are spelling errors. Thus the
> quality of proofreading depends on the spelling errors being resolved first.
> In which order the spelling errors from different sources are displayed
> does not matter much. But probably they should be sorted by their
> occurrence in the sentence.

Agreed.

[snip]

> I also don't expect any proofreader to implement a
> 'this-is-100%-correct' check function. It was probably just a useless
> thought of mine, since if the spell checker can not provide some
> detailed information about the type of error found, then the only choice
> for overruling the spell checker results in this case would be for the
> proofreader to discard all of them. Thus essentially saying: if the
> proofreader returns spelling errors as well, then don't use word-only
> spell checkers at all. Thus lets just forget about this thought of mine,
> since providing any additional information from the word-only spell
> checker will probably need a complete new dictionary implementation to
> provide that kind of information.

Hm, the grammar checker can implement some kind of context checks, or
statistical processing that would mark some words as correct. I think,
however, that this kind of feature would be quite cumbersome - and it
can be implemented more easily by simply keeping spellcheckers
word-based and let them accept all possible words disregarding the
context. Then grammar checkers can highlight errors that depend on the
context.

[...]

>>>
>>> Of course, this presupposes that developers of proofreaders are in touch
>>> with developers of spellchecker dictionaries so that dictionaries would
>>> be properly prepared.
>>>  
>>  
> Why would that be the case?

You would need to talk to dictionary maintainers to keep Burkino and
Fasa as separate entries. Some of them don't want that, and they want
hunspell to implement context-sensitive checks. But I think the job
should be for a grammar checker to make context-sensitive checking. But
most dictionaries are word-based, and they accept both Sri and Lanka.

[...]

>>> Ps. BTW, I've heard that the comment being visible only after clicking
>>> "Explain" is definitely less usable than the previous dialog box that we
>>> had in LanguageTool. Users I talked to prefer to have the explanation
>>> displayed without clicking. I find this intuitive as well. Maybe we
>>> should ask people from the UX project to comment on this?
>>>  
>>  
>
> Sure you can.
> The last time I asked I was told that the dialog is already cramped but
> the size of the dialog should not increase also. Thus nothing was done
> to display the text from the 'Explain' button in a more directly visible
> way.

I'd be quite happy without dialog branding, which is nice but huge and
pretty much useless, and have explanation otherwise, and branding
limited to an icon. But I suppose that's not what all third-party
proofreader makers want.

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]