Proofreading: Sentence tokenization problem

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Proofreading: Sentence tokenization problem

Marcin Miłkowski
Hi all,

As some of you probably know, LanguageTool doesn't work with sentence
tokenization in OpenOffice.org nicely. We get whole paragraphs in
doProofreading() API and we return the info for whole paragraphs. This
is of course wrong for multilingual paragraphs... Well, the reason is
that I found some small problems that I don't really know how to solve.
There is at least one, IMHO very useful, rule that checks if brackets,
quotation marks etc. come in pairs in the text. Obviously, you want to
check this in a whole paragraph, as quotations often contain many
sentences. Now, the problem is that if I tokenize the text on the
sentence level, I get next bits of paragraph text with every call, and
that makes it very hard to track the number of unmatched quotation
marks. Let me explain:

"Blah blah. Blah blah".

gets 0 matches on paragraph tokenization, as I can retain information on
the number of rules matched in a single go in a paragraph. It gets two
false alarms on sentence tokenization. (The algorithm that I use is to
add an error match to an array, and then add it to a removed array if
there's a corresponding quotation mark later on. Then, when I'm finished
with a paragraph, I delete the matches that are marked as deleted, and
all unpaired matches are displayed. This cannot work in
sentence-tokenized mode, however.) There are two reasons:

(1) I don't get the possibility to remove previous matches - I can only
remove the match in the current sentence, not in the previous one. So
backtracking seems impossible when I get the second sentence ('Blah
blah"'). I could try to store the previous matches internally along with
the text, but I would have to call OOo APIs to set some errors as
ignored, it seems to me, to be able to remove the blue underlining of
the first quotation mark ('"Blah blah'). It seems quite an overkill, and
is not reliable as the user can simply edit the text and the previous
match will end up in a different position. I could try to signal
"recheck" to OOo, but I don't want to recheck the whole document...
There is no way to call "recheck text" on a single paragraph, which
would be needed in such a case.

(2) Worse still, if an English paragraph contains a single French word,
I would loose the rule state info as well, or I would have to store all
instances of LanguageTool per supported language in memory, as rules and
their state are implemented on the language level. It's possible, of
course, until we have just a couple of languages in a document. But in a
multilingual document (which is easy for any European Union leaflet with
10 languages or something) you would have all that checkers in memory...
  Of course, I could try to store just the state of the paragraph-level
rules instead of the whole checker but that would complicate the code a lot.

I was playing with different design strategies, and it seems to me that
it would be easiest for me if we had two more features in the API:

(1) Checking the whole paragraphs and

(2) triggering a recheck of the whole paragraphs for special
paragraph-level rules.

I could try to implement normal sentence-level checks via doProofreading
and iterate the text manually via the paragraph-text APIs, where those
special rules would be called on whole paragraphs; but maybe the same
functionality would be needed for other checkers, and I would be
duplicating code... This would involve another change of APIs, which
isn't the nicest thing, to say the least.

What do others think? Any thoughts or advices on that?

Thanks in advance
Marcin
--
www.languagetool.org

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proofreading: Sentence tokenization problem

thomas.lange

Hi Marcin and everyone else,

Marcin Miłkowski wrote:

> Hi all,
>
> As some of you probably know, LanguageTool doesn't work with sentence
> tokenization in OpenOffice.org nicely. We get whole paragraphs in
> doProofreading() API and we return the info for whole paragraphs. This
> is of course wrong for multilingual paragraphs... Well, the reason is
> that I found some small problems that I don't really know how to solve.
> There is at least one, IMHO very useful, rule that checks if brackets,
> quotation marks etc. come in pairs in the text. Obviously, you want to
> check this in a whole paragraph, as quotations often contain many
> sentences. Now, the problem is that if I tokenize the text on the
> sentence level, I get next bits of paragraph text with every call, and
> that makes it very hard to track the number of unmatched quotation
> marks. Let me explain:
>
> "Blah blah. Blah blah".
>
> gets 0 matches on paragraph tokenization, as I can retain information on
> the number of rules matched in a single go in a paragraph. It gets two
> false alarms on sentence tokenization. (The algorithm that I use is to
> add an error match to an array, and then add it to a removed array if
> there's a corresponding quotation mark later on. Then, when I'm finished
> with a paragraph, I delete the matches that are marked as deleted, and
> all unpaired matches are displayed. This cannot work in
> sentence-tokenized mode, however.) There are two reasons:
>
> (1) I don't get the possibility to remove previous matches - I can only
> remove the match in the current sentence, not in the previous one. So
> backtracking seems impossible when I get the second sentence ('Blah
> blah"'). I could try to store the previous matches internally along with
> the text, but I would have to call OOo APIs to set some errors as
> ignored, it seems to me, to be able to remove the blue underlining of
> the first quotation mark ('"Blah blah'). It seems quite an overkill, and
> is not reliable as the user can simply edit the text and the previous
> match will end up in a different position. I could try to signal
> "recheck" to OOo, but I don't want to recheck the whole document...
> There is no way to call "recheck text" on a single paragraph, which
> would be needed in such a case.
>
> (2) Worse still, if an English paragraph contains a single French word,
> I would loose the rule state info as well, or I would have to store all
> instances of LanguageTool per supported language in memory, as rules and
> their state are implemented on the language level. It's possible, of
> course, until we have just a couple of languages in a document. But in a
> multilingual document (which is easy for any European Union leaflet with
> 10 languages or something) you would have all that checkers in memory...
>   Of course, I could try to store just the state of the paragraph-level
> rules instead of the whole checker but that would complicate the code a lot.
>
> I was playing with different design strategies, and it seems to me that
> it would be easiest for me if we had two more features in the API:
>
> (1) Checking the whole paragraphs and
>
> (2) triggering a recheck of the whole paragraphs for special
> paragraph-level rules.
>
> I could try to implement normal sentence-level checks via doProofreading
> and iterate the text manually via the paragraph-text APIs, where those
> special rules would be called on whole paragraphs; but maybe the same
> functionality would be needed for other checkers, and I would be
> duplicating code... This would involve another change of APIs, which
> isn't the nicest thing, to say the least.
>
> What do others think? Any thoughts or advices on that?

There are basically two maybe three solutions to this problem. One you
probably won't like, a second for a special case that might be hard to
implement, and a third one that is working but not an optimal solution
to the problem.

(1) Stick to sentence analysis only and leav the thinking to the user.
That is report the error for missing brackets and quotes and let the
user decide if that is an error or not. If it is not he can press the
"Ignore once" button (of course our implementation fpr that may need
some improvements) other he can add the missing bracket/quote.
Reasoning: Are you sure you can decide that in cases like
    "Blah blah. Blah blah".
the user did not actually wanted to write
    "Blah blah." "Blah blah". ?
The only one who can decide about that is the user.

(2) If it is a case like "Blah blah. Blah blah". Isn't that usually the
quotation of one or more sentences within another sentence?
For example:
    The scribbling at the wall said: "Don't panic! Get drunk".
In that case, and because of the API not explicitly providing means for
embedded sentences the proper choice would be that "Don't panic!" and
"Get drunk" should not be treated as sentences on their own. Instead the
whole text should be one sentence only. Then again the quotes will have
a match within the single sentence.

(3) It is not for no reason that the XProofreader::doProofreading call
gets the text for the whole paragraph passed on along with the sentence
start position. If we wanted to restrict the proofreader to a single
sentence it would have been possible to provide the text from sentence
start to paragraph end only. It is exactly to provide means of beyond
sentence checking that the whole text is provided.
Thus your implementation is free to check the text before and after the
current sentence. The only restriction is that errors must only be
reported within the bounds of the current sentence.
That is if you have text like:
    (This does not occur always. Sometimes there will be no problem. At
other times it can't be fixed.)
then the text should probably be decomposed in the following sentences
    - (This does not occur always.
    - Sometimes there will be no problem.
    - At other times it can't be fixed.)
If the closing bracket would be missing you than have to report that
error in the first sentence by referring to the opening bracket. And if
the opening bracket would be missing you will have to report that in the
third sentence by referring to the closing bracket as error position.
The only drawback of this solution is that you need to check for it all
over and again for every sentence in the paragraph by searching the
whole paragraph each time.
But if you only check for things like matching brackets and quotes that
should not be a big problem since those can be found without spending
the larger amount of time to tokenize the whole paragraph in each call.
For such rather simple cases it should be possible for you to stick with
sentence level tokenization only. And compared to the time you will need
to tokenize the sentence the overhead of always looking for matching
brackets/quotes in the whole paragraph should still be small.


All of the above solutions/work-arounds can be applied with the current API.
Of course it would also be possible to extend the API, for example by an
interface with a function like doParagraphLevelChecking.
The semantic of that function would be that it will be called once for
each paragraph (probably after the last sentence was checked), and it
likely needs to be restricted to non language specific checks since the
paragraph may not consist of one language only. For needs of UI however
any found error still needs to be assigned to a single sentence and a
position within that.
Also such a function will probably not be called for each language but
only once for each proofreader implementation, and maybe even so despite
a specific proofreader not supporting any of the languages used in that
paragraph. (That's why the implementation should do only non-language
specific checks).
For the time being however I'd like to postpone such an implementation
until we have a larger list of problems where the current API is not
sufficient/efficent enough. If we have more cases that need to get
better attention we may see that the required API extension needs to
look somewhat different or even worse it may turn out later that we need
to implement an incompatible API change. And I prefer to have as few API
changes as possible. Thus before we go with a function like this lets
wait some more time and see what other problems of API may turn up in
order to address them in one clean sweep.


Regards,
Thomas



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proofreading: Sentence tokenization problem

Marcin Miłkowski
Hi Thomas, and all others :)

[snip]

> There are basically two maybe three solutions to this problem. One you
> probably won't like, a second for a special case that might be hard to
> implement, and a third one that is working but not an optimal solution
> to the problem.

Actually, I like the third idea a lot :)

> (1) Stick to sentence analysis only and leav the thinking to the user.

This is what I decided to do yesterday when committing some other
changes to CVS. But it's suboptimal, as it increases false positives *a
lot*.

> (2) If it is a case like "Blah blah. Blah blah". Isn't that usually the
> quotation of one or more sentences within another sentence?
> For example:
>     The scribbling at the wall said: "Don't panic! Get drunk".
> In that case, and because of the API not explicitly providing means for
> embedded sentences the proper choice would be that "Don't panic!" and
> "Get drunk" should not be treated as sentences on their own. Instead the
> whole text should be one sentence only. Then again the quotes will have
> a match within the single sentence.

That would be feasible under one condition: if you actually get both
quotation marks and only quotation marks. Yet, if you get parentheses
inside quotes, which is quite common, then you end up with strange
segmentation that would make false positives to show up again:

Derrida wrote: "Blah (bla?) blah!".

It's not a rare case to have brackets inside parentheses, and
parantheses with quotes etc.

> (3) It is not for no reason that the XProofreader::doProofreading call
> gets the text for the whole paragraph passed on along with the sentence
> start position. If we wanted to restrict the proofreader to a single
> sentence it would have been possible to provide the text from sentence
> start to paragraph end only. It is exactly to provide means of beyond
> sentence checking that the whole text is provided.
> Thus your implementation is free to check the text before and after the
> current sentence. The only restriction is that errors must only be
> reported within the bounds of the current sentence.

This is quite a nice setup. I can easily use only paragraph-level rules
on the whole paragraph, even without tokenization, and then return only
the results within the currently analyzed sentence. I assure this is
much easier than storing the state of the rules (via class
serialization, for example).

> The only drawback of this solution is that you need to check for it all
> over and again for every sentence in the paragraph by searching the
> whole paragraph each time.
> But if you only check for things like matching brackets and quotes that
> should not be a big problem since those can be found without spending
> the larger amount of time to tokenize the whole paragraph in each call.

Yes, I can skip the tokenization of the paragraph altogether for this
rule. It would be even easier to map the results if I skip the tokenization.

[snip]

> For the time being however I'd like to postpone such an implementation
> until we have a larger list of problems where the current API is not
> sufficient/efficent enough. If we have more cases that need to get
> better attention we may see that the required API extension needs to
> look somewhat different or even worse it may turn out later that we need
> to implement an incompatible API change. And I prefer to have as few API
> changes as possible. Thus before we go with a function like this lets
> wait some more time and see what other problems of API may turn up in
> order to address them in one clean sweep.

I agree. Much more important are some other changes that don't require
changes in the API (like proper checking of the text that contains
redlined text with revisions marked or supplying some basic information
on formatting in paragraph properties).

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proofreading: Sentence tokenization problem

Mathias Bauer
In reply to this post by Marcin Miłkowski
Marcin Miłkowski wrote:

> There is at least one, IMHO very useful, rule that checks if brackets,
> quotation marks etc. come in pairs in the text. Obviously, you want to
> check this in a whole paragraph, as quotations often contain many
> sentences. Now, the problem is that if I tokenize the text on the
> sentence level, I get next bits of paragraph text with every call, and
> that makes it very hard to track the number of unmatched quotation
> marks.

IMHO checking whole paragraphs is only gradually better than checking
sentences only. Quotations even may contain several paragraphs and the
only way to catch missing matching quotations here is always checking
the whole text, obviously not a useful solution.

Basically there is nothing wrong with maintaining information about a
paragraph (here: "opening quote found") and postpone its judgement until
more information is available (here: waiting for a matching closing
quote, maybe in a later paragraph). What we then will need is a way ( =
API ) how to report possible errors found this way.

Regards,
Mathias

--
Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
Please don't reply to "[hidden email]".
I use it for the OOo lists and only rarely read other mails sent to it.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]