ZWSP

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

ZWSP

Javier SOLA
Hi Nemeth,

In relation to the issue of using /u2xxx characters in Hunspell, I
wanted to ask you if there is any more information or development on it.
Any chances that it can be fixed in 2.4 (or have a patch that we can use).

For Khmer we need to use ZWSP as word separator (words are written one
after the other without separation), and the spellchecker so far does
not work in 2.4 (it did in prior versions).

I would be grateful for any information.

Cheers,

Javier

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Németh László-2
Hi Javier,

I believe, I have got only a report about the tokenization problems of
the command line version of Hunspell, yet. I will add the requested
ZWSP suggestion instead of space, that is related to OpenOffice.org,
but I don't know of ZWSP problems of OOo 2.4., if they exists.
Could you send me a more detailed bug reports with the new Hunspell
1.2.2 beta, maybe tests with compiling without HAVE_ICONV macro of
config.h? I'd like to fix your problem in 1.2.2, and integrate it with
OpenOffice.org as soon as possible.

Thanks in advance,
László


2008/1/16, Javier SOLA <[hidden email]>:

> Hi Nemeth,
>
> In relation to the issue of using /u2xxx characters in Hunspell, I
> wanted to ask you if there is any more information or development on it.
> Any chances that it can be fixed in 2.4 (or have a patch that we can use).
>
> For Khmer we need to use ZWSP as word separator (words are written one
> after the other without separation), and the spellchecker so far does
> not work in 2.4 (it did in prior versions).
>
> I would be grateful for any information.
>
> Cheers,
>
> Javier
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Javier SOLA
Hi Németh,

Thanks !

Should I file a bug report in OpenOffice or in Hunspell?

We are still working with 2.1, because 2.2 and 2.3 had important issues
for Khmer. In 2.1 we can do spelling of Khmer without any problem. We
separate words with ZWSP, because graphically the words need to be
together, the space is used to mark a stop in the speech, equivalent to
the comma in English.

We do have the problem of Hunspell suggesting SPACE as a separator,
instead of ZWSP, but this is a different issue.

We have tested the latests builds of 2.4, and now ZWSP is not
interpreted as a word separator. If we separate words with ZWSP, they
are still considered as one word (as if the character was a ZWJ or
ZWNJ), and considered as misspelled. If we separate them with SPACE,
then it works. This is just a long shot, but... could it be that when
you added support for ZWJ (200D) and ZWNJ (200C) as characters that can
be placed inside a word, the ZWSP (200B) went into the same block of
characters that can be use inside words?

Cheers,

Javier

Németh László wrote

> Hi Javier,
>
> I believe, I have got only a report about the tokenization problems of
> the command line version of Hunspell, yet. I will add the requested
> ZWSP suggestion instead of space, that is related to OpenOffice.org,
> but I don't know of ZWSP problems of OOo 2.4., if they exists.
> Could you send me a more detailed bug reports with the new Hunspell
> 1.2.2 beta, maybe tests with compiling without HAVE_ICONV macro of
> config.h? I'd like to fix your problem in 1.2.2, and integrate it with
> OpenOffice.org as soon as possible.
>
> Thanks in advance,
> László
>
>
> 2008/1/16, Javier SOLA <[hidden email]>:
>  
>> Hi Nemeth,
>>
>> In relation to the issue of using /u2xxx characters in Hunspell, I
>> wanted to ask you if there is any more information or development on it.
>> Any chances that it can be fixed in 2.4 (or have a patch that we can use).
>>
>> For Khmer we need to use ZWSP as word separator (words are written one
>> after the other without separation), and the spellchecker so far does
>> not work in 2.4 (it did in prior versions).
>>
>> I would be grateful for any information.
>>
>> Cheers,
>>
>> Javier
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Németh László-2
Hi Javier,

2008/1/17, Javier SOLA <[hidden email]>:
> Hi Németh,
>
> Thanks !
>
> Should I file a bug report in OpenOffice or in Hunspell?

OpenOffice

>
> We are still working with 2.1, because 2.2 and 2.3 had important issues
> for Khmer. In 2.1 we can do spelling of Khmer without any problem. We
> separate words with ZWSP, because graphically the words need to be
> together, the space is used to mark a stop in the speech, equivalent to
> the comma in English.
>
> We do have the problem of Hunspell suggesting SPACE as a separator,
> instead of ZWSP, but this is a different issue.
>
> We have tested the latests builds of 2.4, and now ZWSP is not
> interpreted as a word separator. If we separate words with ZWSP, they
> are still considered as one word (as if the character was a ZWJ or
> ZWNJ), and considered as misspelled. If we separate them with SPACE,
> then it works. This is just a long shot, but... could it be that when
> you added support for ZWJ (200D) and ZWNJ (200C) as characters that can
> be placed inside a word, the ZWSP (200B) went into the same block of
> characters that can be use inside words?

Hunspell doesn't break the text in OpenOffice.org. OOo uses IBM ICU library
for this task: http://wiki.services.openoffice.org/wiki/ICU

It seems, updating IBM ICU in OpenOffice.org has generated your problem.
Maybe new ICU files have overwritten the good syntax definitions of
ZWSP tokenization.
We need a new l10n OpenOffice.org issue with detailed bug report.

Cheers,
László

>
> Cheers,
>
> Javier
>
> Németh László wrote
> > Hi Javier,
> >
> > I believe, I have got only a report about the tokenization problems of
> > the command line version of Hunspell, yet. I will add the requested
> > ZWSP suggestion instead of space, that is related to OpenOffice.org,
> > but I don't know of ZWSP problems of OOo 2.4., if they exists.
> > Could you send me a more detailed bug reports with the new Hunspell
> > 1.2.2 beta, maybe tests with compiling without HAVE_ICONV macro of
> > config.h? I'd like to fix your problem in 1.2.2, and integrate it with
> > OpenOffice.org as soon as possible.
> >
> > Thanks in advance,
> > László
> >
> >
> > 2008/1/16, Javier SOLA <[hidden email]>:
> >
> >> Hi Nemeth,
> >>
> >> In relation to the issue of using /u2xxx characters in Hunspell, I
> >> wanted to ask you if there is any more information or development on it.
> >> Any chances that it can be fixed in 2.4 (or have a patch that we can use).
> >>
> >> For Khmer we need to use ZWSP as word separator (words are written one
> >> after the other without separation), and the spellchecker so far does
> >> not work in 2.4 (it did in prior versions).
> >>
> >> I would be grateful for any information.
> >>
> >> Cheers,
> >>
> >> Javier
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Javier SOLA
Lázló

Németh László wrote

> Hi Javier,
>
> Hunspell doesn't break the text in OpenOffice.org. OOo uses IBM ICU library
> for this task: http://wiki.services.openoffice.org/wiki/ICU
>
> It seems, updating IBM ICU in OpenOffice.org has generated your problem.
> Maybe new ICU files have overwritten the good syntax definitions of
> ZWSP tokenization.
> We need a new l10n OpenOffice.org issue with detailed bug report.
>
>  
Thanks !

It is not the only problem that we have had with the change to the new
version of ICU, collation was also affected by issues in the new version.

I will file the issue.

Out of curiosity, and out of topic for this list. We are developing an
localization editor for XLIFF files, and we are trying to integrate
Hunspell. Do we need to do our own tokenization (for ZWSP)?

Cheers,

Javier


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

thomas.lange
In reply to this post by Javier SOLA

Hello Javier and Lazlo!

>> Hunspell doesn't break the text in OpenOffice.org. OOo uses IBM ICU library
>> for this task: http://wiki.services.openoffice.org/wiki/ICU
>>
>> It seems, updating IBM ICU in OpenOffice.org has generated your problem.
>> Maybe new ICU files have overwritten the good syntax definitions of
>> ZWSP tokenization.
>> We need a new l10n OpenOffice.org issue with detailed bug report.
>>

It is correct what Lazlo said that the problem of word separation lies
within the breakiterator which is implemented by mean of ICU.

But I dare to assume that even though your solution with ZWSP is
currently the only option for you it might not exactly be as good a
result as you want.
That is for example if you have a text separated by the breakiterator
and consisting of lets say 3 Khmer words XYZ, abcd and PQRST as I
understand they will be displayed (and presented to the spellchecker)
like this  XYZabcdPQRST. If it now happens to be that there is a single
error within the second word you can't help but have the spellchecker
return a suggestion for the whole text, that is all 3 words.

I don't know how long such constructs without spaces might get in Khmer.
But if they tend to be longer it will become troublesome especially if
you think you may have to deal with more than one error. E.g. one in the
first word and one in the third and for both of 'em being more than one
reasonable choice available. How are you going to handle the multitude
of suggestions you have to deal with if you have to return suggestions
for the text consisting of all three words?
Seems that one can not at all be fixed by just modifying the current
spellchecker implementation.

There are two options I see to solve this:
a) If ZWSP is simple AND fast enough to apply AND OpenSource it
   might be integrated into the breakiterator and thus it may be
   fine. (We may still have follow up issue with attributes being
   applied where they should not have been though. I've already
   seen similar issues with Chinese translation and Hangul/Hanja
   conversion).
b) You have to wait for our grammar checking (or better proof
   reading) framework to get finished, because for that we will
   pass complete sentences on to the checker.
   Right now (in the CWS gcframework) we have implemented it to
   the point where it can be used for automatic checking and
   marking of wrong text but without having suggestion available
   in the context-menu.

Basically I'm just saying you should probably be prepared to implement a
grammar checker later on since that is likely to be the only correct
solution to the problem.


Regards,
Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Németh László-2
In reply to this post by Javier SOLA
Hi Javier,

2008/1/17, Javier SOLA <[hidden email]>:
> Out of curiosity, and out of topic for this list. We are developing an
> localization editor for XLIFF files, and we are trying to integrate
> Hunspell. Do we need to do our own tokenization (for ZWSP)?

I have checked now, Hunspell handles ZWSP correctly:

echo xxx$(echo -ne '\x0B\x20' | iconv -f utf-16 -t utf-8)xxx | hunspell -d en_US
Hunspell 1.2.2b
& xxx 4 0: xx, xix, x xx, xx x
& xxx 4 6: xx, xix, x xx, xx x

You can use Hunspell tokenization via its pipe interface or parser library.

Cheers,
László

>
> Cheers,
>
> Javier
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Németh László-2
In reply to this post by thomas.lange
Hi Thomas,

2008/1/17, Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:
> There are two options I see to solve this:

These options are not mutually exclusive: using optional ZWSP
characters as word breaks will not modify the grammar checking of the
sentences.  The problem is that ZWSP is not a word break character
now, but ZWSP "used to indicate word boundaries to text processing
systems when using scripts that do not use explicit spacing";
(http://en.wikipedia.org/wiki/Space_(punctuation))

Rregards,
László

> a) If ZWSP is simple AND fast enough to apply AND OpenSource it
>    might be integrated into the breakiterator and thus it may be
>    fine. (We may still have follow up issue with attributes being
>    applied where they should not have been though. I've already
>    seen similar issues with Chinese translation and Hangul/Hanja
>    conversion).
> b) You have to wait for our grammar checking (or better proof
>    reading) framework to get finished, because for that we will
>    pass complete sentences on to the checker.
>    Right now (in the CWS gcframework) we have implemented it to
>    the point where it can be used for automatic checking and
>    marking of wrong text but without having suggestion available
>    in the context-menu.
>
> Basically I'm just saying you should probably be prepared to implement a
> grammar checker later on since that is likely to be the only correct
> solution to the problem.
>
>
> Regards,
> Thomas
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Javier SOLA
Hi Thomas,

We have an easy way of typing ZWSP. It is in the spacebar of the Khmer
keyboard (SP is shift+spacebar). We teach everybody to use it when they
type.

It was working before with ICU 2.6, but it there seems to be a
regression after the upgrade to the newer version.

As you say, using ZWSP is definitelly not the best solution. The correct
thing would be that the applications are able to do tokenization and
line breaking, so that we do not have to separete the words, but this
will take some time (they do it already for Thai, though, through ICU).
The problem is that all the appliations need to implement the algorithm
for the text to be transportable. For example, we cannot stop using ZWSP
in webpages while Internet Explorer manages line-breaking for Khmer...
and that is probably going to take a very long time.

It will be great to work with the new framework. So it is probably a
good idea to start working on a dic file that has all the word types and
characteristics that will be used by the framework...

Cheers,

Javier

Németh László wrote

> Hi Thomas,
>
> 2008/1/17, Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:
>  
>> There are two options I see to solve this:
>>    
>
> These options are not mutually exclusive: using optional ZWSP
> characters as word breaks will not modify the grammar checking of the
> sentences.  The problem is that ZWSP is not a word break character
> now, but ZWSP "used to indicate word boundaries to text processing
> systems when using scripts that do not use explicit spacing";
> (http://en.wikipedia.org/wiki/Space_(punctuation))
>
> Rregards,
> László
>
>  
>> a) If ZWSP is simple AND fast enough to apply AND OpenSource it
>>    might be integrated into the breakiterator and thus it may be
>>    fine. (We may still have follow up issue with attributes being
>>    applied where they should not have been though. I've already
>>    seen similar issues with Chinese translation and Hangul/Hanja
>>    conversion).
>> b) You have to wait for our grammar checking (or better proof
>>    reading) framework to get finished, because for that we will
>>    pass complete sentences on to the checker.
>>    Right now (in the CWS gcframework) we have implemented it to
>>    the point where it can be used for automatic checking and
>>    marking of wrong text but without having suggestion available
>>    in the context-menu.
>>
>> Basically I'm just saying you should probably be prepared to implement a
>> grammar checker later on since that is likely to be the only correct
>> solution to the problem.
>>
>>
>> Regards,
>> Thomas
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

Javier SOLA
In reply to this post by Németh László-2
Hi László, Thomas,

We have ended including ZWSP specifically in the word boundary rules of
OpenOffice (originally taken from ICU).

I am meanwhile writing a proposal to the UNICODE consortium to revert to
the original state of ZWSP. UNICODE is not supposed to change characters
just like that.. and much less retroactivelly with an errata to the
standard, bypassing all the committees as this was done.

With respect to the Wikipedia entry, it is ambiguous, "used to..." is
also short for "It is used to...", which I think is the meaning that the
writer wanted to give (because ZWSP was changed in May 2008).

http://en.wikipedia.org/wiki/Space_(punctuation)

The good news is that it works again, and that we can do spellchecking.

We have written a dictionary-based breakiterator for Khmer, copying the
one in ICU 4.0... but OpenOffice 3.0 uses ICU 3.6, so we cannot yet
include it. I am looking into ICU 3.6, but Thai seems to be rule-based,
only Chinese and Japanese have dictionary based breakiterators. Is this
right ?

Cheers,

Javier



Németh László wrote

> Hi Thomas,
>
> 2008/1/17, Thomas Lange - Sun Germany - ham02 - Hamburg <[hidden email]>:
>  
>> There are two options I see to solve this:
>>    
>
> These options are not mutually exclusive: using optional ZWSP
> characters as word breaks will not modify the grammar checking of the
> sentences.  The problem is that ZWSP is not a word break character
> now, but ZWSP "used to indicate word boundaries to text processing
> systems when using scripts that do not use explicit spacing";
> (http://en.wikipedia.org/wiki/Space_(punctuation))
>
> Rregards,
> László
>
>  
>> a) If ZWSP is simple AND fast enough to apply AND OpenSource it
>>    might be integrated into the breakiterator and thus it may be
>>    fine. (We may still have follow up issue with attributes being
>>    applied where they should not have been though. I've already
>>    seen similar issues with Chinese translation and Hangul/Hanja
>>    conversion).
>> b) You have to wait for our grammar checking (or better proof
>>    reading) framework to get finished, because for that we will
>>    pass complete sentences on to the checker.
>>    Right now (in the CWS gcframework) we have implemented it to
>>    the point where it can be used for automatic checking and
>>    marking of wrong text but without having suggestion available
>>    in the context-menu.
>>
>> Basically I'm just saying you should probably be prepared to implement a
>> grammar checker later on since that is likely to be the only correct
>> solution to the problem.
>>
>>
>> Regards,
>> Thomas
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ZWSP

thomas.lange
In reply to this post by Javier SOLA
Hi all,

> We have written a dictionary-based breakiterator for Khmer, copying the
> one in ICU 4.0... but OpenOffice 3.0 uses ICU 3.6, so we cannot yet
> include it. I am looking into ICU 3.6, but Thai seems to be rule-based,
> only Chinese and Japanese have dictionary based breakiterators. Is this
> right ?


I asked the respective developer to be sure about this and got the
following answer:

> Yes, we have our own dictionary driven Chinese and Japanese word
> breakiterator, which is implemented sicne SO 6.0, far before we use ICU
> since SO 7. It is not ICU dictionary-based breakiterator.
>
> We only use rule-based word breakiterator in ICU for other languages. If
> there is requirement for word-based breakiterator, we need to enhance
> our code.


Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]