syllable and word.....

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

syllable and word.....

ge-7
Hello,

Why can not you store the Dzonkha words in the dictionary as words together with Tsheg marks:

Wordcount
Syl1
syl1TshegSyl2/flags
Syl3TshegSyl4TshegSyl5/flags
Syl1TshegSyl4/flags
..

?

This is how all latin charset using languages
store their words. (except: They do not store Tshegs,
but checking would work perfectly also with Tshegs)

Is Tsheg also between words in Dzongha, or there is space or a different symbol?

-eleonora


Hi,

Dzongkha text flow in continuum. Dzongkha words consists of one or more
syllable.
in case of multisyllable word, the syllables are separated by the Tibetan
Inter-syllabic Mark called Tsheg [unicode: 0F0B].
This Tsheg is a small dot represented in the Dzongkha keyboard by [Space
Bar].

So, the basic problem with the Dzongkha Spell Checker is that, this Tsheg
causes
hunspell to spell check Dzongkha word syllable by syllable.
and if we store the .dic file with syllables instead of word,
then there would be multitude of invalid words formed.

The example to suit the above problem would be Latin-borrowed English words
"ad hoc", "alma mater", etc....
if we list "ad", "hoc", "alma", "mater", separately in the .dic file, then
we can have words such as "ad alma" "ad mater"
"alma hoc", and so on.......

i see mentioning about ICU breakiterator, ZWSP, etc. how do these all
works..any links to these....
How to go about it... Any idea and suggestionsgreatly appreciated..

Thanks in advance
C. Norbu.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Cthar
On Fri, Jun 12, 2009 at 5:52 PM, ge <[hidden email]> wrote:

> Hello,
>
> Why can not you store the Dzonkha words in the dictionary as words together
> with Tsheg marks:
>
> Wordcount
> Syl1
> syl1TshegSyl2/flags
> Syl3TshegSyl4TshegSyl5/flags
> Syl1TshegSyl4/flags
> ..
>
> ?
> Thanks. This is how i did and it doesn't seems to work. i had even tried
> including

WORDCHARS [Tsheg] and BREAK [Tsheg] in the affix file.


> This is how all latin charset using languages
> store their words. (except: They do not store Tshegs,
> but checking would work perfectly also with Tshegs)
>
 In our case, instead of space, we use Tsheg. it is with the keystroke
[SPACE BAR] in our keyboard system. so, the moment we strike the SPACE BAR,
the first syllable was spell checked (even after storing the words same as
above).

>
> Is Tsheg also between words in Dzongha, or there is space or a different
> symbol?
> Yes. Tsheg [symbolically, small dot] is between Dzongkha characters,
> syllables, and words.


is it something to do with word boundaries in Dzongkha. or may be incorrect
.aff and .dic file.
How do you see it.


Thanks.
Regards,

C.Norbu

>
> -eleonora
>
>
> Hi,
>
> Dzongkha text flow in continuum. Dzongkha words consists of one or more
> syllable.
> in case of multisyllable word, the syllables are separated by the Tibetan
> Inter-syllabic Mark called Tsheg [unicode: 0F0B].
> This Tsheg is a small dot represented in the Dzongkha keyboard by [Space
> Bar].
>
> So, the basic problem with the Dzongkha Spell Checker is that, this Tsheg
> causes
> hunspell to spell check Dzongkha word syllable by syllable.
> and if we store the .dic file with syllables instead of word,
> then there would be multitude of invalid words formed.
>
> The example to suit the above problem would be Latin-borrowed English words
> "ad hoc", "alma mater", etc....
> if we list "ad", "hoc", "alma", "mater", separately in the .dic file, then
> we can have words such as "ad alma" "ad mater"
> "alma hoc", and so on.......
>
> i see mentioning about ICU breakiterator, ZWSP, etc. how do these all
> works..any links to these....
> How to go about it... Any idea and suggestionsgreatly appreciated..
>
> Thanks in advance
> C. Norbu.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Javier SOLA
Hi C. Norbu,

Tsheng is used at the end of each syllable, and it marks both syllable
end and word end. If it is used as a word-boundary, then you break
syllables. If it is not used, then you do not know where to break text,
because there are no spaces. You find yourself in the same space as
Khmer, Lao, Thai and Burmese, with text that cannot be broken, except if
you introduce Zero Width Spaces between words (which was just
reclasified as a word boundary), or you use dictionary based line-breaking.

The only good solution that I see is to used dictionary based line
breaking, and also spellchecker, but this takes some work with ICU and
with OpenOffice, as well as very good word lists.

For dictionary-based breaking, Tsheng must be reclasified as non-boundary.

Also, ICU works by script, and any change that you do would apply to
tibetan script also. You would have to check that your dictionary based
breakiterator applies only to Dzongkha.

Cheers,

Javier

C. Norbu wrote:

> On Fri, Jun 12, 2009 at 5:52 PM, ge <[hidden email]> wrote:
>
>  
>> Hello,
>>
>> Why can not you store the Dzonkha words in the dictionary as words together
>> with Tsheg marks:
>>
>> Wordcount
>> Syl1
>> syl1TshegSyl2/flags
>> Syl3TshegSyl4TshegSyl5/flags
>> Syl1TshegSyl4/flags
>> ..
>>
>> ?
>> Thanks. This is how i did and it doesn't seems to work. i had even tried
>> including
>>    
>
> WORDCHARS [Tsheg] and BREAK [Tsheg] in the affix file.
>
>
>  
>> This is how all latin charset using languages
>> store their words. (except: They do not store Tshegs,
>> but checking would work perfectly also with Tshegs)
>>
>>    
>  In our case, instead of space, we use Tsheg. it is with the keystroke
> [SPACE BAR] in our keyboard system. so, the moment we strike the SPACE BAR,
> the first syllable was spell checked (even after storing the words same as
> above).
>
>  
>> Is Tsheg also between words in Dzongha, or there is space or a different
>> symbol?
>> Yes. Tsheg [symbolically, small dot] is between Dzongkha characters,
>> syllables, and words.
>>    
>
>
> is it something to do with word boundaries in Dzongkha. or may be incorrect
> .aff and .dic file.
> How do you see it.
>
>
> Thanks.
> Regards,
>
> C.Norbu
>
>  
>> -eleonora
>>
>>
>> Hi,
>>
>> Dzongkha text flow in continuum. Dzongkha words consists of one or more
>> syllable.
>> in case of multisyllable word, the syllables are separated by the Tibetan
>> Inter-syllabic Mark called Tsheg [unicode: 0F0B].
>> This Tsheg is a small dot represented in the Dzongkha keyboard by [Space
>> Bar].
>>
>> So, the basic problem with the Dzongkha Spell Checker is that, this Tsheg
>> causes
>> hunspell to spell check Dzongkha word syllable by syllable.
>> and if we store the .dic file with syllables instead of word,
>> then there would be multitude of invalid words formed.
>>
>> The example to suit the above problem would be Latin-borrowed English words
>> "ad hoc", "alma mater", etc....
>> if we list "ad", "hoc", "alma", "mater", separately in the .dic file, then
>> we can have words such as "ad alma" "ad mater"
>> "alma hoc", and so on.......
>>
>> i see mentioning about ICU breakiterator, ZWSP, etc. how do these all
>> works..any links to these....
>> How to go about it... Any idea and suggestionsgreatly appreciated..
>>
>> Thanks in advance
>> C. Norbu.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Ruud Baars-2
In reply to this post by Cthar
C. Norbu schreef:

> On Fri, Jun 12, 2009 at 5:52 PM, ge <[hidden email]> wrote:
>
>  
>> Hello,
>>
>> Why can not you store the Dzonkha words in the dictionary as words together
>> with Tsheg marks:
>>
>> Wordcount
>> Syl1
>> syl1TshegSyl2/flags
>> Syl3TshegSyl4TshegSyl5/flags
>> Syl1TshegSyl4/flags
>> ..
>>
>> ?
>> Thanks. This is how i did and it doesn't seems to work. i had even tried
>> including
>>    
>
> WORDCHARS [Tsheg] and BREAK [Tsheg] in the affix file.
>
>
>  
>> This is how all latin charset using languages
>> store their words. (except: They do not store Tshegs,
>> but checking would work perfectly also with Tshegs)
>>
>>    
>  In our case, instead of space, we use Tsheg. it is with the keystroke
> [SPACE BAR] in our keyboard system. so, the moment we strike the SPACE BAR,
> the first syllable was spell checked (even after storing the words same as
> above).
>
>  
I had some problems like this, and spent a lot of time trying to find
the cause. In the end, it was the encoding that was bothering me. Be
sure the file has the same encoding as stted in the header of the .aff ..

>> Is Tsheg also between words in Dzongha, or there is space or a different
>> symbol?
>> Yes. Tsheg [symbolically, small dot] is between Dzongkha characters,
>> syllables, and words.
>>    
>
>
> is it something to do with word boundaries in Dzongkha. or may be incorrect
> .aff and .dic file.
> How do you see it.
>
>
> Thanks.
> Regards,
>
> C.Norbu
>
>  
>> -eleonora
>>
>>
>> Hi,
>>
>> Dzongkha text flow in continuum. Dzongkha words consists of one or more
>> syllable.
>> in case of multisyllable word, the syllables are separated by the Tibetan
>> Inter-syllabic Mark called Tsheg [unicode: 0F0B].
>> This Tsheg is a small dot represented in the Dzongkha keyboard by [Space
>> Bar].
>>
>> So, the basic problem with the Dzongkha Spell Checker is that, this Tsheg
>> causes
>> hunspell to spell check Dzongkha word syllable by syllable.
>> and if we store the .dic file with syllables instead of word,
>> then there would be multitude of invalid words formed.
>>
>> The example to suit the above problem would be Latin-borrowed English words
>> "ad hoc", "alma mater", etc....
>> if we list "ad", "hoc", "alma", "mater", separately in the .dic file, then
>> we can have words such as "ad alma" "ad mater"
>> "alma hoc", and so on.......
>>
>> i see mentioning about ICU breakiterator, ZWSP, etc. how do these all
>> works..any links to these....
>> How to go about it... Any idea and suggestionsgreatly appreciated..
>>
>> Thanks in advance
>> C. Norbu.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

ge-7
In reply to this post by ge-7
Javier Sola wrote:
> The only good solution that I see is to used dictionary based line
> breaking, and also spellchecker, but this takes some work with ICU and
> with OpenOffice, as well as very good word lists.
>
> For dictionary-based breaking, Tsheng must be reclasified as non-boundary.

We thought in the past long about this in the case of
Thai, and we could not find any solution.

Could you please give a concrete example what you mean?
You probably mean line breaking = word breaking, right?
But that does not clarify either, what you mean for me....

Very good word list is a requirement for ANY language
for quality spell checking, exactly like a very good
affix file.

How comes ICU here?

I think, that when word breaks are the same as syllable
breaks, there is NO solution at all. Unfortunately.

He can not change the original text, and can not modify
either syllable break or word break.

A machine can not find out from syllables, which combination
is valid and which is not using just a syllable list.

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Javier SOLA
In reply to this post by ge-7
ge wrote:

> Javier Sola wrote:
>  
>> The only good solution that I see is to used dictionary based line
>> breaking, and also spellchecker, but this takes some work with ICU and
>> with OpenOffice, as well as very good word lists.
>>
>> For dictionary-based breaking, Tsheng must be reclasified as non-boundary.
>>    
>
> We thought in the past long about this in the case of
> Thai, and we could not find any solution.
>  
For Thai it actually works quite well in ICU, but the list of words is
too short. It is nevertheless overridden by code in OpenOffice that
makes Thai line-breaking syllable based :-(

> Could you please give a concrete example what you mean?
> You probably mean line breaking = word breaking, right?
> But that does not clarify either, what you mean for me....
>  
Lne-breaking and word-bundaries are different. For example, you do not
put a a line-reak before a space (otherwise the space would be the first
character of the next line), but you put a word-boundary before and
after the space, for example, in "is the mouse red?" line breaks are "is
|the |mouse| red?" but word boundaries are "is| |the| |mouse| |red|?" so
that spaces are not sent attached to the words to the spellchecker. Each
character in unicode has line breaking properties and word boundary
properties
> Very good word list is a requirement for ANY language
> for quality spell checking, exactly like a very good
> affix file.
>  
Spell-checking and line breaking lists do not need to be identical.
There are words that you might not want to break, but you spellcheck
separatelly.. or vice-versa
> How comes ICU here?
>  
It does the tokenization (puts the word boundaries in) for openoffice,
as well as the rendering of complex scripts. Some of the library's code
has been integrated in OOo and modified, for specific uses or languages.
> I think, that when word breaks are the same as syllable
> breaks, there is NO solution at all. Unfortunately.
>  
They are different, one word can have one or several syllables. Breaking
by syllables is easy, specially in Dzongkha, where there is a character
that is the end-of-syllable character.
> He can not change the original text, and can not modify
> either syllable break or word break.
>
> A machine can not find out from syllables, which combination
> is valid and which is not using just a syllable list.
>  
Actually, yes, because the scripts that have this problem are abuguidas
(scritps that originate in Brahmi) and they have orthographic syllables
in recognizable clusters (Thai is the most complex one)... but breaking
in syllables is always a bad solution, words are better.

Cheers,

Javier
> -eleonora
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

ge-7
In reply to this post by ge-7
Javier Sola wrote:
>For Thai it actually works quite well in ICU

Nice to hear that. :-)

>> I think, that when word breaks are the same as syllable
>> breaks, there is NO solution at all. Unfortunately.
>>  
>They are different, one word can have one or several syllables. Breaking
>by syllables is easy, specially in Dzongkha, where there is a character
>that is the end-of-syllable character.

So usage of ICU libraries made possible to use a
differentiate character to separate syllables and to
separate words. That is the key point. In my opinion
this must happen at text entry stadium, (the user must press
an other button for syllable separation than for word separation),
later there is no chance any more to differentiate.

In case of Thai we were told, such a separation were
not possible, since text exists already with same
break characters, and there were no possibility to
differentiate. That was obviously not the last word.


Thanks, eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Javier SOLA
In reply to this post by ge-7
ge wrote:

> Javier Sola wrote:
>  
>> For Thai it actually works quite well in ICU
>>    
>
> Nice to hear that. :-)
>
>  
>>> I think, that when word breaks are the same as syllable
>>> breaks, there is NO solution at all. Unfortunately.
>>>  
>>>      
>> They are different, one word can have one or several syllables. Breaking
>> by syllables is easy, specially in Dzongkha, where there is a character
>> that is the end-of-syllable character.
>>    
>
> So usage of ICU libraries made possible to use a
> differentiate character to separate syllables and to
> separate words. That is the key point. In my opinion
> this must happen at text entry stadium, (the user must press
> an other button for syllable separation than for word separation),
> later there is no chance any more to differentiate.
>  
Nothing for syllable separation (OOo has an algorithm to recognize Thai
syllables). In Khmer we use the Zero Width Space (ZWSP) as a word
separator, but I have been doing some testing with dictionary-based
breaking, and it seems to work. Entering ZWSPs after each word is
unnatural to them. The problem is that this is not supported by (for
example) web-browsers, and therefore you do need ZWSP in the text to
have correct display of text in the browser.

So.. we have a small application (also dictionary based) that goes over
an ODF or HTML file and includes the ZWSPs in the text...
> In case of Thai we were told, such a separation were
> not possible, since text exists already with same
> break characters, and there were no possibility to
> differentiate. That was obviously not the last word.
>  
No, the question is if Thais users want to go to the dictionary-based
model or not, which I think is much better. We will go into that
direction for Khmer, and we already have the code... I just need one
week of holiday to start submitting things first to ICU and then to OOo.
In Lao and Myanmar they have tended to syllable based also, for lack of
anything better.

For Thai, a good word list would be easy to find (I know were to find it).

Cheers,

Javier

>
> Thanks, eleonora
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

ge-7
In reply to this post by ge-7
Javier Sola wrote:

>we have a small application (also dictionary based) that goes over
>an ODF or HTML file and includes the ZWSPs in the text...

Well, that application is not possible to write fully correctly.
You can never know, which is the word limit.

for example:
1.  "abetter" means someone, who abets someone to do a crime.
2. "a better" means the words a and better.

If in a text there is "abetter", which is meant?
Even when you look into the word's environment, the question
still can remain.
It is not hard to construct lots of such examples....

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: syllable and word.....

Javier SOLA
In reply to this post by ge-7
ge wrote:

> Javier Sola wrote:
>
>  
>> we have a small application (also dictionary based) that goes over
>> an ODF or HTML file and includes the ZWSPs in the text...
>>    
>
> Well, that application is not possible to write fully correctly.
> You can never know, which is the word limit.
>
> for example:
> 1.  "abetter" means someone, who abets someone to do a crime.
> 2. "a better" means the words a and better.
>
> If in a text there is "abetter", which is meant?
> Even when you look into the word's environment, the question
> still can remain.
> It is not hard to construct lots of such examples....
>  
This is true. This has always been the critizism for the ICU breaker, it
always chooses the longest match. In this case the two words would be
kept together, and spellchecked together. To go further would require
statistical analysis of situations... and even then it would not be
always perfect.

The system can be fixed by hand in places where the breaker does not
break, but it would be interesting to.. by inserting a ZWSP.



> -eleonora
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]