Minimal number of characters for hyphenation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Minimal number of characters for hyphenation

Bugzilla from hatapitk@iki.fi
Hi!

In Options - Language Settings - Writing aids there is option "Minimal number
of characters for hyphenation", which defaults to 5. This has the effect that
if word has 4 characters or less, it will not be automatically hyphenated. So
far, this is good. But what exactly should count as a "character" here?

In some languages, at least in Finnish, compound words are hyphenated by
parts. If we have compound word "valokuva" which is composed of "valo"
and "kuva", the preferred position to split the word is between the
parts "valo-kuva". It is also possible to hyphenate the individual parts.
Considering that "valokuva" has 8 characters and can therefore be hyphenated,
all the possible hyphenation points for this word are "va-lo-ku-va". I call
this Option 1.

When "Minimal number of characters for hyphenation" is 5, words "valo"
and "kuva" will not be hyphenated when they occur alone. We could then claim
that they should not be hyphenated in compound words either, so
that "valokuva" would only be split as "valo-kuva" by the hyphenator. This is
Option 2.

The Finnish spellchecker extension Voikko currently uses Option 1. But we have
also implemented Option 2, and it can be activated by adding one line of code
before building the extension. The OpenOffice.org builtin hyphenator uses
Option 1. As I understand, it currently cannot do anything else because no
morphological analysis is performed on the words before hyphenation, which is
needed for Option 2. But hunspell does support morphological analysis and [1]
suggests that it might in future be used in OOo to improve the hyphenation
quality.

Therefore I would like to know that if Option 2 becomes technically possible
for OOo's builtin hyphenator to implement, will it make sense to use it
instead of current behaviour? Or should there perhaps be a separate option to
allow users to choose this, defaulting to current model (Option 1)? I have no
strong opinions about which behaviour is actually better. I do not have MS
Word, but I have been told that it does something that is close but not quite
the same as Option 2 for Finnish compound words. But I do think that it makes
sense for all backends (OOo builtin, Voikko, proprietary extensions) to
interpret the options in the same way, which is why I am bringing this up for
discussion.

Harri

[1] http://hunspell.sourceforge.net/tb87nemeth.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]