Re: Hyphenation for Dutch

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Hyphenation for Dutch

Németh László-2
Dear Ruud and all,

2008/3/4, Ruud Baars <[hidden email]>:
> László, could you help with the following questions:
>  1) What is the best moment to take into account that a word cannot
>  hyphenate leaving 1 char alone at start of end? This is (in Dutch) even
>  true in compounds, where one char of the part's cannot be left alone !
>  So far, i have taken this into account while making the TeX patterns using
>  patgen.

Now the best method for languages with open compounding is collecting
millions of real compound words (eg. from web pages and analyzing them
with Hunspell and its upcoming -m (analyze) option), making a huge
hyphenated dictionary for pattern generation.

>  2) The is ofthen a hyphenation conflict in the compounding. E.g : a ch is
>  never split in Dutch (it is like a g), unless in (rare) compounds like
>  tic+hand. Patgen treats this by creating rules and exceptions. This
>  generates an rather large pattern file. Did anyone ever try using full
>  (uncompounded) words as (perfect) patterns? Would that be feasible (it's
>  is easier to maintain the least ...)
>  I also have a (very slow) php-program that generates only perfect
>  patterns, without exceptions. Is that a path that might be feasible?

Full words in hyphenation patterns generate too many data after conversion (for example, half million patterns instead
of 100 thousand in a real example). I have also written a perfect
pattern generator in Perl to solve this problem with size
optimization. Full words need only for the learning and test corpus.

>  3) What is the way to explicitly code compound boundaries ? I saw
>  something like .. ? How does the (un)compounding work in hyphenation ?

Decomposition is supported in hyphenation by learning data, so the
resulted patterns will hyphenate only this data perfectly. I plan to
use Hunspell for decomposition, but it is also not perfect for all
possible compounds. I will test the following lightweight "compound
hyphenation level" patch. The hyphenation dictionary development will
be consist from two phases: the compound and the non compound pattern
generation and the integration of these patterns. Some of the
hyphenation levels hyphenate only at compound boundaries ("compound
hyphenation levels"), for example level 5 and level 7:


The hyphenator will break the hyphenated words at compound break points,
and rehyphenate the parts, for example ti3c5hand hyphenation is
hyphenated as hyphenate(tic) and hyphenate(hand), so the bad break
point (ti-chand) will be eliminated. Advantages of this method are the
better compound decomposition, the optional hyphenation break distance
from compound breaks (it might be a hyphenation option in, too), and maybe the limited perfect pattern generation
(only for the compound breaks).

>  4) More hunspell-like : dus the uncompounding also support additonal
>  characters? In Dutch (and German) koningshuis uncompounds to
>  koning+s+huis. (konings is not a word) Can the uncompounding support this?
>  Does uncompounding in some way relate to compounding rules in hunspell?

I think, with the suggested compound hyphenation level feature, the
hyphenator will be handle better this morpheme, because the
hyphenation will be more based on compound breaks. In your example,
you will be able to use konings|haus decomposition for the compound
hyphenation level (if you need, also adding the non word "konings" to
your hyphenation dictionary).

>  You see, i am trying to get a picture of the entire process, to make the
>  hyphenation as perfect as it can. I think it is better to not hyphenate
>  then to hyphenate wrongly.

I believe, the aim of the hyphenation is the perfect typesetting, not
the perfect
orthography, so better to hyphenate (especially the long compound
words), then not. Fortunatelly, the pattern based hyphenation of
TeX/ supports these and other extraordinary (for
example, mistyped) cases.

>  Compounding is an important issue in this (valk-uil and val-kuil are both
>  valid compounds, and there is no way to decide which is correct without
>  doing content analysis.)

Ambiguous compound hyphenation (valk|uil, val|kuil or the Hungarian
leg|elő-re, le-ge-lő-re) can be forbidden on a compound hyphenation
level, for example:


>  Hope you can help scetch me the big picture.
>  Ruud

I hope, too. :)
I have also posted this letter to lang-dev.

Best regards,