spell check dictionaries and ligatures

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

spell check dictionaries and ligatures

Thomas Lange - Oracle

Hi all,

Since a short while hunspell is able to handle ligatures. Additionally I
fixed the problem that spell checking treated ligatures as word breaking
characters (targeted to OOo 3.4 only though). Thus now everything is
ready for spell checking ligatures.

Taking a look at the latest English dictionary collection (provided by
Lazlo Nemeth, thanks!) that is already supporting ligatures, it looks
like the only thing to do is to add a few lines to the affix file.

It should have lines like this:

ICONV ff ff
ICONV fi fi
ICONV fl fl
ICONV ffi ffi
ICONV ffl ffl
ICONV ſt st
ICONV st st

Despite the second last entry looking like an 'ft' it should be an 'st'.
See http://unicode.org/charts/PDF/UFB00.pdf
For the more curious ones:
http://babelstone.blogspot.com/2006/06/rules-for-long-s.html
http://babelstone.blogspot.com/2006/07/long-and-short-of-letter-s.html

Note: The CWS with the respective fix is NOT yet integrated, and will
also not be part of OOo 3.3. I just wrote this mail now in order to not
forget about it once the CWS is actually integrated.


Best regards,
Thomas



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: spell check dictionaries and ligatures

Goran Rakic
Hi Thomas,

Thank you for your message. I have a question that may be offtopic or
just uninformed.

What is puzzling me is should not ligatures decomposing be implemented
further down so things like search, thesaurus or other extensions using
grammar checking API like languagetool can work transparently?

I can see that hyphenation or some typography style checkers would need
an exception of such decomposing with an option to work on a raw text.

Best regards,
Goran

У сре, 25. 08 2010. у 12:43 +0200, Thomas Lange - Oracle пише:

> Hi all,
>
> Since a short while hunspell is able to handle ligatures. Additionally I
> fixed the problem that spell checking treated ligatures as word breaking
> characters (targeted to OOo 3.4 only though). Thus now everything is
> ready for spell checking ligatures.
>
> Taking a look at the latest English dictionary collection (provided by
> Lazlo Nemeth, thanks!) that is already supporting ligatures, it looks
> like the only thing to do is to add a few lines to the affix file.
>
> It should have lines like this:
>
> ICONV ff ff
> ICONV fi fi
> ICONV fl fl
> ICONV ffi ffi
> ICONV ffl ffl
> ICONV ſt st
> ICONV st st
>
> Despite the second last entry looking like an 'ft' it should be an 'st'.
> See http://unicode.org/charts/PDF/UFB00.pdf
> For the more curious ones:
> http://babelstone.blogspot.com/2006/06/rules-for-long-s.html
> http://babelstone.blogspot.com/2006/07/long-and-short-of-letter-s.html
>
> Note: The CWS with the respective fix is NOT yet integrated, and will
> also not be part of OOo 3.3. I just wrote this mail now in order to not
> forget about it once the CWS is actually integrated.
>
>
> Best regards,
> Thomas
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: spell check dictionaries and ligatures

Thomas Lange - Oracle
 Hi Goran,

On 25.08.2010 13:11, Goran Rakic wrote:

> Hi Thomas,
>
> Thank you for your message. I have a question that may be offtopic or
> just uninformed.
>
> What is puzzling me is should not ligatures decomposing be implemented
> further down so things like search, thesaurus or other extensions using
> grammar checking API like languagetool can work transparently?
>
> I can see that hyphenation or some typography style checkers would need
> an exception of such decomposing with an option to work on a raw text.

From the user point of view this is probably true. All I can say here is
that currently we do neither Unicode normalization nor decomposition.
And it is unclear whether we will do so in the future or not.


As for some idle talk about the pros and cons:

- you have already noticed it might be a good idea to have the 'raw'
text available as well for some cases, but always keeping two versions
of the same text will waste too much memory. Therefore spot solutions
seem to be required. And thus someone would be required to list all the
relevant cases and which string version is to be used. But even that
solution will become troublesome if you need to keep the raw data but
currently to use the decomposed text, and later on maybe even have to
match modified decomposed text (or parts of that) to the raw text again.
This is likely to be a lot offset trouble and may have a negative
performance impact if something like that needs to be done on a regular
basis.

- according to unicode.org ligatures should probably not have been added
to Unicode at all, but were added since they were available in quite a
number of the 'old' character set tables.

- many fonts do not even support ligatures (or only a few of them),
probably since font rendering is pretty much advanced by now and thus
there is no real need for them anymore just to get a nice layout.

- the usual competitive 'reference product' does not seem to support
them at all (aside from being able to display them). The don't get
handled upon uppercase, tiltle case or sentence case conversion. And any
word containing a ligature is reported as wrong by the spell checker
(aside from the standalone ff). There as well search does not work as
you would like it to work.

Thus on the bright side in OOo we now at least have:
- working case conversion with ligatures
- working spell checking with ligatures (i.e. if all the required affix
files get changed as listed)


Thomas



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: spell check dictionaries and ligatures

Németh László-2
In reply to this post by Thomas Lange - Oracle
Hi Thomas and All,

Important note: for the Unicode ligature character support by ICONV
parameters of Hunspell, the spelling dictionaries have to be convert
to UTF-8 encoding, too.

OpenOffice.org has a great support for Unicode ligatures: searching –
also in the PDF output –, works well (eg. query string "fi" will match
the U+FB01 Unicode ligature "fi"). I have detect the word boundary
problem with Unicode ligatures only in initial positions
(http://www.openoffice.org/issues/show_bug.cgi?id=56348). Thanks for
your fix and your kind words!

By the way, OpenOffice.org on Mac OS X platform has automatic ligature
support. I work on a font based method for other platforms using
Graphite font technology. A positive feedback with links:
http://user.services.openoffice.org/en/forum/viewtopic.php?f=49&t=32910.
There are some crucial fixes related to Graphite and hyphenation (eg.
between ligatures) in the upcoming OOo 3.3, but I believe,
OpenOffice.org is close to the minimal DTP requirements. The most
important tasks in font/text handling is to add a UI to the smart font
features (for Graphite, see the Graphite Font Extension
http://www.thanlwinsoft.org/GraphiteOOoExt/) and extend the PDF export
to support text search/copy with Graphite (also OpenType under Mac OS
X) ligatures, too.

Best regards,
László


2010/8/25 Thomas Lange - Oracle <[hidden email]>:

>
> Hi all,
>
> Since a short while hunspell is able to handle ligatures. Additionally I
> fixed the problem that spell checking treated ligatures as word breaking
> characters (targeted to OOo 3.4 only though). Thus now everything is
> ready for spell checking ligatures.
>
> Taking a look at the latest English dictionary collection (provided by
> Lazlo Nemeth, thanks!) that is already supporting ligatures, it looks
> like the only thing to do is to add a few lines to the affix file.
>
> It should have lines like this:
>
> ICONV ff ff
> ICONV fi fi
> ICONV fl fl
> ICONV ffi ffi
> ICONV ffl ffl
> ICONV ſt st
> ICONV st st
>
> Despite the second last entry looking like an 'ft' it should be an 'st'.
> See http://unicode.org/charts/PDF/UFB00.pdf
> For the more curious ones:
> http://babelstone.blogspot.com/2006/06/rules-for-long-s.html
> http://babelstone.blogspot.com/2006/07/long-and-short-of-letter-s.html
>
> Note: The CWS with the respective fix is NOT yet integrated, and will
> also not be part of OOo 3.3. I just wrote this mail now in order to not
> forget about it once the CWS is actually integrated.
>
>
> Best regards,
> Thomas
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]