Spell check dictionary update

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spell check dictionary update

Robert Ludvik
Hi all
In March 2006, Robert Vojta from Czech team sent some mails about
spell check dictionaries upgrade (I found just this one in archives:
http://lingucomponent.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=1824)
I got some time, asked Dan from Czech team to send me what Robert left
behind, put all together and talked about this with Daniel and Simon
in Barcelona.
In just a few words: people can send words, that are not yet in spell
check dictionary trough a web form or with a help of a macro, which is
for now only available for OOo but could be ported to MSO, KOffice(?).
Relevant people (linguists) would then review sent words and accept
them for inclusion in dictionaries or reject them.
Dictionaries are in form that can be used for Mozilla and KOffice
products as well.
I'd like to open a discussion about this. If you are interested, you
can read some more at http://r.aufbix.org/spell/, especially a *draft*
of proposal how this could be done
(http://r.aufbix.org/spell/spell-workflow.pdf or
http://r.aufbix.org/spell/spell-workflow.odg, if you prefer)

Regards
Robert Ludvik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell check dictionary update

Marcin Miłkowski
Hi Robert,

Robert Ludvik pisze:

> In just a few words: people can send words, that are not yet in spell
> check dictionary trough a web form or with a help of a macro, which is
> for now only available for OOo but could be ported to MSO, KOffice(?).
> Relevant people (linguists) would then review sent words and accept
> them for inclusion in dictionaries or reject them.
> Dictionaries are in form that can be used for Mozilla and KOffice
> products as well.


That's a nice idea but some projects don't need it - for Polish, we have
a collaboration website (www.kurnik.pl/dictionary) where missing words
are being reported every day. So what I'd propose is a dual system that
includes something more: a bug-tracking system for dictionaries that
have no such system yet, and a link to an existing system in cases where
there is.

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell check dictionary update

Harri Pitkänen
In reply to this post by Robert Ludvik
Hi!

On Sunday 30 September 2007, Robert Ludvik wrote:

> ...
> In just a few words: people can send words, that are not yet in spell
> check dictionary trough a web form or with a help of a macro, which is
> for now only available for OOo but could be ported to MSO, KOffice(?).
> Relevant people (linguists) would then review sent words and accept
> them for inclusion in dictionaries or reject them.
> Dictionaries are in form that can be used for Mozilla and KOffice
> products as well.
> I'd like to open a discussion about this. If you are interested, you
> can read some more at http://r.aufbix.org/spell/, especially a *draft*
> of proposal how this could be done
> (http://r.aufbix.org/spell/spell-workflow.pdf or
> http://r.aufbix.org/spell/spell-workflow.odg, if you prefer)

I can offer some comments, because our development workflow for Finnish spell
checker shares some features with your draft and has been in use for about a
year now.

- We do not have an OOo macro for sending suggestions, but I think it is a
great idea. We do have a web form [1] though. The form consists of a field to
enter the word, a drop-down box for selecting the type of the word ("general
vocabulary", "computing vocabulary", "medical vocabulary", ... , "foreign
words", "dialects", "words that should be removed from current vocabulary")
and a free-form text box for explaining the word if it needs an explanation.
The form has not been very popular, on average we get about one word per day
through it. Could be that we should have advertised it more.

Previously we had a form that only contained a field to enter the word and a
drop-down box for word class. That one was initially perhaps too popular, it
was occasionally misused by spamming it with useless strings. We have never
collected any personal information through these forms. We only track the
user ip address to limit incoming suggestions to 20 words/ip/day to prevent
misuse. But some smart person worked around that limitation by using Tor to
access the form... So I recommend to build the system so that the database
can be easily cleaned up if something like this happens.

It should be noted that Finland has only a population of 5 million people. And
the majority of Finnish OOo users (especially on Windows) are still using a
non-free spell checker (released around 2002) for which our word suggestion
form is useless. Therefore most language teams could probably expect this
type of form to be more popular than what we have experienced.

- The review system we use is a lot simpler than the one in your draft. We
only have one compulsory review step for the suggested words, where a
registered user of the system either rejects the suggestion or moves it to
the master database, and populates the new record with necessary meta
information (inflection class etc.) However, the system maintains a change
log [2] of all changes made to the master database. Our project has three
active contributors, and we more or less regularly check each other's changes
from the log. So in practice there is an extra round of reviews, although it
is not enforced by the software.

I think that for a small team like ours this simplified review works just
fine. We do not have any professional linguists in this project anyway. I
suppose this is the case for many other languages too. So if possible, it
would be nice to be able to merge the non-linguist and linguist reviews in
case some teams cannot afford to have both.

- The role of the technician at the end of the process is more or less similar
in our process and your draft. Only problem we have is that our spell checker
implementation does not allow merging dictionaries at runtime. This is why
there is currently no easy way for the users to add medical etc.
dictionaries, which in turn discourages people from contributing to them.
This is a technical problem that we must solve later. I believe that Hunspell
does not have this problem.


Of course the code of our web application is available to any teams who wish
to use it, since it is under the GPL. The core code has been designed to be
language independent and the application itself can be localised using po
files. But it does have a major limitation in that the same database cannot
be used simultaneously for multiple languages, and technical documentation
mostly just has not been written. And it is written in Python, not PHP, and
there is not (yet) export capability for Hunspell format. So I think that
your proposed workflow, macros and PHP scripts will offer a better initial
design for solving the dictionary update and maintenance problem for many
languages.

Harri

[1] http://joukahainen.lokalisointi.org/ehdotasanoja
[2] http://joukahainen.lokalisointi.org/query/listchanges

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell check dictionary update

Nicolas Mailhot

Le Dim 30 septembre 2007 20:34, Harri Pitkänen a écrit :

> - We do not have an OOo macro for sending suggestions, but I think it
> is a great idea.

IMHO this is a terrible idea. A web form can be integrated in the
translation/i18n/l10n web hubs big distributions like Ubuntu and
Fedora are building, an OO.o macro is app-specific and reduces
contributors to heavy OO.o users.

Remember hunspell is integrated in Firefox3 and is likely to be used
by many different apps in the next years.

--
Nicolas Mailhot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell check dictionary update

Marcin Miłkowski
Nicolas Mailhot pisze:
> Le Dim 30 septembre 2007 20:34, Harri Pitkänen a écrit :
>
>> - We do not have an OOo macro for sending suggestions, but I think it
>> is a great idea.
>
> IMHO this is a terrible idea. A web form can be integrated in the
> translation/i18n/l10n web hubs big distributions like Ubuntu and
> Fedora are building, an OO.o macro is app-specific and reduces
> contributors to heavy OO.o users.

Here I can second your opinion for a change. The Polish spelling
dictionary is based on the efforts of the word-gaming community
(scrabble and similar word games). It has absolutely nothing to do with
specific apps reusing the data.

My advice is: try to use the dictionaries directly for online word
games, like crosswords or scrabble-like multiplayer games. This way
you'll have lots of users wanting to confirm that specific words do
exists (or not!) in the language. This means a lot of user feedback =
good quality. (Take a look at www.kurnik.pl/slownik - thousands of
people commenting on ispell dictionary entries because they couldn't win
in scrabble). I can talk to the admin of the kurnik website so that he
could share the code or setup the operator accounts for specific languages.

Any online solution seems more appropriate for that - and remember, most
languages need inflections, so raw word lists are pretty much useless
without adding hunspell flags. And we can gather raw word lists using
advanced linguistic software for extracting corpora from the web - they
are a way more advanced than any OOo macro I could think of.

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell check dictionary update

Mathias Bauer
In reply to this post by Nicolas Mailhot
Nicolas Mailhot wrote:

> Le Dim 30 septembre 2007 20:34, Harri Pitkänen a écrit :
>
>> - We do not have an OOo macro for sending suggestions, but I think it
>> is a great idea.
>
> IMHO this is a terrible idea. A web form can be integrated in the
> translation/i18n/l10n web hubs big distributions like Ubuntu and
> Fedora are building, an OO.o macro is app-specific and reduces
> contributors to heavy OO.o users.

Don't exaggerate. If that idea was "terrible" you don't leave enough
room on your scales for things that are *really* terrible. ;-)

Beside that you are right, we should have an OOo independent way to
provide suggestions.

That doesn't forbid to create an OOo extension that directly links to
the web form, at best if this would allow to automatically transport the
selected word into the form.

Ciao,
Mathias

--
Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
Please don't reply to "[hidden email]".
I use it for the OOo lists and only rarely read other mails sent to it.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]