Update on WWW based word list editing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Update on WWW based word list editing

Harri Pitkänen
In May [1] I wrote about the status of Finnish spellchecking, and plans about
creating a WWW application for editing spellchecking dictionaries. The status
of Finnish spellchecking is currently quite good, and our work is now
concentrating around the WWW application for maintaining and further
developing the vocabulary.

A test version of the WWW application "Joukahainen" is now available at
http://joukahainen.lokalisointi.org/
While the current version more or less works, it is still unfinished and
unfortunately only available in Finnish and most of the interesting features
require an user account to be able to test them, so I do not expect that test
installation to be of much use for most of the readers on this list. But I
have now reached the point where it would be interesting to internationalise
the application to be usable for other languages as well. I expect to be able
to do this by the end of this month. This means that I will be translating
the application to English and will replace the Finnish data with the
contents from a current English hunspell dictionary file. If there are any
language teams that are already interested in using this application for
maintaining their dictionaries, I can provide some instructions on what to do
and where to start any time, just ask.

The current feature set includes a word editor that can be used to add string
attributes, flags and alternative spellings to any words. All edit actions
are logged and comments can be added in similar way than in typical Bugzilla
installations (the process is just a little less complicated). Words can be
added, either manually or by first storing a list of candidate words in the
database which will be used to pre-fill the word entry form (I have received
some test material from the language recognising web crawler by Kevin
Scannell and this feature is build around that kind of data). Once the data
is in the database, creating a spellchecking dictionary can be done by just
writing an exporter for the particular spellchecker. For hunspell this will
be quite easy.

For use in languages that only need simple affix flags associated with each
word this system may seem overly complex. I will try to make it easy to use
in such cases as well. The main benefit of the system is that it allows
distributed editing of the word list in a way that records the editing
history and allow anyone to review the changes.

I can later write more when things start to get ready and I have something
working to show in English as well.

Harri

[1] http://lingucomponent.openoffice.org/servlets/ReadMsg?list=dev&msgNo=1806

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Update on WWW based word list editing

Mikalai Karelin
Dear Harri,

I've seen your message in lingu-dev mailing list concerning
application to deal with spellcheck dictionaries ("Joukahainen").

The matter is I'm interested in such application, and want ask a
question if you have it translated on English (unfortunately, I know
no Finnish)???

You have plans to prepare English version in the end of August, and
probably its release date is close.

--
Regards,
Mikalai                          mailto:[hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Update on WWW based word list editing

Harri Pitkänen
On Wednesday 06 September 2006 14:13, Mikalai Karelin wrote:

> Dear Harri,
>
> I've seen your message in lingu-dev mailing list concerning
> application to deal with spellcheck dictionaries ("Joukahainen").
>
> The matter is I'm interested in such application, and want ask a
> question if you have it translated on English (unfortunately, I know
> no Finnish)???
>
> You have plans to prepare English version in the end of August, and
> probably its release date is close.

Currently I have finished the separation of Finnish and language independent
parts of the code, and all messages in the source code are in English (they
can in turn be translated to other languages using gettext). The English
language pack containing the page templates and information about affix flags
is not yet done, but I try to do that today or tomorrow.

Meanwhile, the installation instructions for Joukahainen are available at
http://svn.sourceforge.net/viewvc/hunspell-fi/trunk/joukahainen/doc/INSTALL?revision=409&view=markup
You will need those, as I am unfortunately not able to offer server space for
other languages. Step 5 is the one I will be doing for English to demonstrate
how it should be done.

Sorry about this taking a bit longer than I had planned. There has been a lot
going on with Finnish spell checker development which has kept me busy most
of the last few weeks.

Harri

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Update on WWW based word list editing

Harri Pitkänen
On Wednesday 06 September 2006 15:18, Harri Pitkänen wrote:
> The English
> language pack containing the page templates and information about affix
> flags is not yet done, but I try to do that today or tomorrow.

This is now done. The updated installation instructions are available at
http://svn.sourceforge.net/viewvc/hunspell-fi/trunk/joukahainen/doc/INSTALL?view=markup
and the English language pack at
http://svn.sourceforge.net/viewvc/hunspell-fi/trunk/joukahainen/langpacks/en_US/

What is missing at the moment is importer and exporter for Hunspell dictionary
format. This would be easy to do if you want to have the word flag attributes
in Joukahainen directly correspond to affix flags in Hunspell, which is how
the en_US language pack above is designed. But I think that from the point of
view of a user of the word list editor it is not the optimal choice.

For example noun "dog" in en_US dictionary has the following flags in
Joukahainen if the current language pack is used:
[Hunspell-M] suffix -'s
[Hunspell-S] plural
But since most of the nouns should have these two flags, it might be better to
assume that these are set by default for all nouns. For the nouns that do not
need them, there could be flags like
[Hunspell-no-M] suffix -'s not allowed
[Hunspell-no-S] no plural


About having more than one language hosted on the same server: this should be
possible, you just need to have different databases for each language.
Dictionary maintainers should be allowed to have direct access to the
database so that they can do batch changes if needed (like setting flags for
many words at a time if it is later decided that having [Hunspell-no-S] is
better that [Hunspell-S]). Having the vocabulary data from many languages in
the same database would make such access more risky.

Let me know if you need more information. I know that the documentation is not
in the best possible state and there are some features that should still be
implemented, but I would rather improve these things based on the actual
needs of the people trying to use Joukahainen.

Harri

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]