Diacritic restoration + new spell checking packages

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Diacritic restoration + new spell checking packages

Kevin Scannell-2
Hello all,

[Sorry for cross-posting - sending this to aspell-devel and a12n as well]

 This is an announcement of a new package called "charlifter" that
does statistical diacritic restoration:

https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046

and two new open source word lists, one for Lingala (joint work with
Denis Jacquerye):

https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051

and one for Hawaiian:

http://borel.slu.edu/ispell/haw_US.zip



The charlifter script is language-independent - all you need to do is
provide it with some plain text in the language of interest with all
of the diacritical marks in place.   From this the script "learns"
where the diacritics belong, statistically.   You can also improve
performance by feeding it a word list during the training phase.
I've built and packaged pre-trained models for several languages,
including Irish, French, Lingala, Samoan, and Hawaiian - see the
directories "charlifter-*" here:

http://lingala.svn.sourceforge.net/viewvc/lingala/

Once you've trained a language model, or installed one of the models
above, you can feed plain ASCII text to the script and it restores the
diacritics or extended Unicode characters that are missing:

Irish:
$ echo "an chead teanga oifigiuil" | sf.pl -r ga
an chéad teanga oifigiúil

Lingala (note the open vowels "ɔ" are restored correctly):
$ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln
Ngɔlɔ, niná, zambí ikamwísí bangó.

Hawaiian:
$ echo "Olelo aku 'o Papa" | sf.pl -r haw
ʻŌlelo aku ʻo Pāpā

etc....


This work ties in closely with my Crúbadán project which is gathering
text corpora for 400+ languages with a web crawler:

http://borel.slu.edu/crubadan/

Lingala is a good example.  When written properly, it uses diacritics
to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95%
of what is written on the web is in plain ASCII (no tone marks, "o"
and "e" in place of "ɔ" and "ɛ").    Therefore, to use the web corpus
effectively for language modelling purposes, it is important to
restore these ASCII texts to the proper encoding as best as possible.

The spell checkers for Lingala and Hawaiian came directly from this
approach - train charlifter on the small amount (say 5%) of web text
with correct diacritics in place, the restore the other 95% and use
the resulting large corpus to generate frequency lists for
hand-editing, just as we've done with many other Crúbadán languages.

Please contact me if you're interested in trying to develop a new word
list using this approach.  I'm particularly interested in African
languages.

Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Diacritic restoration + new spell checking packages

Németh László-2
Hello,

This is an important feature for word processing in these languages.
With a Python version of your Perl script, we could make an
OpenOffice.org extension to support whole text diacritics restoration
(also OCRed text restoration).

Regards,
László


2009/4/8 Kevin Scannell <[hidden email]>:

> Hello all,
>
> [Sorry for cross-posting - sending this to aspell-devel and a12n as well]
>
>  This is an announcement of a new package called "charlifter" that
> does statistical diacritic restoration:
>
> https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046
>
> and two new open source word lists, one for Lingala (joint work with
> Denis Jacquerye):
>
> https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051
>
> and one for Hawaiian:
>
> http://borel.slu.edu/ispell/haw_US.zip
>
>
>
> The charlifter script is language-independent - all you need to do is
> provide it with some plain text in the language of interest with all
> of the diacritical marks in place.   From this the script "learns"
> where the diacritics belong, statistically.   You can also improve
> performance by feeding it a word list during the training phase.
> I've built and packaged pre-trained models for several languages,
> including Irish, French, Lingala, Samoan, and Hawaiian - see the
> directories "charlifter-*" here:
>
> http://lingala.svn.sourceforge.net/viewvc/lingala/
>
> Once you've trained a language model, or installed one of the models
> above, you can feed plain ASCII text to the script and it restores the
> diacritics or extended Unicode characters that are missing:
>
> Irish:
> $ echo "an chead teanga oifigiuil" | sf.pl -r ga
> an chéad teanga oifigiúil
>
> Lingala (note the open vowels "ɔ" are restored correctly):
> $ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln
> Ngɔlɔ, niná, zambí ikamwísí bangó.
>
> Hawaiian:
> $ echo "Olelo aku 'o Papa" | sf.pl -r haw
> ʻŌlelo aku ʻo Pāpā
>
> etc....
>
>
> This work ties in closely with my Crúbadán project which is gathering
> text corpora for 400+ languages with a web crawler:
>
> http://borel.slu.edu/crubadan/
>
> Lingala is a good example.  When written properly, it uses diacritics
> to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95%
> of what is written on the web is in plain ASCII (no tone marks, "o"
> and "e" in place of "ɔ" and "ɛ").    Therefore, to use the web corpus
> effectively for language modelling purposes, it is important to
> restore these ASCII texts to the proper encoding as best as possible.
>
> The spell checkers for Lingala and Hawaiian came directly from this
> approach - train charlifter on the small amount (say 5%) of web text
> with correct diacritics in place, the restore the other 95% and use
> the resulting large corpus to generate frequency lists for
> hand-editing, just as we've done with many other Crúbadán languages.
>
> Please contact me if you're interested in trying to develop a new word
> list using this approach.  I'm particularly interested in African
> languages.
>
> Kevin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]