Perl support for the hunspell library

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Perl support for the hunspell library

ge-7
Dear All,

I wrote a perl module (Text::Hunspell)  that supports the hunspell
library's spell and suggest (that is the spell checking) features.

It is on:
http://tkltrans.sourceforge.net/tklspell/text_hunspell.tar.gz

Regards: Eleonora

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Дмитрий Габинский
Dear Eleonora,

That's very interesting. I've got a question as follows. I work with a
translation memory program OmegaT
(http://www.omegat.org/omegat/omegat.html). It's fine but has no
spellcheck capability. Is it possible then to somehow interface your
script to OmegaT? Since I'm not a professional in programming, I've got
not a vaguest idea how to do it, but still Perl is said to be very
powerful. So, maybe?

Best regards,

Dmitri Gabinski

P.S. I'm on Windows XP/JRE 1.5.0_06/OOo 2.0.3/Perl 5.8.6 by ActiveState
   
   
   
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Marcin Miłkowski
Dear Dmitri,

have a look on hunspell website, try to find jmorph (Java version of
hunmorph which uses hunspell dictionaries, I guess). It seems to me
there are some programs that interface myspell dictionaries in Java.
This could be adapted to OmegaT more easily but you'll need to search more.

Best,
Marcin Milkowski

Dmitri Gabinski napisał(a):

> Dear Eleonora,
>
> That's very interesting. I've got a question as follows. I work with a
> translation memory program OmegaT
> (http://www.omegat.org/omegat/omegat.html). It's fine but has no
> spellcheck capability. Is it possible then to somehow interface your
> script to OmegaT? Since I'm not a professional in programming, I've got
> not a vaguest idea how to do it, but still Perl is said to be very
> powerful. So, maybe?
>
> Best regards,
>
> Dmitri Gabinski
>
> P.S. I'm on Windows XP/JRE 1.5.0_06/OOo 2.0.3/Perl 5.8.6 by ActiveState
>         ---
> Лето - время покупать товары для отдыха и туризма!
> http://shop.tut.by
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Дмитрий Габинский

> try to find jmorph

Looks like this project is not alive. Its SourceForge page does not even
list any file. I've downloaded jmorph.jar from some ftp site, but can't
find any ocumentation. Anywa, the (sample?) Hungarian module in the
package is not in MySpell format. Looks like this door is not really open.

Well, thanks for the hint!

Best regards,

Dmitri Gabinski
   
   
   
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: Perl support for the hunspell library

Marcin Miłkowski
---- Wiadomość Oryginalna ----
Od:
Do: [hidden email]
Data: 10 lipca 2006 10:48
Temat: Re: [lingu-dev] Perl support for the hunspell library

> > try to find jmorph
>
> Looks like this project is not alive. Its SourceForge page does not even
> list any file. I've downloaded jmorph.jar from some ftp site, but can't
> find any ocumentation. Anywa, the (sample?) Hungarian module in the
> package is not in MySpell format. Looks like this door is not really open.


The project is well and alive, only the docs are in Hungarian :(

Anyway, you should look for jmorph at a Hungarian site, around here:

http://ftp.mokk.bme.hu/Tool/LgIndep/jmorph/

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Дмитрий Габинский
Mon, 10 Jul 2006 12:31:50 +0200, Marcin Mib3kowski <[hidden email]>
писал(а):


> The project is well and alive, only the docs are in Hungarian :(

Really? So, where is it? Google does not give any reasonable result for
jmorph, only download servers. The SF page shows very little activity and
no documents, even in Hungarian :( definitely, I'd prefer something in
another languge, for example, Polish :)

Best regards,

Dmitri Gabinski
   
   
   
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

ge-7
Hi, Dmitri and Marcin,

jmorph is partly a seemingly dead japanese morphology project, partly a Hungarian java morphology project. I read the readme of the second one  (attached), and got not more clever after.

Dmitri, I have absolutely no experience with java interfacing, therefore I cannot answer you question with the translation memory. I think your idea is good, to make spell checking before translation.

Regards: Eleonora

------------------- jmorph readme
Introduction
============
jmorph is the Java implementation of morphbase

ide leirni szepen, hogy ugyanazt az eroforrast hasznalja, stb.


Installing the build tools
==========================

The JMorph build system is based on Jakarta Ant, which is a Java
building tool originally developed for the Jakarta Tomcat project but
now used in many other Apache projects and extended by many
developers.

Ant is a little but very handy tool that uses a build file written in
XML (build.xml) as building instructions. For more information refer
to "http://jakarta.apache.org/ant/".

The only thing that you have to make sure of is that the "JAVA_HOME"
environment property is set to match the top level directory
containing the JVM you want to use. For example:

C:\> set JAVA_HOME=C:\jdk1.2.2

or on Unix:

% setenv JAVA_HOME /usr/local/java
  (csh)
> JAVA_HOME=/usr/java; export JAVA_HOME
  (ksh, bash)

That's it!


Building instructions
=====================

Ok, let's build the code. First, make sure your current working
directory is where the build.xml file is located. Then type

  ./build.sh (unix)

if everything is right and all the required packages are visible, this
action will generate a file called "opennlp-common-${version}.jar" in
the "./build" directory. Note, that if you do further development,
compilation time is reduced since Ant is able to detect which files
have changed an to recompile them at need.

Also, you'll note that reusing a single JVM instance for each task, increases
tremendously the performance of the whole build system, compared to other
tools (i.e. make or shell scripts) where a new JVM is started for each task.


Build targets
=============

The build system is not only responsible for compiling Opennlp into a jar
file, but is also responsible for creating the HTML documentation in
the form of javadocs.

These are the meaningful targets for this build file:

 - package [default] -> creates ./build/opennlp-common.jar
 - compile -> compiles the source code
 - javadoc -> generates the API documentation in ./build/javadocs
 - clean -> restores the distribution to its original and clean state

For example, to build the Java API documentation, type

build.sh javadoc
(Unix)

To learn the details of what each target does, read the build.xml file.
It is quite understandable.

Downloading Resourcess
==================

Running JMorph
=================

Bug Reports
===========

Please report bugs at the bug section of the JMorph site


Special Note
============

This README and the directory structure and the build system for this
project were taken directly from the OpenNLP project.
-------------------------


> Mon, 10 Jul 2006 12:31:50 +0200, Marcin Mib3kowski <[hidden email]>
> писал(а):
>
>
> > The project is well and alive, only the docs are in Hungarian :(
>
> Really? So, where is it? Google does not give any reasonable result for
> jmorph, only download servers. The SF page shows very little activity and
> no documents, even in Hungarian :( definitely, I'd prefer something in
> another languge, for example, Polish :)
>
> Best regards,
>
> Dmitri Gabinski

--


"Feel free" – 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Дмитрий Габинский
> Dmitri, I have absolutely no experience with java interfacing, therefore
>I cannot answer you question with the translation memory.

Pity-pity-pity :-(

>I think your
>idea is good, to make spell checking before translation.

Not exactly. You see, what we (users of OmegaT) want is spellcheck DURING
translation or just upon it. The workflow with OmegaT is as follows (most
briefly):
1) prepare files to translate in supported formats;
2) create a project and translate (when you load a project, OmegaT (like,
actually any CAT tool) splits the text(s) into so called segments —
minimal units to translate (it may be a line, a sentence, a paragraph —
depending on file types and settings).
3) create target documents.

So, untill you make the step 3, you can't control any typing mistakes in
the translation. The idea is to somehow engage a spellcheck engine to have
this ability in OmegaT (possibly with any kind of highlighting spelling
errors). Obviously, Hunspell would be a perfect option: it's free (LGPL,
if I'm not mistaken) and it can use MySpell dictionaries which are already
numerous.

If any embedding into OmegaT (Java) directly is not possible, is it
possible to make a kinda bypass by checking the project's translation
memory (I bet, this should be possible with a Perl script!). Some
background: OmegaT stores translations memories as TMX files. TMX is an
XML application, so it's a well-structured format. All translated segments
as described above are stored as pairs of the source text and its
translation. The source and the target are clearly labeled with
language/locale tags. Such a pair is called a translation unit (TU).
Here's an example of such a file:

========================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
   <header
     creationtool="OmegaT"
     creationtoolversion="1"
     segtype="paragraph"
     o-tmf="OmegaT TMX"
     adminlang="EN-US"
     srclang="EN-US"
     datatype="plaintext"
   >
   </header>
   <body>
     <tu>
       <tuv lang="EN-US">
         <seg>Cancel</seg>
       </tuv>
       <tuv lang="PL-PL">
         <seg>Anuluj</seg>
       </tuv>
     </tu>
     <tu>
       <tuv lang="EN-US">
         <seg>Close</seg>
       </tuv>
       <tuv lang="PL-PL">
         <seg>Zamknij</seg>
       </tuv>
     </tu>
   </body>
</tmx>
==========================================================

So, I envisage a scenario approximately like this:

1) run a script that reads and parses a TM file (AFAIK, Perl has libraries
for handling XML);
2) the script reads each segment (I guess, SAX would be OK) and checks
only translations (i.e., the contents of such <tuv></tuv>, where the
“lang” attribute is DIFFERENT of the “srclang” in the header) and somehow
displays mistakes.
3) it would be cool to have also the ability to correct mistakes.

Something like this. Well, I understand, it can be a real job. But maybe?

I'll also send a copy of this letter to the OmegaT group. Maybe, someone
there can suggest something.

I'm afraid, I did not say this, though I should: THANK YOU :-)

Best regards,

Dmitri Gabinski
   
   
   
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

ge-7
Dmitri,

Thanks for the information about OmegaT's internals.

The perl interfacing to hunspell is really trivial:

1. create speller object to a language
my $speller = Text::Hunspell->new("/.../test.aff", "/.../test.dic");

2. do spell check
 $speller->check( $word );
(result: 1 if found, 0 if not:

3. if not found, give suggestions:
@suggestions = $speller->suggest( $misspelled );

4. delete spell object.
$speller->delete($speller);

I think, the above information helps a bit for designing a spelling  interface to OmegatT. Maybe you could forward also this information to the Omega group.

There is a very similar perl interface to aspell also, Text::Aspell. (it was my sample for the hunspell one). Aspell is mighty in suggestions, but it misses forbidden words and twofold affixing at the moment.

Regards: Eleonora


> >I think your
> >idea is good, to make spell checking before translation.
>
> Not exactly. You see, what we (users of OmegaT) want is spellcheck DURING
> translation or just upon it. The workflow with OmegaT is as follows (most
> briefly):
> 1) prepare files to translate in supported formats;
> 2) create a project and translate (when you load a project, OmegaT (like,
> actually any CAT tool) splits the text(s) into so called segments —
> minimal units to translate (it may be a line, a sentence, a paragraph —
> depending on file types and settings).
> 3) create target documents.
>
> So, untill you make the step 3, you can't control any typing mistakes in
> the translation. The idea is to somehow engage a spellcheck engine to have
> this ability in OmegaT (possibly with any kind of highlighting spelling
> errors). Obviously, Hunspell would be a perfect option: it's free (LGPL,
> if I'm not mistaken) and it can use MySpell dictionaries which are already
> numerous.
>
> If any embedding into OmegaT (Java) directly is not possible, is it
> possible to make a kinda bypass by checking the project's translation
> memory (I bet, this should be possible with a Perl script!). Some
> background: OmegaT stores translations memories as TMX files. TMX is an
> XML application, so it's a well-structured format. All translated segments
> as described above are stored as pairs of the source text and its
> translation. The source and the target are clearly labeled with
> language/locale tags. Such a pair is called a translation unit (TU).
> Here's an example of such a file:
>
> ========================================================================
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE tmx SYSTEM "tmx11.dtd">
> <tmx version="1.1">
>    <header
>      creationtool="OmegaT"
>      creationtoolversion="1"
>      segtype="paragraph"
>      o-tmf="OmegaT TMX"
>      adminlang="EN-US"
>      srclang="EN-US"
>      datatype="plaintext"
>    >
>    </header>
>    <body>
>      <tu>
>        <tuv lang="EN-US">
>          <seg>Cancel</seg>
>        </tuv>
>        <tuv lang="PL-PL">
>          <seg>Anuluj</seg>
>        </tuv>
>      </tu>
>      <tu>
>        <tuv lang="EN-US">
>          <seg>Close</seg>
>        </tuv>
>        <tuv lang="PL-PL">
>          <seg>Zamknij</seg>
>        </tuv>
>      </tu>
>    </body>
> </tmx>
> ==========================================================
>
> So, I envisage a scenario approximately like this:
>
> 1) run a script that reads and parses a TM file (AFAIK, Perl has libraries
> for handling XML);
> 2) the script reads each segment (I guess, SAX would be OK) and checks
> only translations (i.e., the contents of such <tuv></tuv>, where the
> “lang” attribute is DIFFERENT of the “srclang” in the header) and
> somehow
> displays mistakes.
> 3) it would be cool to have also the ability to correct mistakes.
>
> Something like this. Well, I understand, it can be a real job. But maybe?
>
> I'll also send a copy of this letter to the OmegaT group. Maybe, someone
> there can suggest something.
>
> I'm afraid, I did not say this, though I should: THANK YOU :-)

--


"Feel free" – 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perl support for the hunspell library

Дмитрий Габинский
Thank you, Eleonora!

> The perl interfacing to hunspell is really trivial:

Yes, for you :-) Well, if it was Python!

> Maybe you could forward also this information to
>the Omega group.

Sure I do! I think, someone there does speak Perl :-)

Best regards,

Dmitri Gabinski
   
   
   
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

JMySpell - Java port of MySpell

Marcin Miłkowski
Dear Dmitri,

I've just found an updated version of JMySpell, a pure Java port of MySpell:

http://jmyspell.javahispano.net/en/index.html

It seems pretty easy to integrate with any Java app, so it should be of
some use for OmegaT. It has docs in English and Spanish, so no need to
learn Hungarian first.

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JMySpell - Java port of MySpell

Simon Brouwer
Hi Marcin,

Marcin Miłkowski schreef:
> Dear Dmitri,
>
> I've just found an updated version of JMySpell, a pure Java port of
> MySpell:
Wow, I can think of some uses for that! Thanks!

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]