Lightproof grammar checker 1.0

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Lightproof grammar checker 1.0

Németh László-2
Hi,

Supported by the FSF.hu Foundation, Hungary, I have developed a fast
grammar checker and a simple framework in Python to speed up grammar
checker developments of OpenOffice.org:

http://extensions.services.openoffice.org/node/2301

Translating and modifying a little the template rules are enough for a
minimalistic grammar checker for a new language.

Please, report it, if you have any special language problem with the
grammar checker (morphological and syntactic rules aren't supported
yet). I have already added special casing support for Turkish, Azeri
and Duch, but for example, I haven't tried the grammar checking rules
with Asian languages, yet.

Best regards,
László

P.S. I have missed from the manual, that the rules are sentence-level
regex patterns, so ^ and $ mean sentence boundaries.
P.S. 2 Documentation, sample English rules:

------------- doc/manual.txt ------------
Adding new language support

1. Rename data/tutorial.dat to your locale ID, ie. xx_YY.dat or xx.dat
   with language and country identifiers).

2. Translate messages, modify or add new rules (see doc/syntax.txt).

3. Type make in the root directory. (Without a Unix or Cygwin
   environment, you can compile your dat file with the
   following commands in the pythonpath subfolder (replace slashes
   to backslashes under Windows):

   cd pythonpath
   python Convert.py ../data/your_locale.dat >lightproof_your_locale.py
   python Locale.py ../data/*.dat >lightproof_lang.py

4. Type make dist to zip the distribution (or use your zip compressor
   in the root directory, eg.

   zip -r lightproof.oxt .

5. Check it in the OpenOffice.org Tools->Options->Language Settings->
   Writing Aids after the installation by the Tools->Extension manager->
   Add dialog and menu item.

   Note: Without country identifiers (xx.dat, not xx_XX.dat data files)
   the grammar checking won't be default for this language. Choose Lightproof
   grammar checker in the Writings Aid Options page and click on the Edit
   button. Select your language in the Edit Modules dialog, and check
   in the grammar checker.
-----------------------

---------------------- doc/syntax.txt -------------
= Encoding =

UTF-8

= Rule syntax =

pattern -> replacement # message

Basically pattern and replacement will be the parameters of the
standard Python re.sub() regular expression function (see also
Python regex module documentation for regular expression syntax).

Example 0. Report "foo" in the text and suggest "bar":

foo -> bar # Use bar instead of foo.

Note: this rule recognizes "foo" in words, too. For
whole word only matching we will use the zero-length word
boundary regex notation \b.

Example 1. Recognize and suggest missing hyphen:

\bfoo bar\b -> foo-bar # Missing hyphen.

Here \b signs the end and the begin of the words.)

Example 2. Recognize double or more spaces and suggests a single space

"  +" -> " " # Extra space.

ASCII " characters protect spaces in the pattern and in the replacement text.
Plus sign means 1 or more repetitions of the previous space.

Example 3. Suggest a word with correct quotation marks:

\"(\w+)\" -> “\1” # Correct quotation marks.

(Here \" is an ASCII quotation mark, \w means an arbitrary letter,
+ means 1 or more repetitions of the previous object,
The parentheses define a regex group (the word). In the
replacement, \1 is a reference to the (first) group of the pattern.)

Example 4. Suggest the missing space after the !, ? or . signs:

\b([?!.])([a-zA-Z]+) -> \1 \2 # Missing space?

The [ and ] define a character pattern, the replacement will contain
the actual matching character (?, ! or .), a space and the word after
the punctuation character.
Note: ? and . characters have special meanings in regular expressions,
use [?] or [.] patterns to check "?" and "." signs in the text.

== Case-insensitive patterns ==

Add the Python "(?i)" notation to the pattern for case insensitive
matching and capitalized suggestions:

(?i)\bfoo bar\b -> foo-bar # Missing hyphen.

The proofreader will recognize also "Foo bar" and "FOO BAR"
(and suggests "Foo-bar" instead of "foo-bar" for capitalized matchings).

For more special casing, you can use grouping or name definitions (see
later):

(?i)\b(Foo) (Bar)\b -> \1-\2 # Missing hyphen.

or multiple rules:

\bFoo Bar\b -> Foo-Bar # Missing hyphen.
\bFOO BAR\b -> FOO-BAR # Missing hyphen.

== Multiple suggestions ==

Use \n (new line) in the replacement text to add multiple suggestions:

foo -> Foo\nFOO\nBar\nBAR # Did you mean:

(Foo, FOO, Bar and BAR suggestions for the input word "foo")

== Tests ==

It is recommended to add test for the rules by the TEST keyword:

foo([xy]) -> bar(\1) # Did you mean:
TEST: foox -> barx

The rule precompiler will check the matching and suggestions
of the TESTs.

== Name definitions ==

Lightproof supports name definitions to simplify the
description of the complex rules.

Definition:

name pattern # name definition

Usage in the rules:

"{name} " -> "{name}. " # Missing dot?

{Name}s in the first part of the rules mean
subpatterns (groups). {Name}s in the second
part of the rules mean back references to the
matched texts of the subpatterns.

Example: thousand markers (10000 -> 10,000 or 10 000)

# definitions
d \d\d\d # name definition: 3 digits
d2 \d\d # 2 digits
D \d{1,3} # 1, 2 or 3 digits

# rules
# ISO thousand marker: space, here: no-break space (U+00A0)
\b{d2}{d}\b -> {d2},{d}\n{d2} {d} # Use thousand marker (common or ISO).
\b{D}{d}{d}\b -> {D},{d},{d}\n{D} {d} {d} # Use thousand markers
(common or ISO).
TEST: 123456789 -> 123,456,789\n123 456 789

Note: Lightproof uses named groups for name definitions and
their references, adding a hidden number to the group names
in the form of "_n". You can use these explicit names in the replacement:

\b{d2}{d}\b -> {d2_1},{d_1}\n{d2_1} {d_1} # Use thousand marker (common or ISO).
\b{D}{d}{d}\b -> {D_1},{d_1},{d_2}\n{D_1} {d_1} {d_2} # Use thousand
markers (common or ISO).

Note: back references of name definitions are zeroed after new line
characters, see this and the following example:

E ( |$) # name definition: space or end of sentence
"\b[.][.]{E}" -> .{E}\n…{E} # Period or ellipsis?

See data/template.dat for more examples.

-------------------- data/en_US.dat -----------------

# Sample proofreading rules for English

# punctuation

" ([.?!,:;)”—]($| ))" -> \1 # Extra space before punctuation.
"([(“—]) " -> \1 # Extra space after punctuation.

"^[-—] " -> "– " # Hyphen instead of n-dash.
" [-—]([ ,;])" -> " –\1" # Hyphen instead of n-dash.

TEST: ( item ) -> (item)
TEST: A small - but reliable - example. -> A small – but reliable – example.

# definitions
abc [a-z]+
ABC [A-Z]+
Abc [a-zA-Z]+
punct [?!,:;%‰‱˚“”‘]

{Abc}{punct}{Abc} -> {Abc}{punct} {Abc} # Missing space?
{abc}[.]{ABC} -> {abc}. {ABC} # Missing space?
TEST: missing,space -> missing, space
TEST: missing.Space -> missing. Space

(\d+)x(\d+) -> \1×\2 # Multiplication sign.
TEST: 800x600 -> 800×600

# typography
"[.]{3}" -> "…" # Three dot character.

(^|\b|{punct}|[.]) {2,3}\b -> "\1 " # Extra space.
TEST: Extra  space -> Extra space
TEST: End... -> End…

# quotation

\"(\w[^\"“”]*[\w.?!,])\" -> “\1” # Quotion marks.
\B'(\w[^']*[\w.?!,])'\B -> ‘\1’ # Quotion marks.
TEST: "The 'old' boy." -> “The ‘old’ boy.”

# apostrophe

w \w*
(?i){Abc}'{w} -> {Abc}’{w} # Apostrophe.
TEST: o'clock -> o’clock
TEST: singers' voices -> singers’ voices

# words

# frequent mistakes

# silent h
(?i)\ba (honest(y|ly)?|hour(ly|glass)?|honou?r(abl[ey]|ed|ing|ifics?|s)|heir(less|loom)?)\b
-> an \1 # Did you mean:
TEST: A heirloom -> An heirloom

# possessive pronouns
(?i)\b(your|her|our|their)['’]s\b -> \1s # Did you mean:
TEST: Your's -> Yours

# duplicates
\b(and|or|for)\b \1 -> \1 # Did you mean:

(?i)\bcomprises of\b -> comprises # Did you mean:

# rare words (potential errors)

# multiword expressions

\bscot free\b -> scot-free\nscotfree # Did you mean:
TEST: scot free -> scot-free\nscotfree # Suggestions separated by new lines (\n)

(?i)\bying and yang\b -> yin and yang # Did you mean:

# accept foreign words only in multiword expressions

(?i)\bde(?! (facto|juro))\b -> de facto\nde juro # Missing latin expression?
TEST: de standard -> de facto\nde juro standard

# formats

# Thousand separators: 10000 -> 10,000  (common) or 10 000 (ISO standard)

# definitions
d \d\d\d # name definition: 3 digits
d2 \d\d # 2 digits
D \d|\d\d|\d\d\d # 1, 2 or 3 digits

# ISO thousand separatos: space, here: no-break space (U+00A0)
\b{d2}{d}\b -> {d2},{d}\n{d2} {d} # Use thousand separators (common or ISO).
\b{D}{d}{d}\b -> {D},{d},{d}\n{D} {d} {d} # Use thousand separators
(common or ISO).
\b{D}{d}{d}{d}\b -> {D},{d},{d},{d}\n{D} {d} {d} {d} # Use thousand
separators (common or ISO).
TEST: 1234567890 -> 1,234,567,890\n1 234 567 890

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lightproof grammar checker 1.0

Olivier R.-2
Hi,

I wrote some rules for the French language, and tried the extension.
This tool looks promising.

Minor issues:

*1*
In the documentation Convert.py should be replaced by Compile.py


*2*
The script Locales.py does not seem to work.

The command:

   Locales.py ..\data\*.dat >lightproof_lang.py

generates only:

   locales = {'': ['', '', '']}

under Vista with Python 2.6.1

So I edited the file lightproof_lang.py manually to make the extension work.


*3*
The website launchpad.net/lightproof has disappeared?



> Please, report it, if you have any special language problem with the
> grammar checker (morphological and syntactic rules aren't supported
> yet).

Will this extension plan to use the grammatical field of Hunspell
dictionaries ou will it use another specific dictionary?


Best regards,
Olivier R.

--

== N'écrivez pas à cette adresse. Réservée aux listes de discussion. ==
** Do not reply at this address. Mailing-list only. **

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lightproof grammar checker 1.0

Németh László-2
In reply to this post by Németh László-2
Hello,

2009/4/25 ge <[hidden email]>:
> Hello,
>
> I am glad to hear, there is such a tool, and I think, that will be the tool of choice for turanian (agglutinating) languages.

I hope that. For example, the Hunspell morphological analyzer of OOo
3.1 will help to distinguish compounds and simple (but large number)
affixed forms of Hungarian words.

>
> I checked it out, and find the concept  promising.

The concept was to develop a simple, portable, yet powerful pattern
matching and replacement tool for grammar checking tasks. The idea and
the syntax of the regex name definitions comes from Flex, extended
with the simplified back reference notations.

>
> I am just missing one thing: I could not find any description of command line usage. That is the basic of using that tool in other systems, like notepad++, kwrite, Abiword and lots of others.
>
> Could you please provide such a description?

First I have to provide a command line interface. But you can use this
shell script, too:

----------------------- lightproof-1.0/lightproof ---------------------
#!/bin/bash
python pythonpath/Compile.py <(cat $1; sed 's/->/=>/g' | awk '{print
"TEST: ", $0, "->", $0 }') 2>&1 >/dev/null | grep -v '^Warning: Non
matched'
--------------------------------------------------------------------------------

Usage:

~/lightproof-1.0$ ./lightproof data/en_US.dat
Sample  input from the standard input: 45463432 a honest
[Ctrl-D]
Failed test in line 93:
    TEST: [u'Sample   input from the standard input: 45463432 a honest']
EXPECTED: [u'Sample  input from the standard input: 45463432 a honest']
  RESULT: [u'Sample input from the standard input:
45,463,432\\n45\xa0463\xa0432 an honest']

Regards,
László

>
> Thanks, Eleonora
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lightproof grammar checker 1.0

Németh László-2
In reply to this post by Olivier R.-2
Hi,

2009/4/28 Olivier R. <[hidden email]>:
> Hi,
>
> I wrote some rules for the French language, and tried the extension.
> This tool looks promising.

I'm very glad of it.

>
> Minor issues:
>
> *1*
> In the documentation Convert.py should be replaced by Compile.py

Indeed. Thanks.

>
>
> *2*
> The script Locales.py does not seem to work.
>
> The command:
>
>  Locales.py ..\data\*.dat >lightproof_lang.py
>
> generates only:
>
>  locales = {'': ['', '', '']}
>
> under Vista with Python 2.6.1
>
> So I edited the file lightproof_lang.py manually to make the extension work.

It seems, Vista command line doesn't know parameter substitutions.
I will add a flag to the program for the scan the folder. A workaround:

python pythonpath\Locales.py fr_FR.dat en_US.dat >pythonpath\lightproof_lang.py

>
>
> *3*
> The website launchpad.net/lightproof has disappeared?

Unfortunately, it was inactivated for a few days by the site maintainers.

>
>
>
>> Please, report it, if you have any special language problem with the
>> grammar checker (morphological and syntactic rules aren't supported
>> yet).
>
> Will this extension plan to use the grammatical field of Hunspell
> dictionaries ou will it use another specific dictionary?

OOo 3.1 will support the XML query interface of Hunspell 1.2.8 for
stemming and morphological analysis/generation (see man 3 hunspell). I
plan to add these features to Lightproof with the following syntax:

v OpenOffice(-\w+)?\b   # "OpenOffice" or affixed forms of OpenOffice
in Hungarian
w OpenOffice.org        # correct stem
{v} -> {w.gen(v.morph)}   # suggest "OpenOffice.org" with the correct
affixes instead of "OpenOffice"

# v.morph is the result of the morphological analysis of the the
matched "OpenOffice" word form
# w.gen([category]) will suffix OpenOffice.org(=w) based on the category code

TEST: OpenOffice-szal -> OpenOffice.org-gal  # "with OpenOffice" and
"with OpenOffice.org" in Hungarian

But it is possible to add optionally the fsa package used by
LanguageTool or other Python/C libraries and functions or to fork the
project for a language with special requirements/libraries.

Thanks and best regards,
László

>
>
> Best regards,
> Olivier R.
>
> --
>
> == N'écrivez pas à cette adresse. Réservée aux listes de discussion. ==
> ** Do not reply at this address. Mailing-list only. **
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]