C replacement for substrings.pl

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

C replacement for substrings.pl

Nanning Buitenhuis
Hi,

I wrote a C replacement for substrings.pl.
Although it uses an identical algorithm it is quite a bit faster:

$ time ./substrings hyphen.us hyphen.new
0.04user 0.00system 0:00.05elapsed 84%CPU (0avgtext+0avgdata 0maxres)k
0inputs+0outputs (0major+381minor)pagefaults 0swaps

$ time perl substrings.pl hyphen.us hyphen.mashed
1.09user 0.00system 0:01.13elapsed 97%CPU (0avgtext+0avgdata 0maxres)k
0inputs+0outputs (0major+832minor)pagefaults 0swaps

It also fixed a minor bug in combine(): if a sub-pattern is found twice
(or more) in the main pattern, then all occurences were changed instead
of (the correct) last occurence. Only example in hyphen.us is 'tanta3'

Other caveats are:
- the output of the C version is sorted in unicode order
- the input should be utf-8

Anybody interested?
   NaNning.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: C replacement for substrings.pl

Daniel Naber-4
On Mittwoch 26 Juli 2006 13:17, Nanning Buitenhuis wrote:

> It also fixed a minor bug in combine(): if a sub-pattern is found twice
> (or more) in the main pattern, then all occurences were changed instead
> of (the correct) last occurence. Only example in hyphen.us is 'tanta3'

I'm not that familiar with the algorithm, so: does that have an effect on
the final result, i.e. the way hyphenation works?

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: C replacement for substrings.pl

Nanning Buitenhuis

>> It also fixed a minor bug in combine(): if a sub-pattern is found twice (or more) in the main pattern, then all occurences were changed instead of (the correct) last occurence. Only example in hyphen.us is 'tanta3'
>>    
>
> I'm not that familiar with the algorithm, so: does that have an effect on the final result, i.e. the way hyphenation works?
>  
There are two differences:
1) the output file is sorted (the perl output wasn't)
2) 'tant3a' +'1ta' gets converted to 'tan1t3a' instead of '1tan1t3a'. As
the algorithm tries to find a right side match, this seems to be the
correct solution. The perl code found the right-side 'ta' and then
upgraded _all_ 'ta's in the main expression.

I just discovered that it is not supposed to work with utf-8, but with
8-bit character sets. I will fix the code so that it works with both (it
is a pity that the OO code is 8 bit). The speed will not change much.
The reason I rewrote it is that we're using the OO hyphenation code for
a new TeX version, which will be utf-8/unicode based.




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: C replacement for substrings.pl

nemeth-2
In reply to this post by Nanning Buitenhuis
Hi,

Many thanks for your work and bug report! Could you send me
a fixed substring.pl? (If not, I will fix it.)
It would be fine to check your code on the huhyphn Hungarian hyphenation
patterns (>60000 patterns, http://www.tipogral.hu/huhyphn.tex).

I think, the sorting order and the input format don't matter,
but maybe you didn't use the newer substrings.pl of OpenOffice.org 2.0.2
with non-standard hyphenation and Unicode support (only non-standard
hyphenation pattern processing need special UTF-8 code, because
the 8-bit hyphenation algorithm handles the UTF-8 patterns correctly).

Standalone hyphenator and extended substring.pl:
http://www.openoffice.org/nonav/issues/showattachment.cgi/33618/altlinuxHyph2.tar.gz
See also Issue 58558: http://www.openoffice.org/issues/show_bug.cgi?id=58558.

I'm interested in your TeX hyphenation development, because I plan
a TeX prehyphenator with non-standard hyphenation, word
disambiguation and compound word decomposition.
I also plan a compound word decomposition preprocessor to OpenOffice.org
hyphenator based on Hunspell and the non-standard hyphenation extension
of OOo 2.0.2. With compound word decomposition, we will be able
to make small hyphenation dictionaries along with accurate compound
word hyphenation. (Now Huhyphn patterns with Libhnj use 9 MB memory,
thanks for the missing compound word decomposition.)

Best regards,

Laci



Quoting Nanning Buitenhuis <[hidden email]>:

> Hi,
>
> I wrote a C replacement for substrings.pl.
> Although it uses an identical algorithm it is quite a bit faster:
>
> $ time ./substrings hyphen.us hyphen.new
> 0.04user 0.00system 0:00.05elapsed 84%CPU (0avgtext+0avgdata 0maxres)k
> 0inputs+0outputs (0major+381minor)pagefaults 0swaps
>
> $ time perl substrings.pl hyphen.us hyphen.mashed
> 1.09user 0.00system 0:01.13elapsed 97%CPU (0avgtext+0avgdata 0maxres)k
> 0inputs+0outputs (0major+832minor)pagefaults 0swaps
>
> It also fixed a minor bug in combine(): if a sub-pattern is found twice
> (or more) in the main pattern, then all occurences were changed instead
> of (the correct) last occurence. Only example in hyphen.us is 'tanta3'
>
> Other caveats are:
> - the output of the C version is sorted in unicode order
> - the input should be utf-8
>
> Anybody interested?
>    NaNning.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: C replacement for substrings.pl

Nanning Buitenhuis
 > It would be fine to check your code on the huhyphn Hungarian hyphenation
 > patterns (>60000 patterns, http://www.tipogral.hu/huhyphn.tex).

$ time perl substrings.pl huhyphn.tex  huhyphn-perl.mashed
  22.01user 0.07system 0:22.32elapsed 98%CPU (0avgtext+0avgdata 0maxres)k
  0inputs+0outputs (0major+5800minor)pagefaults 0swaps
$ time ./substrings-8bit huhyphn.tex  huhyphn-c8bit.mashed
  1.38user 0.03system 0:01.53elapsed 92%CPU (0avgtext+0avgdata 0maxres)k
  0inputs+0outputs (0major+4100minor)pagefaults 0swaps

 > I think, the sorting order and the input format don't matter,
 > but maybe you didn't use the newer substrings.pl of OpenOffice.org 2.0.2
 > with non-standard hyphenation and Unicode support (only non-standard
 > hyphenation pattern processing need special UTF-8 code, because
 > the 8-bit hyphenation algorithm handles the UTF-8 patterns correctly).

Your right, but for a silly feature in my code (debugging info). The C code is
now uninterested in the encoding of the input (as long it ain't EBCIDC).
The fact that the output is sorted is a side-effect, not a feature. It
slightly slows down the UTF-8 case as all invalid utf-8 sequences are checked too.

 > I'm interested in your TeX hyphenation development, because I plan
 > a TeX prehyphenator with non-standard hyphenation, word
 > disambiguation and compound word decomposition.
 > I also plan a compound word decomposition preprocessor to OpenOffice.org
 > hyphenator based on Hunspell and the non-standard hyphenation extension
 > of OOo 2.0.2. With compound word decomposition, we will be able
 > to make small hyphenation dictionaries along with accurate compound
 > word hyphenation. (Now Huhyphn patterns with Libhnj use 9 MB memory,
 > thanks for the missing compound word decomposition.)

This is for Taco's MetaTeX project, funded by Colorado State University.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]