Help needed - bulk extraction of words

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help needed - bulk extraction of words

Leif Lodahl
Hi all,
The Danish project has been so fortunate to receive a bunch of articles
from a news magazine. These are odt files and we would like to extract
the words from these documents. We have programs for this purpose, but
we usually get donations one document at the time. This time we have
several thousand documents and I believe it would take about a year to
load these documents one by one.

Do any of you have a program that can extract words from several documents ?

The words will be loaded into our workflow for linguistic processing and
at the end be a part of the Danish spelling directory.

Thanks in advance.

--
Med venlig hilsen - best regards,

Leif Lodahl
Native-Language coordinator DA.OpenOffice.org
Mail: [hidden email]
Blog: http://lodahl.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Help needed - bulk extraction of words

Marcin Miłkowski
Hi,

you'd need as well to convert these document to pure text in order to
process them; you can try to spawn OOo for conversion in a batch mode
but the easier option is to use unzip in a script, and take content.xml
only from the files. Then process the files using awk (define the field
separator just like you would define the word boundary) and filter out
all words that match <[a-z]+>. This should kill all xml from the files.

Regards
Marcin

ge pisze:

> Hello,
>
> I did word collection several times using different sources( web sources)
>
> I use linux, but these tools are also available
> for windows as gnu tools.
>
> I used awk, like:
>  for (i = 1; i <= $NF; i++)
>    print $i;
>
> This prints each word in a single line.
>
> Then I sorted the file using sort <infile > outfile
> and then used further awk scripts to get rid of word endings,
> this is probably much easier for Danish, than for Hungarian.
>
> Good luck! Eleonora
>
>
> [lingu-dev] Help needed - bulk extraction of words
>
> Hi all,
> The Danish project has been so fortunate to receive a bunch of articles
> from a news magazine. These are odt files and we would like to extract
> the words from these documents. We have programs for this purpose, but
> we usually get donations one document at the time. This time we have
> several thousand documents and I believe it would take about a year to
> load these documents one by one.
>
> Do any of you have a program that can extract words from several documents ?
>
> The words will be loaded into our workflow for linguistic processing and
> at the end be a part of the Danish spelling directory.
>
> Thanks in advance.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Help needed - bulk extraction of words

F Wolff
In reply to this post by Leif Lodahl
Op Donderdag 2008-02-07 skryf Leif Lodahl:

> Hi all,
> The Danish project has been so fortunate to receive a bunch of articles
> from a news magazine. These are odt files and we would like to extract
> the words from these documents. We have programs for this purpose, but
> we usually get donations one document at the time. This time we have
> several thousand documents and I believe it would take about a year to
> load these documents one by one.
>
> Do any of you have a program that can extract words from several documents ?
>
> The words will be loaded into our workflow for linguistic processing and
> at the end be a part of the Danish spelling directory.
>
> Thanks in advance.
>

Hallo Leif

My system has a command called sxw2txt which can simply print out the
plain text from a file. I also found a website with a tool called
odt2txt which might help:
http://stosberg.net/odt2txt/

Keep well
Friedel


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]