Writer - Word Frequency?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Writer - Word Frequency?

Harold Fuchs-6
Is there an extension (or other software) that will produce a word
frequency table in Writer (2.4.1 or 3.x)? Where, please?
Note: I do not mean a word count but a list of the number of times each
word is used in  a document.

--
Harold Fuchs
London, England
Please reply *only* to [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Marcin Miłkowski
Save as text file, and run this awk script on it from command line (gawk
-f <scriptfile> <filename.txt>):

----------
  # Print list of word frequencies
      {
          for (i = 1; i <= NF; i++)
              freq[$0]++
      }

      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }

--------------

To get better results you could remove all punctuation by simple search
and replace before saving as a text file. An extension would be easy to
write but a nightmare in a language without hash tables as used in awk.

Best
Marcin


Harold Fuchs pisze:
> Is there an extension (or other software) that will produce a word
> frequency table in Writer (2.4.1 or 3.x)? Where, please?
> Note: I do not mean a word count but a list of the number of times each
> word is used in  a document.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Harold Fuchs-6
Thanks but it's not exactly what I had in  mind. As far as I know
extensions to OOo can be written in Java which, again as far as I know,
can handle the associative array you used in your awk example. So, for
someone familiar with the ODF structure and API, writing such an
extension should be quite simple. Or ???

In addition, OOo can already produce a word *count* so it knows what a
"word" is ...

Harold Fuchs
London, England
Please reply *only* to [hidden email]



On 06/01/2009 02:18, Marcin Miłkowski wrote:

> Save as text file, and run this awk script on it from command line
> (gawk -f <scriptfile> <filename.txt>):
>
> ----------
>  # Print list of word frequencies
>      {
>          for (i = 1; i <= NF; i++)
>              freq[$0]++
>      }
>
>      END {
>          for (word in freq)
>              printf "%s\t%d\n", word, freq[word]
>      }
>
> --------------
>
> To get better results you could remove all punctuation by simple
> search and replace before saving as a text file. An extension would be
> easy to write but a nightmare in a language without hash tables as
> used in awk.
>
> Best
> Marcin
>
>
> Harold Fuchs pisze:
>> Is there an extension (or other software) that will produce a word
>> frequency table in Writer (2.4.1 or 3.x)? Where, please?
>> Note: I do not mean a word count but a list of the number of times
>> each word is used in  a document.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Marcin Miłkowski
Well, as I'm developing LanguageTool, I know too well that this is not
so trivial - as we painfully found out :( You would have to traverse the
whole document (including tables and footnotes) to find the text - we
didn't find a proper way to do it and were happy to abandon this as soon
as new API was available. Actually exporting to text document and
running a script would be an easier option from the developer's point of
view. Or even using a standalone Java program that parses ODF as XML file.

If you have to create frequency lists very frequently, then maybe it
could make some sense to create such an extension that you describe.
What would be the use of the frequency list? I simply cannot see a
realistic usage scenario for non-scripting environment.

Regards
Marcin

Harold Fuchs pisze:

> Thanks but it's not exactly what I had in  mind. As far as I know
> extensions to OOo can be written in Java which, again as far as I know,
> can handle the associative array you used in your awk example. So, for
> someone familiar with the ODF structure and API, writing such an
> extension should be quite simple. Or ???
>
> In addition, OOo can already produce a word *count* so it knows what a
> "word" is ...
>
> Harold Fuchs
> London, England
> Please reply *only* to [hidden email]
>
>
>
> On 06/01/2009 02:18, Marcin Miłkowski wrote:
>> Save as text file, and run this awk script on it from command line
>> (gawk -f <scriptfile> <filename.txt>):
>>
>> ----------
>>  # Print list of word frequencies
>>      {
>>          for (i = 1; i <= NF; i++)
>>              freq[$0]++
>>      }
>>
>>      END {
>>          for (word in freq)
>>              printf "%s\t%d\n", word, freq[word]
>>      }
>>
>> --------------
>>
>> To get better results you could remove all punctuation by simple
>> search and replace before saving as a text file. An extension would be
>> easy to write but a nightmare in a language without hash tables as
>> used in awk.
>>
>> Best
>> Marcin
>>
>>
>> Harold Fuchs pisze:
>>> Is there an extension (or other software) that will produce a word
>>> frequency table in Writer (2.4.1 or 3.x)? Where, please?
>>> Note: I do not mean a word count but a list of the number of times
>>> each word is used in  a document.
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

brandelune

On mercredi 07 janv. 09, at 08:42, Marcin Miłkowski wrote:

> If you have to create frequency lists very frequently, then maybe it  
> could make some sense to create such an extension that you describe.  
> What would be the use of the frequency list?

Glossary creation. Translators (in general) use that. Such a list  
would be created for any and all documents that needs to be translated.





Jean-Christophe Helary

------------------------------------
http://mac4translators.blogspot.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Marcin Miłkowski
Jean-Christophe Helary pisze:
>
> On mercredi 07 janv. 09, at 08:42, Marcin Miłkowski wrote:
>
>> If you have to create frequency lists very frequently, then maybe it
>> could make some sense to create such an extension that you describe.
>> What would be the use of the frequency list?
>
> Glossary creation. Translators (in general) use that. Such a list would
> be created for any and all documents that needs to be translated.

I'd rather create repetitions and word clusters rather than single word
frequency list - you'd need to filter out stopwords as well to get
sensible results for that.

I think the developer of Anaphraseus could do something like that, as
this is the only CAT system that runs straight under OOo (though it has
its problems with 3.x, as I've heard). I suppose he has solved
traversing the text problem effectively.

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

ge-7
In reply to this post by Harold Fuchs-6
Harold,

In that case you should write an enhancement request
over the standard bug/enhancement channel to Oo.

-Eleonora

Re: [lingu-dev] Writer - Word Frequency?
 (Harold Fuchs, Wed Jan  7 00:18:12 2009)
Thanks but it's not exactly what I had in  mind. As far as I know
extensions to OOo can be written in Java which, again as far as I know,
can handle the associative array you used in your awk example. So, for
someone familiar with the ODF structure and API, writing such an
extension should be quite simple. Or ???

In addition, OOo can already produce a word *count* so it knows what a
"word" is ...

Harold Fuchs
London, England
Please reply *only* to [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Olivier R.
In reply to this post by Harold Fuchs-6
Hi Harold,

The extension Linguist can list all words in a document, list all unrecognized words in a document, calculate number of sentences, etc.
http://extensions.services.openoffice.org/project/Linguist

I modify the extension to calculate the number of iteration of each words:
Linguist-1.2.2-wordscount.oxt

Hope this helps. ;)

Regards,
Olivier R.

Harold Fuchs-6 wrote
Is there an extension (or other software) that will produce a word
frequency table in Writer (2.4.1 or 3.x)? Where, please?
Note: I do not mean a word count but a list of the number of times each
word is used in  a document.
Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Mathias Bauer
In reply to this post by Marcin Miłkowski
Hi Marcin,

Marcin Miłkowski wrote:

> Well, as I'm developing LanguageTool, I know too well that this is not
> so trivial - as we painfully found out :( You would have to traverse the
> whole document (including tables and footnotes) to find the text - we
> didn't find a proper way to do it and were happy to abandon this as soon
> as new API was available.
The new "Proof reading" in OOo uses a new API that provides the whole
document in so called "flat paragraphs". It works on Writer documents
only but perhaps is exactly the API you might need for your traversing.

This API will give you the whole text of the document, including tables,
text frames, footnotes etc. separated in paragraphs.

You still need to parse this text by yourself to create the word breaks
- but I think you can use the break iterator or any other code that is
able to do that.

Regards,
Mathias

--
Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
Please don't reply to "[hidden email]".
I use it for the OOo lists and only rarely read other mails sent to it.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Writer - Word Frequency?

Marcin Miłkowski
Mathias Bauer pisze:

> Hi Marcin,
>
> Marcin Miłkowski wrote:
>
>> Well, as I'm developing LanguageTool, I know too well that this is not
>> so trivial - as we painfully found out :( You would have to traverse the
>> whole document (including tables and footnotes) to find the text - we
>> didn't find a proper way to do it and were happy to abandon this as soon
>> as new API was available.
> The new "Proof reading" in OOo uses a new API that provides the whole
> document in so called "flat paragraphs". It works on Writer documents
> only but perhaps is exactly the API you might need for your traversing.

Yes, I didn't think of using the *new* API for counting words. With new
API, it's trivially simple, as the new API is quite good :)

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]