scripted multiplatform .doc to .html conversion

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

scripted multiplatform .doc to .html conversion

Kirk Is
So the folks at my new job decided to really give me a trial by
fire...they'd like me to outline a clear and detailed outline of how
to include .doc to .html conversion in our product, in an automated
kind of way.

Openoffice seems to handle the basic task gracefully through the UI.
Can anyone tell me if there's a commandline version that would enable
this from the commandline?  Or, possibly even better, is there a
specific callable module responsible for this, is there an
intermediate in-memory format that can be marshalled/unmarshalled with
the various file formats?

I'm at a bit of a loss to know where to start code diving...would it
be a better idea for a n00b to start using the CVS feed, or is there a
downloadable archive lurking around on one of the websites?

Thanks for any and all advice!  I'm really in dire straits here, so
suggestions are acts of mercy...

-Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Laurent Godard-3
Hi

> Openoffice seems to handle the basic task gracefully through the UI.
> Can anyone tell me if there's a commandline version that would enable
> this from the commandline?  Or, possibly even better, is there a
> specific callable module responsible for this, is there an
> intermediate in-memory format that can be marshalled/unmarshalled with
> the various file formats?
>

you may have a look at this, for a very frist shoot
http://oooconv.free.fr/oooconv/oooconv_en.html

Laurent

--
Laurent Godard <[hidden email]> - Ing?nierie OpenOffice.org
Indesko >> http://www.indesko.com
Nuxeo CPS >> http://www.nuxeo.com - http://www.cps-project.org
Livre "Programmation OpenOffice.org", Eyrolles 2004

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Juergen Schmidt-3
In reply to this post by Kirk Is
Hi Krik,

take a look into the SDK example java\DocumentHandling\DocumentConverter
you can easy implement a Java remote client application doing the
conversion for you. But you always need an installed office working as a
server (for example with UI if necessary)

Juergen

Kirk Israel wrote:

> So the folks at my new job decided to really give me a trial by
> fire...they'd like me to outline a clear and detailed outline of how
> to include .doc to .html conversion in our product, in an automated
> kind of way.
>
> Openoffice seems to handle the basic task gracefully through the UI.
> Can anyone tell me if there's a commandline version that would enable
> this from the commandline?  Or, possibly even better, is there a
> specific callable module responsible for this, is there an
> intermediate in-memory format that can be marshalled/unmarshalled with
> the various file formats?
>
> I'm at a bit of a loss to know where to start code diving...would it
> be a better idea for a n00b to start using the CVS feed, or is there a
> downloadable archive lurking around on one of the websites?
>
> Thanks for any and all advice!  I'm really in dire straits here, so
> suggestions are acts of mercy...
>
> -Kirk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
In reply to this post by Laurent Godard-3
On 12/9/05, Laurent Godard <[hidden email]> wrote:
>
> you may have a look at this, for a very frist shoot
> http://oooconv.free.fr/oooconv/oooconv_en.html

So that's a webpage in PHP, and macro for use in an existing instance
of OOo, making a web application for that kind of conversion?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
In reply to this post by Juergen Schmidt-3
On 12/9/05, Jürgen Schmidt <[hidden email]> wrote:
> Hi Krik,
>
> take a look into the SDK example java\DocumentHandling\DocumentConverter
> you can easy implement a Java remote client application doing the
> conversion for you. But you always need an installed office working as a
> server (for example with UI if necessary)

Hmm. Is your feeling then, that "just" the document functionality
might too difficult to extract on a source code level?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Juergen Schmidt-3
In reply to this post by Kirk Is
Kirk Israel wrote:

> On 12/9/05, J?rgen Schmidt <[hidden email]> wrote:
>
>>Hi Krik,
>>
>>take a look into the SDK example java\DocumentHandling\DocumentConverter
>>you can easy implement a Java remote client application doing the
>>conversion for you. But you always need an installed office working as a
>>server (for example with UI if necessary)
>
>
> Hmm. Is your feeling then, that "just" the document functionality
> might too difficult to extract on a source code level?

Yes exactly, the current architecture doesn't allow to extract only this
small part. Maybe it will be possible some time in the future ;-)
For using the API you need always a runnig office instance. The other
possiblity is to work directly on the xml file format and work with XSL
transformations but that of course is not possible for most of binary
formats.

Juergen

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Laurent Godard-3
In reply to this post by Kirk Is
Hi kirk

>
> So that's a webpage in PHP, and macro for use in an existing instance
> of OOo, making a web application for that kind of conversion?
>

a 2 1/2 year old first shoot
to give some ideas
better can be done of course (and i will release when time a tool like
this based on python, xml-rpc & OOo)

Laurent

--
Laurent Godard <[hidden email]> - Ing?nierie OpenOffice.org
Indesko >> http://www.indesko.com
Nuxeo CPS >> http://www.nuxeo.com - http://www.cps-project.org
Livre "Programmation OpenOffice.org", Eyrolles 2004

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
In reply to this post by Juergen Schmidt-3
On 12/9/05, Jürgen Schmidt <[hidden email]> wrote:

> Kirk Israel wrote:
> > On 12/9/05, Jürgen Schmidt <[hidden email]> wrote:
> >
> > Hmm. Is your feeling then, that "just" the document functionality
> > might too difficult to extract on a source code level?
>
> Yes exactly, the current architecture doesn't allow to extract only this
> small part. Maybe it will be possible some time in the future ;-)
> For using the API you need always a runnig office instance. The other
> possiblity is to work directly on the xml file format and work with XSL
> transformations but that of course is not possible for most of binary
> formats.

Sorry, I'm not being willfully dense here...I understand that if I'm
doing this through the API, there has to be an instance of OOo
running, but are you saying that the segment of the source responsible
for reading in Doc (and the other segment, reseponsible for spitting
out HTML) is so tightly coupled with the rest of the system as a whole
that extracting those two segments isn't feasible, that saying "aha,
THIS is the conversion function" wouldn't get you anywhere, because it
depends on so much other stuff working to run?

Dang, if that IS the case my manager isn't going to like that I'm
shooting down the team's preferred cool new idea :-)

Thanks,
Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Mathias Bauer
Kirk Israel wrote:

> Sorry, I'm not being willfully dense here...I understand that if I'm
> doing this through the API, there has to be an instance of OOo
> running, but are you saying that the segment of the source responsible
> for reading in Doc (and the other segment, reseponsible for spitting
> out HTML) is so tightly coupled with the rest of the system as a whole
> that extracting those two segments isn't feasible, that saying "aha,
> THIS is the conversion function" wouldn't get you anywhere, because it
> depends on so much other stuff working to run?

I think you have a misconception how document conversion in OOo works.
There is no direct "translation" between input and output format, input
filters always convert the input format into a representation in memory
(the "core" of a document) and the output filter converts this into the
output format. If you think about this a little bit you will see that
anything else doesn't make sense, at the end OOo is an application and
not a conversion service: why should there be code that directly
translates from e.g. doc to html? OOo itself doesn't need such code.

So it will never make sense to isolate the filter code, you always also
need the code of the document core also. Theoretically it is possible to
take the code of the filters and the core and make it a smaller package
but until now nobody needed something like this so very badly that he
started the work to create such an environment. You will need a kind of
an application anyway and you will need UNO and its bootstrapping, you
will need some of the services in OOo used by the filters etc.

So it's possible but quite some work to do and all you earn from the
work to make it happen would be that you safe some MB on disk.
Is that worth the effort?

BTW: you don't need an *installed* version of OOo on your machine, it's
enough to have a runnable *copy* (though in this case you have to create
each UNO connection manually because your system doesn't provide a hint
where the OOo installation is).

Best regards,
Mathias

--
Mathias Bauer - OpenOffice.org Application Framework Project Lead
Please reply to the list only, [hidden email] is a spam sink.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
Mathias, thank you for your feedback...I have a few responses.

> I think you have a misconception how document conversion in OOo works.
> There is no direct "translation" between input and output format, input
> filters always convert the input format into a representation in memory
> (the "core" of a document) and the output filter converts this into the
> output format. If you think about this a little bit you will see that
> anything else doesn't make sense, at the end OOo is an application and
> not a conversion service: why should there be code that directly
> translates from e.g. doc to html? OOo itself doesn't need such code.

I assumed that it would be a "doc to internal" unmarshalling followed
by a "internal to HTML" unmarshalling, for obvious reasons (like need
2n filters rather n!)...I guess I was envisioning a small(ish) bit of
code that would do something like (in pseudojava)

Document doc = OOoUtils.getDocument(HTML_CONVERTER,"somefile.html");
OOoUtils.writeDocument(DOC_CONVERTER,doc,"output.doc");

maybe with some Input/Output Streams or services instead, but that's
the general jist.

> So it will never make sense to isolate the filter code, you always also
> need the code of the document core also. Theoretically it is possible to
> take the code of the filters and the core and make it a smaller package
> but until now nobody needed something like this so very badly that he
> started the work to create such an environment. You will need a kind of
> an application anyway and you will need UNO and its bootstrapping, you
> will need some of the services in OOo used by the filters etc.

I see what you're getting at, the conversion process isn't
self-contained but dependent on a series of services, strucutres, and
what not.

Just by reading some recent archives of this list, I'd say this kind
of scripting is fairly sought after...but maybe the people who want to
cherrypick the functionality aren't the same kind of people willing to
put in the work to make it an isolated tool.

> So it's possible but quite some work to do and all you earn from the
> work to make it happen would be that you safe some MB on disk.
> Is that worth the effort?

Quite possibly not...I think it was a desire for more easily embedding
installation of "just the conversion stuff" rather than having OOo be
a seperate install. If you could easily embed just a few filters and
some supporting classes at the source code level into a larger
project, that would make it more transparent to the user.

> BTW: you don't need an *installed* version of OOo on your machine, it's
> enough to have a runnable *copy* (though in this case you have to create
> each UNO connection manually because your system doesn't provide a hint
> where the OOo installation is).

Aha, good to know.

> Best regards,
> Mathias

Thank you!
Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
This project got backburnered but is now coming up again, the concept of
integrating OOo's doc to HTML conversion as seamlessly as possible into an
exisint J2EE application.

My understanding is that OOo must be present (copied to, but not necceaarily
installed, baed on Mathias' previous comments. At that point, it should be
fairly easy to go through with the UNO libraries...is that about the size of
it? Am I missing anything, or are there any resources that might make this
easier?

Thanks,
Kirk
Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Tom Schindl
Hi,

there's a fully functional codesnippet available which does show how
document-conversion can happen.

http://codesnippets.services.openoffice.org/Office/Office.ConvertDocuments.snip

If you are running this from a J2EE application you need to take into
consideration that ***one*** OO-Instance can not deal with multiple
request at the same time, so must:
- serialize access to OO
- create a pool of instances you connect to and serialize access to them

Tom

Kirk Israel wrote:

> This project got backburnered but is now coming up again, the concept of
> integrating OOo's doc to HTML conversion as seamlessly as possible into an
> exisint J2EE application.
>
> My understanding is that OOo must be present (copied to, but not necceaarily
> installed, baed on Mathias' previous comments. At that point, it should be
> fairly easy to go through with the UNO libraries...is that about the size of
> it? Am I missing anything, or are there any resources that might make this
> easier?
>
> Thanks,
> Kirk
>

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Kirk Is
Tom,
thanks, that is very cool.
I was able to get the snippet up and running...
Through trial and error I got the correct Jars from my OOo directory I
needed to compile against,
and then Google indicated I needed to include the "OOo/program" directory in
the classpath.

Was there a "smarter way" I should have know the above?

And in terms of expanding on that so that it have .doc as input (right now
it seems to only accept .odt) and HTML as output (currently not one of the
options listed in the program), are there any gotchas I should know about or
is it just about finding some appropriate API documentation and doing the
fairly obvious things?

This was a great first step, many thanks!
-Kirk




On 4/19/06, Tom Schindl <[hidden email]> wrote:

>
> Hi,
>
> there's a fully functional codesnippet available which does show how
> document-conversion can happen.
>
>
> http://codesnippets.services.openoffice.org/Office/Office.ConvertDocuments.snip
>
> If you are running this from a J2EE application you need to take into
> consideration that ***one*** OO-Instance can not deal with multiple
> request at the same time, so must:
> - serialize access to OO
> - create a pool of instances you connect to and serialize access to them
>
> Tom
>
> Kirk Israel wrote:
> > This project got backburnered but is now coming up again, the concept of
> > integrating OOo's doc to HTML conversion as seamlessly as possible into
> an
> > exisint J2EE application.
> >
> > My understanding is that OOo must be present (copied to, but not
> necceaarily
> > installed, baed on Mathias' previous comments. At that point, it should
> be
> > fairly easy to go through with the UNO libraries...is that about the
> size of
> > it? Am I missing anything, or are there any resources that might make
> this
> > easier?
> >
> > Thanks,
> > Kirk
> >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Tom Schindl
Hi Kirk,

No you simply have to discover the right filters which have to be used ;-)

A more appropriate place to ask is:
- [hidden email]
- http://api.openoffice.org/DevelopersGuide/DevelopersGuide.html

But once more if you plan to use this snippet in a multi-threaded
environment like your J2EE-Server you need to serialize access in your
application or have to have a pool of OO-Instances and dispatch
conversion process to one of them.

One more thing, I don't think that you need to have
OOo/programm-Directory in your class-path OpenOffice 2 libs should
locate the soffice.bin itself but I could be mistaken here.

Tom

Kirk Israel wrote:

> Tom,
> thanks, that is very cool.
> I was able to get the snippet up and running...
> Through trial and error I got the correct Jars from my OOo directory I
> needed to compile against,
> and then Google indicated I needed to include the "OOo/program" directory in
> the classpath.
>
> Was there a "smarter way" I should have know the above?
>
> And in terms of expanding on that so that it have .doc as input (right now
> it seems to only accept .odt) and HTML as output (currently not one of the
> options listed in the program), are there any gotchas I should know about or
> is it just about finding some appropriate API documentation and doing the
> fairly obvious things?
>
> This was a great first step, many thanks!
> -Kirk
>
>
>
>
> On 4/19/06, Tom Schindl <[hidden email]> wrote:
>
>>Hi,
>>
>>there's a fully functional codesnippet available which does show how
>>document-conversion can happen.
>>
>>
>>http://codesnippets.services.openoffice.org/Office/Office.ConvertDocuments.snip
>>
>>If you are running this from a J2EE application you need to take into
>>consideration that ***one*** OO-Instance can not deal with multiple
>>request at the same time, so must:
>>- serialize access to OO
>>- create a pool of instances you connect to and serialize access to them
>>
>>Tom
>>
>>Kirk Israel wrote:
>>
>>>This project got backburnered but is now coming up again, the concept of
>>>integrating OOo's doc to HTML conversion as seamlessly as possible into
>>
>>an
>>
>>>exisint J2EE application.
>>>
>>>My understanding is that OOo must be present (copied to, but not
>>
>>necceaarily
>>
>>>installed, baed on Mathias' previous comments. At that point, it should
>>
>>be
>>
>>>fairly easy to go through with the UNO libraries...is that about the
>>
>>size of
>>
>>>it? Am I missing anything, or are there any resources that might make
>>
>>this
>>
>>>easier?
>>>
>>>Thanks,
>>>Kirk
>>>
>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Andreas Höhmann-2
In reply to this post by Kirk Is
Kirk Israel wrote:
>
> And in terms of expanding on that so that it have .doc as input (right now
> it seems to only accept .odt) and HTML as output (currently not one of the
> options listed in the program), are there any gotchas I should know about or
> is it just about finding some appropriate API documentation and doing the
> fairly obvious things?
>

have a look at jooconvert (http://jooreports.sourceforge.net/)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scripted multiplatform .doc to .html conversion

Tom Schindl
Andreas Höhmann wrote:

> Kirk Israel wrote:
>
>>And in terms of expanding on that so that it have .doc as input (right now
>>it seems to only accept .odt) and HTML as output (currently not one of the
>>options listed in the program), are there any gotchas I should know about or
>>is it just about finding some appropriate API documentation and doing the
>>fairly obvious things?
>>
>
>
> have a look at jooconvert (http://jooreports.sourceforge.net/)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
That's really cool stuff. I haven't been aware of this tool until new.
What one really would need is a OO-Addon to edit complex documents using
OpenOffice.

Tom