[lang guesser]Next steps

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[lang guesser]Next steps

Jocelyn Merand
Hi all lingu-workers,

Since the official end of my SoC, I have taken a rest of 1 big week. I'm now
returning to the keyboard...

For those who are not up to date to the language guesser component project,
I have successfully completed the SoC and the component can guess the
language of quite short texts. But, often, it is not able to return
*all*the languages included in the text.

Here are suggestions and remarks from Thomas Lange about the component just
to make it clearer and smarter (not to develop new features).

   -

   "If this is about sentences using the XBreakiterator's functionality
   to
   identify sentence boundaries might be useful." (about breaking
   sentences in sets of words)
   -

   "Ok. You may use an STL container for this. Even though that would be
   C++
   mixing them in the source code should be fine." (about N-Gram memory
   allocation)
   -

   To have a look to a specific data structure called Bloom filter to
   store N-Grams
   -

   Results for mixed languages texts are quite bad (general conclusion)

All these points make me doubtful about the interest of using libtextcat for
next version of the component. Because it's coded in C and this code not
seems to be designed for reusing and for easy modifying.

In addition, if we want to guess the language from texts witch are composed
of some different languages, we have to find typical text parts like quoted
or bracketed words sequences. I expect that there is a UNO component to do
that, isn't it? I chat with Thomas LEBARBE – a Researcher at the Grenoble
University (France) – during the OooCon and he suggested me to use something
he called "virgulo" witch is a kind of grammatical separator. I also
thought, at the beginning of the summer, when I was searching a good way to
guess multi-languages texts, that language changes are often on beginning or
end of grammatical blocks. So this should be a possible way to improve the
efficiency of multi-guesses (to analyze block by block).

About the Bloom Filter, this is very interesting but it is not useful if you
want to get the rank (frequency) of all the N-Grams. Thank you for making me
aware about this strange thing.

It sounds that a complete refactoring should be needed if we want to
implement new functionalities and if we want to have real multi-guess
features. I propose to develop a complete C++ library, of course not from
scratch, but I will be inspired by libtextcat especially for the fingerprint
comparison witch have been implemented in libtextcat in a very efficient
way. Unfortunately, this algorithm is ad-hoc and I think I will have to
really look at it. So we will have, for example, a component called
"XFingerprintMaker" that also would be very useful for other linguistic
usages.

Maybe it's not really interesting to send everybody the present version of
the component because I think it will be modified.

Every things that I said here are not the priority. Of course, these are
next steps. Thomas, please, can you send me the last component snapshot in
case of modification on your side. I will restart from this step.

Best regards


 Jocelyn
Reply | Threaded
Open this post in threaded view
|

Re: [lang guesser]Next steps

thomas.lange

Hi Jocelyn and all,

>    Results for mixed languages texts are quite bad (general conclusion)

That is if that text is passed on to the language guessing component
in one chunk and a single fingerprint for that text is calculated.
I was hoping that the combined fingerprint would be 'close' to the
two actual languages being used.
But that turned out to be a too simple approach.

> All these points make me doubtful about the interest of using libtextcat for
> next version of the component. Because it's coded in C and this code not
> seems to be designed for reusing and for easy modifying.

Basically I see no essential problem with libTextCat here. The code and
the fingerprint data do work. Of course though the original code was not
intended to run with Unicode strings...

> In addition, if we want to guess the language from texts witch are composed
> of some different languages, we have to find typical text parts like quoted
> or bracketed words sequences. I expect that there is a UNO component to do
> that, isn't it?

That would be the breakiterator. It can be used to identify
word-boundaries and start and end of sentences. I do not specifically
know how well it works with quoted or bracketed text though.
After all since it is used for cursor traveling and to identify words
for spell checking it is required to be more fast than accurate.
What I mean to say by this is that I think in order to properly identify
sentence boundaries one would already need a grammar checker or sth.
of similar level of complexity. And that would be way to much overhead
for the purposes the breakiterator is used for.


>I chat with Thomas LEBARBE – a Researcher at the Grenoble
> University (France) – during the OooCon and he suggested me to use something
> he called "virgulo" witch is a kind of grammatical separator. I also
> thought, at the beginning of the summer, when I was searching a good way to
> guess multi-languages texts, that language changes are often on beginning or
> end of grammatical blocks. So this should be a possible way to improve the
> efficiency of multi-guesses (to analyze block by block).

Remember to make it rather quick (maybe a regular-expression grammar
checker only?) because language guessing is likely to be used with the
actual grammar checking, e.g. by guessing the primary sentence of a
text. If it is to slow it will have a severe impact on the usefulness of
grammar checking.


> It sounds that a complete refactoring should be needed if we want to
> implement new functionalities and if we want to have real multi-guess
> features.

We probably first need to identify what type of functionality we
actually do require. As already told the first client in mind
for such extended functionality would likely be grammar checking.

From my current point of view it would at the very least require
to guess the primary language of a text as accurate as possible.
Secondary tasks would be:
  - to guess all involved languages
and maybe
  - to identify the boundaries between those languages
    (or in other words: identify the language for each word)


I propose to develop a complete C++ library, of course not from
> scratch, but I will be inspired by libtextcat especially for the fingerprint
> comparison witch have been implemented in libtextcat in a very efficient
> way. Unfortunately, this algorithm is ad-hoc and I think I will have to
> really look at it. So we will have, for example, a component called
> "XFingerprintMaker" that also would be very useful for other linguistic
> usages.

Would be nice.

> Maybe it's not really interesting to send everybody the present version of
> the component because I think it will be modified.

> Every things that I said here are not the priority. Of course, these are
> next steps. Thomas, please, can you send me the last component snapshot in
> case of modification on your side. I will restart from this step.

Nothing is done yet.
To be more precise unfortunately it still not yet decided in which form
it should be integrated (as uno package or as library with data files
like most of the other code).
I need to inquire about this again .


Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]