[SoC][Report] Component for guessing the language of text

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[SoC][Report] Component for guessing the language of text

Jocelyn Merand
Hi, Thomas



First of all, I'm surprised to see that the dead line of the project is next
Monday. In fact, I believed that since the project was sponsored by intel
and with 1 week of delay, it should end also with 1 week of delay (about end
of August). No prob, I'll change my agenda to finish it in time. I will
contribute to the project freely after this deadline to improve the
efficiency of guessing and to add new languages.



Now I'm worrying about the Unicode because libtextcat is definitely not
designed to use bigger encoding that 8bit based ones. I also have thought
about the interest of making a Unicode analyze. I tried to make a set of
rules to guess the language of text using the codes of characters and it
hardly works. Finally, I think that the N-Gram analyze includes a code based
analyze that could really be sufficient to pick up the most probable
languages. In fact, when the N-Gram analyzer counts N-Grams, it also can
count just single characters. This is why I decided to use libtextcat with
short text and to not use a dedicated algorithm. I tested libtextcat on
short text and I defined and implemented some tips to analyze short text
with libtextcat like: "reduce the minimum size of an N-Gram for short text"
or "add white space before and after single words to improve categorisation
by introducing the mark of beginning and end of the word" (basicly, "hello"
have these 2-Gram: "he", "el", "ll" and "lo". If I add spaces, I also
introduce these 2-Gram " h" and "o " which is much more expressive for
example with words that ends with "ing" in English.



Today, I have a problem with character encoding. In fact, the best way to
guess the language should be to use always the same character encoding for
every texts and to compare fingerprints with languages ones (all encoded
with the same encoding). One encoding appears me to be the best to do this
is, it's UTF16 but it's a 2 bytes based encoding. To use it I should modify
libtextcat to accept 2 Bytes based characters and it should be a big job (I
am modifying the program that makes fingerprints and the rest will be done
before the end of the week).



I also have added methods to configure the component (set the fingerprint
DB and enable/disable languages). I not have made tests for it and it's the
reason that I'm not sending it now.



I will write all documentation and comments next weekend and Monday.



About debug, I have added the lines you sent me last week but I still debug
manually.



I also read (in this news :
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
) that
Google will publish its N-Gram set (which is about 25GB !!!) soon. It could
be really interesting to use a subset of this huge data base to make our
fingerprints and I'm looking for the official release of them.



Regards, Jocelyn
Reply | Threaded
Open this post in threaded view
|

Re: [SoC][Report] Component for guessing the language of text

Laurent Godard-3
HI Jocelyn

> I also read (in this news :
> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html 
>
> ) that
> Google will publish its N-Gram set (which is about 25GB !!!) soon. It could
> be really interesting to use a subset of this huge data base to make our
> fingerprints and I'm looking for the official release of them.
>

let me know
i'm also interrested in it

Laurent

--
Laurent Godard <[hidden email]> - Ingénierie OpenOffice.org
Indesko >> http://www.indesko.com
Nuxeo CPS >> http://www.nuxeo.com - http://www.cps-project.org
Livre "Programmation OpenOffice.org", Eyrolles 2004

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SoC][Report] Component for guessing the language of text

thomas.lange
In reply to this post by Jocelyn Merand

Hi Jocelyn,

> First of all, I'm surprised to see that the dead line of the project is next
> Monday. In fact, I believed that since the project was sponsored by intel
> and with 1 week of delay, it should end also with 1 week of delay (about end
> of August).

Well. If you want to be sure about this you need to ask Dhananjay.
When I inquired at the beginning if the timeline would be the same as
for SoC I was told 'yes'.
If you like to work one more week on this and Dhananjay does not object
I also won't have a problem. If you like to do so please ask him.

> No prob, I'll change my agenda to finish it in time. I will
> contribute to the project freely after this deadline to improve the
> efficiency of guessing and to add new languages.

See above.

> Now I'm worrying about the Unicode because libtextcat is definitely not
> designed to use bigger encoding that 8bit based ones.

Probably you have to know the prefered 8-bit encoding libtextcat likes
and then you need to convert the unicode text first.

> I also have thought
> about the interest of making a Unicode analyze. I tried to make a set of
> rules to guess the language of text using the codes of characters and it
> hardly works.

Why not? At least some languages Hebrew, Arabian, Thai, Japanese this
should work quite well. For Arabic you probably won't find out the
country part though.

> Finally, I think that the N-Gram analyze includes a code based
> analyze that could really be sufficient to pick up the most probable
> languages. In fact, when the N-Gram analyzer counts N-Grams, it also can
> count just single characters. This is why I decided to use libtextcat with
> short text and to not use a dedicated algorithm. I tested libtextcat on
> short text and I defined and implemented some tips to analyze short text
> with libtextcat like: "reduce the minimum size of an N-Gram for short text"
> or "add white space before and after single words to improve categorisation
> by introducing the mark of beginning and end of the word" (basicly, "hello"
> have these 2-Gram: "he", "el", "ll" and "lo". If I add spaces, I also
> introduce these 2-Gram " h" and "o " which is much more expressive for
> example with words that ends with "ing" in English.

Sounds like good ideas. We have to see how they work out on actual short
text.

> Today, I have a problem with character encoding. In fact, the best way to
> guess the language should be to use always the same character encoding for
> every texts and to compare fingerprints with languages ones (all encoded
> with the same encoding).

Always using the same encoding for the same language is probably a must.
Otherwise you are likely to need additional fingerprint data.
But since it always should be possible to convert the unicode string to
libcats preferred encoding for that language there should be no need for
that.

> One encoding appears me to be the best to do this
> is, it's UTF16 but it's a 2 bytes based encoding. To use it I should modify
> libtextcat to accept 2 Bytes based characters and it should be a big job (I
> am modifying the program that makes fingerprints and the rest will be done
> before the end of the week).

This sounds also reasonable. Of yourse you need to calculate new
fingerprint data form the originally supplied sample text.
Also since UTF-16 uses two bytes per character one should consider of
doubling the ngram size as well.

But to keep going with the originally used n-gram size why not
converting to UTF-8? It is character encoded with escape sequences and
should also do the trick. Maybe it is more suitable than UTF-16.
(BTW UTF-16 is what is stored in the OUStrings)

But since with UTF-8 a single character can be encoded in (if I'm
correct) up to 4 bytes (and in some languages this may happen quite
often) I also would suspect that the fingerprint looses some quality
because of that.

I think it would still be worth to give it a try.
Also it is probably a good idea to ask the original author about
his opinion here.


> I will write all documentation and comments next weekend and Monday.
>
> About debug, I have added the lines you sent me last week but I still debug
> manually.

Didn't it work? Or is it just a matter of preference?

> I also read (in this news :
> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
> ) that
> Google will publish its N-Gram set (which is about 25GB !!!) soon. It could
> be really interesting to use a subset of this huge data base to make our
> fingerprints and I'm looking for the official release of them.

Would definitely be nice to be able to use it.


Regards,
Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]