rtl::OUString::iterateCodePoints

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

rtl::OUString::iterateCodePoints

stephan.bergmann
Hi all,

<http://www.openoffice.org/issues/show_bug.cgi?id=76869> requests
functionality to work on an rtl::OUString as a sequence of Unicode
scalar values or code points, rather than a sequence of UTF-16 code units.

What I came up with is the minimalistic rtl_uString_iterateCodePoints in
rtl/ustring.h (see below) and an accompanying public rtl::OUString
member function

   inline sal_uInt32 iterateCodePoints(
     sal_Int32 * indexUtf16, sal_Int32 postIncrementCodePoints = 1);

that is an almost trivial wrapper around it.

Any comments?  Especially, I am interested in the following two points:

1  Would there be legitimate use cases for rtl_uString_iterateCodePoints
to adjust an incoming index that points into the middle of a surrogate
pair, or would that only hide broken code?

2  With the current setup where moving past the beginning or end of the
string is undefined behavior, is there any use for
postIncrementCodePoints outside [-1 .. 1]?  Or would there be legitimate
use cases for rtl_uString_iterateCodePoints to stop moving past the
beginning/end of the string when postIncrementCodePoints is too large?

-Stephan


/** Iterate through a string based on code points instead of UTF-16 code
     units.

     See Chapter 3 of The Unicode Standard 5.0 (Addison--Wesley, 2006)
     for definitions of the various terms used in this description.

     The given string is interpreted as a sequence of zero or more UTF-16
     code units.  For each index into this sequence (from zero to the
     length of the sequence, inclusive), a code point represented
     starting at the given index is computed as follows:

     - If the index points to the end of the sequence, the computed code
     point is the special marker SAL_MAX_UINT32.

     - Otherwise, if the UTF-16 code unit addressed by the index
     constitutes a well-formed UTF-16 code unit sequence, the computed
     code point is the scalar value encoded by that UTF-16 code unit
     sequence.

     - Otherwise, if the index is at least two UTF-16 code units away
     from the end of the sequence, and the sequence of two UTF-16 code
     units addressed by the index constitutes a well-formed UTF-16 code
     unit sequence, the computed code point is the scalar value encoded
     by that UTF-16 code unit sequence.

     - Otherwise, the computed code point is the UTF-16 code unit
     addressed by the index.  (This last case catches unmatched
     surrogates as well as indices pointing into the middle of surrogate
     pairs.)

     @param string
     pointer to a valid string; must not be null.

     @param indexUtf16
     pointer to a UTF-16 based index into the given string; must not be
     null.  On entry, the index must be in the range from zero to the
     length of the string (in UTF-16 code units), inclusive.  Upon
     successful return, the index will be updated to address the UTF-16
     code unit that is the given postIncrementCodePoints away from the
     initial index.

     @param postIncrementCodePoints
     the number of code points to move the given indexUtf16; can be
     negative.  The value must be such that the resulting UTF-16 based
     index is in the range from zero to the length of the string (in
     UTF-16 code units), inclusive.

     @return
     the code point (an integer in the range from 0 to 0x10FFFF,
     inclusive) or the special marker SAL_UINT_MAX that is represented at
     the given indexUtf16 starting index within the given string.

     @since UDK 3.2.7
*/
sal_uInt32 SAL_CALL rtl_uString_iterateCodePoints(
     rtl_uString const * string, sal_Int32 * indexUtf16,
     sal_Int32 postIncrementCodePoints);

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

Eike Rathke
Hi Stephan,

On Wednesday, 2007-05-09 14:10:35 +0200, Stephan Bergmann wrote:

> 1  Would there be legitimate use cases for rtl_uString_iterateCodePoints
> to adjust an incoming index that points into the middle of a surrogate
> pair, or would that only hide broken code?

I think that in the current state it would more hide broken code than
being useful. Instead, other functions like those mentioned in i76869
could be introduced, if synchronization is needed. On the other hand,
especially finding the start of a code point may be useful when
iterating backwards from the end of the string and a surrogate is the
last two code units. Maybe that's a special case?

> 2  With the current setup where moving past the beginning or end of the
> string is undefined behavior, is there any use for
> postIncrementCodePoints outside [-1 .. 1]?

There may be in scenarios like "next I'll be interested in the character
after the next", so postIncrementCodePoints would be 2.

> Or would there be legitimate
> use cases for rtl_uString_iterateCodePoints to stop moving past the
> beginning/end of the string when postIncrementCodePoints is too large?

I think it should stop if it is called with indexUtf16 being "outside"
the string, or resulting in such a value, so -1 and length would be the
min/max resulting values. Also,

| @param postIncrementCodePoints
| the number of code points to move the given indexUtf16; can be negative.
| The value must be such that the resulting UTF-16 based index is in the
| range from zero to the length of this string (in UTF-16 code units),
| inclusive.

leaves the impression that in

sal_Int32 nIndex = str.getLength() - 1;
str.iterateCodePoints( &nIndex, 2 )

the value of postIncrementCodePoints would be invalid because it would
increment nIndex beyond the length. Instead, the function should limit
nIndex to str.getLength() upon return.

  Eike

--
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to this [hidden email] account, which I use for
 mailing lists only and don't read from outside Sun. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

stephan.bergmann
In reply to this post by stephan.bergmann
Eike Rathke wrote:

> Hi Stephan,
>
> On Wednesday, 2007-05-09 14:10:35 +0200, Stephan Bergmann wrote:
>
>> 1  Would there be legitimate use cases for rtl_uString_iterateCodePoints
>> to adjust an incoming index that points into the middle of a surrogate
>> pair, or would that only hide broken code?
>
> I think that in the current state it would more hide broken code than
> being useful. Instead, other functions like those mentioned in i76869
> could be introduced, if synchronization is needed. On the other hand,
> especially finding the start of a code point may be useful when
> iterating backwards from the end of the string and a surrogate is the
> last two code units. Maybe that's a special case?

   sal_Int32 i = s.getLength();
   s.iterateCodePoints(&i, -1);

will make i point to the start of the last character (if s is nonempty).

>> 2  With the current setup where moving past the beginning or end of the
>> string is undefined behavior, is there any use for
>> postIncrementCodePoints outside [-1 .. 1]?
>
> There may be in scenarios like "next I'll be interested in the character
> after the next", so postIncrementCodePoints would be 2.

My point was that you can only safely make that call if you know that
there are at least two more code points after the current index, which
in general you can only know if you inspect the "surrogate structure" of
the OUString at the sal_Unicode level (which iterateCodePoints should
shield you from).  (Whether you can safely make a call with
postIncrementCodePoints in [-1 .. 1] is easily checkable by the caller,
on the other hand.)

>> Or would there be legitimate
>> use cases for rtl_uString_iterateCodePoints to stop moving past the
>> beginning/end of the string when postIncrementCodePoints is too large?
>
> I think it should stop if it is called with indexUtf16 being "outside"
> the string, or resulting in such a value, so -1 and length would be the
> min/max resulting values. Also,

Why -1 instead of 0?

> | @param postIncrementCodePoints
> | the number of code points to move the given indexUtf16; can be negative.
> | The value must be such that the resulting UTF-16 based index is in the
> | range from zero to the length of this string (in UTF-16 code units),
> | inclusive.
>
> leaves the impression that in
>
> sal_Int32 nIndex = str.getLength() - 1;
> str.iterateCodePoints( &nIndex, 2 )
>
> the value of postIncrementCodePoints would be invalid because it would
> increment nIndex beyond the length. Instead, the function should limit
> nIndex to str.getLength() upon return.

The nice thing about having it undefined behavior for now is that if
there ever turns up demand to do clip excessive moves at 0 resp. length,
then that can easily be implemented as a backwards compatible change.

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

Eike Rathke
Hi Stephan,

On Wednesday, 2007-05-30 16:26:34 +0200, Stephan Bergmann wrote:

> >especially finding the start of a code point may be useful when
> >iterating backwards from the end of the string and a surrogate is the
> >last two code units. Maybe that's a special case?
>
>   sal_Int32 i = s.getLength();
>   s.iterateCodePoints(&i, -1);
>
> will make i point to the start of the last character (if s is nonempty).

Ah, nice, a detail that wasn't clear to me.

> >>2  With the current setup where moving past the beginning or end of the
> >>string is undefined behavior, is there any use for
> >>postIncrementCodePoints outside [-1 .. 1]?
> >
> >There may be in scenarios like "next I'll be interested in the character
> >after the next", so postIncrementCodePoints would be 2.
>
> My point was that you can only safely make that call if you know that
> there are at least two more code points after the current index, which
> in general you can only know if you inspect the "surrogate structure" of
> the OUString at the sal_Unicode level (which iterateCodePoints should
> shield you from).

True. So, then I assume we don't need other postincrement values.


> >>Or would there be legitimate
> >>use cases for rtl_uString_iterateCodePoints to stop moving past the
> >>beginning/end of the string when postIncrementCodePoints is too large?
> >
> >I think it should stop if it is called with indexUtf16 being "outside"
> >the string, or resulting in such a value, so -1 and length would be the
> >min/max resulting values. Also,
>
> Why -1 instead of 0?

I thought of -1 signalling an end condition in reverse iteration, as
does 'length' in forward iteration, both point "outside" the string and
would follow the general [...[ inclusive/exclusive approach.

A forward loop would look like

    for(i=0; i<s.getLength(); )
    {
        c = s.iterateCodePoints( &i, +1);
    }

A similar reverse loop

    for(i=s.getLength(), s.iterateCodePoints( &i, -1); i>=0; )
    {
        c = s.iterateCodePoints( &i, -1);
    }

would not work if 0 was the smallest indexUtf16 value returned in i, one
would have to insert an if(i==0)break; condition at the end of the loop,
quite ugly.. Furthermore the length had to be checked in advance as well
to not enter the loop with an empty string. Altogether nasty, I'd say.

  Eike

--
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to this [hidden email] account, which I use for
 mailing lists only and don't read from outside Sun. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

stephan.bergmann
In reply to this post by stephan.bergmann
Eike Rathke wrote:

>>>> 2  With the current setup where moving past the beginning or end of the
>>>> string is undefined behavior, is there any use for
>>>> postIncrementCodePoints outside [-1 .. 1]?
>>> There may be in scenarios like "next I'll be interested in the character
>>> after the next", so postIncrementCodePoints would be 2.
>> My point was that you can only safely make that call if you know that
>> there are at least two more code points after the current index, which
>> in general you can only know if you inspect the "surrogate structure" of
>> the OUString at the sal_Unicode level (which iterateCodePoints should
>> shield you from).
>
> True. So, then I assume we don't need other postincrement values.

Think so too.  Anyway, having the more general case available (even if
probably not of much use) does not really hurt, so I will leave that in.

>>>> Or would there be legitimate
>>>> use cases for rtl_uString_iterateCodePoints to stop moving past the
>>>> beginning/end of the string when postIncrementCodePoints is too large?
>>> I think it should stop if it is called with indexUtf16 being "outside"
>>> the string, or resulting in such a value, so -1 and length would be the
>>> min/max resulting values. Also,
>> Why -1 instead of 0?
>
> I thought of -1 signalling an end condition in reverse iteration, as
> does 'length' in forward iteration, both point "outside" the string and
> would follow the general [...[ inclusive/exclusive approach.

But what should

   sal_Int32 i = -1;
   s.iterateCodePoints(&i, 1);

mean then?  Pseudo-iterate forward to i == 0?

But you are right, reverse-iterating code does look more awkward.  Would
it help if postIncrementCodePoints actually acted as
preIncrementCodePoints if it is negative?  Is not that what we want?  Or
is it to confusing?

   sal_Int32 i = s.getLength();
   while (i != 0) {
     sal_uInt32 c = s.iterateCodePoints(&i, -1);
   }

would then neatly reverse-iterate through any string, and we would get
rid of the ugly SAL_MAX_UINT32 special-case return value.

-Stephan

> A forward loop would look like
>
>     for(i=0; i<s.getLength(); )
>     {
>         c = s.iterateCodePoints( &i, +1);
>     }
>
> A similar reverse loop
>
>     for(i=s.getLength(), s.iterateCodePoints( &i, -1); i>=0; )
>     {
>         c = s.iterateCodePoints( &i, -1);
>     }
>
> would not work if 0 was the smallest indexUtf16 value returned in i, one
> would have to insert an if(i==0)break; condition at the end of the loop,
> quite ugly.. Furthermore the length had to be checked in advance as well
> to not enter the loop with an empty string. Altogether nasty, I'd say.
>
>   Eike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

Eike Rathke
Hi Stephan,

On Thursday, 2007-05-31 17:49:24 +0200, Stephan Bergmann wrote:

> >>Why -1 instead of 0?
> >
> >I thought of -1 signalling an end condition in reverse iteration, as
> >does 'length' in forward iteration, both point "outside" the string and
> >would follow the general [...[ inclusive/exclusive approach.
>
> But what should
>
>   sal_Int32 i = -1;
>   s.iterateCodePoints(&i, 1);
>
> mean then?  Pseudo-iterate forward to i == 0?

Yes, analogous to reverse-iterating with i=s.getLength(). However, that
case may be a bit pathological.. it also needs to return SAL_MAX_UINT32
again. So, if we preincremented on reverse-iteration like mentioned
below, what would this situation give? The same?


> But you are right, reverse-iterating code does look more awkward.  Would
> it help if postIncrementCodePoints actually acted as
> preIncrementCodePoints if it is negative?  Is not that what we want?

It is.

> Or is it to confusing?

I don't think so. Well, maybe at the beginning, but it does what we want :-)
it just reverses the entire behavior.

>   sal_Int32 i = s.getLength();
>   while (i != 0) {
>     sal_uInt32 c = s.iterateCodePoints(&i, -1);
>   }
>
> would then neatly reverse-iterate through any string, and we would get
> rid of the ugly SAL_MAX_UINT32 special-case return value.

What if 'i' is 0, and maybe 's' also an empty string? This wouldn't
happen in a proper loop, but a call to iterateCodePoints() in these
cases would result in what?

s.iterateCodePoints(&i, +1) => i== 0 ?  because getLength()==0
s.iterateCodePoints(&i,  0) => i== 0 ?  because not iterating
s.iterateCodePoints(&i, -1) => i==-1 ?  because preincremented past the beginning

And the return value?

  Eike

--
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to this [hidden email] account, which I use for
 mailing lists only and don't read from outside Sun. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: rtl::OUString::iterateCodePoints

stephan.bergmann
In reply to this post by stephan.bergmann
Eike Rathke wrote:

>> But you are right, reverse-iterating code does look more awkward.  Would
>> it help if postIncrementCodePoints actually acted as
>> preIncrementCodePoints if it is negative?  Is not that what we want?
>
> It is.
>
>> Or is it to confusing?
>
> I don't think so. Well, maybe at the beginning, but it does what we want :-)
> it just reverses the entire behavior.
>
>>   sal_Int32 i = s.getLength();
>>   while (i != 0) {
>>     sal_uInt32 c = s.iterateCodePoints(&i, -1);
>>   }
>>
>> would then neatly reverse-iterate through any string, and we would get
>> rid of the ugly SAL_MAX_UINT32 special-case return value.
>
> What if 'i' is 0, and maybe 's' also an empty string? This wouldn't
> happen in a proper loop, but a call to iterateCodePoints() in these
> cases would result in what?
>
> s.iterateCodePoints(&i, +1) => i== 0 ?  because getLength()==0
> s.iterateCodePoints(&i,  0) => i== 0 ?  because not iterating
> s.iterateCodePoints(&i, -1) => i==-1 ?  because preincremented past the beginning
>
> And the return value?

All three cases would be undefined behavior.  I would change the
preconditions for

   iterateCodePoints(
     sal_Int32 * indexUtf16, sal_Int32 incrementCodePoints)

as follows:

- indexUtf16 must not be null
- if incrementCodePoints >= 0:
   - *indexUtf16 must be in [0 .. length[
   - *indexUtf16 + incrementCodePoints must be in [0 .. length]
- if incrementCodePoints < 0:
   - *indexUtf16 must be in [0 .. length]
   - *indexUtf16 + incrementCodePoints must be in [0 .. length[

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]