Re: G_UTF8String: Boxed Type Proposal

From: Randall Sawyer <srandallsawyer hushmail me>
To: Matthias Clasen <matthias clasen gmail com>, gtk-devel-list <gtk-devel-list gnome org>
Subject: Re: G_UTF8String: Boxed Type Proposal
Date: Fri, 18 Mar 2016 09:57:49 -0400

On 03/17/2016 07:23 PM, Matthias Clasen wrote:

Sure, code point works too. Anyway, enough with the ontology, we're
not a standards body....

I still don't think that we need a utf8-string datatype.


I have questions, then.

Here are excerpts from the current master files:
"gstring.h"
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
};
...

"gstring.c"

...
/**
 * g_string_insert_len:
 * @string: a #GString
 * @pos: position in @string where insertion should
 *       happen, or -1 for at the end
 * @val: bytes to insert
 * @len: number of bytes of @val to insert
 *
 * Inserts @len bytes of @val into @string at @pos.
 * Because @len is provided, @val may contain embedded
 * nuls and need not be nul-terminated. If @pos is -1,
 * bytes are inserted at the end of the string.
 *
 * Since this function does not stop at nul bytes, it is
 * the caller's responsibility to ensure that @val has at
 * least @len addressable bytes.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_len (GString     *string,
                     gssize       pos,
                     const gchar *val,
                     gssize       len)
...
/**
 * g_string_insert_unichar:
 * @string: a #GString
 * @pos: the position at which to insert character, or -1
 *     to append at the end of the string
 * @wc: a Unicode character
 *
 * Converts a Unicode character into UTF-8, and insert it
 * into the string at the given position.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_unichar (GString  *string,
                         gssize    pos,
                         gunichar  wc)
...

1) Since GString handles insertion of both raw strings and gunicharvalues, then it is safe to assume that the raw strings are treated as UTF-8.In that case, does the value of the argument `pos' refer to C arrayindex or to UTF-8 offset? [I had to read the source code to find out.]2) If the former is true - which it is - then the developer will need tocall g_utf8_strlen() to determine if there are multi-byte sequences tonavigate - and if there are - g_utf8_offset_to_pointer() to locate thearray index. Doesn't this increase processing demand?3) Wouldn't it be helpful to keep track of how many code points("characters")are stored in the GString - a number which may be lessthan the value of GString.len - without needing to call g_utf8_strlen()each time to find out?4) Would my efforts be better spent editing patches of "gstring.h" and"gstring.c" - or - to proceed as I am to introduce a parallel alternative?


If the answer to (4) is yes, then how about the following modifications?
Change "gstring.h":
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
  gsize utf8_len;
};
...

Add to "gstring.h":
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_truncate_utf8       (GString      *string,
                                       gsize         utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_len_utf8     (GString      *string,
                                       gssize        offset,
                                       const gchar  *val,
                                       gssize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_utf8         (GString *string,
                                       gssize offset,
                                       const gchar *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_c_utf8       (GString *string,
    gssize offset,
    gchar c);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_unichar_utf8 (GString *string,
gssize offset,
gchar wc);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_utf8      (GString    *string,
gssize        offset,
                                       const gchar  *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_len_utf8  (GString      *string,
                                       gssize        offset,
                                       const gchar  *val,
                                       gssize        utf8_len);

Add to "utf8.c":
...
GLIB_AVAILABLE_IN_2_XX
void   g_utf8_measure (const gchar  *utf8,
                                       glong         max_len,
                                       gsize        *utf8_len,
                                       gsize        *byte_len,
                                       gboolean      validate);
GLIB_AVAILABLE_IN_2_XX
gchar* g_utf8_sized_offset_to_pointer (const gchar  *utf8,
                                       glong         offset,
                                       gsize         utf8_len,
                                       gsize byte_len);
...

Note 1: The GString functions ending in *_utf8 would check if values ofGString.len and GString.utf8_len are equal - and directly accesscontained gchar array if they are, thus dispensing with looking uppointer from offset.Note 2: The function g_utf8_measure() iterates the passed array once,simultaneously arriving at the values which would be returned byg_utf8_strlen() and strlen() - dispensing with the need to iterate overthe array twice, which the current means demand. If `validate' is set toTRUE, then a private validating function is called. If `utf8' is knownto be valid, then the user calls the function with `validate' set toFALSE - in which case a faster "skipping" private function is called.Note 3: The function g_utf8_sized_offset_to_pointer() first compares`utf8_len' and `byte_len', reverting to simple pointer arithmetic ifthey are equal - or - if they are not, then comparing `offset' and`utf8_len' to determine whether to call g_utf8_offset_to_pointer() fromthe beginning or the end of the array.


Thank you, Matthias,  for your time and attention.

I am sincere in requesting your advice in how best to proceed.

_______________________________________________
gtk-devel-list mailing list
gtk-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list

Follow-Ups:
- Re: G_UTF8String: Boxed Type Proposal
  - From: Florian Müllner
- Re: G_UTF8String: Boxed Type Proposal
  - From: Nicolas George
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen

References:
- G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen
- Re: G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Jasper St. Pierre
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen
- Re: G_UTF8String: Boxed Type Proposal
  - From: Jasper St. Pierre
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen
- Re: G_UTF8String: Boxed Type Proposal
  - From: Simon McVittie
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]