Re: G_UTF8String: Boxed Type Proposal



On 03/17/2016 07:23 PM, Matthias Clasen wrote:
Sure, code point works too. Anyway, enough with the ontology, we're
not a standards body....

I still don't think that we need a utf8-string datatype.

I have questions, then.

Here are excerpts from the current master files:
"gstring.h"
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
};
...

"gstring.c"

...
/**
 * g_string_insert_len:
 * @string: a #GString
 * @pos: position in @string where insertion should
 *       happen, or -1 for at the end
 * @val: bytes to insert
 * @len: number of bytes of @val to insert
 *
 * Inserts @len bytes of @val into @string at @pos.
 * Because @len is provided, @val may contain embedded
 * nuls and need not be nul-terminated. If @pos is -1,
 * bytes are inserted at the end of the string.
 *
 * Since this function does not stop at nul bytes, it is
 * the caller's responsibility to ensure that @val has at
 * least @len addressable bytes.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_len (GString     *string,
                     gssize       pos,
                     const gchar *val,
                     gssize       len)
...
/**
 * g_string_insert_unichar:
 * @string: a #GString
 * @pos: the position at which to insert character, or -1
 *     to append at the end of the string
 * @wc: a Unicode character
 *
 * Converts a Unicode character into UTF-8, and insert it
 * into the string at the given position.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_unichar (GString  *string,
                         gssize    pos,
                         gunichar  wc)
...

1) Since GString handles insertion of both raw strings and gunichar values, then it is safe to assume that the raw strings are treated as UTF-8. In that case, does the value of the argument `pos' refer to C array index or to UTF-8 offset? [I had to read the source code to find out.] 2) If the former is true - which it is - then the developer will need to call g_utf8_strlen() to determine if there are multi-byte sequences to navigate - and if there are - g_utf8_offset_to_pointer() to locate the array index. Doesn't this increase processing demand? 3) Wouldn't it be helpful to keep track of how many code points ("characters")are stored in the GString - a number which may be less than the value of GString.len - without needing to call g_utf8_strlen() each time to find out? 4) Would my efforts be better spent editing patches of "gstring.h" and "gstring.c" - or - to proceed as I am to introduce a parallel alternative?

If the answer to (4) is yes, then how about the following modifications?
Change "gstring.h":
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
  gsize utf8_len;
};
...

Add to "gstring.h":
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_truncate_utf8       (GString      *string,
                                       gsize         utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_len_utf8     (GString      *string,
                                       gssize        offset,
                                       const gchar  *val,
                                       gssize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_utf8         (GString *string,
                                       gssize offset,
                                       const gchar *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_c_utf8       (GString *string,
    gssize offset,
    gchar c);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_unichar_utf8 (GString *string,
gssize offset,
gchar wc);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_utf8      (GString    *string,
gssize        offset,
                                       const gchar  *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_len_utf8  (GString      *string,
                                       gssize        offset,
                                       const gchar  *val,
                                       gssize        utf8_len);

Add to "utf8.c":
...
GLIB_AVAILABLE_IN_2_XX
void   g_utf8_measure (const gchar  *utf8,
                                       glong         max_len,
                                       gsize        *utf8_len,
                                       gsize        *byte_len,
                                       gboolean      validate);
GLIB_AVAILABLE_IN_2_XX
gchar* g_utf8_sized_offset_to_pointer (const gchar  *utf8,
                                       glong         offset,
                                       gsize         utf8_len,
                                       gsize byte_len);
...

Note 1: The GString functions ending in *_utf8 would check if values of GString.len and GString.utf8_len are equal - and directly access contained gchar array if they are, thus dispensing with looking up pointer from offset. Note 2: The function g_utf8_measure() iterates the passed array once, simultaneously arriving at the values which would be returned by g_utf8_strlen() and strlen() - dispensing with the need to iterate over the array twice, which the current means demand. If `validate' is set to TRUE, then a private validating function is called. If `utf8' is known to be valid, then the user calls the function with `validate' set to FALSE - in which case a faster "skipping" private function is called. Note 3: The function g_utf8_sized_offset_to_pointer() first compares `utf8_len' and `byte_len', reverting to simple pointer arithmetic if they are equal - or - if they are not, then comparing `offset' and `utf8_len' to determine whether to call g_utf8_offset_to_pointer() from the beginning or the end of the array.

Thank you, Matthias,  for your time and attention.

I am sincere in requesting your advice in how best to proceed.

_______________________________________________
gtk-devel-list mailing list
gtk-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]