Re: Reducing unncessary string copying



2012/2/20, Enrico Weigelt <weigelt metux de>:
>
> Hi folks,
>
>
> I've observed in many C applications, often compile-time defined
> data like strings are copied unnecessarily.
>
> For example, if you pass names to certain objects (eg. opening a
> file, creating a window, etc) that will never change of disappear
> during the process' lifetime (eg. compiled-in), those strings
> dont need to be copied.
>

Static strings usually don't occupy much memory. For example, there is
approximatively 138 kilobytes of static strings in nautilus 2.32:
$ readelf -p '.rodata' /usr/bin/nautilus|sed 's/^[^]]*\]  //'|wc -c
138623

However, string space is probably wasted because MULTIPLE copies of
the same string (static or not) tend to be created.
Some string implementations are reference-counted, avoiding useless
copies, but, this adds some pointer indirection overhead and increase
the base memory structure size, and so, is not significantly better
for short strings. For long strings (hundred of bytes to kilobytes),
this is good, but, most GTK+ strings end up to being quite short.

> Now the big question becomes: how to decide this ?
>
> I'm currently experimenting with a new string reference datatype
> that has an extra bit which tells whether the string is static.
> Essentially a struct consisting of the char* pointer and an
> extra byte for the flag. That struct then will be used instead
> of the const char* pointers. A few inline functions handle the
> conversion from/to normal const char* and on-demand copying.
>

This adds some level of indirection to access the data bytes, and some
space overhead.
For example, the string "hello" is 6 bytes long (including the zero
terminator), with GLIBC's malloc, it may occupy 8 bytes (or maybe 16
bytes).
With your implementation, it would occupy twice as much memory.
Moreover, the binary code using strings would be larger of a few bytes too.

> Just some theoretical example:
>
> Some function
>
>     FOO* create_foo(const char* name)
>
> now becomes
>
>     FOO* create_foo(GCStr name)
>
> and the call now would be
>
>     create_foo(G_CSTR_STATIC("hello world"));
>
> in case we dont have an "static" string, but something with
> limited lifetime, it could look like this:
>
>     create_foo(G_CSTR_DYNAMIC(bar));
>
> Inside create_foo() we'll then replace g_strdup() by some
> G_CSTR_COPY() and g_free() by G_CSTR_FREE(). These functions
> will know whether the string has to be copied/free'd.

And so, once g_strdup() has been called, a new dynamic string is
created. Consequently, very few strings would be static.

>
>
> Let's just leave the question of API-compatibility aside,
> what do you think about this idea ?
>

In my opinion, the memory gains would be smaller than the loss, and a
significant performance overhead would incur.

Moreover, I can think of a better implementation (still not worth it,
in my opinion).
Storing the string as an array of bytes. The first byte of the array
would be the "static" bit, and the rest would be the string data. The
actual GStr* object would be a pointer to the second byte of the array
(i.e. the first data byte of the string).
This would make declaring static strings a bit trickier.

-- 
André Gillibert


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]