Copy On Write
Copy-on-write (COW) is an important optimization technique, which is widely used in file system, OS and data structure. The idea of COW comes from lazy-copying. When multiple objects, tasks or processes want to own the similar copies of same resource, it is not necessary to create multiple copies of the resource. Instead one copy of resource is shared among these owners. The resource is only copied when some owner wants to modify it. COW does a good job when coping resource is expensive or modification rarely happens.
A very successful example using COW is fork() in Linux. When a new process is created, conceptually fork() makes copy of the parent's text, data, heap, and stack segments. However, performing a simple copy of the parent's virtual memory pages into a new child process would be wasteful because fork() is often followed by an immediate exec(), which replaces the process with a new program and reinitialize the process' memory. Most modern Linux implementation use COW to handle this case. After fork(), the child process and parent process share the same physical memory pages. The kernel traps any attempts by either the parent or the child to modify one of these pages, and makes duplicate copy of the about-to-be-modified page. This new page copy is assigned to the faulting process, and the corresponding page table for the child process is adjusted appropriately. From this point on, the parent and child has its private copies of the page and modification is invisible to the other process.
COW in String Implementation
COW is also used in class string. There are two reasons
- Copying string is very expensive. Allocating space on heap and memory copying are involved
- Modification on string is infrequent.
As the real data is shared between multiple string objects, there must an reference count in the string class. Just like the implementation in shared_ptr, the string object must share the same reference count as well. When the first string is constructed, the reference count is initialized to 1. If the string is copied through copy constructor or assignment, the reference count is incremented by 1. When a string is destroyed, the reference count is decremented by 1. If the reference count is 0. the data is freed from the heap because there is no string object holding this data now.
When some mutating function like insert(), replace() and operator [] is called on string, COW happens. Some simple implementation is as like following.
if((*_refcount) > 1){
char* p = new char[_len + 1];
memcpy(p, _data, _len + 1);
_data = p;
--(*_refcount);
_refcount = new size_t;
(*_refcount) = 1;
}
Testing Code
To test if string is implemented by copy on write mechanism, I write a simple test program. For string, there are only two types of copies. One is assignment and the other is copy constructor. In the test program, str2 is assigned from str1 and str3 is copy constructed from str2. The address of the data is printed out before and after modification. So what is the result? Let's build the test program using gcc 4.4 and vc2012 and run it in Linux and windows.
std::string str1 = "123456";
std::string str2;
str2 = str1;
std::string str3(str2);
std::cout<<"str1's data address is "<<(int*)(str1.data())<<std::endl;
std::cout<<"str2's data address is "<<(int*)(str2.data())<<std::endl;
std::cout<<"str3's data address is "<<(int*)(str3.data())<<std::endl;
str1[0] = '7';
std::cout<<"str1's data address is "<<(int*)(str1.data())<<std::endl;
std::cout<<"str2's data address is "<<(int*)(str2.data())<<std::endl;
std::cout<<"str3's data address is "<<(int*)(str3.data())<<std::endl;
str2[1] = '8';
std::cout<<"str1's data address is "<<(int*)(str1.data())<<std::endl;
std::cout<<"str2's data address is "<<(int*)(str2.data())<<std::endl;
std::cout<<"str3's data address is "<<(int*)(str3.data())<<std::endl;
Testing Result
gcc 4.4
str1's data address is 0x2497028
str2's data address is 0x2497028
str3's data address is 0x2497028
str1's data address is 0x2497058
str2's data address is 0x2497028
str3's data address is 0x2497028
str1's data address is 0x2497058
str2's data address is 0x2497088
str3's data address is 0x2497028
vc2012
str1's data address is 0040FCBC
str2's data address is 0040FC98
str3's data address is 0040FC74
str1's data address is 0040FCBC
str2's data address is 0040FC98
str3's data address is 0040FC74
str1's data address is 0040FCBC
str2's data address is 0040FC98
str3's data address is 0040FC74
The results using gcc is very different from the one using vc2012. For gcc, the real data addresses in str1, str2 and str3 are just same before modification. And str1's address changes when non-const operator[] is called. This is exactly what copy on write does. The memory is allocated and the data is copied when the non-const function is called. For vc2012, the data addresses of three strings are just different after assignment and copy constructor. So vc2012 may not use COW to implement string. To confirm it, let's look at the source code of string in gcc and vc2012.
String Implementation
GCC(4.4)
// _Rep: string representation
// Invariants:
// 1. String really contains _M_length + 1 characters: due to 21.3.4
// must be kept null-terminated.
// 2. _M_capacity >= _M_length
// Allocated memory is always (_M_capacity + 1) * sizeof(_CharT).
// 3. _M_refcount has three states:
// -1: leaked, one reference, no ref-copies allowed, non-const.
// 0: one reference, non-const.
// n>0: n + 1 references, operations require a lock, const.
// 4. All fields==0 is an empty string, given the extra storage
// beyond-the-end for a null terminator; thus, the shared
// empty string representation needs no constructor.
struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};
Here is a structure called _Rep_base, in which there are three member variables: length, capacity and reference count. So COW is used in gcc to implement string.
template<typename _CharT, typename _Traits, typename _Alloc>
typename basic_string<_CharT, _Traits, _Alloc>::_Rep*
basic_string<_CharT, _Traits, _Alloc>::_Rep::
_S_create(size_type __capacity, size_type __old_capacity,
const _Alloc& __alloc)
{
// _GLIBCXX_RESOLVE_LIB_DEFECTS
// 83. String::npos vs. string::max_size()
if (__capacity > _S_max_size)
__throw_length_error(__N("basic_string::_S_create"));
// The standard places no restriction on allocating more memory
// than is strictly needed within this layer at the moment or as
// requested by an explicit application call to reserve().
// Many malloc implementations perform quite poorly when an
// application attempts to allocate memory in a stepwise fashion
// growing each allocation size by only 1 char. Additionally,
// it makes little sense to allocate less linear memory than the
// natural blocking size of the malloc implementation.
// Unfortunately, we would need a somewhat low-level calculation
// with tuned parameters to get this perfect for any particular
// malloc implementation. Fortunately, generalizations about
// common features seen among implementations seems to suffice.
// __pagesize need not match the actual VM page size for good
// results in practice, thus we pick a common value on the low
// side. __malloc_header_size is an estimate of the amount of
// overhead per memory allocation (in practice seen N * sizeof
// (void*) where N is 0, 2 or 4). According to folklore,
// picking this value on the high side is better than
// low-balling it (especially when this algorithm is used with
// malloc implementations that allocate memory blocks rounded up
// to a size which is a power of 2).
const size_type __pagesize = 4096;
const size_type __malloc_header_size = 4 * sizeof(void*);
// The below implements an exponential growth policy, necessary to
// meet amortized linear time requirements of the library: see
// http://gcc.gnu.org/ml/libstdc++/2001-07/msg00085.html.
// It's active for allocations requiring an amount of memory above
// system pagesize. This is consistent with the requirements of the
// standard: http://gcc.gnu.org/ml/libstdc++/2001-07/msg00130.html
if (__capacity > __old_capacity && __capacity < 2 * __old_capacity)
__capacity = 2 * __old_capacity;
// NB: Need an array of char_type[__capacity], plus a terminating
// null char_type() element, plus enough for the _Rep data structure.
// Whew. Seemingly so needy, yet so elemental.
size_type __size = (__capacity + 1) * sizeof(_CharT) + sizeof(_Rep);
const size_type __adj_size = __size + __malloc_header_size;
if (__adj_size > __pagesize && __capacity > __old_capacity)
{
const size_type __extra = __pagesize - __adj_size % __pagesize;
__capacity += __extra / sizeof(_CharT);
// Never allocate a string bigger than _S_max_size.
if (__capacity > _S_max_size)
__capacity = _S_max_size;
__size = (__capacity + 1) * sizeof(_CharT) + sizeof(_Rep);
}
// NB: Might throw, but no worries about a leak, mate: _Rep()
// does not throw.
void* __place = _Raw_bytes_alloc(__alloc).allocate(__size);
_Rep *__p = new (__place) _Rep;
__p->_M_capacity = __capacity;
// ABI compatibility - 3.4.x set in _S_create both
// _M_refcount and _M_length. All callers of _S_create
// in basic_string.tcc then set just _M_length.
// In 4.0.x and later both _M_refcount and _M_length
// are initialized in the callers, unfortunately we can
// have 3.4.x compiled code with _S_create callers inlined
// calling 4.0.x+ _S_create.
__p->_M_set_sharable();
return __p;
}
Where is the reference count is stored? The answer is in the source code from constructor. As you can see, __size is equal to the sum of the size of the real data plus the size of _Rep. The total buffer is allocated with __size. Then the _Rep is constructed using placement new at the beginning of the buffer.
In the figure above, str1, str2 and str3 share the same data buffer. The beginning of the buffer is used to store length, capacity and reference count. The real data starts just after the rep. So why this string representation is stored at the beginning rather than the end? The advantage is that we don't need to move the representation when the capacity of string is increased, which actually increases the performance.
_CharT*
_M_data() const
{ return _M_dataplus._M_p; }
_CharT*
_M_data(_CharT* __p)
{ return (_M_dataplus._M_p = __p); }
_Rep*
_M_rep() const
{ return &((reinterpret_cast<_Rep*> (_M_data()))[-1]); }
_Rep is stored before the real data, it can be accessed by negative index.
_CharT*
_M_refdata() throw()
{ return reinterpret_cast<_CharT*>(this + 1); }
_CharT*
_M_grab(const _Alloc& __alloc1, const _Alloc& __alloc2)
{
return (!_M_is_leaked() && __alloc1 == __alloc2)
? _M_refcopy() : _M_clone(__alloc1);
}
_CharT*
_M_refcopy() throw()
{
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
__gnu_cxx::__atomic_add_dispatch(&this->_M_refcount, 1);
return _M_refdata();
} // XXX MT
void
_M_dispose(const _Alloc& __a)
{
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
-1) <= 0)
_M_destroy(__a);
} // XXX MT
The reference count is incremented when string is copied and decremented when string is destroyed. The real data is freed when the count equals to 0. Here it uses atomic function to read modify write reference count. Compared with the traditional expensive mutex lock implementation, the performance of using atomic exchange and swap is much better.
VC2012
_Myt& operator=(const _Myt& _Right)
{ // assign _Right
if (this != &_Right)
{ // different, assign it
#if _HAS_CPP0X
if (this->_Getal() != _Right._Getal()
&& _Alty::propagate_on_container_copy_assignment::value)
{ // change allocator before copying
_Tidy(true);
this->_Change_alloc(_Right._Getal());
}
#endif /* _HAS_CPP0X */
assign(_Right);
}
return (*this);
}
_Myt& assign(const _Myt& _Right,size_type _Roff, size_type _Count)
{ // assign _Right [_Roff, _Roff + _Count)
if (_Right.size() < _Roff)
_Xran(); // _Roff off end
size_type _Num = _Right.size() - _Roff;
if (_Count < _Num)
_Num = _Count; // trim _Num to size
if (this == &_Right)
erase((size_type)(_Roff + _Num)), erase(0, _Roff); // substring
else if (_Grow(_Num))
{ // make room and assign new stuff
_Traits::copy(this->_Myptr(),_Right._Myptr() + _Roff, _Num);
_Eos(_Num);
}
return (*this);
}
In the assignment function, we see no reference count. The buff is allocated in _Grow(_Num), where _Num is the length of the data. Then it uses _Traits::copy to copy the data.
static _Elem *__CLRCALL_OR_CDECL copy(_Elem *_First1, const _Elem *_First2,size_t _Count)
{ // copy [_First2, _First2 + _Count) to [_First1, ...)
return (_Count == 0 ? _First1: (_Elem *)_CSTD memcpy(_First1, _First2, _Count));
}
In the copy function, memcpy() is simply used to copy the data.
More with COW
So why doesn't vc2012 use COW to implement string? Here let's analyze pros and cons of COW
Pros
- Reduce the latency when the owner of resource is copied.
- Avoid the unnecessary resource allocation and copy. For example, we know that fork() is often followed by exec(), so COW increases the performance in Linux.
Cons
- The latency that COW tries to remove is brought by resource allocation for and copy. However, if modification on the data is necessary, the latency cannot be removed but actually be delayed to modification. Depending on the situation, slow modification may be worse than slow coping if modification is more time critical. For a typical low latency system, the memory is usually allocated at the beginning of the program to avoid real-time memory allocation. If COW is enabled, memory is allocated real time, which may decrease the whole system's throughput and increase the latency.
- Thread safety. Is std::string thread safe? It may be or may not be. If multiple threads are read or write a shared string, std::string is not thread safe, just like other STL container. C++ standard says nothing about the multiple threads safety of the std::string, because thread safety always means synchronization, which decreases the performance. In [3], locks or synchronization should be done in the code that owns/manipulated the string object. However, If each thread is read/write separate string, it must be thread safe. COW makes separate strings share the same data. Reference count is used to determine when to copy and to delete data. So reference count must be synchronized in order to guarantee string is safe to use in multithreads environment.
Other String Implemtations
Reference
- http://stackoverflow.com/questions/12520192/is-stdstring-refcounted-in-gcc-c11
- http://stackoverflow.com/questions/1466073/how-is-stdstring-implemented
- http://www.gotw.ca/publications/optimizations.htm
- http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2534.html
- http://en.wikipedia.org/wiki/Copy-on-write
- http://blog.youkuaiyun.com/haoel/article/details/24058
- http://www.cnblogs.com/promise6522/archive/2012/03/22/2412686.html
- http://www.cnblogs.com/promise6522/archive/2012/06/05/2535530.html
- http://www.cnblogs.com/Solstice/archive/2012/03/17/2403335.html
- http://cloud.github.com/downloads/chenshuo/documents/CppPractice.pdf
- Scott Meyers, Effective STL, Item 15
- Michael Kerrisk, The Linux Programming Interface, Chapter 24