Segmentation Fault In Recurisve Structure With Smart Pointer

A singly linked list implemented with smart pointer, which has a reference to next, can easily cause Stack Overflow at its destruction because the destructor of the next, and the next, and the next, are called recursively.

Here's the code.

Code:

#include <boost/shared_ptr.hpp>
#include <iostream>
 
#define RECURSIVE
 
class IntSLLNode {
 
private:
  int index;
  boost::shared_ptr<IntSLLNode> next;
 
  IntSLLNode(int ix){
    index = ix;
  }
 
public:
  int hasNext(){return next != 0;}
  boost::shared_ptr<IntSLLNode>& moveNext(){return next;}
  int getIndex(){return index;}
  friend class IntSLList;
};
 
class IntSLList {
private:
  boost::shared_ptr<IntSLLNode> head, tail;
public:  
  ~IntSLList(){
#ifndef RECURSIVE
    while(head != tail){
      // replace
      head = head->next;
    }
#endif
  }
 
  void addNode(int index){
 
    boost::shared_ptr<IntSLLNode> n(new IntSLLNode(index));
    if(head == 0){
      head = tail =n;
    }
    else
    {
      tail->next = n;
      tail = n;
    }
 
  }
 
  boost::shared_ptr<IntSLLNode>& getHeadNode(){return head;}
};
 
int main(){
  IntSLList list;
  // the number to reproduce crush may vary, but seems to be stable
  // On my Core2Duo box, it is around 65000
  // Here I dare to use a value large enough
  for(int i=0; i<1000000; i++){
    list.addNode(i);
  }
  // crush at destruction
  std::cout << "CRUSH?" << std::endl;
}


This works by commenting out the line 4.

For your own smart pointer, the point is line 7 in the following code.

Code:

SmartPointer& operator=(const SmartPointer& src){
    if(this != &src){
      if(counter->decrement() == 0){
        delete counter;
        //src.pointee_ = 0; // NOT DESTRUCTIVE
        counter = src.counter;
        counter->increment(); // increment before deleting pointee_, which might have a reference to src. This results in a temporal no referencer, and triggers recursive destruction
        if(pointee_) delete pointee_;
        pointee_ = src.pointee_;
      }else{
        pointee_ = src.pointee_;
        //src.pointee_ = 0; // NOT DESTRUCTIVE
        counter = src.counter;
        counter->increment();
      }
    }
    return *this;
  }

If you call delete pointee_ before incrementing the counter of src, it triggers recursive destruction in the above case.

For a large graph cached in memory, it's good to have some kind of protection or grouping methods.

So far, so bad.

I found a couple of things. A couple of bad things.

1. shared_ptr is not thread safe * it was stack overflow by recursive destruction

Here's the code to reproduce. The reason follows.

SPIntSLList.h

Code:

#ifndef INT_SINGLY_LINKEDLIST
#define INT_SINGLY_LINKEDLIST
 
#include <boost/shared_ptr.hpp>
 
class SPIntSLLNode {
 
private:
  int index;
  long int payload[4];
  boost::shared_ptr<SPIntSLLNode> next;
 
  SPIntSLLNode(int ix){
    index = ix;
  }
 
public:
  int hasNext(){return next != 0;}
  boost::shared_ptr<SPIntSLLNode>& moveNext(){return next;}
  int getIndex(){return index;}
  friend class SPIntSLList;
};
 
class SPIntSLList {
private:
  boost::shared_ptr<SPIntSLLNode> head, tail;
public:  
  ~SPIntSLList();
 
  void addNode(int index);
  boost::shared_ptr<SPIntSLLNode>& getHeadNode(){return head;}
};
 
#endif

SPIntSLList.cpp

Code:

#include <iostream>
#include <time.h>
#include "SPIntSLList.h"
 
SPIntSLList::~SPIntSLList(){
  // no need iterative deletion
}
 
void SPIntSLList::addNode(int index){
  boost::shared_ptr<SPIntSLLNode> n(new SPIntSLLNode(index));
  if(head == 0){
    head = n;
    tail = n;
  }
  else
  {
    tail->next = n;
    tail = n;
  }
}
 
int main(){
 
  std::cout << "sizeof(SPIntSLLNode) = " << sizeof(SPIntSLLNode) << std::endl;
 
  clock_t start, end;
 
  std::cout << "creating huge large SPIntSLList" << std::endl;
  start = clock();
  SPIntSLList huge;
  std::cout << "creating huge SPIntSLListNode" << std::endl;
  // 157160: segmetation fault boarder when load was evaluated
  for(int i=0; i<1000000; i++){
    huge.addNode(i);
    // n, temporal PSP is destructed here
  }
  end = clock();
  std::cout << "creation time: " << (end-start) << " clocks(" << (double) (end-start)/CLOCKS_PER_SEC << " sec)" << std::endl;
 
  std::cout << "iterating over huge SPIntSLListNode" << std::endl;
  start = clock();
  // the following code calls copy constructor instead of assignment operator
  boost::shared_ptr<SPIntSLLNode> current = huge.getHeadNode();
  while(current->hasNext()){
    if(current->getIndex() % 10000 == 0)
      std::cout << "index:" << current->getIndex() << std::endl;
    current = current->moveNext();
  }
  end = clock();
  std::cout << "iteration time: " << (end-start) << " clocks(" << (double) (end-start)/CLOCKS_PER_SEC << " sec)" << std::endl;
  std::cout << "completed huge SPIntSLList" << std::endl;
}

Then, a wonderful gift.

Segmentation fault (core dumped)

And here's the stack trace.

Code:

Program terminated with signal 11, Segmentation fault.
#0  0x0000000000401227 in boost::detail::atomic_exchange_and_add (pw=Cannot access memory at address 0x7fff2ee2dff8
)
    at /home/takenori/workspace_native/Boost/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:36
36      {
(gdb) bt
#0  0x0000000000401227 in boost::detail::atomic_exchange_and_add (pw=Cannot access memory at address 0x7fff2ee2dff8
)
    at /home/takenori/workspace_native/Boost/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:36
#1  0x000000000040137b in boost::detail::sp_counted_base::release (
    this=0xd921130)
    at /home/takenori/workspace_native/Boost/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:143
#2  0x000000000040142b in boost::detail::shared_count::~shared_count (
    this=0xd9210c0, __in_chrg=<value optimized out>)
    at /home/takenori/workspace_native/Boost/boost/smart_ptr/detail/shared_count.hpp:217
#3  0x00000000004014be in boost::shared_ptr<SPIntSLLNode>::~shared_ptr (
    this=0xd9210b8, __in_chrg=<value optimized out>)
    at /home/takenori/workspace_native/Boost/boost/smart_ptr/shared_ptr.hpp:163

They have the same mistake as mine.

Code:

~shared_count() // nothrow
    {
       if( pi_ != 0 ) pi_->release();
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
        id_ = 0;
#endif
    }

After "pi_ != 0", pi is not guarranteed to stay valid. It is possible to delete after its evaluation.

The last two referencing threads can enter release().

Code:

void release() // nothrow
    {
        if( atomic_exchange_and_add( &use_count_, -1 ) == 1 )
        {
            dispose();
            weak_release();
        }
    }

Then, one comes late experiences Segmentation Fault on $use_count_.

2. Atomic types and operations are not as fast as RCU(read copy update)

I expected this(from the presentation by liburcu).

But the performance of lock-free reference counting smart pointer is as slow as mutex, on my dual core linux box.

The difference might be less significant because it's only dual core on 1 CPU. But I doubt it.

The test was like this.

Code:

#include "TsPersistentSmartPointer.h"
#include <iostream>
#include <time.h>
 
#define MAX_THREADS  100
#define ITERATION  1000
 
class Foo{
public:
  Foo(int id) : id_(id){std::cout << id_ << ":constructer called" << std::endl;}
  ~Foo(){std::cout << id_ << ":destructer called" << std::endl;}
 
  void SayHello(){
    std::cout << "Hello, this is " << id_ << std::endl;
  }
private:
  int id_;
};
 
void *copy(void *arg){
  // 1. copy, adding a reference
  TsPersistentSmartPointer<Foo> psp = *((TsPersistentSmartPointer<Foo>*) arg);
  TsPersistentSmartPointer<Foo> pspCopied = psp;
 
  // 2. removing a reference when exiting
}
 
int main(){
  
  TsPersistentSmartPointer<Foo> psp(new Foo(3));
  pthread_t threads[MAX_THREADS];
  int rt;
 
  clock_t start, end;
  std::cout << "Executing..." << std::endl;
  start = clock();
  for(int i=0; i<ITERATION; i++)
  {
    for(int t=0; t<MAX_THREADS; t++){
      rt = pthread_create(&threads[t], NULL, copy, (void*) &psp);
      if(rt != 0){
        std::cout << "Failed to create thread " << t << std::endl;
      }
    }
 
    for(int t=0; t<MAX_THREADS; t++){
      rt = pthread_join(threads[t], NULL);
      if(rt != 0){
        std::cout << "Failed to join thread " << t << std::endl;
      }
    }
  }
 
  end = clock();
  std::cout << "execution time: " << (end-start) << " clocks(" << (double) (end-start)/CLOCKS_PER_SEC << " sec)" << std::endl;
}

This is heavily read intensive test. But each read on reference counting SmartPointer requires incrementing the counter, and decrementing when exiting. So it is deadly heavy write intensive test. It would run as low as Mutex in the figure above.

liburcu is implemented with architecture specific assembly code. But what's the difference? That's what I ought to find...

Lock-free reference counting smart pointer

With GCC 4.4, I tested how atomic types and operations of c++0x work by implementing lock-free reference counting smart pointer.

TsPersistentSmartPointer.h

Code:

#ifndef PERSISTENT_SMART_POINTER
#define PERSISTENT_SMART_POINTER
 
#include <cstdatomic>
 
class Counter {
public:
  Counter():i(1){}
  ~Counter(){}
 
  void increment(){
    i.fetch_add( 1, std::memory_order_relaxed );
  }
  int decrement(){
    return i.fetch_sub( 1, std::memory_order_relaxed );
  }
  void reset(){
    i.store(0, std::memory_order_relaxed);
  }
private:
  std::atomic_uint i;
};
 
template <class T>
class TsPersistentSmartPointer {
public:
  explicit TsPersistentSmartPointer(T* pointee) : pointee_(pointee), counter(new Counter()){
  }
  TsPersistentSmartPointer() : pointee_(0), counter(new Counter()){
  }; // for null construction, has to be copied from somewhere else
  ~TsPersistentSmartPointer(){
    if(counter->decrement() == 0){
      delete counter;
      delete pointee_;
    }
  }
 
  T& operator*() const{
    return *pointee_;
  }
 
  T* operator->() const{
    return pointee_;
  }
 
  // assignment
  TsPersistentSmartPointer(const TsPersistentSmartPointer& src){
      pointee_ = src.pointee_;
      counter = src.counter;
      counter->increment();
      //src.pointee_ = 0; // NOT DESTRUCTIVE
  }
  TsPersistentSmartPointer& operator=(const TsPersistentSmartPointer& src){
    if(this != &src){
      if(counter->decrement() == 0){
        delete counter;
        if(pointee_) delete pointee_;
      }
      pointee_ = src.pointee_;
      //src.pointee_ = 0; // NOT DESTRUCTIVE
      counter = src.counter;
      counter->increment();
    }
    return *this;
  }
  TsPersistentSmartPointer& operator=(T* pointee){
    if(pointee_ != pointee){
      if(pointee_)
        delete pointee_;
      pointee_ = pointee;
      counter->reset();
    }
    return *this;
  }
 
  // equality
  inline friend bool operator==(const TsPersistentSmartPointer& psp, const T* pointee){
    return psp.pointee_ == pointee;
  }
  inline friend bool operator==(const T* pointee, const TsPersistentSmartPointer& psp){
    return pointee == psp.pointee_;
  }
  // inequality
  inline friend bool operator!=(const TsPersistentSmartPointer& psp, const T* pointee){
    return psp.pointee_ != pointee;
  }
  inline friend bool operator!=(const T* pointee, const TsPersistentSmartPointer& psp){
    return pointee != psp.pointee_;
  }
 
private:
  T* pointee_;
  Counter* counter;
};
 
#endif

Atomic operations are implemented inside Counter class. The point is the following part, the destructor of TsPersistentSmartPointer:

Code:

if(counter->decrement() == 0){
      delete counter;
      delete pointee_;
    }

At first, I did the following, which resulted in Segmentation Fault under a million of concurrent destructions.

Code:

counter->decrement();
    if(counter->get() == 0){
      delete counter;
      delete pointee_;
    }

The problem could happen when the last two enters concurrently.

Thread1: counter->decrement(); // i = 1
Thread2: counter->decrement(); // i = 0
Thread2: counter->get() == 0 // TRUE
Thread1: counter->get() == 0 // TRUE!
Thread1: delete counter;
Thread2: delete counter; // OUCH!

Here're some more code for testing.

TsPersistentSmartPointer.cpp

Code:

#include "TsPersistentSmartPointer.h"
#include <iostream>
 
class Foo{
public:
  Foo(int id) : id_(id){std::cout << id_ << ":constructer called" << std::endl;}
  ~Foo(){std::cout << id_ << ":destructer called" << std::endl;}
 
  void SayHello(){
    std::cout << "Hello, this is " << id_ << std::endl;
  }
private:
  int id_;
};
 
int main(){
 
  Foo f0(0);
  Foo *f1 = new Foo(1);
  Foo *f2 = new Foo(2); // memory leak
  Foo *f3 = new Foo(3);
  TsPersistentSmartPointer<Foo> psp(0);
 
  // copy
  psp = f3;
 
  // dereference and indirection
  psp->SayHello();
  (*psp).SayHello();
 
  // equality
  std::cout << "equality works? :" << ((psp == f3) & (f3 == psp)) << std::endl;
 
  // copy constructor
  TsPersistentSmartPointer<Foo> pspCopied(psp);
  // dereference and indirection
  pspCopied->SayHello();
  (*pspCopied).SayHello();
 
  // see how destructors are called
  delete f1;
 
  return 0;
}

TsPSPIntSLList.h

Code:

#ifndef INT_SINGLY_LINKEDLIST
#define INT_SINGLY_LINKEDLIST
 
#include "TsPersistentSmartPointer.h"
 
class TsPSPIntSLLNode {
 
private:
  int index;
  long int payload[4];
  TsPersistentSmartPointer<TsPSPIntSLLNode> next;
 
  TsPSPIntSLLNode(int ix){
    index = ix;
  }
 
public:
  int hasNext(){return next != 0;}
  TsPersistentSmartPointer<TsPSPIntSLLNode>& moveNext(){return next;}
  int getIndex(){return index;}
  friend class TsPSPIntSLList;
};
 
class TsPSPIntSLList {
private:
  TsPersistentSmartPointer<TsPSPIntSLLNode> head, tail;
public:  
  ~TsPSPIntSLList();
 
  void addNode(int index);
  TsPersistentSmartPointer<TsPSPIntSLLNode>& getHeadNode(){return head;}
};
 
#endif

TsPSPIntSLList.cpp

Code:

#include <iostream>
#include <time.h>
#include "TsPSPIntSLList.h"
 
TsPSPIntSLList::~TsPSPIntSLList(){
  // no need iterative deletion
}
 
void TsPSPIntSLList::addNode(int index){
  TsPersistentSmartPointer<TsPSPIntSLLNode> n(new TsPSPIntSLLNode(index));
  if(head == 0){
    head = n;
    tail = n;
  }
  else
  {
    tail->next = n;
    tail = n;
  }
}
 
int main(){
 
  std::cout << "sizeof(TsPSPIntSLLNode) = " << sizeof(TsPSPIntSLLNode) << std::endl;
 
  clock_t start, end;
 
  std::cout << "creating huge large TsPSPIntSLList" << std::endl;
  start = clock();
  TsPSPIntSLList huge;
  std::cout << "creating huge TsPSPIntSLListNode" << std::endl;
  // 157160: segmetation fault boarder when load was evaluated
  for(int i=0; i<1000000; i++){
    huge.addNode(i);
    // n, temporal PSP is destructed here
  }
  end = clock();
  std::cout << "creation time: " << (end-start) << " clocks(" << (double) (end-start)/CLOCKS_PER_SEC << " sec)" << std::endl;
 
  std::cout << "iterating over huge TsPSPIntSLListNode" << std::endl;
  start = clock();
  // the following code calls copy constructor instead of assignment operator
  TsPersistentSmartPointer<TsPSPIntSLLNode> current = huge.getHeadNode();
  while(current->hasNext()){
    if(current->getIndex() % 10000 == 0)
      std::cout << "index:" << current->getIndex() << std::endl;
    current = current->moveNext();
  }
  end = clock();
  std::cout << "iteration time: " << (end-start) << " clocks(" << (double) (end-start)/CLOCKS_PER_SEC << " sec)" << std::endl;
  std::cout << "completed huge TsPSPIntSLList" << std::endl;
 
  // followings are called automatically, but here explicitely for measurement
  //current.~TsPersistentSmartPointer();
  //huge.~TsPSPIntSLList();
 
}

Please note no code is required for the destructor of TsPersistentSmartPointer.

So what about reference linking?

No. It can't be lock free if with doubly linked list. Two links to prev and next have to be updated at the same time, otherwise one could see inconsistent stete.

Performance?

I have measured performance with clock() function defined in "time.h" as in TsPSPIntSLList.cpp.

Raw Pointer for single thread
Creation: 0.01 sec
Iteration: 0.02 sec

SmartPointer for single thread
Creation: 0.27 sec
Iteration: 0.05 sec

Lock-free SmartPointer
Creation: 0.45 sec
Iteration: 0.12 sec

In time, there's a significant overhead over plain raw pointer implementation.

Also, a couple of CPU cache statistics collected by vaigrind.

Raw Pointer for single thread
Instruction Read: 409,523,865
Data Read: 115,388,899
L1 Data Read Miss: 2,013,421
L2 Data Read Miss: 2,005,354
Data Write: 69,133,007
L1 Data Write Miss: 1,002,531
L2 Data Write Miss: 1,001,018

Smart Pointer for single thread
Instruction Read: 960,359,665
Data Read: 284,618,659
L1 Data Read Miss: 2,049,844
L2 Data Read Miss: 2,041,472
Data Write: 172,410,809
L1 Data Write Miss: 1,916,330
L2 Data Write Miss: 1,914,709

Lock-free Smart Pointer
Instruction Read: 1,161,535,161
Data Read: 317,391,770
L1 Data Read Miss: 2,013,687
L2 Data Read Miss: 2,005,198
Data Write: 235,134,100
L1 Data Write Miss: 2,002,377
L2 Data Write Miss: 2,001,012

(pthread Mutex SmartPointer)
Creation: 0.51 sec
Iteration: 0.14 sec
Instruction Read: 1,149,615,787
Data Read: 311,412,927
L1 Data Read Miss: 3,014,567
L2 Data Read Miss: 3,005,312
Data Write: 237,144,326
L1 Data Write Miss: 3,002,797
L2 Data Write Miss: 3,001,194

Please note L1 and L2 cache read miss stays around as total number increases.
(except mutex: mutex calls another million of malloc)

It is not free lunch. But for read intensive applications, how attractive it is!

A Solution To Von Neumann bottleneck

Here's a question.

Do you have a problem about memory performance that might have been observed as CPU or I/O bottleneck before?

In other words,

Where's your interests about performance bottleneck today? Memory?

Even like this:

Do you use any Virtualization technology?

If yes, you are likely suffering from Von Neumann bottleneck, which is described as follows in Wikipedia:

The separation between the CPU and memory leads to the von Neumann bottleneck, the limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory.

This is not something new, but is considered having been accelerated by increasing number of CPU cores and size of RAM, while having almost the same amount of cache and the same bus size between memory and CPU.

-

Intel released Xeon 7500 series(e.g. X7560) recently. It has NUMA architecture, which has respective dedicated memory bank and bus to each CPU. OS kernels have improved memory management system like SLAB to achieve better cache hit rate from CPU point's of view.

But, inherently, memory allocation is a responsibility of a programmer. A compiler can optimize instructions, but not data manipulated. Today's CPU can predict repeated pattern, but which is extremely simple like a jump.

So, I could say, the third factor, that has accelerated the bottleneck, is a programmer.

-

Object Database is a technology developed long time ago. And its benefit has been considered development efficiency through maximum utilization of Object Oriented Programming Language.

Recently, I found one possible huge benefit of Object Database that has not been addressed before.

ODBMS returns objects that are directly referenced by a user against a query. While returned values are copied to user objects with RDBMS. In this sense, ODBMS provides memory management system. While it is up to a programmer with RDBMS.

So, here's the potential that ODBMS can be a solution to Von Neumann bottleneck.

-

From CPU's point of view, cache is nothing but bringing data up to L1 by the time CPU process it.

L1 <- L2
(L2 <- L3)
L2 <- RAM
(TLB lookup)
(physical address <- virtual address)
(RAM <- I/O: disk or network)

So, I could say, ODBMS is a cache system.

This can explain up to 100x times faster performance on complex data structure like Graph.

Such a redefinition might unveil the true power of ODBMS.

ChunkIO(Clustered Reads/Writes)

We're just having a db4o global conference in Berlin.

Here's what I have suggested before and just desided to work on this year, ChunkIO(Clustered Reads/Writes).

- Who is the target?

device guys

- What is the purpose of chunkIO?

By providing a way to control the number of IOs,
to help users achieve reliable responses on serious operations.

Also, some internal operations like startup should be improved.

- Why is this required?

Performance is observed in many ways. Usually "Maximum" is often mentioned.
"Maximum" performance can be improved by bufferings. But "Minimum" are not.
So it is not considered reliable for device guys.
Then, a way to improve "Minimum" performance is must.
It is chunkIO that improves "Minimum" performance.

- How does it work?

"Objects updated in a transaction are written/read all at once"
"The behavior can be configured: NONE, CHUNK, etc"
"The behavior is pluggable"

This should boost performance considerably, but is about deep in the core, so must be quite hard to do. Yet, I know it worths a lot to work on it.

This project will be opened under CodeCommander.

I will notify here if it gets up. I will appreciate any feedbacks!

Free Blog Themes and Free Blog Templates