Connection Storm

Connection Storm screenshot

Text-only Preview

The "Connection Storm" Story at Stubhub
Stubhub, back in 2004-2007, was running a 4-node RAC 9i database on 32-bit CPU. To
solve the resource constraints, the 4-node RAC was partitioned into two nodes for OLTP,
while the other two nodes were reserved for broker uploads, 24X7 call center
transactions and internal reporting etc. The deployment of application partitioning and
configuration of different shared pool and data cache for OLTP/DSS completely
eliminated the I/O from the OLTP side and enhanced the user experience at the front. To
fix the 32-bit linux limitation on the SGA, external data cache was configured to borrow
extra memory for the SGA on the OLTP side and lessen the data cache while doing the
opposite on the DSS side, namely, cutting down the shared pool to merely 200+ M and
allocating 2+ GB for the data cache .
The "connection storm" occurred at the turn of 2006-2007. In December 2006 around,
database already had problem with the lock-up of sys objects like OBJ$, SEQ$ etc. (The
lockup of OBJ$, SEQ$ was easily solved by identifying the sessions and killing those
locks.) At about the same time, maybe in January 2007, the app team launched a few
more application blades, which led to a jump of the aggregated total connection pools to
the OLTP database nodes at the backend. Compounding the issues would be the release
of some codes, involving nine LOGISTICS_***_prefixed tables, that totally changed the
behavior of the database, and led to the phenomenon of nine tables and their indexes
generating 95-98% of logical reads of the entire database.
Since day one, Stubhub database had exhibited the "UNDO enqueue" as the most serious
lock of all, which derived from the fact that the database was created with the 16k block
size. When the nine LOGISTICS_***_prefixed tables were modified by the business
people at the backend nodes, the OLTP users would have to obtain an UNDO block
across the interconnect for building a consistent image. With 95-98% of logical reads of
the entire database generated by the nine tables and their indexes, the "UNDO enqueue"
problem was exacerbated. A momentary freeze in resolving the "UNDO enqueue" would
have led to the hang of hundreds of user sessions at the OLTP nodes. This hang caused a
false "connection storm", over which people easily got confused about.
Things became difficult at Stubhub when literally hundreds of new people joined the
band wagon at the same time upon the rumor that the company was to be sold. After 1-2
months' debate and stress test, the nine LOGISTICS_***_prefixed tables and their
indexes were moved to a separate 4k tablespace. The magic thing was that no matter it
was originally 16k or now 4k, the logical reads for the nine LOGISTICS_***_prefixed
tables remained about the same, while response time improved by about 30-40% for some
queries involving those tables. Once the deployment was in, the frequency for false
"connection storm" was lessened to about twice per week instead of more than twice per
day. The complication, however, was that at about the time the nine
LOGISTICS_***_prefixed tables were moved to 4k tablespace in March, the SWAT
team, which was created for dealing with the false denial-of-service claim and the outage
but consisted of mostly non-database staff with a purported tenure of half a year [till June

30, 2007], had pushed for the DCD (dead connection detection) to be deployed ahead of
the 4k tablespace move. While the "connection storm" lessened in intensity, the credit for
the mitigation of the "connection storm" did not get ascribed to the 4k change.
From April 2007 onward, the DCD implementation began to take toll on the database.
DCD, for its mechanism to ping the java connection pools, caused idle connections to the
OLTP database nodes to jump to 200-300 and more from previous levels of about 100-
110 conenctions per OLTP node. That is, DCD defeated the connection pool mechanism,
with connections never shrunk once spawned. The extra hundreds of idle connections,
while eating away at the memory, became deadly when the "UNDO enqueue" induced
hang re-occured. Once hundreds of idle connections transformed into active mode in a
matter of 1-2 seconds, the whole RAC went down. Previously, when the hang occurred, it
would take half a minute for the connections to be created and fired at the OLTP DB
nodes, which in another sense mitigated the severity of attacks at the database.
Now why the RAC database failed to handle the surge of connections? It was both at the
database level and at the OS-level. The Oracle support in April pinpointed the need to
lessen the stack_size parameter. Oracle support, while detecting the "UNDO enqueue" in
some of the systemstate dump provided, pointed out that the 9i cluster manager relied on
the OS-level stack size for handling simultaneous connections to the database. Since the
SWAT team took the infrequent "connection storm" as a DCD fait accompli, stack_size
change, as suggested by Oracle, was irnored. On July 1st, 2007, a new set of SWAT
manager and engineers came on aboard. After about three weeks' discourse, with 20-30
people jamming the conference room, a decision was made to reverse the DCD
deployment and implement the stack_size change. Once the stack_size was changed, the
whole RAC database successfuly withstood the "connection storm" that still frequented at
maybe 1-2 times per week. The reason that stack_size mattered so much to the
"connection storm" was that the cluster manager, under the former stack_size setting,
could not handle more than 200+ concurrent connections. With the new setting,
theoretically, the cluster manager could handle 1000-2000 concurrent connections.
Hence, "UNDO enqueue" induced hang re-occured, the application blades, detecting the
momentary freeze of the database, would spawn and throw hundreds of new connections
to the database in a gradual and normal manner, and then shrink the conenction pool
numbers once the freeze was over. (The ultimate fix of the database hang was to upgrade
to which purportedly fix the "UNDO enqueue".)