Code Reading Notes - MySQL Cluster

The partitioning code in MySQL Cluster is a totally different story from the MySQL Server partition plugin. The partitioning is handled natively by the Network DataBase storage engine.

As mentioned in part 1 and part 2, MySQL Server still needs to parse the SQL and call handlers to handle storage stuff. But NDB is more of a standalone system that has its own API than a simple plugin. The handler part shipped with MySQL Server, is a bridges that translate handler API call into NDB API call.

During the reading of this part of code, I got a lot of help from Frazer, thanks a lot! For an overview of NDB architecture, please refer to his blog. Here's some code snippets I found important, and some notes from the discussion on his blog post mentioned above. Part of the post is copied directly from the discussion. Credit goes to Frazer.

This blog by Mikael Ronstrom is also great. Don't miss it.

The Handler Part

The code is in /sql/ha_ndbcluster.cc. It creates a handlerton and fills in function pointers. The initilization routine is in ndbcluster_init, who will be issuing ndbcluster_connect to join the cluster.

As mentioned above, the handler is just an entry to the NDB cluster. The main logic happens instead in the NDB software, not the handler itself.

NDB

image

The code is in storage/ndb/src/kernel. NDB originates from Ericsson [ref], and it is a message-passing architecture. Functionalities are packaged into blocks, and a virtual machine handles the runtime, e.g. message dispatching, scheduling, OS related works, and more.

Blocks

For a complete list of kernel blocks, refer to NDB Internals here. Some important ones are

  • dbdict: data dictionary block, stores table schema metadata on disk
  • dbdih: partitioning management, stores table partitioning/fragmentation and redo recovery metadata on disk
  • dblqh: local query handler, stores operation redo logs on disk
  • dbtc: transaction coordinator
  • dbtup: tuple manager, manages physical storage
  • dbtux: index
  • ndbfs: implements an async filesystem abstraction for the other blocks, including optional compression.
  • trix: triggers

The most important code is in storage/ndb/src/kernel/vm/SimulatedBlock, which is a base class for other blocks. Since it is message-passing software, managing message handlers is important.

  1. It has a theExecArray which maps a GlobalSignalNumber onto a callback function of type ExecFunction. This is used later as in SimulatedBlock::executeFunction().
  2. It has a addRecSignalImpl() which add an entry to theExecArray. This is used in other blocks' initialization phase, where they use this function to register message handlers. It may appears as addRecSignal().
  3. It defines a macro BLOCK_DEFINES which help define a block by calling a variaty of default functions. But this is for backward compatiblity. Code here is kind of messy.
  4. It contains other execute/callback related code, and send signal related code.

For a general block, it will first define itself as a child class of SimulatedBlock, and then call addRecSignal() several times to register its signal handlers in its initialization routine. If other blocks are going to interact with it, it will usually define them as friend classes. I don't know the reason but that's what the code is saying. Probably because this is converted from PLEX into C++ by some translator. Refer to Dbdict::Dbdict().

Messages

Messages have naming patterns.

  • REQ for request
  • CONF for confirm
  • REF for refuse
  • KEY for key related messages
  • FS for file-system related messages

I found XXX_KEY_REQ messages usually very important in NDB.

Thread and Emulator

Code is in storage/ndb/src/kernel/vm/ThreadConfig and Emulator. The first part is the NDB deamon code. The ThreadConfig::ipControlLoop() is the event loop where all things started. It handles the scheduling of signal execution and input/output.

Emulator code is the virtual machine emulator, where those global variables are defined. This is a good pointer to other important classes.

extern class  JobTable            globalJobTable;  
extern class  TimeQueue           globalTimeQueue;  
extern class  FastScheduler       globalScheduler;  
extern class  TransporterRegistry globalTransporterRegistry;  
extern struct GlobalData          globalData;

#ifdef VM_TRACE
extern class SignalLoggerManager globalSignalLoggers;  
#endif

#ifndef NO_EMULATED_JAM
  #define EMULATED_JAM_SIZE 1024
  #define JAM_MASK ((EMULATED_JAM_SIZE * 4) - 1)

  extern Uint8 theEmulatedJam[];
  extern Uint32 theEmulatedJamIndex;
  // last block entry, used in dumpJam() if jam contains no block entries
  extern Uint32 theEmulatedJamBlockNumber;
#else
  const Uint8 theEmulatedJam[]=0;
  const Uint32 theEmulatedJamIndex=0;
#endif

Partitioning

NDB natively supports [LINEAR] KEY hash partition. Online re-partition/reorg is implemented [ref], which is pretty cool. User defined partitioning is supported via the MySQL Server partitioning plugin.

Here's some discussion.

  • All row access is via primary key (indexes and scans also internally go via primary key)

  • Where the fragment (partition) of the row is unknown, the transaction coordinator (TC) will apply the MD5 hash function to the relevant columns of the primary key (known as the distribution key, or partition key), to get a hash value. storage/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp::execTCKEYREQ() / tckeyreq050Lab() / hash() / execSCANTABREQ()

  • Then the TC will ask the Distribution Handler (DIH) to give it an ordered list of the fragment replicas which store the hash value.

  • For hashmap partitioning, DIH will use the hash to lookup a fragment id in a table (hashmap). For non-hashmap partitioning, DIH will divide the hash modulo the number of fragments to get a fragment id. DIH will return details of the fragment's replicas, with the current primary fragment first in the list. storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp, execDIGETNODESREQ(), execDIH_SCAN_GET_NODES_REQ(). DIH will probably send several FS_XXX requests to NDBFS to read metadata.

  • TC then uses the fragment information to forward the request to the LQH block on the correct node and instance - usually the current Primary replica. This signal will then be handled by e.g. storage/ndb/src/kernel/block/dblqh/DblqhMain.cpp::execLQHKEYREQ(), execSCAN_FRAGREQ().

Online Re-org

This post by Frazer worth reading first. I copied some discussions between us, and pasted it here.

The re-org operation itself is triggered via DICT as part of a schema transaction (an alter table, where the table's hashmap is changed). This part of the code can be found at Dbdict::altertable_parse() where it compares tablePtr.p->hashMapObjectId.

Rows which must be moved from existing to new fragments are detected by looking up their distribution key hash in the old and new hashmaps. Where these differ, the row needs to be moved. You can read Dbdict::altertable_* serial functions to see details.

In DbtcMain, it executeReorgTrigger() to send TCKEYREQ, and then Dbtc itself handles it in tckeyreq050Lab(), who request DIH for nodes info, and uses two loop to choose owning node (TownNode) and backup node, and finally send LQHKEYREQ(s) to LQH for those dirty job.

The TRIX and DBUTIL blocks are used to scan the fragments of the table, and where a row which will be moved is found, it is inserted into the new fragment.

In the meantime, Cluster-internal row change triggers (DBTUP) are used to detect when user DML activity causes a row which will be moved to be changed. The trigger then causes the change to be propagated to the new fragment.

When the scan completes, the data-to-be-moved exists in both the new and old fragments. At this point, the schema transaction commit begins:

  • The new hashmap becomes the official hashmap - used for new requests
  • Moved rows in the old fragment become invisible to scans.
  • The system waits for the open operation refcount on the old hashmap to drop to zero - this required transactions to complete.
  • Once all operations using the old hashmap are complete, TRIX and DBUTIL begin a delete-scan of the existing fragment, removing the copied rows.
  • Then the scan is complete.

As with most databases, many things are accomplished with internal triggers. storage/ndb/src/kernel/dbtup/DbtupTrigger.cpp is a good place to look. Generally triggers either have a node-local or cluster-wide effect. Node-local triggers are used e.g. for online backup, ordered index maintenance etc. Cluster-wide (Global) triggers are used for Unique indexes, Foreign keys (in 7.3+), and online re-org. Where a trigger is global, a transaction's TC is informed when the trigger is fired, and the TC is able to undertake further distributed work based on the trigger firing.

Final Words

After reading the code, and reading a variety of blog posts from Frazer, I think now I have some knowledge about the partitioning techniques in MySQL. It is hard to read, but the rational behind it is pretty clear. However, other partitioning methods exist, like the virtual nodes in Cassandra. I think I may come up with something later after reading Cassandra code.

comments powered by Disqus