The partitioning code in MySQL Cluster is a totally different story from the MySQL Server partition plugin. The partitioning is handled natively by the Network DataBase storage engine.
As mentioned in part 1 and part 2, MySQL Server still needs to parse the SQL and call handlers to handle storage stuff. But NDB is more of a standalone system that has its own API than a simple plugin. The handler part shipped with MySQL Server, is a bridges that translate handler API call into NDB API call.
During the reading of this part of code, I got a lot of help from Frazer, thanks a lot! For an overview of NDB architecture, please refer to his blog. Here's some code snippets I found important, and some notes from the discussion on his blog post mentioned above. Part of the post is copied directly from the discussion. Credit goes to Frazer.
This blog by Mikael Ronstrom is also great. Don't miss it.
The Handler Part
The code is in
/sql/ha_ndbcluster.cc. It creates a
handlerton and fills in function pointers. The initilization routine is in
ndbcluster_init, who will be issuing
ndbcluster_connect to join the cluster.
As mentioned above, the handler is just an entry to the NDB cluster. The main logic happens instead in the NDB software, not the handler itself.
The code is in
storage/ndb/src/kernel. NDB originates from Ericsson [ref], and it is a message-passing architecture. Functionalities are packaged into blocks, and a virtual machine handles the runtime, e.g. message dispatching, scheduling, OS related works, and more.
For a complete list of kernel blocks, refer to NDB Internals here. Some important ones are
- dbdict: data dictionary block, stores table schema metadata on disk
- dbdih: partitioning management, stores table partitioning/fragmentation and redo recovery metadata on disk
- dblqh: local query handler, stores operation redo logs on disk
- dbtc: transaction coordinator
- dbtup: tuple manager, manages physical storage
- dbtux: index
- ndbfs: implements an async filesystem abstraction for the other blocks, including optional compression.
- trix: triggers
The most important code is in
storage/ndb/src/kernel/vm/SimulatedBlock, which is a base class for other blocks. Since it is message-passing software, managing message handlers is important.
- It has a
theExecArraywhich maps a
GlobalSignalNumberonto a callback function of type
ExecFunction. This is used later as in
- It has a
addRecSignalImpl()which add an entry to
theExecArray. This is used in other blocks' initialization phase, where they use this function to register message handlers. It may appears as
- It defines a macro
BLOCK_DEFINESwhich help define a block by calling a variaty of default functions. But this is for backward compatiblity. Code here is kind of messy.
- It contains other execute/callback related code, and send signal related code.
For a general block, it will first define itself as a child class of
SimulatedBlock, and then call
addRecSignal() several times to register its signal handlers in its initialization routine. If other blocks are going to interact with it, it will usually define them as friend classes. I don't know the reason but that's what the code is saying. Probably because this is converted from PLEX into C++ by some translator. Refer to
Messages have naming patterns.
KEYfor key related messages
FSfor file-system related messages
XXX_KEY_REQ messages usually very important in NDB.
Thread and Emulator
Code is in
Emulator. The first part is the NDB deamon code. The
ThreadConfig::ipControlLoop() is the event loop where all things started. It handles the scheduling of signal execution and input/output.
Emulator code is the virtual machine emulator, where those global variables are defined. This is a good pointer to other important classes.
extern class JobTable globalJobTable; extern class TimeQueue globalTimeQueue; extern class FastScheduler globalScheduler; extern class TransporterRegistry globalTransporterRegistry; extern struct GlobalData globalData; #ifdef VM_TRACE extern class SignalLoggerManager globalSignalLoggers; #endif #ifndef NO_EMULATED_JAM #define EMULATED_JAM_SIZE 1024 #define JAM_MASK ((EMULATED_JAM_SIZE * 4) - 1) extern Uint8 theEmulatedJam; extern Uint32 theEmulatedJamIndex; // last block entry, used in dumpJam() if jam contains no block entries extern Uint32 theEmulatedJamBlockNumber; #else const Uint8 theEmulatedJam=0; const Uint32 theEmulatedJamIndex=0; #endif
NDB natively supports
[LINEAR] KEY hash partition. Online re-partition/reorg is implemented [ref], which is pretty cool. User defined partitioning is supported via the MySQL Server partitioning plugin.
Here's some discussion.
All row access is via primary key (indexes and scans also internally go via primary key)
Where the fragment (partition) of the row is unknown, the transaction coordinator (TC) will apply the MD5 hash function to the relevant columns of the primary key (known as the distribution key, or partition key), to get a hash value.
Then the TC will ask the Distribution Handler (DIH) to give it an ordered list of the fragment replicas which store the hash value.
For hashmap partitioning, DIH will use the hash to lookup a fragment id in a table (hashmap). For non-hashmap partitioning, DIH will divide the hash modulo the number of fragments to get a fragment id. DIH will return details of the fragment's replicas, with the current primary fragment first in the list.
execDIH_SCAN_GET_NODES_REQ(). DIH will probably send several
FS_XXXrequests to NDBFS to read metadata.
TC then uses the fragment information to forward the request to the LQH block on the correct node and instance - usually the current Primary replica. This signal will then be handled by e.g.
This post by Frazer worth reading first. I copied some discussions between us, and pasted it here.
The re-org operation itself is triggered via DICT as part of a schema transaction (an alter table, where the table's hashmap is changed). This part of the code can be found at
Dbdict::altertable_parse() where it compares
Rows which must be moved from existing to new fragments are detected by looking up their distribution key hash in the old and new hashmaps. Where these differ, the row needs to be moved. You can read
Dbdict::altertable_* serial functions to see details.
executeReorgTrigger() to send
TCKEYREQ, and then Dbtc itself handles it in
tckeyreq050Lab(), who request DIH for nodes info, and uses two loop to choose owning node (
TownNode) and backup node, and finally send
LQHKEYREQ(s) to LQH for those dirty job.
The TRIX and DBUTIL blocks are used to scan the fragments of the table, and where a row which will be moved is found, it is inserted into the new fragment.
In the meantime, Cluster-internal row change triggers (DBTUP) are used to detect when user DML activity causes a row which will be moved to be changed. The trigger then causes the change to be propagated to the new fragment.
When the scan completes, the data-to-be-moved exists in both the new and old fragments. At this point, the schema transaction commit begins:
- The new hashmap becomes the official hashmap - used for new requests
- Moved rows in the old fragment become invisible to scans.
- The system waits for the open operation refcount on the old hashmap to drop to zero - this required transactions to complete.
- Once all operations using the old hashmap are complete, TRIX and DBUTIL begin a delete-scan of the existing fragment, removing the copied rows.
- Then the scan is complete.
As with most databases, many things are accomplished with internal triggers.
storage/ndb/src/kernel/dbtup/DbtupTrigger.cpp is a good place to look. Generally triggers either have a node-local or cluster-wide effect. Node-local triggers are used e.g. for online backup, ordered index maintenance etc. Cluster-wide (Global) triggers are used for Unique indexes, Foreign keys (in 7.3+), and online re-org. Where a trigger is global, a transaction's TC is informed when the trigger is fired, and the TC is able to undertake further distributed work based on the trigger firing.
After reading the code, and reading a variety of blog posts from Frazer, I think now I have some knowledge about the partitioning techniques in MySQL. It is hard to read, but the rational behind it is pretty clear. However, other partitioning methods exist, like the virtual nodes in Cassandra. I think I may come up with something later after reading Cassandra code.