How can we deal with intersection tables that quickly grow very large?

Question

For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?

Could you post a schema and a few example tuples? – Alex W Nov 25 '12 at 02:39 — Alex W, Nov 25 '12 at 02:39

Branko Dimitrijevic · Accepted Answer · 2012-11-25T08:31:56.650

There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.

There are good reasons for using junction table versus comma-separated list (or similar), including:

Efficient querying (through indexing and clustering).
Enforcement of referential integrity.

When designing a junction table, ask the following questions:

Do I need to query in only one direction or both?¹
- If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
- If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
Is the "extra" data small?
- If yes, cluster the table and cover the extra data in the secondary index as necessary.²
- I no, don't cluster the table and don't cover the extra data in the secondary index.³
Are there any additional tables for which the junction table acts as a parent?
- If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.

In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):

CREATE TABLE JUNCTION_TABLE (
    PARENT_ID INT,
    CHILD_ID INT,
    EXTRA_DATA VARCHAR2(50),
    PRIMARY KEY (PARENT_ID, CHILD_ID),
    FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
    FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;

CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
    JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;

Considerations:

ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.

Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.

Here is the equivalent MS SQL Server syntax (sans index compression):

CREATE TABLE JUNCTION_TABLE (
    PARENT_ID INT,
    CHILD_ID INT,
    EXTRA_DATA VARCHAR(50),
    PRIMARY KEY (PARENT_ID, CHILD_ID),
    FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
    FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);

CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
    JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);

Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.

¹ In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.

² Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.

³ This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!

Thank you, this was a great answer, and I appreciate the links provided for follow-up reading. — Jaigus, Nov 25 '12 at 09:04

How can we deal with intersection tables that quickly grow very large?

1 Answers1

Linked