4-2 Lab: Cardinality And Targeted Data

4-2 Lab: Cardinality and Targeted Data

The 4‑2 lab: cardinality and targeted data is a hands‑on exercise designed to help students grasp how the uniqueness of values in a column (cardinality) influences query performance and how focusing on a subset of records (targeted data) can improve both speed and relevance. By working through this lab, learners gain practical experience writing SQL statements, interpreting execution plans, and applying indexing strategies that are essential for real‑world database optimization That alone is useful..

Introduction

In relational databases, cardinality refers to the number of distinct values that appear in a column relative to the total number of rows. In real terms, high cardinality columns (e. g., primary keys, timestamps) contain many unique entries, while low cardinality columns (e.g.Practically speaking, , gender, status flags) repeat the same values frequently. Understanding this concept is crucial because the optimizer uses cardinality estimates to decide whether to scan an index, perform a full table scan, or join tables in a particular order Most people skip this — try not to..

Targeted data, on the other hand, involves selecting only those rows that satisfy specific business criteria—such as customers from a particular region, orders placed in the last quarter, or products with inventory below a threshold. By narrowing the dataset early in a query, you reduce the amount of data the engine must process, which often translates into faster response times and lower resource consumption.

The 4‑2 lab combines these two ideas: you first measure the cardinality of various columns, then create targeted queries that apply that knowledge to improve performance.

Understanding Cardinality

What Is Cardinality?

Cardinality can be expressed as a ratio:

[ \text{Cardinality} = \frac{\text{Number of Distinct Values}}{\text{Total Number of Rows}} ]

High cardinality (close to 1): many unique values → indexes are usually effective.
Low cardinality (close to 0): few unique values → indexes may be less useful unless combined with other columns.

Why It Matters

Query Planning – The optimizer estimates how many rows will match a predicate. If it overestimates, it may choose a costly full table scan; if it underestimates, it might pick an index that actually returns too many rows, causing extra I/O.
Index Design – Columns with high cardinality are prime candidates for single‑column indexes. Low‑cardinality columns often benefit from bitmap indexes (in data warehouses) or composite indexes that pair them with high‑cardinality fields.
Partitioning Strategies – Range or list partitioning works best on columns with predictable, high‑cardinality values (e.g., dates).

Understanding Targeted Data

Defining Targeted Data

Targeted data is the result of applying filter predicates (WHERE clauses) that isolate a meaningful subset of rows. The goal is to eliminate irrelevant data as early as possible in the query execution pipeline.

Benefits

Reduced I/O – Fewer pages read from disk.
Lower CPU Usage – Fewer rows to evaluate in joins, aggregates, or sorting.
Improved Concurrency – Less lock contention because transactions touch fewer rows.
Clearer Insights – Analysts focus on the data that matters for a specific business question.

Techniques

Predicate Push‑Down – Ensure WHERE clauses are applied before expensive operations like GROUP BY or JOIN.
Covering Indexes – Include all columns needed by the query in the index so the engine can satisfy the request without touching the base table.
Materialized Views – Pre‑aggregate or pre‑filter data for recurring targeted queries.

Lab Overview

The 4‑2 lab typically provides a sample schema (e.Consider this: g. , a sales database with tables Customers, Orders, OrderItems, and Products).

Examine column statistics to determine cardinality.
Create baseline queries that retrieve large result sets.
Apply targeted filters to shrink the result set.
Add or modify indexes based on cardinality insights.
Compare execution plans and runtimes before and after optimization.

Step‑by‑Step Procedure

Step 1: Explore the Schema

SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'sales_db';

Identify candidate columns for cardinality analysis (e.order_date, Products., Customers.g.country, Orders.category_id) But it adds up..

Step 2: Compute Cardinality

Run a query for each column of interest:

SELECT 
    COUNT(*) AS total_rows,
    COUNT(DISTINCT country) AS distinct_countries,
    COUNT(DISTINCT country) * 1.0 / COUNT(*) AS country_cardinality
FROM Customers;

Repeat for other columns. Record the results in a table for later reference.

Step 3: Build a Baseline Query

A typical baseline might retrieve all orders with their line items:

SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM Orders o
JOIN Customers c ON o.customer_id = c.customer_id
JOIN OrderItems oi ON o.order_id = oi.order_id
JOIN Products p ON oi.product_id = p.product_id;

Capture the execution plan (EXPLAIN ANALYZE) and note the runtime Simple, but easy to overlook..

Step 4: Apply Targeted Filters

Suppose the business wants orders from the United States placed in Q4 2023. Modify the query:

SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM Orders o
JOIN Customers c ON o.customer_id = c.customer_id
JOIN OrderItems oi ON o.order_id = oi.order_id
JOIN Products p ON oi.product_id = p.product_id
WHERE c.country = 'United States'
  AND o.order_date BETWEEN '2023-10-01' AND '2023-12-31';

Run EXPLAIN ANALYZE again. In real terms, observe how the planner now uses an index on Customers. country (if available) and a range scan on Orders.order_date.

Step 5: Index Tuning Based on Cardinality

If Customers.country shows low cardinality (few distinct values), consider a bitmap index (if your RDBMS supports it) or a composite index (country, customer_id).
If Orders.order_date

…shows high cardinality (many distinct dates), a B-tree index is ideal for efficient range scans. Create indexes designed for the most common query patterns:

CREATE INDEX idx_customers_country ON Customers(country);
CREATE INDEX idx_orders_order_date ON Orders(order_date);

For composite filters, consider multi-column indexes. Take this: if queries often filter by country and order_date together, create:

CREATE INDEX idx_orders_country_date ON Orders(customer_id, order_date);

Step 6: Compare Execution Plans and Runtimes

After applying indexes, re-run the filtered query with EXPLAIN ANALYZE. That's why you’ll likely see:

Index scans replacing sequential scans. - Bitmap index scans for low-cardinality columns.
Reduced execution time (e.g., from 1.2s to 0.05s).

Example output (simplified):

Index Scan using idx_customers_country on Customers  (cost=0.Here's the thing — 42.. 18.

---

### Conclusion  

Database optimization is an iterative process. Now, baseline queries reveal bottlenecks, while targeted filters mimic real-world usage. Start by understanding your schema and data distribution—cardinality tells you which columns benefit most from indexing. By strategically adding indexes and comparing execution plans, you can dramatically improve query performance.  

Remember:  
- High-cardinality columns (e.Now, , `country`) may benefit from bitmap indexes or composite strategies. On top of that, g. Still, , `order_id`) often need indexes for lookups. In practice, g. - Low-cardinality columns (e.- Always measure before and after—tools like `EXPLAIN ANALYZE` are your best friends.  

Apply these principles to your own databases, and you’ll turn sluggish queries into lightning-fast operations.

4-2 Lab: Cardinality And Targeted Data