|Creation date||Minimum version|
The purpose of normalization is to organize columns and tables of a relational database to reduce data redundancy. The Normalization extension enables you to specify the rules that determine whether duplicate rows exist in a table. According to your specifications, it produces tables with an optimum structure while preserving data integrity.
In the My Extensions panel, the Normalization extension prompts for several types of input as described in this section.
This is the input table to be normalized.
Dimension Table Names
A dimension table is one of two kinds of tables created by the Normalization extension. It is the result of deduplication. That is, the extension deduplicates rows containing identical values in the columns selected in the original table, creating a dimension table with only unique rows.
In the Dimension Table Names section, enter the name of each dimension table.
The dimension table name also forms a part of the name of the primary key column, unless you specify a primary key. For example, if you do not specify a primary key and the dimension table name is restaurant, the primary key column's name is restaurantId.
You can have multiple dimension tables. After entering one name, click Add another argument to specify another name.
For each dimension table, specify the columns to be included. Enter the column names in the Dimension's Columns section, where an argument is a list of columns to be included in the dimension table. Make sure that you specify the same number of arguments in this section as the dimension table names.
Only the values in the columns specified are considered when the extension performs the row deduplication.
For example, in a table with two rows and two column, suppose row 1 and row 2 have the same values in column 1 but not column 2, selecting only column 1 for the dimension table causes the table to contain only one row. Row 1 and row 2 in the original table are duplicates because column 2 is not taken into consideration.
The following list describes how to specify the columns:
- To select a column, click anywhere in the column in the table to be normalized. The column name is displayed in the list. You can click as many columns as desired. Their names are placed in the list, separated by a comma.
- To enter another list of columns for another dimension table, click Add another argument.
- The order of the arguments (that is, the lists of columns) must be the same as the order of the dimension table names. For example, suppose you have two dimension table names specified in this order: User and Business. The first list of columns should be for the User table, and the second list of columns should be for the Business table.
Each dimension table must have a primary key. The following list provides the guidelines for specifying the primary key:
- To select a column as the primary key, click anywhere in the column in the table to be normalized. The column name is displayed in the Column Keys section.
- The primary specified must be a column listed in the Dimension's Columns section for the corresponding dimension table.
- To enter a primary key for another table, click Add another argument.
- The order of the primary keys must be the same as the order of the dimension table names. For example, suppose you have two dimension table names specified in this order: User and Business. The first primary key should be for the User table, and the second primary key should be for the Business table.
- Each primary key must be a value that can uniquely identify a row. The extension fails if it detects that in the dimension table to be created, there would be duplicate values in the primary key column.
- If you do not specify a key for a dimension table, the extension creates a primary key for the table. The column name for the key is the dimension table name followed by the string Id. The generated column consists of integers starting from 1.
Fact Table Name
A fact table is one of two kinds of tables created by the Normalization extension. Its purpose is to preserve the data integrity in the original table.
The fact table name is optional. If you do not specify a name, the default name is inputTable-factTable, where inputTable is the name of the table being normalized.
The extension creates two types of tables.
The extension creates dimension tables according to your specifications (the table name and the columns). The columns are placed in each dimension table in the same order as they are specified in the Dimension's Columns section. Rows in a dimension table are unique.
If you do not specify a primary key, an additional column is created as the primary key populated with integers starting from 1.
In the data browser, the following derived fields are created:
- Count_of_tableName: tableName is the name of the dimension table. This field indicates the number of rows that are deduplicated. For example, in the dimension table named User, if the Count_of_User field for row 1 of User is 3, row 1 occurs in the original table 3 times. That is, there are 3 rows in the original table that contain the same values in the columns included in the dimension table.
- tableName_concat: This field is used internally by the extension.
The extension creates a fact table, which contains the following columns:
- primary key column of each dimension table
- columns from the original table (the table being normalized) that are not in any dimension tables
The number of rows in the fact table is always equal to the number of rows in the table being normalized. Columns not in the fact table but in the original table are retained in the data browser., which enables you to pull the columns to the fact table if desired. No data from the original table is lost during normalization and data integrity is preserved.
In this example, the Normalization extension is used to normalize a table named Reviews, which is shown in the following screenshot:
In the My Extensions panel, start the Normalization extension and enter the input as shown in the following screenshot:
In this example, three dimension tables are specified: User and Business.
For the User table, these columns are selected:
For the Business table, these columns are selected:
Because all columns are prefixed columns, the table prefix is included in the column names.
The primary key for the User table is UserId. No primary key is specified for the Business table.
The dimension table named User is shown in the following screenshot. No primary key column is created because the column UserId is the specified primary key.
The dimension table named Business is shown in the following screenshot. The BusinessId column is created as a primary key because no primary key is specified.
The fact table named factTable is created with primary key columns from the dimension tables and the Reviews::Review column from the original table, which is not included in either dimension table.