Creation date Minimum version
2017-9-22 1.2.2

Distinct

The Distinct extension is similar to the DISTINCT operator in SQL commands. The main purpose is to remove duplicate rows. Rows are considered duplicates if they have identical values for the specified columns. The values in columns not specified are disregarded when the extension determines whether the rows are duplicates. After the duplicates are removed, the remaining rows are distinct rows.

The Distinct extension consists of three functions.

Distinct function

This function creates a table with distinct rows and only the columns specified in your input. The function is equivalent to the following SQL command:

SELECT DISTINCT <Columns> FROM <triggerTable>;

Input

The following list describes the input fields:

  • Columns specifies the columns to be considered when the function determines whether rows are duplicates. For example, if you specify column 1 and column 2, two rows are considered non-distinct if they have the same values for column 1 and column 2, even if these rows have different values for column 3,
  • New Table Name specifies the name of the resultant table. If the name is not specified, the default name is tableName-GB where tableName is the name of the input table.

Output

The resultant table consists of the specified columns, and each row is unique. That is, if there are non-distinct rows in the input table, only one instance of each group of non-distinct rows is in the resultant table.

Example

The following screenshot shows the input table:

To create a table with only rows that are distinct when the reviews::Restaurant and reviews::Stars columns are considered, enter the input as shown in the following screenshot:

After you apply the extension, the following resultant table is created:

The resultant table contains only the distinct values for the reviews::Restaurant and reviews::Stars columns. The resultant table name is also modified according to your input. Because Le Papillon is the only restaurant to receive the same number (3) of stars twice, one row from the input table is removed. That is, only one instance of the Le Papillon row from the input table is in the resultant table.

Count Distinct function

This function returns an aggregate, which is equivalent to this SQL command:

SELECT COUNT(*) FROM (SELECT DISTINCT <Columns> FROM <triggerTable>);

The result is equal to the number of rows in the resultant table of the Distinct function.

Input

The following list describes the input fields:

  • Columns specifies the columns to be considered when the function determines whether rows are duplicates. For example, if you specify column 1 and column 2, two rows are considered non-distinct if they have the same values for column 1 and column 2, even if these rows have different values in column 3,
  • Aggregate Name saves the aggregate value under a name. If provided, Xcalar creates an aggregate with this name, whose value is the result of the extension. Specify a name if you plan to use the aggregate in a future operation.

Output

An aggregate representing the number of distinct rows that are present in the table with respect to the specified columns.

Example

Using the same input table as above, you can enter the same columns as specified for the Distinct function:

The aggregate shown in the following output is equal to the number of rows in the resultant table created by the Distinct function.

The output is also saved under the specified aggregate name (DistinctRatings), which you can display as you would any aggregates in the Tables panel.

Group By Count Distinct

This function creates a table with columns for groups of rows. The rows show the counts of distinct rows in each group. The function works the same as a Group By operation on columns in Group By Columns. The aggregate in the Group By is the result of the Count Distinct function performed on Count Distinct Columns.

This function is equivalent to this SQL command:

SELECT COUNT(*), <Group By Columns> FROM (SELECT DISTINCT <All Columns> FROM <triggerTable>)

GROUP BY <Group By Columns>;

All Columns in the SQL command refers to all unique columns present in the combination of Group By Columns and Count Distinct Columns.

Input

The following list describes the input fields:

  • Count Distinct Columns specifies the columns whose values are considered when the function counts the distinct rows in the group. (Groups are defined by Group By Columns.) The count is displayed in the column named by Resultant Column Name.
  • Group By Columns specifies the columns for which the function calculates the number of unique records with respect to Count Distinct Columns.
  • Resultant Column Name specifies the name of the additional column created by the extension. If the name is not specified, it defaults to DistinctCount.
  • New Table Name specifies the resultant table name. If the name is not specified, the table name is tableName-GBCD, where tableNameis the name of the input table.

Output

The resultant table shows the Group By Columns, and an aggregate column showing the count of unique records with respect to Count Distinct Columns. This aggregate is the same as the result of the Count Distinct function.

Example

In the following example, you can group the restaurants by Cuisine. Then for each group, you count the distinct rows for the reviews::Restaurant column. The number of unique rows in each group is displayed in a new column named numOfRestaurantsPerCuisine.

The function creates the following resultant table. Three groups of restaurants are shown in the Cuisine column (specified by Group by Columns). In each group, the extension counts the distinct rows for the columns specified by Count Distinct Columns.

The following list explains how the extension obtains the counts for the resultant table:

  • in the Thai group, there are 2 distinct rows. That is, the reviews::Restaurant column contains 2 unique restaurant names (Silk Elephant and Bangkok Thai).
  • In the French group, there are 2 distinct rows. That is, the reviews::Restaurant column contains 2 unique restaurant names (Crepes Parisienne and Le Papillon).
  • In the Pizza group, there is 1 distinct row. That is, the reviews:Restaurant column contains 1 unique restaurant names (Sliver).

Go to top