Quantcast
Channel: Tabular – SQLBI
Viewing all articles
Browse latest Browse all 227

Best Practices Using SUMMARIZE and ADDCOLUMNS

$
0
0

Everyone using DAX is probably used to SQL query language. Because of the similarities between the Tabular data modeling and the relational data modeling, there is the expectation that you can do the same operations that SQL allows. However, in its current implementation DAX does not permit all the operations that you can do in SQL. A few of the limitations are caused by the lack of equivalent syntax, others depend on a counterintuitive behavior of the xVelocity in-memory engine when extension columns are involved in a query.

NOTE: all the queries included in this article can be tried querying the AdventureWorks Tabular Model you can download from Codeplex. All the outputs are produced by using DaxStudio, our favorite free DAX editor.

Extension Columns

Extension columns are those columns that you add to existing tables. You can obtain extension columns by using both ADDCOLUMNS and SUMMARIZE. For example, the following query adds a Year Production column to the rows returned from the Product table.

EVALUATE
ADDCOLUMNS(
    Product,
    "Year Production", YEAR( Product[Product Start Date] )
)

FIG01

You can also create an extension column by using SUMMARIZE. For example, you can count the number of products for each product category by using the following query (please note that such a query is not a best practice – you will see why later in this article).

EVALUATE
SUMMARIZE(
    Product,
    Product[Product Category Name],
    "Products", COUNTROWS( Product )
)

FIG02

In practice, an extension column is a calculated column created within the query.

Query Projection

In a SELECT statement in SQL, you can choose the column projected in the result, whereas in DAX you can only add columns to a table by creating extension columns. The only workaround you have is using SUMMARIZE to group the table by the columns you want to obtain in the output. As soon as you do not need to see duplicated rows in the result, this solution does not have particular side effects. For example, if you want to get just the list of product names and their corresponding production start date, you can write the following query.

EVALUATE
SUMMARIZE(
    Product,
    Product[Product Name],
    Product[Product Start Date]
)

FIG03

Whenever you can create an extended column by using both ADDCOLUMNS and SUMMARIZE, you should always favor ADDCOLUMNS for performance reasons. For example, you can add the year of production start date by using two techniques. First, you can just use SUMMARIZE.

EVALUATE
SUMMARIZE(
    Product,
    Product[Product Name],
    Product[Product Start Date],
    "Year Production", YEAR( Product[Product Start Date] )
)

Second, you can use ADDCOLUMNS adding the Year Production column to the SUMMARIZE result.

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        Product,
        Product[Product Name],
        Product[Product Start Date]
    ),
    "Year Production", YEAR( Product[Product Start Date] )
)

Both queries produce the same result.

FIG04

However, you should always favor the ADDCOLUMNS version. The rule of thumb is that you should never add extended columns by using SUMMARIZE, unless it is required because at least one of the following conditions exists:

  • You want to use ROLLUP over one or more grouping columns in order to obtain subtotals
  • You are using non-trivial table expressions in the extended column, as you will see in the “Filter Context in SUMMARIZE and ADDCOLUMNS” section later in this article

The best practice is that, whenever possible, instead of writing

SUMMARIZE( <table>, <group_by_column>, <column_name>, <expression> )

you should write:

ADDCOLUMNS(
    SUMMARIZE( <table>, <group by column> ),
    <column_name>, CALCULATE( <expression> )
)

The CALCULATE you can see in the best practices template above is not always required, but you need it whenever the <expression> contains an aggregation function. The reason is that ADDCOLUMNS operates in a row context that does not automatically propagates into a filter context, whereas the same <expression> within a SUMMARIZE is executed into a filter context corresponding to the values in the grouped columns. The previous examples used a scalar expression over a column that was included in the SUMMARIZE output, so the reference to the column value was valid within the row context. Now, consider the following query that you have already seen at the beginning of this article.

EVALUATE
SUMMARIZE(
    Product,
    Product[Product Category Name],
    "Products", COUNTROWS( Product )
)

If you rewrite this query by simply moving the Products extended columns out of the SUMMARIZE into an ADDCOLUMNS function, you obtain the following query that produces a wrong result, because it returns the number of rows in the entire Products table for each row of the result, instead of the number of products for each category.

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        Product,
        Product[Product Category Name]
    ),
    "Products", COUNTROWS( Product )
)

FIG05

In order to obtain the expected result, you have to wrap the expression for the Products extended column within a CALCULATE statement. In this way, the row context for the Product Category Name is transformed into a filter context and the COUNTROWS function only consider the products belonging to the category of the current row.

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        Product,
        Product[Product Category Name]
    ),
    "Products", CALCULATE( COUNTROWS( Product ) )
)

FIG06

Thus, as a rule of thumb, wrap any expression for an extended column within a CALCULATE function whenever you move an extended column out from SUMMARIZE into an ADDCOLUMN statement.

Grouping by Extension Columns

A counterintuitive limitation you have in DAX is that you can group by extension columns, but you cannot perform meaningful calculations grouping by extension columns. For example, consider an extended column added to the Internet Sales table that returns the range of unit prices obtained with a logarithmic expression. In practice, any sale made with a unit price between 0 and 1 will be grouped as 1, between 1 and 10 will be grouped as 10, between 10 and 100 will be grouped as 100, and so on.

EVALUATE
ADDCOLUMNS(
    'Internet Sales',
    "Price Level", POWER( 10, 1 + INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
)

You can group data by using the Price Level extension column in a SUMMARIZE expression, so that you can see which are the groups for the existing sales.

EVALUATE
SUMMARIZE(
    ADDCOLUMNS(
        'Internet Sales',
        "Price Level", POWER( 10, 1 + INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
    ),
    [Price Level]
)
ORDER BY [Price Level]

FIG07

However, the extended columns that you can use in a SUMMARIZE expression are not part of the filter context. Thus, if you try to add an extended column to a SUMMARIZE expression that group by Price Level, the expression cannot be grouped by Price Level and produces an unexpected result.

EVALUATE
SUMMARIZE(
    ADDCOLUMNS(
        'Internet Sales',
        "Price Level", POWER( 10, 1+INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
    ),
    [Price Level],
    "Total Sales", SUM( 'Internet Sales'[Sales Amount] )
)
ORDER BY [Price Level]

FIG08

The Total Sales extended column always contains the sum of Sales Amount for all the rows of the Internet Sales table, regardless of the Price Level. This is completely counterintuitive, because you can see different lines, but it is like the Price Level column does not belong to the Internet Sales table and instead it is in a separate table, unrelated to Internet Sales, so that its filter context does not propagate to Internet Sales.

Note: in future versions of Analysis Services, the query you have just seen might produce warnings or errors instead of returning this unexpected result.

For this reason, trying to use CALCULATE and ADDCOLUMNS such as in the following query produces the same result as the previous query, which is not what we would like to see.

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        ADDCOLUMNS(
            'Internet Sales',
            "Price Level", POWER( 10, 1 + INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
        ),
        [Price Level]
    ),
    "Total Sales", CALCULATE( SUM( 'Internet Sales'[Sales Amount] ) )
)
ORDER BY [Price Level]

Since you do not have a relationship between two tables (Internet Sales and the “virtual” one for Price Level), you have to inject a filter condition within the CALCULATE expression, in order to only consider the rows in Internet Sales that have a price included within the level defined by Price Level. A simple way to do that is repeating the expression that calculates the Price Level in the filter expression, such as in the following query.

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        ADDCOLUMNS(
            'Internet Sales',
            "Price Level", POWER( 10, 1 + INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
        ),
        [Price Level]
    ),
    "Total Sales",
        CALCULATE(
            SUM( 'Internet Sales'[Sales Amount] ),
            FILTER(
                'Internet Sales',
                [Price Level]
                    = POWER( 10, 1 + INT( LOG10( 'Internet Sales'[Unit Price] ) ) )
            )
        )
)
ORDER BY [Price Level]

In order to avoid the duplication of an expression, you can use the DEFINE MEASURE syntax.

DEFINE
    MEASURE 'Internet Sales'[Price Band]
        = POWER( 10, 1 + INT( LOG10( VALUES( 'Internet Sales'[Unit Price] ) ) ) )
EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        ADDCOLUMNS(
            'Internet Sales',
            "Price Level", [Price Band]
        ),
        [Price Level]
    ),
    "Total Sales",
        CALCULATE(
            SUM( 'Internet Sales'[Sales Amount] ),
            FILTER( 'Internet Sales', [Price Level] = [Price Band] )
        )
)
ORDER BY [Price Level]

Both previous queries return the expected result, showing the sum of Sales Amount for each price level.

FIG09

The final outcome is that you have to generate the proper filter context in any calculation based on the grouping of an extended column, because it does not affect the filter context of the table to which it has been added.

Measure’s Syntax Observations

You might wonder why we did not use the same Price Level name for both the local measure and the extended column names. The reason is that even if it is possible, it would make the query harder to read. In fact, you can try the previous query by using Price Level instead of Price Band as the name of the local measure, as follows.

DEFINE
    MEASURE 'Internet Sales'[Price Level]
        = POWER( 10, 1 + INT( LOG10( VALUES( 'Internet Sales'[Unit Price] ) ) ) )
EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        ADDCOLUMNS(
            'Internet Sales',
            "Price Level", [Price Level]
        ),
        [Price Level]
    ),
    "Total Sales",
        CALCULATE(
            SUM( 'Internet Sales'[Sales Amount] ),
            FILTER( 'Internet Sales', [Price Level] = [Price Level] )
        )
)
ORDER BY [Price Level]

However, the query written in this way does not work, because the highlighted condition contained in the filter statement will always returns true, producing a wrong result.

FIG10

In this case, the EARLIER statement would not help you. The problem is that, as a best practice, we usually refer to a measure without specifying the table name in which it is defined. The reason is that in a Tabular model a measure cannot have the same name of any column in any table of the data model. Removing the table name makes the measure easily recognizable in a query, because we always use the table name to reference a column, even when this is not strictly required. However, when you define a local measure in a query you can override any existing column.

In the previous example, you are using the same name for both a local measure (with DEFINE MEASURE statement) and an extended column (by using the ADDCOLUMNS). When the data is grouped by using SUMMARIZE, the extended column is used, but within the FILTER statement the Price Level syntax will reference the extended column and not the measure. Thus, in this example, in order to discriminate between the extended column and the local measure, you have to use the name of the table (Internet Sales) in order to reference the local measure. An extended column does not belong to any table and can be referenced only through the name of the column without a table name, by using the syntax that we use as a best practice to reference the measures. For this reason, we have to reference the measure including the table name. The following query returns the correct result.

DEFINE
    MEASURE 'Internet Sales'[Price Level]
        = POWER( 10, 1 + INT( LOG10( VALUES( 'Internet Sales'[Unit Price] ) ) ) )
EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        ADDCOLUMNS(
            'Internet Sales',
            "Price Level", [Price Level]
        ),
        [Price Level]
    ),
    "Total Sales",
        CALCULATE(
            SUM( 'Internet Sales'[Sales Amount] ),
            FILTER( 'Internet Sales', [Price Level] = 'Internet Sales'[Price Level] )
        )
)
ORDER BY [Price Level]

We strongly suggest you to not use for extended columns or local measure any name already used for other measure or columns.

Filter Context in SUMMARIZE and ADDCOLUMNS

By describing the pattern of creating extended columns with ADDCOLUMNS instead of SUMMARIZE we mentioned that there are conditions in which you cannot do this replacement, because the result would be not correct. For example, when you apply filters over columns that are not included in the grouped column and then calculate the extended column expression using data coming from related tables, the filter context will be different between SUMMARIZE and ADDCOLUMNS.

The following query returns, for each Product Category and Customer Education, the profit made by the first top 2 customers for each product. Thus, a category might contain 0, 1 or 2 customers:

EVALUATE
SUMMARIZE(
    GENERATE(
        Product,
        TOPN(
            2,
            Customer,
            CALCULATE( SUM( 'Internet Sales'[Sales Amount] ) )
        )
    ),
    Product[Product Category Name],
    Customer[Education],
    "Profit",
    SUM( 'Internet Sales'[Gross Profit] )
)
ORDER BY [Profit] DESC

FIG11

In this case, applying the pattern of moving the extended columns out of a SUMMARIZE into an ADDCOLUMNS does not work, because the GENERATE used as a parameter of the SUMMARIZE returns only a few products and customers, while the SUMMARIZE only considers the sales related to these combinations of products and customers. Consider the following query and its result (please note that the GENERATE statement is included within a CALCULATETABLE statement, so that it transform the row context of the ADDCOLUMNS statement into a filter context for executing the GENERATE statement only for the products of the current category):

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        GENERATE(
            Product,
            TOPN(
                2,
                Customer,
                CALCULATE( SUM( 'Internet Sales'[Sales Amount] ) )
            )
        ),
        Product[Product Category Name],
        Customer[Education]
    ),
    "Profit",
    CALCULATE(
        SUM( 'Internet Sales'[Gross Profit] ),
        CALCULATETABLE(
            GENERATE(
                Product,
                TOPN(
                    2,
                    ALL( Customer[CustomerKey] ),
                    CALCULATE( SUM( 'Internet Sales'[Sales Amount] ) )
                )
            )
        )
    )
)
ORDER BY [Profit] DESC

FIG12

As you can see, the results are different because Profit is higher than the initial result. The reason is that this query is considering the two top customers of each Education of the customer for each product within the same category, whereas the original query were considering the top 2 customers for each product and in case these two customers had different education, only a single customer for a certain product were contributing to the result of the query.

If you wrap the SUMMARIZE into an ADDCOLUMNS, the extended columns created in ADDCOLUMNS works on a filter context defined by Product Category and Customer Education, considering much more sales than those who were originally used by the initial query. Thus, in order to generate the equivalent result by using ADDCOLUMNS, it is necessary to replicate the GENERATE operation in a CALCULATETABLE statement – but because we need to make the calculation related to Product Category and Customer Education included in the output, we also need to change the original GENERATE in order to remove the part of the filter context that might alter the calculation used by TOPN.

This is the equivalent DAX query using ADDCOLUMNS for generating the extended column:

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        GENERATE(
            Product,
            TOPN(
                2,
                Customer,
                CALCULATE( SUM( 'Internet Sales'[Sales Amount] ) )
            )
        ),
        Product[Product Category Name],
        Customer[Education]
    ),
    "Profit",
    CALCULATE(
        SUM( 'Internet Sales'[Gross Profit] ),
        CALCULATETABLE(
            GENERATE(
                Product,
                TOPN(
                    2,
                    ALL( Customer[CustomerKey] ),
                    CALCULATE(
                        SUM( 'Internet Sales'[Sales Amount] ),
                        ALL( Customer[Education] )
                    )
                )
            )
        )
    )
)
ORDER BY [Profit] DESC

FIG11

You should observe that the inner GENERATE uses the single column Customer[CustomerKey] instead of the Customer table, because it is necessary to interact with the external filter context for producing a correct result. The explanation of all the details of this query could be longer, but it is out of scope of this article. The conclusion is that extended columns in a SUMMARIZE expression should not be moved out to an ADDCOLUMNS if the table used in SUMMARIZE has particular filters and the extended column expression uses column that are not part of the output. Even if you can create an equivalent ADDCOLUMNS query, the result is much more complex and there are no performance benefits in this refactoring. The much more complex query has exactly the same (not so good) performances of the SUMMARIZE one – both queries in this section requires almost 20 seconds to run on Adventure Works 2012 Tabular.


Viewing all articles
Browse latest Browse all 227

Trending Articles