This article contains a short checklist of what you have to do in order to optimize the memory used by a data model in PowerPivot or in Analysis Services Tabular, including links to tools and resources that can help you in this task.
An interesting tutorial on this topic is available in the article optimize memory consumption of a PowerPivot Data Model in Excel 2013 published by Microsoft. All these suggestions are also valid for Analysis Services Tabular.
Import Only Useful Columns
When you import a table in a Tabular model, you have to import only columns that are really useful for analysis, removing non necessary columns. Because data are stored column by column, each column has a cost, and it is more expensive for those columns that have an high number of distinct values.
Columns that are usually useless and should not be imported are:
- Primary keys or identity columns of fact tables
- They are not used for analysis and are not required as lookup keys for other tables
- Pay attention to calculated columns using CALCULATE in fact table: a circular reference error is raised if you define more than one calculated column containing a CALCULATE statement, which is implicit when you recall a measure in such expression. However, usually this is not a requirement for a fact table.
- Transaction identifiers in fact tables
- If you need this information for drill through, consider splitting such identifier in multiple columns as described in a following step.
- GUIDs and other columns used for database replication and/or auditing
- Timestamps, date of last update, user of last update and similar columns are usually not relevant for data analysis
Identify Critical Columns
Examine the size of tables and columns imported in a database and identify the most critical columns. The data management view $system.Discover_object_memory_usage provides information about memory consumed by the VertiPaq engine.
If you have a PowerPivot data model in an Excel workbook, you can use the macro written by Kasper De Jong in his blog post What is eating up my memory the PowerPivot / Excel edition. The same analysis can be performed on an Analysis Services instance by using the sample Excel workbook described in the post What is using all that memory on my Analysis server instance, from the same author.
Normalize Columns Lowering Count of Distinct Values
If a table contains denormalized columns, you should import the minimum number and calculate the derived ones by using measures in the data model. For example, consider a Sales table with the following columns:
- Quantity
- Price
- Amount
The best practice is to store the columns that contains the minimize the number of distinct values. In this case, only Quantity and Prices should be imported in the data model. The Amount measure will be calculated with a measure such as:
Amount := SUMX ( Sales, Sales[Quantity] * Sales[Price] )
Please note that you should not create a calculated column for Amount, because otherwise you would still waste memory for storing a column with an higher number of distinct values.
Split High Cardinality Columns
If a column has too many distinct value and cannot be removed (i.e. transaction ID in a fact table for drillthrough), consider splitting the column into multiple distinct parts. For example, always separate date and time in two columns, instead of the original datetime. Each one of the parts will have a small number of unique values, and the combined total of memory consumed will be smaller than the original column.
VertiPaq compresses data at column level, creating a dictionary for each column and storing for each row only the number of bits required to store the index in the dictionary. A longer explanation of this technique is available in the article Optimizing High Cardinality Columns in VertiPaq on this site.
Reduce Precision of Columns when possible
Reduce precision of a number helps reducing the number of distinct values for that column. For example, you can round a number to an integer if the decimal part is not relevant for analysis. However, be careful that just changing the data type is not enough, the consumed memory is reduced only when the number of distinct values in a column is reduced.
For example, if you do only need a precision to the minute level, you can round a time value to the nearest minute, or at least to the nearest second if you have milliseconds stored in your column.