# Step: Calculate Statistics

This step allows playbooks to utilize statistical analysis of data that has gone through previous analysis. By defining a data set specification type and supporting fields per specification, a sample set of data that has been persisted may be used to find a number of statistics to describe that data.

• Dataset Key: The value that represents the reference to the dataset maintained by Darklight. This value can be combined with FreeMarker Expressions. In the example above, this is the event's account name.
• Input Value: The package data value(s) wanted to be added for analysis. These are persisted and used to build the dataset. This field is not required for the step to run. This value can be combined with FreeMarker Expressions. In the example above, this is the download size of the event.
• Minimum Sample Size: The minimum amount of data values that need to be collected in the sample dataset before analysis can be done. Packages up to the minimum size are sent down the "-" Minus output of the step.
• Additive Method: How the Input Value is to be treated after being used for analysis
• None: Not added to the dataset. Useful when combining multiple Statistics steps so the second step doesn't duplicate the value.
• Outlier Exclusive: Outliers will not be added to the dataset. Definition of an outlier
• All Inclusive: Input Value will be added to the dataset after analysis
• Dataset Spec Type: Determines how data is to be collected for analysis from the sample dataset
• All: All data values from the sample dataset will be used for analysis
• Period: Defines a specific period of time in which the data values are persisted (uses Begin Time and End Time)
• First Period: A period of time that begins at the time the first data value that was persisted in the sample dataset to the second time that is determined by the duration of time added to the beginning (uses Duration)
• Last Period: A period of time that begins the period before the latest data value that was persisted (uses Duration)
• First/Last N: Either the first n persisted data values of the sample set or the last n data values (uses N Data Values)
• Begin Time: The begin time used for the Period dataset spec type
• End Time: The end time used for the Period dataset spec type
• Duration: The dropdown selects the unit of time representation for the adjacent value field (used with First Period and Last Period)
• N Data Values: The number of data values to get for the First N and Last N dataset spec types
• Purge Options: The action to be done with removing data after analysis
• Don't purge: The data is not altered after analysis
• Data used for analysis: New and retrieved data used for analysis is removed (purged) from the database for that key
• Data not used for analysis: All the data that was not used for analysis is removed from the database for that key
• Statistic: The available statistical values that can be determined from the sample datasets.
• Minimum, Maximum, Mean, Median, Mode, Standard Deviation, 1st Quartile (25th percentile), 3rd Quartile (75th percentile)
• The box to the right of the statistical method is the name of the variable that the resulting statistic will be saved to in the package. In the example, the Mean calculation will be stored in a package variable called "average" and available to the next step as `\${average}`.
• Note that the Mode could result in multiple values, so it will output as a list (i.e., single-dimensional array), referenced with FreeMarker like `\${mode[0]}` with 0 being the first item in the list.
• Add statistic: Click the + icon to add a new statistical method to calculate using all of the above options. Only one of each kind of statistic can be chosen per step. Make sure the variable names you are using for each statistic are unique.

An important note to make about the size of each dataset. The current repository cap is set to ten million data values across all sample sets with each set having a cap of fifty thousand. Use the Purge Options to control the storage of values in the dataset, or the command-line options below for broader cleanup.

### Usage Tips

If the Test value yields data from the package the step then determines if that data will be used for analysis based on your selection for the additive method. Remember that either the data is used to matter what or disregarded if considered an outlier. If the data is to be used, it will be added to the database before the statistic is analysed from the sample. This is to help with preserving the data if an error occurs after that point. Although it's not guaranteed where in the process that error will occur, the early insertion of the data adds to data loss prevention.

#### Minimum Sample and the 'N'

These two values can possibly cause some confusion when the last n field is greater than the minimum sample size of the step, especially when the sample dataset key is new and growing to sufficient size. If it is not already clear, when setting the minimum sample size, you are setting the initial condition if the dataset is curated enough to retrieve accurate statistics. Although the dataset is large enough, the secondary condition of needing the last/first n data values very well might be insufficient and is not to be a confused with 'at most', the last/first n data values.

• `statdb` show all of the keys and their counts
• `statfind -k keyname` show all of the values stored for keyname. (Surround the keyname in quotes if it contains spaces)
• (optional: add `-c n` to show just the first number of records)
• `statclear -a` clear the whole database
• `statclear -k keyname` clear the data stored for keyname
• (Where is the command line?)
• step/calculate-statistics