Frequency

Frequency analysis in sci-analysis is similar to Distribution analysis, but provides summary statistics and a bar chart of categorical data instead of numeric data. It provides the count, percent, and rank of the occurrence of each category in a given sequence.

Interpreting the Graphs

The only graph shown by the frequency analysis is a bar chart where each bar is a unique category in the data set. By default the bar chart displays the frequency (counts) of each category in the bar chart, but can be configured to display the percent of each category instead.

import numpy as np
import scipy.stats as st
from sci_analysis import analyze

%matplotlib inline
np.random.seed(987654321)
pets = ['cat', 'dog', 'hamster', 'rabbit', 'bird']
sequence = [pets[np.random.randint(5)] for _ in range(200)]
analyze(sequence)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

Interpreting the Statistics

  • Total - The total number of data points in the data set.
  • Number of Groups - The number of unique categories in the data set.
  • Rank - The ranking of largest category to smallest.
  • Frequency - The number occurrences of each categorical value in the data set.
  • Percent - The percent each category makes up of the entire data set.
  • Category - The unique categorical values in the data set.

Usage

analyze(sequence[, percent=False, vertical=True, grid=True, labels=True, dropna=False, order=None, title=’Frequency’, name=’Categories’, xname=’Categories’, yname=None, save_to=None])

Perform a Frequency analysis on sequence.

Parameters:
  • sequence (array-like) – The array-like object to analyze. It can be a list, tuple, numpy array or pandas Series of string values.
  • percent (bool) – Display the percent of each category on the bar chart if True, otherwise will display the count of each category.
  • vertical (bool) – Display the bar chart with a vertical orientation if True.
  • grid (bool) – Add grid lines to the bar chart if True.
  • labels (bool) – Display count or percent labels on the bar chart for each group if True.
  • dropna (bool) – If False, missing values in sequence are grouped together as their own category on the bar chart.
  • order (array-like) – Sets the order of the categories displayed on the bar chart according to the order of values in order.
  • title (str) – The title of the graph.
  • name (str) – The name of the data to show on the graph.
  • xname (str) – Alias for name.
  • yname (str) – The label of the y-axis of the bar chart. The default is “Percent” if percent is True, otherwise the default is “Count.”
  • save_to (str) – If a string value, the path to save the graph to.

Argument Examples

sequence

A sequence of categorical values to be analyzed.

analyze(sequence)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

percent

Controls whether percents are displayed instead of counts on the bar chart. The default is False.

analyze(
    sequence,
    percent=True,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

vertical

Controls whether the bar chart is displayed in a vertical orientation or not. The default is True.

analyze(
    sequence,
    vertical=False,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

grid

Controls whether the grid is displayed on the bar chart or not. The default is False.

analyze(
    sequence,
    grid=True,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

labels

Controls whether the count or percent labels are displayed or not. The default is True.

analyze(
    sequence,
    labels=False,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  5


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             46             23.0000      bird
2             43             21.5000      hamster
3             41             20.5000      cat
4             36             18.0000      dog
5             34             17.0000      rabbit

dropna

Removes missing values from the bar chart if True, otherwise, missing values are grouped together into a category called “nan”. The default is False.

# Convert 10 random values in sequence to NaN.
for _ in range(10):
    sequence[np.random.randint(200)] = np.nan
analyze(sequence)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  6


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             43             21.5000      bird
2             42             21.0000      hamster
3             39             19.5000      cat
4             33             16.5000      dog
4             33             16.5000      rabbit
5             10             5.0000        nan
analyze(
    sequence,
    dropna=True,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  6


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             43             21.5000      bird
2             42             21.0000      hamster
3             39             19.5000      cat
4             33             16.5000      dog
4             33             16.5000      rabbit
5             10             5.0000        nan

order

A list of category names that sets the order for how categories are displayed on the bar chart. If sequence contains missing values, the category “nan” is shown first.

analyze(
    sequence,
    order=['rabbit', 'hamster', 'dog', 'cat', 'bird'],
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  6


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             43             21.5000      bird
2             42             21.0000      hamster
3             39             19.5000      cat
4             33             16.5000      dog
4             33             16.5000      rabbit
5             10             5.0000        nan

If there are categories in sequence that aren’t listed in order, they are reported as “nan” on the bar chart.

analyze(
    sequence,
    order=['bird', 'cat', 'dog'],
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  6


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             43             21.5000      bird
2             42             21.0000      hamster
3             39             19.5000      cat
4             33             16.5000      dog
4             33             16.5000      rabbit
5             10             5.0000        nan

Missing values can be dropped from the bar chart with dropna=True.

analyze(
    sequence,
    order=['bird', 'cat', 'dog'],
    dropna=True,
)
png
Overall Statistics
------------------

Total            =  200
Number of Groups =  6


Statistics
----------

Rank          Frequency     Percent       Category
--------------------------------------------------------
1             43             21.5000      bird
2             42             21.0000      hamster
3             39             19.5000      cat
4             33             16.5000      dog
4             33             16.5000      rabbit
5             10             5.0000        nan