API

datasense.automation module

Automation functions

datasense.automation.fahrenheit_to_celsius_table(min_fahrenheit: int = 350, max_fahrenheit: int = 450, fahrenheit_increment: int = 5, rounding_increment: int = 5) str

Generates an HTML table of Fahrenheit to Celsius conversions.

Parameters:
  • min_fahrenheit (int = 350) – The minimum Fahrenheit temperature to include in the table.

  • max_fahrenheit (int = 450) – The maximum Fahrenheit temperature to include in the table.

  • fahrenheit_increment (int = 5) – The increment in Fahrenheit degrees between each row in the table.

  • rounding_increment (int = 5) – The increment of rounding in the ones place value.

Returns:

html_table – An HTML table of Fahrenheit to Celsius conversions.

Return type:

str

Example

>>> import datasense as ds
>>> output_url = 'fahrenheit_to_celsius.html'
>>> header_title = 'Fahrenheit to Celsius'
>>> header_id = 'fahrenehit-to-celsius'
>>> original_stdout = ds.html_begin(
...     output_url=output_url,
...     header_title=header_title,
...     header_id=header_id
... )
>>> table = ds.fahrenheit_to_celsius_table()
>>> print(table)
>>> ds.html_end(
...     original_stdout=original_stdout,
...     output_url=output_url
... )
datasense.automation.water_coffee_tea_milk(*, mugs_coffee: int = 0, cups_tea: int = 0, mugs_tea: int = 0, water_coffee_filter_mass: int = 150, water_tea_cup_mass: int = 400, water_tea_mug_mass: int = 300, water_coffee_mass: int = 220, milk_coffee_mass: int = 150, coffee_beans_mass: int = 20, time_1000_g_water: int = 340) list[int]

Calculate the mass of water and milk required for coffee mugs, tea cups, and tea mugs. All units are g.

Parameters:
  • mugs_coffee (int = 0,) – The number of coffee mugs.

  • cups_tea (int = 0,) – The number of tea cups.

  • mugs_tea (int = 0,) – The number of tea mugs.

  • water_coffee_filter_mass (int = 150,) – The mass of water to wet one coffee filter.

  • water_tea_cup_mass (int = 400,) – The mass of water for a tea cup.

  • water_tea_mug_mass (int = 300,) – The mass of water for a coffee mug.

  • water_coffee_mass (int = 220) – The mass of water to wet the coffee grinds.

  • milk_coffee_mass (int = 150) – The mass of milk for one serving.`

  • coffee_beans_mnass (int = 20) – The mass of coffee beans for one serving.

  • time_1000_g_water (int =340) – The time to boil 1000 g of water at 8 on a 2300 W induction element.

Returns:

A list of seven integers.

  • water: int

    The total amount of water to boil (g).

  • coffee_mug_water: int

    The amount of water for the coffee mugs (g).

  • coffee_filter_water: int

    The amount of water to wet the coffee filters (g).

  • tea_cup_water: int

    The amount of water for the tea cups (g).

  • tea_mug_water: int

    The amount of water for the tea mugs (g).

  • coffee_milk: int

    The mass of milk to foam (g).

  • coffee_mass: int

    The mass of coffee beans to grind (g).

Return type:

list[int]

Examples

>>> import datasense as ds
>>> ds.water_coffee_tea_milk(
...     mugs_coffee=1,
...     cups_tea=0,
...     mugs_tea=0
... )
(370, 220, 150, 0, 0, 150, 20, (0, 2, 5))
>>> coffee_mug_water, coffee_filter_water = [
...     ds.water_coffee_tea_milk(
...         mugs_coffee=1,
...         cups_tea=0,
...         mugs_tea=0
...     )[i] for i in [1, 2]
... ]
>>> print(coffee_mug_water, coffee_filter_water)
220 150
>>> all_coffee_water = ds.water_coffee_tea_milk(
...     mugs_coffee=1,
...     cups_tea=0,
...     mugs_tea=0
... )[0:3]
>>> print(all_coffee_water)
(370, 220, 150)

datasense.control_charts module

Shewhart control charts

Create X, mR, Xbar, R control charts Invoke Shewhart rules 1, 2, 3, 4

class datasense.control_charts.ControlChart(data: DataFrame)

Bases: ABC

abstract ax(fig: Figure = None) Axes

Matplotlib control chart plot

lcl

Calculate the lower control limit

mean

Calculate the average

sigma

Calculate the standard deviation appropriate to method used

sigmas

TODO

Ex:

cc = ControlChart(some_data) cc.x - cc.mean * 3 == .X_chart.sigmas[-3]

ucl

Calculate the upper control limit

y

The y coordinates of the points on a plot of this chart

class datasense.control_charts.R(data: DataFrame)

Bases: ControlChart

Range of a subgroup of values control chart (R)

ax(fig: Figure = None) Axes

Plots calculated ranges (y axis) versus the index of the dataframe (x axis)

Parameters:

fig (plt.Figure = None) – A matplotlib figure.

Returns:

axes – A matplotlib Axes.

Return type:

Axes

Examples

minimal R control chart

>>> import datasense.control_charts as cc
>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (8, 6)
>>> graph_name = 'graph_r.svg'
>>> X1 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X2 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X3 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X4 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data={
...         'X1': X1,
...         'X2': X2,
...         'X3': X3,
...         'X4': X4,
...     }
... )
>>> graph_r_file_name = 'graph_r.svg'
>>> fig = plt.figure(figsize=(8, 6))
>>> r = cc.R(data=data)
>>> ax = r.ax(fig=fig)
>>> fig.savefig(fname=graph_r_file_name)

complete R control chart

>>> figsize = (8, 6)
>>> graph_name = 'graph_r.svg'
>>> colour='#33bbee'
>>> X1 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X2 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X3 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X4 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data={
...         'X1': X1,
...         'X2': X2,
...         'X3': X3,
...         'X4': X4,
...     }
... )
>>> fig = plt.figure(figsize=(8, 6))
>>> r = cc.R(data=data)
>>> ax = r.ax(fig=fig)
>>> ax.axhline(
...     y=r.sigmas[+1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=r.sigmas[-1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=r.sigmas[+2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=r.sigmas[-2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> cc.draw_rule(r, ax, *cc.points_one(r), '1')
>>> ax.set_title(
...     label="R chart",
...     fontweight="bold"
... ) 
>>> ax.set_ylabel(ylabel="Y label") 
>>> ax.set_xlabel(xlabel="X label") 
>>> graph_r_file_name = 'graph_r.svg'
>>> fig.savefig(fname=graph_r_file_name)
lcl

Lower control limit

mean

Average(R)

sigma

Sigma(R)

Standard deviation using rational subgroup estimator

ucl

Upper control limit

y
class datasense.control_charts.X(data: DataFrame, subgroup_size: int = 2)

Bases: ControlChart

Individual values control chart (X)

ax(fig: Figure = None) Axes

Plots individual values of the column of the dataframe (y axis) versus the index of the dataframe (x axis)

Parameters:

fig (plt.Figure = None) – A matplotlib figure.

Returns:

axes – A matplotlib Axes.

Return type:

Axes

Examples

minimal X control chart

>>> import datasense.control_charts as cc
>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (8, 6)
>>> graph_name = 'graph_x.svg'
>>> data = ds.random_data(
...     distribution='norm',
...     size=42,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data=data,
...     columns=['X']
... )
>>> fig = plt.figure(figsize=figsize)
>>> x = cc.X(data=data)
>>> ax = x.ax(fig=fig)
>>> fig.savefig(fname=graph_name)

complete X control chart

>>> figsize = (8, 6)
>>> graph_name = 'graph_x.svg'
>>> colour='#33bbee'
>>> data = ds.random_data(
...     distribution='norm',
...     size=42,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data=data,
...     columns=['X']
... )
>>> fig = plt.figure(figsize=figsize)
>>> x = cc.X(data=data)
>>> ax = x.ax(fig=fig)
>>> ax.axhline(
...     y=x.sigmas[+1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=x.sigmas[-1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=x.sigmas[+2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=x.sigmas[-2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> cc.draw_rules(x, ax)
>>> ax.set_title(
...     label="X chart title",
...     fontweight="bold"
... ) 
>>> ax.set_ylabel(ylabel="X chart Y label") 
>>> ax.set_xlabel(xlabel="X chart X label") 
>>> fig.savefig(fname=graph_name)
lcl

Lower control limit

mean

Average(X)

sigma

Sigma(X)

Standard deviation using rational subgroup estimator

ucl

Upper control limit

y
class datasense.control_charts.Xbar(data: DataFrame)

Bases: ControlChart

Average of a subgroup of values control chart (Xbar)

ax(fig: Figure = None) Axes

Plots calculated averages (y axis) versus the index of the dataframe (x axis)

Parameters:

fig (plt.Figure = None) – A matplotlib figure.

Returns:

axes – A matplotlib Axes.

Return type:

Axes

Examples

minimal Xbar control chart

>>> import datasense.control_charts as cc
>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (8, 6)
>>> graph_name = 'graph_xbar.svg'
>>> colour='#33bbee'
>>> X1 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X2 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X3 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X4 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data={
...         'X1': X1,
...         'X2': X2,
...         'X3': X3,
...         'X4': X4,
...     }
... )
>>> fig = plt.figure(figsize=figsize)
>>> xbar = cc.Xbar(data=data)
>>> ax = xbar.ax(fig=fig)
>>> fig.savefig(fname=graph_name)

complete Xbar control chart

>>> figsize = (8, 6)
>>> graph_name = 'graph_xbar.svg'
>>> colour='#33bbee'
>>> X1 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X2 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X3 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> X4 = ds.random_data(
...     distribution='norm',
...     size=25,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data={
...         'X1': X1,
...         'X2': X2,
...         'X3': X3,
...         'X4': X4,
...     }
... )
>>> fig = plt.figure(figsize=figsize)
>>> xbar = cc.Xbar(data=data)
>>> ax = xbar.ax(fig=fig)
>>> ax.axhline(
...     y=xbar.sigmas[+1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=xbar.sigmas[-1],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=xbar.sigmas[+2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> ax.axhline(
...     y=xbar.sigmas[-2],
...     linestyle='--',
...     dashes=(5, 5),
...     color=colour,
...     alpha=0.5
... ) 
>>> cc.draw_rules(xbar, ax) 
>>> ax.set_title(
...     label="Xbar chart title",
...     fontweight="bold"
... ) 
>>> ax.set_ylabel(ylabel="Xbar chart Y label") 
>>> ax.set_xlabel(xlabel="Xbar chart X label") 
>>> fig.savefig(fname=graph_name)
lcl

Lower control limit

mean

Average(Xbar)

sigma

Sigma(Xbar)

Standard deviation using rational subgroup estimator

ucl

Upper control limit

y
datasense.control_charts.draw_rule(cc: ControlChart, ax: Axes, above: Series, below: Series, rule_name: str) None

Invokes one of the points_* rules to identify out-of-control points

Parameters:
  • cc (ControlChart) – The control chart object.

  • ax (axes.Axes) – The Axes object.

  • above (pd.Series) – The pandas Series for the points above rule.

  • below (pd.Series) – The pandas Series for the points below a rule.

datasense.control_charts.draw_rules(cc: ControlChart, ax: Axes) None

Invokes all of the points_* rules to identify out-of-control points

Parameters:
  • cc (ControlChart) – The control chart object.

  • ax (axes.Axes) – The Axes object.

class datasense.control_charts.mR(data: DataFrame, subgroup_size: int = 2)

Bases: ControlChart

Moving range of individual values control chart (mR)

ax(fig: Figure = None) Axes

Plots calculated moving ranges (y axis) versus the index of the dataframe (x axis)

Parameters:

fig (plt.Figure = None) – A matplotlib figure.

Returns:

axes – A matplotlib Axes.

Return type:

Axes

Examples

minimal mR control chart

>>> import datasense.control_charts as cc
>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (8, 6)
>>> graph_name = 'graph_mr.svg'
>>> data = ds.random_data(
...     distribution='norm',
...     size=42,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data=data,
...     columns=['X']
... )
>>> fig = plt.figure(figsize=figsize)
>>> mr = cc.mR(data=data)
>>> ax = mr.ax(fig=fig)
>>> fig.savefig(fname=graph_name)

complete mR control chart

>>> figsize = (8, 6)
>>> graph_name = 'graph_mr.svg'
>>> data = ds.random_data(
...     distribution='norm',
...     size=42,
...     loc=69,
...     scale=13
... )
>>> data = pd.DataFrame(
...     data=data,
...     columns=['X']
... )
>>> mr = cc.mR(data=data)
>>> ax = mr.ax(fig=fig)
>>> cc.draw_rule(mr, ax, *cc.points_one(mr), '1')
>>> ax.set_title(
...     label="mR chart title",
...     fontweight="bold"
... ) 
>>> ax.set_ylabel(ylabel="mR chart Y label") 
>>> ax.set_xlabel(xlabel="mR chart X label") 
>>> fig.savefig(fname=graph_name)
lcl

Lower control limit

mean

Average(mR)

sigma

Sigma(mR)

Standard deviation using rational subgroup estimator

ucl

Upper control limit

y
datasense.control_charts.points_four(cc: ControlChart) tuple[pandas.core.series.Series, pandas.core.series.Series]

Return out of control points as Series of only said points

Shewhart and Western Electric rule four. Nelson and Minitab rule two. Eight successive points all on the same side of the central line. This rule is used with the X and Xbar charts.

Parameters:

cc (ControlChart) – The control chart object.

Returns:

A tuple containing two elements, the data points that are out of control for rule four.

  • series_above: pd.Series

    The series of points above the control limit.

  • series_below: pd.Series

    The series of points below the control limit.

Return type:

tuple[pd.Series]

datasense.control_charts.points_one(cc: ControlChart) tuple[pandas.core.series.Series, pandas.core.series.Series]

Return out of control points as Series of only said points

Shewhart and Western Electric Rule one. Nelson and Minitab rule one. One point outside the three-sigma limits. This rule is used with the X, mR, Xbar, and R charts.

Parameters:

cc (ControlChart) – The control chart object.

Returns:

A tuple containing two elements, the data points that are out of control for rule one.

  • series_above: pd.Series

    The series of points above the control limit.

  • series_below: pd.Series

    The series of points below the control limit.

Return type:

tuple[pd.Series]

datasense.control_charts.points_three(cc: ControlChart) tuple[pandas.core.series.Series, pandas.core.series.Series]

Return out of control points as Series of only said points

Shewhart or Western Electric rule three. Nelson or Minitab rule six. Four-out-of-five successive points on the same side of the central line and are more than one sigma units away from the central line. This rule is used with the X and Xbar charts.

Parameters:

cc (ControlChart) – The control chart object.

Returns:

A tuple containing two elements, the data points that are out of control for rule three.

  • series_above: pd.Series

    The series of points above the control limit.

  • series_below: pd.Series

    The series of points below the control limit.

Return type:

tuple[pd.Series]

datasense.control_charts.points_two(cc: ControlChart) tuple[pandas.core.series.Series, pandas.core.series.Series]

Return out of control points as Series of only said points

Shewhart and Western Electric rule two. Nelson and Minitab rule five. Two-out-of-three successive points on the same side of the central line and both are more than two sigma units away from the central line. This rule is used with the X and Xbar charts.

Parameters:

cc (ControlChart) – The control chart object.

Returns:

A tuple containing two elements, the data points that are out of control for rule two.

  • series_above: pd.Series

    The series of points above the control limit.

  • series_below: pd.Series

    The series of points below the control limit.

Return type:

tuple[pd.Series]

datasense.graphs module

Graphical analysis

Colours used are colour-blind friendly.

blue “#0077bb” cyan “#33bbee” teal “#009988” orange “#ee7733” red “#cc3311” magenta “#ee3377” grey “#bbbbbb”

datasense.graphs.dd_to_dms(dd: list[float]) list[tuple[int, int, float, str]]

Converts a list of decimal degrees (DD) to a list of tuples containing degrees, minutes, and seconds (DMS).

Parameters:

dd (list[float]) – A list of two floats representing decimal degrees (latitude, longitude).

Returns:

A list of tuples containing degrees, minutes, seconds, and hemisphere (DMS) for latitude and longitude.

Return type:

list[tuple[int, int, float, str]]

Examples

Ottawa Parliament

>>> import datasense as ds
>>> dd = [45.4250225, -75.6970594]
>>> dsm = ds.dd_to_dms(dd=dd)
>>> dms
[(45, 25, 30.081, 'N'), (75, 41, 49.41384, 'W')]

Eiffel Tower

>>> dd = [48.858393, 2.257616]
>>> dms = ds.dd_to_dms(dd=dd)
>>> dms
[(48, 51, 30.2148, 'N'), (2, 15, 27.4176, 'E')]

Machu Pichu

>>> dd = [-13.163194, -72.547842]
>>> dms = ds.dd_to_dms(dd=dd)
>>> dms
[(13, 9, 47.4984, 'S'), (72, 32, 52.2312, 'W')]

Sydney Opera House

>>> dd = [-33.8567433, 151.1784306]
>>> dms = ds.dd_to_dms(dd=dd)
>>> dms
[(33, 51, 24.27588, 'S'), (151, 10, 42.35016, 'E')]

Notes

DMS. Latitude north of the equation is “N” and south of the equator is “S”. Longitude west of longitude 0 (Greenwich UK) is “W” and east is “E”.

DD. Latitude north of the equation is a positive float and south negative. Longitude west of longitude 0 is negative and east is positive.

datasense.graphs.despine(*, ax: Axes) None

Remove the top and right spines of a graph.

Parameters:

ax (axes.Axes) – A matplotlib Axes.

Example

>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> despine(ax=ax)
datasense.graphs.dms_to_dd(dms: list[tuple[int, int, float, str]]) tuple[float, float]

Converts a list of tuples containing degrees, minutes, and seconds (DMS) to decimal degrees (DD).

Parameters:

dms (list[tuple[int, int, float, str]]) – A list of tuples containing degrees, minutes, seconds, and hemisphere.

Returns:

A list of two floats containing two decimal degrees (DD) for latitude and longitude.

Return type:

list[float]

Examples

Ottawa Parliament

>>> import datasense as ds
>>> dms = [(45, 25, 30.081, 'N'), (75, 41, 49.41384, 'W')]
>>> dd = ds.dms_to_dd(dms=dms)
>>> dd
[45.4250225, -75.6970594]

Eiffel Tower

>>> dms = [(48, 51, 30.2148, 'N'), (2, 15, 27.4176, 'E')]
>>> dd = ds.dms_to_dd(dms=dms)
>>> dd
[48.858393, 2.257616]

Machu Pichu

>>> dms = [(13, 9, 47.4984, 'S'), (72, 32, 52.2312, 'W')]
>>> dd = ds.dms_to_dd(dms=ds)
>>> dd
[-13.163194, -72.547842]

Sydney Opera House

>>> dms = [(33, 51, 24.27588, 'S'), (151, 10, 42.35016, 'E')]
>>> dd = ds.dms_to_dd(dms=dms)
>>> dd
[-33.8567433, 151.1784306]

Notes

DMS. Latitude north of the equation is “N” and south of the equator is “S”. Longitude west of longitude 0 (Greenwich UK) is “W” and east is “E”.

DD. Latitude north of the equation is a positive float and south negative. Longitude west of longitude 0 is negative and east is positive.

datasense.graphs.empirical_cdf(*, s: Series, figsize: tuple[float, float] = None, marker: str = '.', markersize: float = 4, colour: str = '#0077bb', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Create an empirical cumulative distribution function.

Parameters:
  • s (pd.Series) – The input series.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • marker (str = ".") – The type of plot point.

  • markersize (float = 4) – The size of the plot point (pt).

  • colour (str = colour_blue) – The colour of the plot point (hexadecimal triplet string).

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> series_x = ds.random_data(
...     loc=69,
...     scale=13
... )
>>> fig, ax = ds.empirical_cdf(s=series_x)

Notes

scipy is working on scipy.stats.ecdf post version 1.10.1

datasense.graphs.format_dates(*, fig: Figure, ax: Axes, defaultfmt: str = '%Y-%m-%d') None

Format dates and ticks for plotting.

Parameters:
  • fig (plt.Figure) – A matplotlib figure.

  • ax (axes.Axes) – A matplotlib Axes.

  • defaultfmt (str = "%Y-%m-%d") – The date string.

Example

>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ds.format_dates(
...     fig=fig,
...     ax=ax
... )
datasense.graphs.plot_barleft_lineright_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, barwidth: float = 10, colour1: str = '#0077bb', colour2: str = '#33bbee', linestyle1: str = '-', linestyle2: str = '-', marker2: str = 'o') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]

Bar plot of y1 left vertical axis versus X. Line plot of y2 right vertical axis versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have different units or scales, and you wish to see if they are correlated.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the ordinate.

  • y2 (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • barwidth (float = 10) – The width of the bars.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • linestyle1 (str = "-") – The style of the line for y1.

  • linestyle2 (str = "-") – The style of the line for y2.

  • marker2 (str = "o") – The type of plot point for y2.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> X = ds.datetime_data()
>>> y1 = ds.random_data()
>>> y2 = ds.random_data()
>>> figsize = (6, 4)
>>> fig, ax1, ax2 = ds.plot_barleft_lineright_x_y1_y2(
...     X=X,
...     y1=y1,
...     y2=y2,
...     figsize=figsize,
...     barwidth=20,
...     colour1="#cc3311",
...     colour2="#ee3377"
... )
datasense.graphs.plot_boxcox(*, s: Series | ndarray, la: int = -20, lb: int = 20, colour1: str = '#0077bb', colour2: str = '#33bbee', marker: str = '.', markersize: float = 4, ylabel: str = 'Correlation Coefficient', remove_spines: bool = True, lmbda: float | int | None = None, alpha: float = 0.05) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Box-Cox normality plot

Parameters:
  • s (pd.Series | np.ndarray) – The data series or NumPy array.

  • la (int = -20) –

  • lb (int = 20) – The lower and upper bounds for the lmbda values to pass to boxcox for Box-Cox transformations. These are also the limits of the horizontal axis of the plot if that is generated.

  • colour1 (str = colour_blue) – The colour of the plot points.

  • colour2 (str = colour_cyan) – The colour of the lower and upper bound lines.

  • marker (str = ".") – The type of plot points.

  • markersize (float = 4) – The size of the plot points.

  • ylabel (str = "Correlation Coefficient") – The label of the y axis.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

  • lmbda (float | int | None = None) – If lmbda is None (default), find the value of lmbda that maximizes the log-likelihood function and return it as the second output argument. If lmbda is not None, do the transformation for that value.

  • alpha (float = 0.05) – If lmbda is None and alpha is not None (default), return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0. If lmbda is not None, alpha is ignored.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> from scipy import stats
>>> import datasense as ds
>>> s = stats.loggamma.rvs(5, size=500) + 5
>>> fig, ax = ds.plot_boxcox(s=s)

Notes

Series must be > 0

References

datasense.graphs.plot_boxplot(*, series: Series, notch: bool = True, showmeans: bool = None, figsize: tuple[float, float] = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Create a box-and-whisker plot with several elements: - minimum - first quartile - second quartile (median) - confidence interval of the second quartile - third quartile - maximum - outliers

Parameters:
  • series (pd.Series) – The input series.

  • notch (bool = True,) – Boolean to show the confidence interval of the second quartile.

  • showmeans (bool = None,) – Boolean to show average.

  • figsize (tuple[float, float] = None,) – The (width, height) of the figure (in, in).

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> series = ds.random_data()
>>> fig, ax = ds.plot_boxplot(series=series)
>>> ax.set_title(label="Box-and-whisker plot") 
>>> ax.set_xticks(ticks=[1], labels=["series"]) 
>>> ax.set_ylabel("y") 
>>> ds.despine(ax=ax) 
datasense.graphs.plot_histogram(*, series: Series, number_bins: int = None, bin_range: tuple[int, int] = None, figsize: tuple[float, float] = None, bin_width: int = None, edgecolor: str = '#ffffff', linewidth: int = 1, bin_label_bool: bool = False, color: str = '#0077bb', remove_spines: bool = True, probability_density_function: bool = False, percentiles: tuple[float, float] = None, percentiles_colour: str = '#cc3311') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]
Parameters:
  • series (pd.Series) – The input series.

  • number_bins (int = None) – The number of equal-width bins in the range s.max() - s.min().

  • bin_range (tuple[int, int] | tuple[int, int] = None) – The lower and upper range of the bins. If not provided, range is (s.min(), s.max()).

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • bin_width (int = None) – The width of the bin in same units as the series s.

  • edgecolor (str = colour_white) – The hexadecimal color value for the bar edges.

  • linewidth (int = 1) – The bar edges line width (point).

  • bin_label_bool (bool = False) – If True, label the bars with count and percentage of total.

  • color (str = colour_blue) – The color of the bar faces.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

  • probability_density_function (bool = False) – If True, a density parameter normalizes the bin heights so that the integral of the histogram is 1. The resulting histogram is an approximation of the probability density function.

  • percentiles (tuple[float, float] = [0.025, 0.975]) – The percentiles for plotting vertical lines on the histogram.

  • percentiles_colour (str = colour_red) – The colour of the vertical lines for the percentiles.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

Create a series of random floats, normal distribution, with the default parameters.

>>> import datasense as ds
>>> s = ds.random_data()
>>> fig, ax = ds.plot_histogram(series=s)

Create a series of random integers, integer distribution, size = 113, min = 0, max = 13.

>>> s = ds.random_data(
...     distribution="randint",
...     size=113,
...     low=0,
...     high=14
... )
>>> fig, ax = ds.plot_histogram(series=s)

Create a series of random integers, integer distribution, size = 113, min = 0, max = 13. Set histogram parameters to control bin width.

>>> s = ds.random_data(
...     distribution="randint",
...     size=113,
...     low=0,
...     high=14
... )
>>> fig, ax = ds.plot_histogram(
...     series=s,
...     bin_width=1
... )

Create a series of random integers, integer distribution, size = 113, min = 0, height = 14. Set histogram parameters to control bin width and plotting range.

>>> s = ds.random_data(
...     distribution="randint",
...     size=113,
...     low=0,
...     high=13
... )
>>> fig, ax = ds.plot_histogram(
...     series=s,
...     bin_width=1,
...     bin_range=(0, 10)
... )

Create a series of random floats, size = 113, average = 69, standard deviation = 13. Set histogram parameters to control bin width and plotting range.

>>> s = ds.random_data(
...     distribution="norm",
...     size=113,
...     loc=69,
...     scale=13
... )
>>> fig, ax = ds.plot_histogram(
...     series=s,
...     bin_width=5,
...     bin_range=(30, 110)
... )

Create a series of random floats, size = 113, average = 69, standard deviation = 13. Set histogram parameters to control bin width, plotting range, labels. Set colour of the bars. Plot the probability density function on top of the histogram.

>>> s = ds.random_data(
...     distribution="norm",
...     size=113,
...     loc=69,
...     scale=13
... )
>>> fig, ax = ds.plot_histogram(
...     series=s,
...     bin_width=5,
...     bin_range=(30, 110),
...     figsize=(10,8),
...     bin_label_bool=True,
...     color="#33bbee"
... )
>>> ax.set_xlabel(xlabel="X-axis label", labelpad=30) 
>>> plt.tight_layout()
datasense.graphs.plot_horizontal_bars(*, y: list[int] | list[float] | list[str], width: list[int] | list[float], height: float = 0.8, figsize: tuple[float, float] = None, edgecolor: str = '#ffffff', linewidth: int = 1, color: str = '#0077bb', left: datetime | int | float = None) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]
Parameters:
  • y (list[int] | list[float] | list[str],) – The y coordinates of the bars.

  • width (list[int] | list[float],) – The width(s) of the bars.

  • height (float = 0.8,) – The height of the bars.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • edgecolor (str = colour_white,) – The hexadecimal color value for the bar edges.

  • linewidth (int = 1,) – The bar edges line width (point).

  • color (str = colour_blue) – The color of the bar faces.

  • left (datetime | int | float = None) – The x coordinates of the left sides of the bars.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> y = ["Yes", "No"]
>>> width = [69, 31]
>>> fig, ax = ds.plot_horizontal_bars(
...     y=y,
...     width=width
... )
>>> y = ["Yes", "No"]
>>> width = [69, 31]
>>> fig, ax = ds.plot_horizontal_bars(
...     y=y,
...     width=width,
...     height=0.4
... )

Create Gantt chart

>>> import datetime
>>> data = {
...     "start": ["2021-11-01", "2021-11-03", "2021-11-04", "2021-11-08"],
...     "end": ["2021-11-08", "2021-11-16", "2021-11-11", "2021-11-13"],
...     "task": ["task 1", "task 2", "task 3", "task 4"]
... }
>>> columns = ["task", "start", "end", "duration", "start_relative"]
>>> data_types = {
...     "start": "datetime64[ns]",
...     "end": "datetime64[ns]",
...     "task": "str"
... }
>>> df = (pd.DataFrame(data=data)).astype(dtype=data_types)
>>> df[columns[3]] = (df[columns[2]] - df[columns[1]]).dt.days + 1
>>> df = df.sort_values(
...     by=[columns[1]],
...     axis=0,
...     ascending=[True]
... )
>>> start = df[columns[1]].min()
>>> end = df[columns[2]].max()
>>> start = df[columns[1]].min()
>>> duration = (end - start).days + 1
>>> x_ticks = [x for x in range(duration + 1)]
>>> x_labels = [
...     f"{(start + datetime.timedelta(days=x)):%Y-%m-%d}"
...     for x in x_ticks
... ]
>>> df[columns[4]] = (df[columns[1]] - start).dt.days
>>> fig, ax = ds.plot_horizontal_bars(
...     y=df[columns[0]],
...     width=df[columns[3]],
...     left=df[columns[4]]
... )
>>> ax.invert_yaxis() 
>>> ax.set_xticks(ticks=x_ticks) 
>>> ax.set_xticklabels(labels=x_labels, rotation=45) 
datasense.graphs.plot_line_line_line_x_y1_y2_y3(*, X: Series, y1: Series, y2: Series, y3: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, colour1: str = '#0077bb', colour2: str = '#33bbee', colour3: str = '#009988', labellegendy1: str = None, labellegendy2: str = None, labellegendy3: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Line plot of y1 versus X. Line plot of y2 versus X. Line plot of y3 versus X. Optional smoothing applied to y1, y2, y3.

This graph is useful if y1, y2, and y3 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the y1 ordinate.

  • y2 (pd.Series) – The data to plot on the y2 ordinate.

  • y3 (pd.Series) – The data to plot on the y3 ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – the number of knows for natural cubic spline smoothing.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_teal) – The colour of the line for y2.

  • colour2 – The colour of the line for y2.

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

  • labellegendy3 (str = None) – The legend label of the line y3.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (6, 4)
>>> df = pd.DataFrame(data={
...     'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...     'y1': [
...         8000, 9000, 10000, 11000, 12000,
...         13000, 14000, 15000, 16000, 17000
...     ],
...     'y2': [
...         7630.59, 12091.24, 12610.42, 14382.62, 23275.12,
...         21676.23, 22264.38, 20776.82, 21384.69, 17041.38
...     ]
... }).sort_values(by=["x1"])
>>> x1 = df["x1"]
>>> y1 = df["y1"]
>>> y2 = df["y2"]
>>> (
...     fitted_model, predictions, confidence_interval_lower,
...     confidence_interval_upper, prediction_interval_lower,
...     prediction_interval_upper
... ) = ds.linear_regression(
...     X=x1,
...     y=y2
... )
>>> fig, ax = ds.plot_line_line_line_x_y1_y2_y3(
...     X=x1,
...     y1=y1,
...     y2=y2,
...     y3=predictions,
...     figsize=figsize,
...     labellegendy1="target",
...     labellegendy2="actual",
...     labellegendy3="predicted"
... )
datasense.graphs.plot_line_line_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker1: str = '.', marker2: str = '.', markersize1: int = 8, markersize2: int = 8, linestyle1: str = '-', linestyle2: str = '-', linewidth1: float = 1, linewidth2: float = 1, colour1: str = '#0077bb', colour2: str = '#33bbee', labellegendy1: str = None, labellegendy2: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Line plot of y1 versus X. Line plot of y2 versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the y1 ordinate.

  • y2 (pd.Series) – The data to plot on the y2 ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – the number of knows for natural cubic spline smoothing.

  • marker1 (str = ".") – The type of plot point for y1.

  • marker2 (str = ".") – The type of plot point for y2.

  • markersize1 (int = 8) – The size of the plot point for y1.

  • markersize2 (int = 8) – The size of the plot point for y2.

  • linestyle1 (str = "-") – The style of the line for y1.

  • linestyle2 (str = "-") – The style of the line for y2.

  • linewidth1 (float = 1) – The width of the line for y1.

  • linewidth2 (float = 1) – The width of the line for y2.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

  • remove_spines (booll = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> figsize = (6, 4)
>>> X = ds.datetime_data()
>>> y1 = ds.random_data()
>>> y2 = ds.random_data()
>>> fig, ax = ds.plot_line_line_x_y1_y2(
...     X=X,
...     y1=y1,
...     y2=y2,
...     figsize=figsize
... )
datasense.graphs.plot_line_line_y1_y2(*, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker1: str = '.', marker2: str = '.', markersize1: int = 8, markersize2: int = 8, linestyle1: str = '-', linestyle2: str = '-', linewidth1: float = 1, linewidth2: float = 1, colour1: str = '#0077bb', colour2: str = '#33bbee', labellegendy1: str = None, labellegendy2: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Line plot of y1 and y2.

Optional smoothing applied to y1 and y2. y1 and y2 are of the same length. y1 and y2 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • y1 (pd.Series) – The data to plot on the ordinate.

  • y2 (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker1 (str = ".") – The type of plot point for y1.

  • marker2 (str = ".") – The type of plot point for y2.

  • markersize1 (int = 8) – The size of the plot point for y1 (pt).

  • markersize2 (int = 8) – The size of the plot point for y2 (pt).

  • linestyle1 (str = "_") – The style of the line for y1.

  • linestyle2 (str = "_") – The style of the line for y2.

  • linewidth1 (float = 1) – The width of the line for y1.

  • linewidth2 (float = 1) – The width of the line for y2.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> series_y1 = ds.random_data()
>>> series_y2 = ds.random_data()
>>> fig, ax = ds.plot_line_line_y1_y2(
...     y1=series_y1,
...     y2=series_y2
... )
datasense.graphs.plot_line_x_y(*, X: Series, y: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker: str = '.', markersize: float = 8, linestyle: str = '-', linewidth: float = 1, colour: str = '#0077bb', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y versus X. Optional smoothing applied to y.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker (str = ".") – The type of plot point.

  • markersize (float = 8) – The size of the plot point (pt).

  • linestyle (str = "-") – The style of the line joining the points.

  • linewidth (float = 1) – The width of the line joining the points.

  • colour (str = colour_blue) – The colour of the plot point (hexadecimal triplet string).

  • remove_spines (bool = True) – IF True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> X = ds.datetime_data()
>>> y = ds.random_data()
>>> fig, ax = ds.plot_line_x_y(
...     X=X,
...     y=y
... )
>>> X = ds.random_data(distribution="randint").sort_values()
>>> y = ds.random_data()
>>> fig, ax = ds.plot_line_x_y(
...     X=X,
...     y=y,
...     figsize=(8, 4.5),
...     marker="o",
...     markersize=8,
...     linestyle=":",
...     linewidth=5,
...     colour="#ee3377"
... )
>>> X = ds.random_data(distribution="uniform").sort_values()
>>> y = ds.random_data()
>>> fig, ax = ds.plot_line_x_y(
...     X=X,
...     y=y
... )
>>> X = ds.random_data().sort_values()
>>> y = ds.random_data()
>>> fig, ax = ds.plot_line_x_y(
...     X=X,
...     y=y
... )
datasense.graphs.plot_line_y(*, y: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker: str = '.', markersize: float = 8, linestyle: str = '-', colour: str = '#0077bb', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Line plot of y. Optional smoothing applied to y.

The abscissa is a series of integers 1 to the size of y.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • y (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker (str = ".") – The type of plot point.

  • markersize (float = 8) – The size of the plot point (pt).

  • linestyle (str = "-") – The style for the line.

  • colour (str = colour_blue) – The colour of the plot point (hexadecimal triplet string).

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> series_y = ds.random_data()
>>> fig, ax = ds.plot_line_y(y=series_y)
>>> fig, ax = ds.plot_line_y(
...     y=series_y,
...     figsize=(8, 4.5),
...     marker="o",
...     markersize=4,
...     colour=colour_orange
... )
datasense.graphs.plot_lineleft_lineright_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, colour1: str = '#0077bb', colour2: str = '#33bbee', linestyle1: str = '-', linestyle2: str = '-', marker1: str = '.', marker1size: float = 8, marker2: str = '.', marker2size: float = 8, labellegendy1: str = None, labellegendy2: str = None, xticklabels_rotation=None, defaultfmt='%Y-%m-%d') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]

Line plot of y1 left vertical axis versus X. Line plot of y2 right vertical axis versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have different units or scales, and you wish to see if they are correlated.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the ordinate.

  • y2 (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • linestyle1 (str = "-") – The style of the line for y1.

  • linestyle2 (str = "-") – The style of the line for y2.

  • marker1 (str = ".") – The type of plot point for y1.

  • markersize1 (int = 8) – The size of the plot point for y1 (pt).

  • marker2 (str = ".") – The type of plot point for y2.

  • markersize2 (int = 8) – The size of the plot point for y2 (pt).

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> import pandas as pd
>>> figsize = (6, 4)
>>> df = pd.DataFrame(data={
...     "X": [
...         "2018-07-31", "2018-08-04", "2018-08-06", "2018-08-11",
...         "2018-08-12", "2018-08-15", "2018-08-16", "2018-08-17",
...         "2018-08-18", "2018-08-25", "2018-09-15"
...     ],
...     "y1": [10, 15, 30, 35, 40, 45, 40, 30, 35, 50, 75],
...     "y2": [20, 35, 20, 15, 30, 45, 50, 40, 45, 50, 65]
... })
>>> fig, ax1, ax2 = ds.plot_lineleft_lineright_x_y1_y2(
...     X=df["X"],
...     y1=df["y1"],
...     y2=df["y2"],
...     figsize=figsize
... )
>>> import pandas as pd
>>> figsize = (6, 4)
>>> df = pd.DataFrame(data={
...     "X": [
...         "2018-07-31", "2018-08-04", "2018-08-06", "2018-08-11",
...         "2018-08-12", "2018-08-15", "2018-08-16", "2018-08-17",
...         "2018-08-18", "2018-08-25", "2018-09-15"
...     ],
...     "y1": [10, 15, 30, 35, 40, 45, 40, 30, 35, 50, 75],
...     "y2": [20, 35, 20, 15, 30, 45, 50, 40, 45, 50, 65]
... })
>>> df["X"] = pd.to_datetime(df["X"])
>>> fig, ax1, ax2 = ds.plot_lineleft_lineright_x_y1_y2(
...     X=df["X"],
...     y1=df["y1"],
...     y2=df["y2"],
...     smoothing="natural_cubic_spline",
...     number_knots=5,
...     figsize=figsize
... )
datasense.graphs.plot_pareto(*, X: Series, y: Series, figsize: tuple[float, float] = None, width: float = 0.8, colour1: str = '#0077bb', colour2: str = '#33bbee', marker: str = '.', markersize: float = 8, linestyle: str = '-') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]
Parameters:
  • X (pd.Series) – The data to plot on the ordinate.

  • y (pd.Series) – The data to plot on the abscissa.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • width (float = 0.8) – The width of the bars (in).

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • marker (str = ".") – The type of plot point.

  • markersize (float = 8) – The size of the plot point (pt).

  • linestyle (str = "-") – The style of the line joining the points.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> data = pd.DataFrame(
...     {
...         "ordinate": ["Mo", "Larry", "Curly", "Shemp", "Joe"],
...         "abscissa": [21, 2, 10, 4, 16]
...     }
... )
>>> fig, ax1, ax2 = ds.plot_pareto(
...     X=data["ordinate"],
...     y=data["abscissa"]
... )
datasense.graphs.plot_pie(*, x: list[int] | list[float], labels: list[int] | list[float] | list[str], figsize: tuple[float, float] = None, startangle: float = 0, colors: list[str] = None, autopct: str = '%1.1f%%') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]
Parameters:
  • x (list[int] | list[float],) – The wedge sizes.

  • labels (list[int] | list[float] | list[str],) – The labels of the wedges.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • startangle (float = 0,) – The start angle of the pie, counterclockwise from the x axis.

  • colors (list[str] = None) – The color of the wedges.

  • autopct (str = "%1.1f%%") – Label the wedges with their numeric value. If None, no label.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> x = [69, 31]
>>> labels = ["Yes", "No"]
>>> fig, ax = ds.plot_pie(
...     x=x,
...     labels=labels
... )
>>> x = [69, 31]
>>> labels = ["Yes", "No"]
>>> fig, ax = ds.plot_pie(
...     x=x,
...     labels=labels,
...     startangle=90,
...     colors=[
...         colour_blue, colour_cyan, colour_teal, colour_orange,
...         colour_red, colour_magenta, colour_grey
...     ]
... )
datasense.graphs.plot_scatter_line_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, y1_marker: str = '.', y2_marker: str = '', colour1: str = '#0077bb', colour2: str = '#33bbee', labellegendy1: str = None, labellegendy2: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y1 versus X. Line plot of y2 versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The series for the horizontal axis.

  • y1 (pd.Series) – The series for y1 to plot on the vertical axis.

  • y2 (pd.Series) – The series for y2 to plot on the vertical axis.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots to create.

  • marker (str = None) – The type of marker

  • colour1 (str = colour_blue) – The colour of y1.

  • colour2 (str = colour_cyan) – The colour of y2.

  • labellegendy1 (str = None) – The legend for y1.

  • labellegendy2 (str = None) – The legend for y2.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> number_knots = 100
>>> figsize = (6, 4)
>>> X = ds.random_data(distribution="uniform").sort_values()
>>> y = ds.random_data(distribution="norm")
>>> model = ds.natural_cubic_spline(
...     X=X,
...     y=y,
...     number_knots=number_knots
... )
>>> fig, ax = ds.plot_scatter_line_x_y1_y2(
...     X=X,
...     y1=y,
...     y2=model.predict(X),
...     figsize=figsize,
...     labellegendy2=f'number knots = {number_knots}'
... )
datasense.graphs.plot_scatter_scatter_x1_x2_y1_y2(*, X1: Series, X2: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker1: str = '.', marker2: str = '.', markersize1: int = 8, markersize2: int = 8, linestyle1: str = 'None', linestyle2: str = 'None', linewidth1: float = 1, linewidth2: float = 1, colour1: str = '#0077bb', colour2: str = '#33bbee', labellegendy1: str = None, labellegendy2: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y1 versus X1. Scatter plot of y2 versus X2. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X1 (pd.Series) – The data to plot on the abscissa.

  • X2 (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the ordinate.

  • y2 (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker1 (str = ".") – The type of plot point for y1.

  • marker2 (str = ".") – The type of plot point for y2.

  • markersize1 (int = 8) – The size of the plot point for y1.

  • markersize2 (int = 8) – The size of the plot point for y2.

  • linestyle1 (str = "None") – The style of the line for y1.

  • linestyle2 (str = "None") – The style of the line for y2.

  • linewidth1 (float = 1) – The width of the line for y1.

  • linewidth2 (float = 1) – The width of the line for y2.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> series_x1 = ds.datetime_data()
>>> series_x2 = ds.datetime_data()
>>> series_y1 = ds.random_data()
>>> series_y2 = ds.random_data()
>>> fig, ax = ds.plot_scatter_scatter_x1_x2_y1_y2(
...     X1=series_x1,
...     X2=series_x2,
...     y1=series_y1,
...     y2=series_y2
... )
>>> fig, ax = ds.plot_scatter_scatter_x1_x2_y1_y2(
...     X1=series_x1,
...     X2=series_x2,
...     y1=series_y1,
...     y2=series_y2,
...     smoothing="natural_cubic_spline",
...     number_knots=7
... )
>>> series_x1 = ds.random_data(distribution="uniform").sort_values()
>>> series_x2 = ds.random_data(distribution="uniform").sort_values()
>>> series_y1 = ds.random_data()
>>> series_y2 = ds.random_data()
>>> fig, ax = ds.plot_scatter_scatter_x1_x2_y1_y2(
...     X1=series_x1,
...     X2=series_x2,
...     y1=series_y1,
...     y2=series_y2,
...     figsize=(8, 5),
...     marker1="o",
...     marker2="+",
...     markersize1=8,
...     markersize2=12,
...     colour1="red",
...     colour2="magenta",
...     labellegendy1="y1",
...     labellegendy2="y2"
... )
>>> ax.legend(frameon=False) 
>>> fig, ax = ds.plot_scatter_scatter_x1_x2_y1_y2(
...     X1=series_x1,
...     X2=series_x2,
...     y1=series_y1,
...     y2=series_y2,
...     figsize=(8, 5),
...     marker1="o",
...     marker2="+",
...     markersize1=8,
...     markersize2=12,
...     colour1="red",
...     colour2="magenta",
...     labellegendy1="y1",
...     labellegendy2="y2",
...     smoothing="natural_cubic_spline",
...     number_knots=7
... )
>>> ax.legend(frameon=False) 
datasense.graphs.plot_scatter_scatter_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker1: str = '.', marker2: str = '.', markersize1: int = 8, markersize2: int = 8, linestyle1: str = 'None', linestyle2: str = 'None', linewidth1: float = 1, linewidth2: float = 1, colour1: str = '#0077bb', colour2: str = '#33bbee', labellegendy1: str = None, labellegendy2: str = None, remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y1 versus X. Scatter plot of y2 versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have the same units.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the ordinate.

  • y2 (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker1 (str = ".") – The type of plot point for y1.

  • marker2 (str = ".") – The type of plot point for y2.

  • markersize1 (int = 8) – The size of the plot point for y1.

  • markersize2 (int = 8) – The size of the plot point for y2.

  • linestyle1 (str = "None") – The style of the line for y1.

  • linestyle2 (str = "None") – The style of the line for y2.

  • linewidth1 (float = 1) – The width of the line for y1.

  • linewidth2 (float = 1) – The width of the line for y2.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • labellegendy1 (str = None) – The legend label of the line y1.

  • labellegendy2 (str = None) – The legend label of the line y2.

  • remove_spines (booll = True) – IF True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> series_x = ds.datetime_data()
>>> series_y1 = ds.random_data()
>>> series_y2 = ds.random_data()
>>> fig, ax = ds.plot_scatter_scatter_x_y1_y2(
...     X=series_x,
...     y1=series_y1,
...     y2=series_y2
... )
>>> series_x = ds.random_data(distribution="uniform")
>>> fig, ax = ds.plot_scatter_scatter_x_y1_y2(
...     X=series_x,
...     y1=series_y1,
...     y2=series_y2,
...     figsize=(8, 5),
...     marker1="o",
...     marker2="+",
...     markersize1=8,
...     markersize2=12,
...     colour1=colour_red,
...     colour2=colour_magenta,
...     labellegendy1="y1",
...     labellegendy2="y2"
... )
>>> ax.legend(frameon=False) 
datasense.graphs.plot_scatter_x_y(*, X: Series, y: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker: str = '.', markersize: float = 4, colour: str = '#0077bb', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y versus X. Optional smoothing applied to y.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • x (pd.Series) – The data to plot on the abscissa.

  • y (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker (str = ".") – The type of plot point.

  • markersize (float = 4) – The size of the plot point (pt).

  • colour (str = colour_blue) – The colour of the plot point (hexadecimal triplet string).

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> series_x = ds.datetime_data()
>>> series_y = ds.random_data()
>>> fig, ax = ds.plot_scatter_x_y(
...     X=series_x,
...     y=series_y
... )
>>> series_x = ds.random_data(distribution="randint").sort_values()
>>> fig, ax = ds.plot_scatter_x_y(
...     X=series_x,
...     y=series_y,
...     figsize=(8, 4.5),
...     marker="o",
...     markersize=8,
...     colour=colour_red
... )
>>> series_x = ds.random_data(distribution="uniform").sort_values()
>>> fig, ax = ds.plot_scatter_x_y(
...     X=series_x,
...     y=series_y
... )
>>> series_x = ds.random_data().sort_values()
>>> fig, ax = ds.plot_scatter_x_y(
...     X=series_x,
...     y=series_y
... )
datasense.graphs.plot_scatter_y(*, y: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, marker: str = '.', markersize: float = 8, colour: str = '#0077bb', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Scatter plot of y. Optional smoothing applied to y.

The abscissa is a series of integers 1 to the size of y.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • y (pd.Series) – The data to plot on the ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – The number of knots for natural cubic spline smoothing.

  • marker (str = ".") – The type of plot point.

  • markersize (float = 8) – The size of the plot point (pt).

  • colour (str = colour_blue) – The colour of the plot point (hexadecimal triplet string).

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> series_y = ds.random_data()
>>> fig, ax = ds.plot_scatter_y(y=series_y)
>>> fig, ax = ds.plot_scatter_y(
...     y=series_y,
...     figsize=(8, 4.5),
...     marker="o",
...     markersize=4,
...     colour=colour_orange
... )
datasense.graphs.plot_scatterleft_scatterright_x_y1_y2(*, X: Series, y1: Series, y2: Series, figsize: tuple[float, float] = None, smoothing: str = None, number_knots: int = None, colour1: str = '#0077bb', colour2: str = '#33bbee', linestyle1: str = 'None', linestyle2: str = 'None') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]

Scatter plot of y1 left vertical axis versus X. Scatter plot of y2 right vertical axis versus X. Optional smoothing applied to y1, y2.

This graph is useful if y1 and y2 have different units or scales, and you wish to see if they are correlated.

If smoothing is applied, the series must not contain NaN, inf, or -inf. Fit a piecewise cubic function the the constraint that the fitted curve is linear outside the range of the knots. The fitter curve is continuously differentiable to the second order at all of the knots.

Parameters:
  • X (pd.Series) – The data to plot on the abscissa.

  • y1 (pd.Series) – The data to plot on the y1 ordinate.

  • y2 (pd.Series) – The data to plot on the y2 ordinate.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • smoothing (str = None) – The type of smoothing to apply. Options: “natural_cubic_spline”

  • number_knots (int = None) – the number of knows for natural cubic spline smoothing.

  • colour1 (str = colour_blue) – The colour of the line for y1.

  • colour2 (str = colour_cyan) – The colour of the line for y2.

  • linestyle1 (str = "None") – The style of the line for y1.

  • linestyle2 (str = "None") – The style of the line for y2.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> X = ds.random_data(distribution="randint").sort_values()
>>> y1 = ds.random_data(distribution="norm")
>>> y2 = ds.random_data(distribution="norm")
>>> fig, ax1, ax2 = ds.plot_scatterleft_scatterright_x_y1_y2(
...     X=X,
...     y1=y1,
...     y2=y2,
...     figsize=(6, 4),
...     linestyle2="-"
... )
datasense.graphs.plot_stacked_bars(*, x: list[int] | list[float] | list[str], height1: list[int] | list[float], label1: str = None, height2: list[int] | list[float] = None, label2: str = None, height3: list[int] | list[float] = None, label3: str = None, height4: list[int] | list[float] = None, label4: str = None, height5: list[int] | list[float] = None, label5: str = None, height6: list[int] | list[float] = None, label6: str = None, height7: list[int] | list[float] = None, label7: str = None, width: float = 0.8, figsize: tuple[float, float] = None, color: [list[str]] = ['#0077bb', '#33bbee', '#009988', '#ee7733', '#cc3311', '#ee3388', '#bbbbbb']) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Stacked vertical bar plot of up to seven levels per bar.

Parameters:
  • x (list[int] | list[float] | list[str]) – The x coordinates of the bars.

  • height1 (list[int] | list[float]) – The height of the level 1 bars.

  • label1 (str = None) – The label of the level 1 bars.

  • height2 (list[int] | list[float]) – The height of the level 2 bars.

  • label2 (str = None) – The label of the level 2 bars.

  • height3 (list[int] | list[float]) – The height of the level 3 bars.

  • label3 (str = None) – The label of the level 3 bars.

  • height4 (list[int] | list[float]) – The height of the level 4 bars.

  • label4 (str = None) – The label of the level 4 bars.

  • height5 (list[int] | list[float]) – The height of the level 5 bars.

  • label5 (str = None) – The label of the level 5 bars.

  • height6 (list[int] | list[float]) – The height of the level 6 bars.

  • label6 (str = None) – The label of the level 6 bars.

  • height7 (list[int] | list[float]) – The height of the level 7 bars.

  • label7 (str = None) – The label of the level 7 bars.

  • width (float = 0.8) – The width of the bars.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • color (str = [) – colour_blue, colour_cyan, colour_teal, colour_orange, colour_red, colour_magenta, colour_grey

  • ] – The color of the bar faces, up to seven levels.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> x = ["G1", "G2", "G3", "G4", "G5"]
>>> height1 = [20, 35, 30, 35, 27]
>>> label1 = "A"
>>> width = 0.35
>>> height2 = [25, 32, 34, 20, 25]
>>> label2 = "B"
>>> fig, ax = ds.plot_stacked_bars(
...     x=x,
...     height1=height1,
...     label1=label1,
...     height2=height2,
...     label2=label2
... )
>>> fig.legend(frameon=False, loc="upper right") 
>>> x = ["G1", "G2", "G3", "G4", "G5"]
>>> height1 = [20, 35, 30, 35, 27]
>>> label1 = "A"
>>> width = 0.35
>>> height2 = [25, 32, 34, 20, 25]
>>> label2 = "B"
>>> height3 = [30, 34, 23, 27, 32]
>>> label3 = "C"
>>> height4 = [30, 34, 23, 27, 32]
>>> label4 = "D"
>>> height5 = [30, 34, 23, 27, 32]
>>> label5 = "E"
>>> height6 = [30, 34, 23, 27, 32]
>>> label6 = "F"
>>> height7 = [30, 34, 23, 27, 32]
>>> label7 = "G"
>>> fig, ax = ds.plot_stacked_bars(
...     x=x,
...     height1=height1,
...     label1=label1,
...     width=width,
...     figsize=(9, 6),
...     height2=height2,
...     label2=label2,
...     height3=height3,
...     label3=label3,
...     height4=height4,
...     label4=label4,
...     height5=height5,
...     label5=label5,
...     height6=height6,
...     label6=label6,
...     height7=height7,
...     label7=label7,
... )
>>> fig.legend(frameon=False, loc="upper right") 
datasense.graphs.plot_vertical_bars(*, x: list[int] | list[float] | list[str], height: list[int] | list[float], width: float = 0.8, figsize: tuple[float, float] = None, edgecolor: str = '#ffffff', linewidth: int = 1, color: str = '#0077bb') tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]
Parameters:
  • x (list[int] | list[float] | list[str],) – The x coordinates of the bars.

  • height (list[int] | list[float],) – The height(s) of the bars.

  • width (float = 0.8,) – The width of the bars.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • edgecolor (str = colour_white,) – The hexadecimal color value for the bar edges.

  • linewidth (int = 1,) – The bar edges line width (point).

  • color (str = colour_blue) – The color of the bar faces.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Examples

>>> import datasense as ds
>>> x = ["Yes", "No"]
>>> height = [69, 31]
>>> fig, ax = ds.plot_vertical_bars(
...     x=x,
...     height=height
... )
>>> x = ["Yes", "No"]
>>> height = [69, 31]
>>> fig, ax = ds.plot_vertical_bars(
...     x=x,
...     height=height,
...     width=0.4
... )
datasense.graphs.probability_plot(*, data: ~pandas.core.series.Series, figsize: tuple[float, float] = None, distribution: object = <scipy.stats._continuous_distns.norm_gen object>, fit: bool = True, plot: object = None, colour1: str = '#0077bb', colour2: str = '#33bbee', remove_spines: bool = True) tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Plot a probability plot of data against the quantiles of a specified theoretical distribution.

Parameters:
  • data (pd.Series) – A pandas Series.

  • figsize (tuple[float, float] = None) – The (width, height) of the figure (in, in).

  • distribution (object = norm) – Fit a normal distribution by default.

  • fit (bool = True) – Fit a least-squares regression line to the data if True.

  • plot (object = None) – If given, plot the quantiles and least-squares fit.

  • colour1 (str = colour_blue,) – The colour of line 1.

  • colour2 (str = colour_cyan) – The colour of line 2.

  • remove_spines (bool = True) – If True, remove top and right spines of axes.

Returns:

A matplotlib Figure and Axes tuple.

  • fig: plt.Figure

    A matplotlib Figure.

  • ax: axes.Axes

    A matplotlib Axes.

Return type:

tuple[plt.Figure, axes.Axes]

Example

>>> import datasense as ds
>>> data = ds.random_data()
>>> fig, ax = ds.probability_plot(data=data)
datasense.graphs.qr_code(*, qr_code_string: str, qr_code_path: Path) None

Create a QR code and save as .svg and .png.

Parameters:
  • qr_code_string (str) – Text for the QR code

  • qr_code_path (Path) – Text for the path

Example

>>> import datasense as ds
>>> code_string = "mystring"
>>> code_path = Path("str_of_path")
>>> ds.qr_code(qr_code_string=code_string, qr_code_path=code_path)
datasense.graphs.style_graph() None

Style graphs.

Fonts

For Linux these are stored: /usr/lib/python3.10/site-packages/matplotlib/mpl-data/fonts/ttf/

Example

>>> import datasense as ds
>>> ds.style_graph()

References

https://matplotlib.org/stable/tutorials/introductory/customizing.html

datasense.graphs.waterfall(*, df: DataFrame, path_in: Path | str, xticklabels_rotation: float = 45, last_column: str = 'NET', ylim_min: float | None = None, ylim_max: float | None = None, positive_colour: str = 'green', negative_colour: str = 'red', first_bar_colour: str = 'blue', last_bar_colour: str = 'blue', grid_alpha: float = 0.2, graph_format: str = 'svg', title: str = 'Waterfall Chart') DataFrame

Create a waterfall chart, to understand the cumulative effect of sequentially introduced positive or negative values.

Parameters:
  • df (pd.DataFrame) – The DataFrame to convert to a waterfall DataFrame.

  • path_in (Path | str) – The path of the data file.

  • xticklabels_rotation (float = 45) – The angle to rotate the xticklabels.

  • last_column (str = "NET") – The name of the last column in the waterfall chart.

  • ylim_min (float | None = None) – The lower limit of the y axis.

  • ylim_max (float | None = None) – The upper limit of the y axis.

  • positive_colour (str = "green") – The colour of the positive bars.

  • negative_colour (str = "red") – The colour of the negative bars.

  • first_bar_colour (str = "blue") – The colour of the first bar.

  • last_bar_colour (str = "blue") – The colour of the last bar.

  • grid_alpha (float = 0.2) – The fraction of the full colour of the grid.

  • graph_format (str = "svg") – The output format of the graph.

  • title (str = "Waterfall Chart") – The title on the graph.

Returns:

df – The waterfall DataFrame.

Return type:

pd.DataFrame

Example

Budget waterfall chart

>>> import pandas as pd
# the df shown here is a proxy for waterfall_budget.xlsx
>>> df = pd.DataFrame(data={
...     'Categories': [
...         'Base', 'Inflation', 'Merit Raises',
...         'Market Wages', 'Volume', 'Fuel',
...         'Other', 'Compliance', 'Reorganization',
...         'Consolidations', 'Initiative Savings',
...         'Consultants'
...     ],
...     'Amount ($MM)': [
...         423.5, 11.7, 2.9, 1.1, 1.5, 0.1,
...         5.3, 1.1, -2.7, -23.3, -6.4, -8
...     ],
... })
>>> df = ds.waterfall(
...     df=df,
...     path_in="waterfall_budget.xls"
...     ylim_min=400,
...     ylim_max=455,
... )

datasense.html_ds module

HTML and report functions

datasense.html_ds.explore_functions(function: str) None

Explore functions using inspect.signature.

Parameters:

function (str) – Name of function to explore.

Examples

>>> import datasense as ds
>>> from sklearn.compose import make_column_transformer
>>> function_to_explore = make_column_transformer
>>> ds.explore_functions(function=function_to_explore) 
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.pipeline import make_pipeline
>>> functions = ["function_name_syntax", "function_name"]
>>> for function in functions:
...     ds.explore_functions(function=function) 
datasense.html_ds.html_begin(*, output_url: str = 'html_report.html', header_title: str = 'Report', header_id: str = 'report') IO[str]

Open a file to write html and set an hmtl header.

Parameters:
  • output_url (str = 'html_report.html') – The file name for the html output.

  • header_title (str = 'Report') – The file title.

  • header_id (str = 'report') – The id for the header_title.

Returns:

original_stdout – A file object for the output of print().

Return type:

IO[str]

Examples

>>> import datasense as ds
>>> output_url = '../tests/my_html_file.html'
>>> original_stdout = ds.html_begin(output_url=output_url)
>>> header_title = 'My Report'
>>> header_id = 'my-report'
>>> original_stdout = ds.html_begin(
...     output_url=output_url,
...     header_title=header_title,
...     header_id=header_id
... )
datasense.html_ds.html_end(*, original_stdout: IO[str], output_url: str) None

Create an html footer, close an html file, and open an html file in a new tab in a web browser.

Parameters:
  • original_stdout (IO[str]) – The original stdout.

  • output_url (str) – The file name for the html output.

Example

>>> import datasense as ds
>>> output_url = '../tests/my_html_file.html'
>>> # see original_stdout example in def html_begin()
>>> original_stdout = ds.html_begin(
...     output_url="output_url.html",
...     header_title="header_title",
...     header_id="header-id"
... )
>>> ds.html_end(
...     original_stdout=original_stdout,
...     output_url="output_url.html"
... )
datasense.html_ds.html_figure(*, file_name: Path | str, caption: str = None) None

Print an html tag for a figure.

Parameters:
  • file_name (str) – The file name of the image.

  • caption (str = None) – The figure caption.

Examples

>>> import datasense as ds
>>> import matplotlib.pyplot as plt
>>> graph_file = 'my_graph_file.svg'
>>> figsize = (8, 6)
>>> fig = plt.figure(figsize=figsize)
>>> fig.savefig(graph_file)
>>> ds.html_figure(file_name=graph_file)
</pre><figure><img src="my_graph_file.svg" alt="my_graph_file.svg"/><figcaption>my_graph_file.svg</figcaption></figure><pre style="white-space: pre-wrap;">
>>> ds.html_figure(
...     file_name=graph_file,
...     caption='../tests/my graph file caption'
... )
</pre><figure><img src="my_graph_file.svg" alt="my_graph_file.svg"/><figcaption>../tests/my graph file caption</figcaption></figure><pre style="white-space: pre-wrap;">

Create an html footer.

Example

>>> import datasense as ds
>>> ds.html_footer() 
</body>
</html>
datasense.html_ds.html_header(*, header_title: str = 'Report', header_id: str = 'report') None

Create an html header.

Parameters:
  • header_title (str = 'Report') – The header title.

  • header_id (str = 'report') – The header ID.

Example

>>> import datasense as ds
>>> ds.html_header(
...     header_title="header title",
...     header_id="header-id"
... ) 
<!DOCTYPE html>
<html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, user-scalable=yes"            name="viewport"/>
<style>@import url("support.css");</style>
<title>header title</title>
</head>
<body>
<h1 class="title" id="header-id">header title</h1>
datasense.html_ds.page_break() None

Create an html page break.

Example

>>> import datasense as ds
>>> ds.page_break() 
</pre>
<p style="page-break-after:always"></p>
<p style="page-break-before:always"></p>
<pre style="white-space: pre-wrap;">
datasense.html_ds.report_summary(*, start_time: float, stop_time: float, print_heading: bool = True, read_file_names: list[str] = None, save_file_names: list[str] = None, targets: list[str] = None, features: list[str] = None, number_knots: list[int] = None) None

Create a report summary.

Parameters:
  • start_time (float) – The start time.

  • stop_time (float) – The stop time.

  • print_heading (bool = True) – The boolean to print the heading for the report summary.

  • read_file_names (list[str] = None) – The list of file names read.

  • save_file_names (list[str] = None) – The list of file names saved.

  • targets (list[str] = None) – The list of target variables.

  • features (list[str] = None) – Thje list of feature variables.

  • number_knots (list[str] = None) – The number of spline knots.

Example

>>> import datasense as ds
>>> import time
>>> start_time = time.perf_counter()
>>> stop_time = time.perf_counter()
>>> ds.report_summary(
...     start_time=start_time,
...     stop_time=stop_time
... )
</pre>
<h1>Report summary</h1>
<pre style="white-space: pre-wrap;">
Execution time : 0.000 s
datasense.html_ds.script_summary(*, script_path: Path, action: str = 'run') None

Print script name and time of execution.

Parameters:
  • script_path (Path) – The path of the script file.

  • action (str = "run") – An action message: run, started at, finished at, etc.

Examples

>>> import datasense as ds
>>> ds.script_summary(script_path=Path(__file__)) 
>>> ds.script_summary(
...     script_path=Path(__file__),
...     action="started at"
... ) 
>>> ds.script_summary(
...     script_path=Path(__file__),
...     action="finished at"
... ) 
datasense.html_ds.sync_directories(*, sourcedir: str, targetdir: str, action: str = 'sync', twoway: bool = False, purge: bool = False, verbose: bool = True) None

Synchronize two directories.

Parameters:
  • sourcedir (str) – The source directory for syncing.

  • targetdir (str) – The target directory for syncing.

  • action (str = 'sync') – The syncing action. Options: diff, sync, update.

  • twoway (bool = False) – Update files from sourcedir to targetdir (False) or both (True).

  • purge (bool = False) – Delete files from targetdir.

  • verbose (bool = True) – Provide verbose output.

Example

>>> import datasense as ds
>>> local_docs = 'string_to_directory'
>>> sharepoint_docs = 'string_to_mapped_drive_of_sharepoint'
>>> ds.sync_directories(
...     sourcedir="../tests/sourcedir",
...     targetdir="../tests/targetdir",
...     action='sync',
...     twoway=False,
...     purge=False,
...     verbose=True
... ) 

datasense.msa module

Honest MSA reports on pandas DataFrames

class datasense.msa.MSA(df: DataFrame)

Bases: object

TODO

average_chart()

TODO

average_in_control() bool

TODO

average_out_of_control_reason() str

TODO

classification()

TODO

effective_resolution()

TODO

interpret()

Overall interpretation

interpret_tables()

General interpretation of tables

main_effects_chart_anome()

TODO

main_effects_in_control() bool

TODO

main_effects_out_of_control_reason() str

TODO

mean_ranges_chart_anomr()

TODO

mean_ranges_in_control() bool

TODO

mean_ranges_out_of_control_reason() str

TODO

msa_gauge_rr_results()

TODO

msa_results()

TODO

parallelism_chart()

TODO

range_chart()

TODO

range_in_control() bool

TODO

range_out_of_control_reason() str

TODO

report()

TODO

variance_components()

TODO

datasense.munging module

Data munging

datasense.munging.ask_directory_path(*, title: str = 'Select directory', initialdir: Path = None, print_bool: bool = False) Path

Ask user for directory.

Parameters:
  • title (str = 'Select directory') – The title of the dialog window.

  • initialdir (Path = None) – The directory in which the dialogue starts.

  • print_bool (bool = False) – A boolean. Print message if True.

Returns:

path – The path of the directory.

Return type:

Path

Example

>>> from tkinter import filedialog
>>> from pathlib import Path
>>> from tkinter import Tk
>>> import datasense as ds
>>> path = ds.ask_directory_path(title='your message') 
datasense.munging.ask_open_file_name_path(*, title: str, initialdir: Path | None = None, filetypes: list[tuple[str]] = [('xlsx files', '.xlsx .XLSX')]) Path

Ask user for the path of the file to open.

Parameters:
  • title (str) – The title of the dialog window.

  • initialdir (Path | None = None) – The directory in which the dialogue starts.

  • filetypes (list[tuple[str]] = [('xlsx files', '.xlsx .XLSX')]) – The file types to make visible.

Returns:

path – The path of the file to open.

Return type:

Path

Examples

>>> from tkinter import filedialog
>>> from pathlib import Path
>>> from tkinter import Tk
>>> import datasense as ds
>>> path = ds.ask_open_file_name_path(title='message') 
>>> path = ds.ask_open_file_name_path(
...     title='your message',
...     filetypes=[('csv files', '.csv .CSV')]
... ) 
datasense.munging.ask_save_as_file_name_path(*, title: str = 'Select file', initialdir: Path | None = None, filetypes: list[tuple[str]] = [('xlsx files', '.xlsx .XLSX')], print_bool: bool = True) Path

Ask user for the path of the file to save as.

Parameters:
  • title (str = 'Select file') – The title of the dialog window.

  • initialdir (Path | None = None) – The directory in which the dialogue starts.

  • filetypes (list[tuple[str]] = [('xlsx files', '.xlsx .XLSX')]) – The list of file types to show in the dialog.

  • print_bool (bool = True) – A boolean. Print message if True.

Returns:

path – The path of the file to save as.

Return type:

Path

Examples

>>> from tkinter import filedialog
>>> from pathlib import Path
>>> from tkinter import Tk
>>> import datasense as ds
>>> path = ds.ask_save_as_file_name_path(title='message') 
>>> path = ds.ask_save_as_file_name_path(
...     title='your message',
...     filetypes=[('csv files', '.csv .CSV')]
... ) 
datasense.munging.byte_size(*, num: int64, suffix: str = 'B') str

Convert bytes to requested units.

Parameters:
  • num (np.int64) – The input value.

  • suffix (str = 'B') – The units.

Returns:

memory_usage – The output value.

Return type:

str

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> print(
...     ds.byte_size(
...         num=df.memory_usage(index=True).sum()
...     )
... )
4.2 KiB
datasense.munging.convert_csv_to_feather(paths_in: list[str] | Path, paths_out: list[str] | Path) None

Convert list of csv files to feather files

Parameters:
  • paths_in (list[str] | Path) – List of csv file names or paths.

  • paths_out (list[str] | Path) – Liat of feather file names or paths.

Note

paths_in and paths_out must be of the same length

Example

One way to create paths_in.

>>> import datasense as ds
>>> extension_in = [".csv"]
>>> paths_in = ds.list_files(
...     directory=path_csv,
...     pattern_extension=extension_in
... ) 

One way to create paths_out.

>>> extension_out = ".feather"
>>> paths_out = [
...     Path(
...         directory_feather_files,
...         paths_in[count].name
...     ).with_suffix(extension_out)
...     for count, element in enumerate(paths_in)
... ] 

Convert csv to feather.

>>> ds.convert_csv_to_feather(
...     paths_in=paths_in,
...     paths_out=paths_out
... ) 
datasense.munging.convert_seconds_to_hh_mm_ss(*, seconds: int = None) tuple[int, int, int]

Convert seconds to hours, minutes and seconds.

Parameters:

seconds (int = None) – Time in seconds

Returns:

A tuple containing hours, minutes, seconds.

  • hours: int

    An integer of hours.

  • minutes: int

    An integer of minutes.

  • seconds: int

    An integer of seconds.

Return type:

tuple[int]

Example

>>> import datasense as ds
>>> hours_minutes_seconds = ds.convert_seconds_to_hh_mm_ss(seconds=251)
>>> hours_minutes_seconds
(0, 4, 11)
datasense.munging.copy_directory(*, sources: Path | str, destinations: Path | str, ignore_errors: bool = True) None

Delete destination directories (if present) and copy source directories to destination directories.

Parameters:
  • sources (Path | str) – The source directory name.

  • destinations (Path | str) – The destination directory name.

  • ignore_errors (bool = True) – Boolean to deal with errors.

Example

>>> import datasense as ds
>>> sources = ['source_directory']
>>> destinations = ['destination_directory']
>>> ds.rename_directory(
...     sources=sources,
...     destinations=destinations
... ) 
datasense.munging.create_dataframe(*, size: int = 42, fraction_nan: float = 0.13) DataFrame

Create a Pandas DataFrame.

Parameters:
  • size (int = 42) – The number of rows to create.

  • fraction_nan (float = 0.13) – The fraction of the DataFrame rows to contain NaN.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = create_dataframe()

Notes

a : float64 b : bool bn : boolean (nullable) c : category cs : CategoricalDtype category d : timedelta64[ns] r : object s : object t : datetime64[ns] u : datetime64[ns] x : float64 y : int64 yn : Int64 z : float64

datasense.munging.create_dataframe_norm(*, row_count: int = 42, column_count: int = 13, loc: float = 69, scale: float = 13, random_state: int = None, column_names: list[str] = None) DataFrame

Create DataFrame of random normal data.

Parameters:
  • row_count (int = 42,) – The number of rows to create.

  • column_count (int = 13,) – The number of columns to create.

  • loc (float = 69,) – The mean of the data.

  • scale (float = 13) – The standard deviation of the data.

  • random_state (int = None) – The random number seed.

  • column_names (list[str]) – The column names.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.create_dataframe_norm()
>>> column_count = 100
>>> row_count = 1000
>>> column_names = [f'col{item}' for item in range(column_count)]
>>> df = ds.create_dataframe_norm(
...     row_count=row_count,
...     column_count=column_count,
...     loc=69,
...     scale=13,
...     random_state=42,
...     column_names=column_names
... )
datasense.munging.create_directory(*, directories: list[str], ignore_errors: bool = True) None

Create empty directories for a path. - Deletes existing directories, whether empty or non-empty. - Ignores errors such as no existing directories.

Parameters:
  • directories (list[str]) – The list of directories.

  • ignore_errors (bool = True) – Boolean to deal with errors.

Example

>>> import datasense as ds
>>> directory_list = ['directory_one', 'directory_two']
>>> ds.create_directory(directories=directory_list)
datasense.munging.dataframe_info(*, df: DataFrame, file_in: Path | str, unique_bool: bool = False) DataFrame

Describe a DataFrame.

  • Display count of rows (rows_in_count)

  • Display count of empty rows (rows_empty_count)

  • Display count of non-empty rows (rows_out_count)

  • Display count of columns (columns_in_count)

  • Display count of empty columns (columns_empty_count)

  • Display count of non-empty columns (columns_non_empty_count)

  • Display table of data type, empty cell count, and empty cell percentage for non-empty columns (calls def number_empty_cells_in_columns())

  • Display count and list of non-empty columns (columns_non_empty_count, columns_non_empty_list)

  • Display count and list of boolean columns (columns_bool_count, columns_bool_list)

  • Display count and list of category columns (columns_category_count, columns_category_list)

  • Display count and list of datetime columns (columns_datetime_count, columns_datetime_list)

  • Display count and list of float columns (columns_float_count, columns_float_list)

  • Display count and list of integer columns (columns_integer_count, columns_integer_list)

  • Display count and list of string columns (columns_object_count, columns_object_list)

  • Display count and list of timedelta columns (columns_timedelta_count, columns_timedelta_list)

  • Display count and list of empty columns (columns_empty_count, columns_empty_list)

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • file_in (Path | str) – The name of the file from which df was created.

  • unique_bool (bool = False) – Print unique values of a column if True.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.dataframe_info(
...     df=df,
...     file_in='df'
... ) 
>>> df = ds.dataframe_info(
...     df=df,
...     file_in='df',
...     unique_bool=True
... ) 
datasense.munging.delete_columns(*, df: DataFrame, columns: list[str]) DataFrame

Delete columns of a DataFrame using a list.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • columns (list[str]) – A list of column names.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = ds.delete_columns(
...     df=df,
...     columns=columns
... ) 
datasense.munging.delete_directory(*, directories: list[str], ignore_errors: bool = True) None

Delete a list of directories. - Deletes existing directories, whether empty or non-empty.

Parameters:
  • directories (list[str]) – The list of directories.

  • ignore_errors (bool = True) – Boolean to deal with errors.

Example

>>> import datasense as ds
>>> directory_list = ['directory_one', 'directory_two']
>>> ds.delete_directory(directories=directory_list)
datasense.munging.delete_empty_columns(*, df: DataFrame, list_empty_columns: list[str] | None = None) DataFrame

Delete empty columns

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • list_empty_columns (list[str] | None = None) – A list of empty columns to delete. The code does not check if these columns are empty, but assumes they are.

  • TODO (Check that the columns in list_empty_columns are empty.) –

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.delete_empty_columns(df=df) 
>>> list_empty_columns = ["mixed", "nan_none"]
>>> df = ds.delete_empty_columns(
...    df=df,
...    list_empty_columns=list_empty_columns
... ) 

Notes

The following code also works, should dropna not work.

Delete columns where all elements are missing. df.loc[:, ~df.isna().all()]

datasense.munging.delete_empty_rows(df: DataFrame, list_columns: list[str] | None = None) DataFrame

Delete empty rows

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • list_columns (list[str] | None = None) – A list of columns to use to determine if row elements are empty.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.delete_empty_rows(df=df) 
>>> list_columns = ["column_x", "column_y", "column_z"]
>>> df = ds.delete_empty_rows(
...     df=df,
...     list_columns=list_columns
... ) 

Notes

The following code also works, should dropna not work.

Delete rows where all elements are missing in all columns. df.loc[~(df.shape[1] == df.isna().sum(axis=1)), :]

Delete rows where all elements are missing, in specific columns. df.dropna(how=”all”, subset=specific_columns)

Delete rows where all elements are missing, in specific columns. df.loc[~((df[look_in_columns].isna().sum(axis=1)) == (len(specific_columns))), :]

datasense.munging.delete_list_files(*, files: list[pathlib.Path] | list[str]) None

Delete a list of files

Parameters:

files (list[Path] | list[str]) – The list of files from which to remove the path.

Example

>>> import datasense as ds
>>> ds.delete_list_files(
...     files=files,
... ) 
datasense.munging.delete_rows(*, df: DataFrame, delete_row_criteria: tuple[str, int] | tuple[str, float] | tuple[str, str]) DataFrame

Delete rows of a DataFrame based on a value in one column.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • delete_row_criteria – tuple[str, int] | tuple[str, float] | tuple[str, str] A tuple of column name and criteria for the entire cell.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = ds.delete_rows(
...     df=df,
...     delete_row_criteria=['Batch Acceptance', 1]
... ) 
datasense.munging.directory_file_print(*, directory: str | Path, text: str = 'Files in directory') None

Print the files in a path.

Parameters:
  • directory (str | Path) – The path of the files to print.

  • text (str = 'Files in directory') – The text to print.

Example

>>> import datasense as ds
>>> path = "path to a directory"
>>> text = 'your text'
>>> ds.directory_file_print(
...     directory=path,
...     text=text
... ) 
datasense.munging.feature_percent_empty(*, df: DataFrame, columns: list[str], threshold: float) list[str]

Remove features that have NaN > threshold.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • columns (list[str]) – The list of columns to evaluate.

  • threshold (float) – The percentage empty threshold value.

Returns:

list_columns – The list of columns below the threshold value.

Return type:

list[str]

Example

>>> import datasense as ds
>>> features = ds.feature_percent_empty(
...     df=data,
...     columns=features,
...     threshold=percent_empty_features
... ) 
datasense.munging.file_size(path: Path | str) int

Determine the file size in bytes.

Parameters:

path (Path | str) – The path of the file.

Returns:

size – The file size in bytes

Return type:

int

Example

>>> import datasense as ds
>>> path = "myfile.feather"
>>> size = ds.file_size(path=path) 
datasense.munging.find_bool_columns(*, df: DataFrame) list[str]

Create a list of boolean column names of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

columns_bool – A list of boolean column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_bool = ds.find_bool_columns(df=df)
>>> columns_bool
['b', 'bn']
datasense.munging.find_category_columns(*, df: DataFrame) list[str]

Create list of category column names of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

columns_category – A list of category column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_category = ds.find_category_columns(df=df)
>>> columns_category
['c', 'cs']
datasense.munging.find_datetime_columns(*, df: DataFrame) list[str]

Find all datetime columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

datetime_columns – A list of datetime column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_datetime = ds.find_datetime_columns(df=df)
>>> columns_datetime
['t', 'u']
datasense.munging.find_float_columns(*, df: DataFrame) list[str]

Find all float columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

float_columns – A list of float column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_float = ds.find_float_columns(df=df)
>>> columns_float
['a', 'x', 'z']
datasense.munging.find_int_float_columns(*, df: DataFrame) list[str]

Find all integer and float columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

columns_int_float – A list of integer and float column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_int_float = ds.find_int_float_columns(df=df)
>>> columns_int_float
['a', 'i', 'x', 'y', 'yn', 'z']
datasense.munging.find_integer_columns(*, df: DataFrame) list[str]

Find all integer columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

integer_columns – A list of integer column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_int = ds.find_integer_columns(df=df)
>>> columns_int
['i', 'y', 'yn']
datasense.munging.find_object_columns(*, df: DataFrame) list[str]

Find all object columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

object_columns – A list of object column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_object = ds.find_object_columns(df=df)
>>> columns_object
['r', 's']
datasense.munging.find_timedelta_columns(*, df: DataFrame) list[str]

Find all timedelta columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

columns_timedelta – A list of timedelta column names.

Return type:

list[str]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> columns_timedelta = ds.find_timedelta_columns(df=df)
>>> columns_timedelta
['d']
datasense.munging.get_mtime(path: Path) float

Get the time of last modification of a Path object.

Parameters:

path (Path) – The path of the object.

Returns:

modified_time – The last modification time of a Path object (in seconds since epoch).

Return type:

float

Examples

>>> import datasense as ds
>>> from pathlib import Path
>>> path = Path('readme.md')
>>> modified_time = get_mtime(path=path)
1714576968.9664862
>>> path = Path('~/documents/readme.md')
>>> modified_time = get_mtime(path=path)
1714576968.9664862
datasense.munging.list_change_case(*, list_dirty: list[str], case: str) list[str]

Change the case of items in a list.

Parameters:
  • list_dirty (list[str]) – The list of strings.

  • case (str) – The type of case to apply.

Returns:

list_clean – The list of strings with case applied.

Return type:

list[str]

Example

>>> import datasense as ds
>>> list_clean = ds.list_change_case(
...     list_dirty=list_dirty,
...     case='upper'
... ) 
datasense.munging.list_directories(*, path: str | Path, pattern_startswith: list[str] | tuple[str] | None = None) list[str]

Return a list of directories found within a path.

Parameters:
  • path (str | Path) – The path of the enclosing directory.

  • pattern_startswith (list[str] | tuple[str] | None = None) – The string for determining if a directory start with this string.

Returns:

directory_list – A list of directories.

Return type:

list[str]

Examples

>>> import datasense as ds
>>> path = "path"
>>> directory_list = ds.list_directories(path=path) 
>>> path = "path"
>>> pattern_startswith = ["job aids"]
>>> directory_list = ds.list_directories(
...     path=path,
...     pattern_startswith=pattern_startswith
... ) 
>>> path = "path"
>>> pattern_startswith = ["job aids", "cheatsheet"]
>>> directory_list = ds.list_directories(
...     path=path,
...     pattern_startswith=pattern_startswith
... ) 
datasense.munging.list_files(*, directory: str | Path, pattern_startswith: list[str] | tuple[str] | None = None, pattern_extension: list[str] | tuple[str] | None = None) list[pathlib.Path]

Return a list of files within a directory.

Parameters:
  • directory (str | Path) – The path of the directory.

  • pattern_startswith (list[str] | tuple[str] | None = None) – The string for determining if a file starts with this string.

  • pattern_extension (list[str] | tuple[str] | None = None) – The file extensions to use for finding files in the path.

Returns:

files – A list of paths.

Return type:

list[Path]

Examples

>>> import datasense as ds
>>> files = ds.list_files(directory=path) 
>>> pattern_extension = [".html", ".HTML"]
>>> path = "path"
>>> files = ds.list_files(
...     directory=path,
...     pattern_extension=pattern_extension
... ) 
>>> pattern_extension = [".html", ".HTML"]
>>> pattern_startswith = ["job_aid"]
>>> files = ds.list_files(
...     directory=path,
...     pattern_extension=pattern_extension,
...     pattern_startswith=pattern_startswith
... ) 
datasense.munging.list_one_list_two_ops(*, list_one: list[str] | list[int] | list[float], list_two: list[str] | list[int] | list[float], action: str) list[str] | list[int] | list[float]

Create a list of items comparing two lists: - Items unique to list_one - Items unique to list_two - Items common to both lists (intersection) Duplicate items are removed.

Parameters:
  • list_one (list[str] | list[int] | list[float]) – A list of items.

  • list_two (list[str] | list[int] | list[float]) – A list of items.

  • action (str) – A string of either “list_one”, “list_two”, or “intersection”.

Returns:

list_result – The list of unique items.

Return type:

list[str] | list[int] | list[float]

Examples

>>> import datasense as ds
>>> list_one = [1, 2, 3, 4, 5, 6]
>>> list_two = [4, 5, 6, 7, 8, 9]
>>> list_one_unique = ds.list_one_list_two_ops(
...     list_one=list_one,
...     list_two=list_two,
...     action="list_one"
... ) 
[1, 2, 3]
>>> list_one = [1, 2, 3, 4, 5, 6]
>>> list_two = [4, 5, 6, 7, 8, 9]
>>> list_one_unique = ds.list_one_list_two_ops(
...     list_one=list_one,
...     list_two=list_two,
...     action="list_two"
... ) 
[7, 8, 9]
>>> list_one = [1, 2, 3, 4, 5, 6]
>>> list_two = [4, 5, 6, 7, 8, 9]
>>> list_one_unique = ds.list_one_list_two_ops(
...     list_one=list_one,
...     list_two=list_two,
...     action="intersection"
... ) 
[4, 5, 6]
datasense.munging.listone_contains_all_listtwo_substrings(*, listone: list[str], listtwo: list[str]) list[str]

Return a list of items from one list that contain substrings of items from another list.

Parameters:
  • listone (list[str]) – The list of items in which there are substrings to match from listtwo.

  • listwo (list[str]) – The list of items that are substrings of the items in listone.

Returns:

matches – The list of items from listone that contain substrings of the items from listtwo.

Return type:

list[str]

Example

>>> import datasense as ds
>>> listone = ['prefix-2020-21-CMJG-suffix', 'bobs your uncle']
>>> listwo = [ 'CMJG', '2020-21']
>>> matches = ds.listone_contains_all_listtwo_substrings(
...     listone=listone,
...     listtwo=listtow
... ) 
['prefix-2020-21-CMJG-suffix']
datasense.munging.mask_outliers(df: DataFrame, mask: list[tuple[str, float, float]]) DataFrame

Mask outliers within a scikit-learn pipeline.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • mask (list[tuple[str, float, float]]) – The list of mask values.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

Create a transformer to be used in a scikit-learn pipeline.

>>> from sklearn.preprocessing import FunctionTransformer
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.impute import SimpleImputer
>>> import datasense as ds
>>> mask = [
...     ("X1", -10, 10),
...     ("X2", -25, 25),
...     ("X3", -5, 5),
...     ("X4", -7, 7),
...     ("X5", -3, 3),
...     ("X6", -2, 2),
...     ("X7", -13, 13),
...     ("X8", -8, 8),
...     ("X9", -9, 9),
...     ("X10", -10, 10),
...     ("X11", -9, 9),
...     ("X12", -16, 17),
...     ("X13", -20, 23)
... ]
>>> mask = FunctionTransformer(
...     mask_outliers,
...     kw_args={"mask": mask}
... )
>>> imputer = SimpleImputer()
>>> imputer_pipeline = make_pipeline(mask, imputer)
>>> transformer = make_column_transformer(
...     (imputer_pipeline, features),
...     remainder="drop"
... ) 
datasense.munging.number_empty_cells_in_columns(*, df: DataFrame) None

Create and print a table of data type, empty-cell count, and empty-all percentage for non-empty columns of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Example

>>> import datasense as ds
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(data={
...     'X': [25.0, 24.0, 35.5, np.nan, 23.1],
...     'Y': [27, 24, np.nan, 23, np.nan],
...     'Z': ['a', 'b', np.nan, 'd', 'e']
... })
>>> ds.number_empty_cells_in_columns(df=df) 
Information about non-empty columns
 Column   Data type   Empty cell count   Empty cell %   Unique
-------- ----------- ------------------ -------------- --------
 X        float64                    1           20.0        4
 Y        float64                    2           40.0        3
 Z        object                     1           20.0        4
datasense.munging.optimize_columns(df: DataFrame, float_columns: list[str] = None, integer_columns: list[str] | None = None, datetime_columns: list[str] | None = None, object_columns: list[str] | None = None, fraction_categories: int | None = 0.5) DataFrame

Downcast float columns

Parameters:
  • df (pd.DataFrame) – The DataFrame.

  • float_columns (list[str] | None = None) – A list of float columns to downcast.

  • integer_columns (list[str] | None = None) – A list of integer columns to downcast.

  • object_columns (list[str] | None = None) – A list of object columns to downcast.

  • fraction_categories (int | None = 0.5) – The fraction of categories in an object column.

Returns:

df – The DataFrame with all columns downcast where possible or requested.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.optimize_columns(df=df) 

If using the default values, it is important to identify object columns that should be datetime columns in order to get the correct answer. >>> df = ds.optimize_columns( … df=df, … datetime_columns=datetime_columns, … ) # doctest: +SKIP

>>> float_columns = ["column_A", "column_B"]
>>> integer_columns = ["column_C", "column_D"]
>>> object_columns = ["column_E", "column_F"]
>>> df = ds.optimize_columns(
...     df=df,
...     float_columns=float_columns,
...     integer_columns=integer_columns,
...     datetime_columns=datetime_columns,
...     object_columns=object_columns,
...     fraction_categories=0.2
... ) 
datasense.munging.optimize_datetime_columns(df: DataFrame, datetime_columns: list[str] = None) DataFrame

Cast object and datetime columns to pandas datetime. It does not reduce memory usage, but enables time-based operations.

Parameters:
  • df (pd.DataFrame) – The DataFrame that contains one or more datetime columns.

  • datetime_columns (list[str] | None = None) – A list of datetime columns to cast.

Returns:

df – The DataFrame with all datetime columns cast and other columns unchanged.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.optimize_integer_columns(df=df) 
>>> integer_columns = ["column A", "column B"]
>>> df = ds.optimize_integer_columns(
...     df=df,
...     datetime_columns=datetime_columns
... ) 
datasense.munging.optimize_float_columns(df: DataFrame, float_columns: list[str] = None) DataFrame

Downcast float columns

Parameters:
  • df (pd.DataFrame) – The DataFrame that contains one or more float columns.

  • float_columns (list[str] | None = None) – A list of float columns to downcast.

Returns:

df – The DataFrame with all float columns downcast and other columns unchanged.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.optimize_float_columns(df=df) 
>>> float_columns = ["column A", "column B"]
>>> df = ds.optimize_float_columns(
...     df=df,
...     float_columns=float_columns
... ) 
datasense.munging.optimize_integer_columns(df: DataFrame, integer_columns: list[str] | None = None) DataFrame

Downcast integer columns

Parameters:
  • df (pd.DataFrame) – The DataFrame that contains one or more integer columns.

  • integer_columns (list[str] | None = None) – A list of integer columns to downcast.

Returns:

df – The DataFrame with all integer columns downcast and other columns unchanged.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.optimize_integer_columns(df=df) 
>>> integer_columns = ["column A", "column B"]
>>> df = ds.optimize_integer_columns(
...     df=df,
...     integer_columns=integer_columns
... ) 
datasense.munging.optimize_object_columns(df: DataFrame, object_columns: list[str] | None = None, fraction_categories: int | None = 0.5) DataFrame

Downcast object columns

Parameters:
  • df (pd.DataFrame) – The DataFrame that contains one or more integer columns.

  • object_columns (list[str] | None = None) – A list of object columns to downcast.

  • fraction_categories (int | None = 0.5) – The fraction of categories in an object column.

Returns:

df – The DataFrame with all object columns downcast and other columns unchanged.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> df = ds.optimize_integer_columns(df=df) 
>>> fraction_categories = 0.25
>>> df = ds.optimize_integer_columns(
...     df=df,
...     fraction_categories = fraction_categories
... ) 
>>> object_columns = ["column A", "column B"]
>>> df = df.optimize_object_columns(
...     df=df,
...     object_columns=object_columns
... ) 
datasense.munging.parameters_dict_replacement(*, file_name: Path, sheet_name: str, usecols: list[str]) dict[str, str]

Read Excel worksheet. Create dictionary of text replacement key, value pairs.

Parameters:
  • file_name (Path) – The path of the Excel file.

  • sheet_name (str) – The Excel worksheet.

  • usecols (list[str]) – The column names to read.

Returns:

text_replacement – A dictionary of text replacement tuples.

Return type:

dict[str, str]

Example

>>> import datasense as ds
>>> path_parameters = Path('parameters.xlsx')
>>> usecols = ['old_text', 'new_text']
>>> sheet_name = 'text_replacement'
>>> replacement_dict = ds.parameters_dict_replacement(
...     file_name=path_parameters,
...     sheet_name=sheet_name,
...     usecols=usecols
... ) 
datasense.munging.parameters_text_replacement(*, file_name: Path, sheet_name: str, usecols: list[str], text_case: str = None) tuple[tuple[str, str]]

Read Excel worksheet. Create tuple of text replacement tuples.

Parameters:
  • file_name (Path) – The path of the Excel file.

  • sheet_name (str) – The Excel worksheet.

  • usecols (list[str]) – The column names to read.

  • case ("str" = None) – Change the case of all items: None, lower, upper.

Returns:

text_replacement – A tuple of text replacement tuples.

Return type:

tuple[tuple[str, str]]

Examples

>>> import datasense as ds
>>> path_parameters = Path("bcp_parameters.xlsx")
>>> usecols = ["old_text", "new_text"]
>>> sheet_name = "text_replacement"
>>> text_replacement = parameters(
...     file_name=path_parameters,
...     sheet_name=sheet_name,
...     usecols=usecols
... ) 
>>> path_parameters = Path("bcp_parameters.xlsx")
>>> usecols = ["old_text", "new_text"]
>>> sheet_name = "text_replacement"
>>> text_replacement = parameters(
...     file_name=path_parameters,
...     sheet_name=sheet_name,
...     usecols=usecols,
...     case="upper"
... ) 
>>> path_parameters = Path("bcp_parameters.xlsx")
>>> usecols = ["old_text", "new_text"]
>>> sheet_name = "text_replacement"
>>> text_replacement = parameters(
...     file_name=path_parameters,
...     sheet_name=sheet_name,
...     usecols=usecols,
...     case="lower"
... ) 
datasense.munging.print_dictionary_by_key(*, dictionary_to_print: dict[str, list[str]], title: str = None) None

Print each key, value of a dictionary, one key per line.

Parameters:
  • dictionary_to_print (dict[str, list[str]]) – The dictionary to print.

  • title (str = None) – The title to print.

Example

>>> import datasense as ds
>>> ds.print_dictionary_by_key(dictionary_to_print=mydict) 
datasense.munging.print_list_by_item(*, list_to_print: list[str], title: str = None) None

Print each item of a list.

Parameters:
  • list_to_print (list[str]) – The list of strings to print.

  • title (str = None) – The title to print.

Example

>>> import datasense as ds
>>> ds.print_list_by_item(list_to_print=my_list_to_print) 
datasense.munging.process_columns(*, df: DataFrame) tuple[pandas.core.frame.DataFrame, int, int, int, list[str], list[str], list[str], int, list[str], int, list[str], int, list[str], int, list[str], int, list[str], int, list[str], int]

Return a DataFrame without empty columns and ensure all column labels are strings.

  • Create various counts of columns of a DataFrame.

  • Create count of columns (columns_in_count)

  • Create count and list of empty columns (columns_empty_count, columns_empty_list)

  • Create count and list of non-empty columns (columns_non_empty_count, columns_non_empty_list)

  • Delete empty columns

  • Create count and list of boolean columns (columns_bool_count, columns_bool_list)

  • Create count and list of category columns (columns_category_count, columns_category_list)

  • Create count and list of datetime columns (columns_datetime_count, columns_datetime_list)

  • Create count and list of float columns (columns_float_count, columns_float_list)

  • Create count and list of integer columns (columns_integer_count, columns_integer_list)

  • Create count and list of string columns (columns_object_count, columns_object_list)

  • Create count of timedelta columns (columns_timedelta_count, columns_timedelta_list)

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

Return a DataFrame without empty columns and ensure all column labels are strings.

  • dfpd.DataFrame

    The output DataFrame.

  • columns_in_countint

    The count of columns.

  • columns_non_empty_countint

    The count of non-empty columns.

  • columns_empty_count: int

    The count of empty columns.

  • columns_empty_listlist[str]

    The list of empty columns.

  • columns_non_empty_listlist[str]

    The list of non-empty columns.

  • columns_bool_listlist[str]

    The list of boolean columns.

  • columns_bool_countint

    The count of boolean columns.

  • columns_float_listlist[str]

    The list of float columns.

  • columns_float_countint

    The count of float columns.

  • columns_integer_listlist[str]

    The list of integer columns.

  • columns_integer_countint

    The count of integer columns

  • columns_datetime_listlist[str]

    The list of datetime columns.

  • columns_datetime_countint

    The count of datetime columns.

  • columns_object_listlist[str]

    The list of object columns.

  • columns_object_countint

    The count of object columns.

  • columns_category_listlist[str]

    The list of category columns.

  • columns_category_countint

    The count of category columns.

  • columns_timedelta_listlist[str]

    The list of timedelta columns.

  • columns_timedelta_countint

    The count of timedelta columns.

Return type:

tuple[pd.DataFrame, int, int, int, list[str], list[str], list[str], int, list[str], int, list[str], int, list[str], int, list[str], int, list[str], int, list[str], int]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> df, columns_in_count, columns_non_empty_count, columns_empty_count,    ...     columns_empty_list, columns_non_empty_list, columns_bool_list,    ...     columns_bool_count, columns_float_list, columns_float_count,    ...     columns_integer_list, columns_integer_count,    ...     columns_datetime_list, columns_datetime_count,    ...     columns_object_list, columns_object_count, columns_category_list,    ...     columns_category_count, columns_timedelta_list,    ...     columns_timedelta_count = ds.process_columns(df=df) 
columns_in_count       : 12
columns_non_empty_count: 12
columns_empty_count    : 0
columns_empty_list     : []
columns_non_empty_list :
    ['a', 'b', 'c', 'd', 'i', 'r', 's', 't', 'u', 'x', 'y', 'z']
columns_bool_list      : ['b']
columns_bool_count     : 1
columns_float_list     : ['a', 'i', 'x', 'z']
columns_float_count    : 4
columns_integer_list   : ['y']
columns_integer_count  : 1
columns_datetime_list  : ['t', 'u']
columns_datetime_count : 2
columns_object_list    : ['r', 's']
columns_object_count   : 2
columns_category_list  : ['c']
columns_category_count : 1
columns_timedelta_list : ['d']
columns_timedelta_count: 1
datasense.munging.process_rows(*, df: DataFrame) tuple[pandas.core.frame.DataFrame, int, int, int]

Return a DataFrame without duplicate rows.

Create various counts of rows of a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

A tuple of a DataFrame without duplicate rows, a count of the input rows, a count of the output rows, and a count of the empty rows.

  • dfpd.DataFrame

    The output DataFrame.

  • rows_in_countint

    The count of rows of the input DataFrame.

  • rows_out_countint

    The count of rows of the output DataFrame.

  • rows_empty_countint

    The count of empty rows of the input DataFrame.

Return type:

tuple[pd.DataFrame, int, int, int]

Example

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> df, rows_in_count, rows_out_count, rows_empty_count =     ...     ds.process_rows(df=df) 
rows_in_count   : 42
rows_out_count  : 42
rows_empty_count: 0
datasense.munging.quit_sap_excel() None

Several applications, Excel in particular, need to be closed otherwise they may cause a function to crash.

Example

>>> import datasense as ds
>>> ds.quit_sap_excel()
datasense.munging.read_file(*, file_name: str | Path, header: int | list[int] | None = 0, skiprows: list[int] | None = None, column_names_dict: dict[str, str] = {}, index_columns: list[str] = [], usecols: list[str] | None = None, dtype: dict | None = None, converters: dict | None = None, parse_dates: list[str | int] | dict | bool = False, datetime_format: str | None = None, time_delta_columns: list[str] = [], category_columns: list[str] = [], integer_columns: list[str] = [], float_columns: list[str] = [], boolean_columns: list[str] = [], object_columns: list[str] = [], sort_columns: list[str] = [], sort_columns_bool: list[bool] = [], sheet_name: str = False, nrows: int | None = None, skip_blank_lines: bool = True, encoding: str = 'utf-8') DataFrame

Create a DataFrame from an external file.

  • read csv | read CSV

  • read ods | read ODS

  • read Excel: read xlsx | read XLSX | read xlsm | read XLSM

  • read feather

Parameters:
  • file_name (str | Path) – The name of the file to read.

  • header (int | list[int] | None = 0) – The row to use for the column labels. Use None if there is no header.

  • skiprows (list[int] | None = None) – The specific row indices to skip.

  • column_names_dict (dict[str, str] = {}) – The new column names to replace the old column names.

  • index_columns (list[str] = []) – The columns to use for the DataFrame index.

  • usecols (list[str] | None = None) – The columns to read.

  • dtype (dict | None = None) – A dictionary of column names and dtypes. NOTE: Nullable Boolean data type is experimental and does not work; use .astype() on df after created.

  • converters (dict | None = None) – Dictionary of functions for converting values in certain columns.

  • parse_dates (list[str] = False) – The columns to use to parse date and time.

  • date_format (str | dict = None) – If used in conjunction with parse_dates, will parse dates according to this format.

  • datetime_format (str | None = None) – The str to use for formatting date and time.

  • time_delta_columns (list[str] = []) – The columns to change to dtype timedelta.

  • category_columns (list[str] = []) – The columns to change to dtype category.

  • integer_columns (list[str] = []) – The columns to change to dtype integer.

  • float_columns (list[str] = []) – The columns to change to dtype float.

  • boolean_columns (list[str] = []) – The columns to change to dtype boolean.

  • object_columns (list[str] = []) – The columns to change to dtype object.

  • sort_columns (list[str] = []) – The columns on which to sort the DataFrame.

  • sort_columns_bool (list[bool] = []) – The booleans for sort_columns.

  • sheet_name (str = False) – The name of the worksheet in the workbook.

  • nrows (int | None = None) – The number of rows to read.

  • skip_blank_lines (bool = True) – If True, skip over blank lines rather than interpreting as NaN values.

  • encoding (str = "utf-8") – Encoding to use for UTF when reading.

Returns:

df – The DataFrame created from the external file.

Return type:

pd.DataFrame

Examples

Create a data file for the examples.

>>> import datasense as ds
>>> file_name='myfile.csv'
>>> df = ds.create_dataframe()
>>> df.columns 
>>> df.dtypes  
>>> df.save_file(
...     df=df,
...     file_name=file_name
... ) 
Index(['a', 'b', 'c', 'd', 'i', 'r', 's', 't', 'u', 'x', 'y', 'z'],dtype='object')
a            float64
b            boolean
c           category
d    timedelta64[ns]
i            float64
r             object
s             object
t     datetime64[ns]
u     datetime64[ns]
x            float64
y              Int64
z            float64
dtype: object

Read a csv file. There is no guarantee the column dtypes will be correct. Only [a, i, s, x, z] have the correct dtypes.

>>> file_name = "file.csv"
>>> df = ds.read_file(file_name=file_name) 
>>> df.dtypes 
a    float64
b       bool
c     object
d     object
i    float64
r      int64
s     object
t     object
u     object
x    float64
y      int64
z    float64
dtype: object

Read a csv file. Ensure the dtypes of datetime columns.

>>> parse_dates = ['t', 'u']
>>> file_name = "file.csv"
>>> df = ds.read_file(
...     file_name=file_name,
...     parse_dates=parse_dates
... ) 
>>> df.dtypes
a           float64
b              bool
bn          boolean
c          category
cs         category
d   timedelta64[ns]
i             int64
r            object
s            object
t    datetime64[ns]
u    datetime64[ns]
x           float64
y             int64
yn            Int64
z           float64
dtype: object

Read a csv file. Ensure the dtypes of columns; not timedelta, datetime.

>>> convert_dict = {
...     'a': 'float64',
...     'b': 'boolean',
...     'c': 'category',
...     'i': 'float64',
...     'r': 'str',
...     's': 'str',
...     'x': 'float64',
...     'y': 'Int64',
...     'z': 'float64'
... }
>>> df = ds.read_file(
...     file_name=file_name,
...     dtype=convert_dict
... ) 
>>> df.dtypes
a             float64
b                bool
bn            boolean
c            category
cs           category
d     timedelta64[ns]
i               int64
r              object
s              object
t      datetime64[ns]
u      datetime64[ns]
x             float64
y               int64
yn              Int64
z             float64
dtype: object

Read a csv file. Ensure the dtypes of columns. Rename the columns. Set index with another column. Convert float column to integer.

>>> column_names_dict = {
...     'a': 'A',
...     'b': 'B',
...     'c': 'C',
...     'd': 'D',
...     'i': 'I',
...     'r': 'R',
...     's': 'S',
...     't': 'T',
...     'u': 'U',
...     'y': 'Y',
...     'x': 'X',
...     'z': 'Z'
... }
>>> index_columns = ['Y']
>>> parse_dates = ['t', 'u']
>>> time_delta_columns = ['D']
>>> category_columns = ['C']
>>> integer_columns = ['A', 'I']
>>> float_columns = ['X']
>>> boolean_columns = ['R']
>>> object_columns = ['Z']
>>> sort_columns = ['I', 'A']
>>> sort_columns_bool = [True, False]
>>> df = ds.read_file(
...     file_name='myfile.csv',
...     column_names_dict=column_names_dict,
...     index_columns=index_columns,
...     parse_dates=parse_dates,
...     # date_format=date_format,
...     time_delta_columns=time_delta_columns,
...     category_columns=category_columns,
...     integer_columns=integer_columns,
...     float_columns=float_columns,
...     boolean_columns=boolean_columns,
...     object_columns=object_columns,
...     sort_columns=sort_columns,
...     sort_columns_bool=sort_columns_bool
... ) 
>>> data = ds.read_file(
...     file_name=file_name,
...     column_names_dict=column_names_dict,
...     index_columns=index_columns,
...     date_time_columns=date_time_columns,
...     # date_format=date_format,
...     parse_dates=date_time_columns,
...     time_delta_columns=time_delta_columns,
...     category_columns=category_columns,
...     integer_columns=integer_columns
... ) 

Read an ods file.

>>> file_name = 'myfile.ods'
>>> df = ds.create_dataframe()
>>> ds.save_file(
...     df=df,
...     file_name=file_name
... )
>>> parse_dates = ['t', 'u']
>>> df = ds.read_file(
...     file_name=file_name,
...     parse_dates=parse_dates
... ) 
>>> ds.dataframe_info(
...     df=df,
...     file_in=file_name
... ) 

Read an xlsx file.

>>> file_name = 'myfile.xlsx'
>>> sheet_name = 'raw_data'
>>> df = ds.read_file(
...     file_name=file_name,
...     sheet_name=sheet_name
... ) 
>>> ds.dataframe_info(
...     df=df,
...     file_in=file_name
... ) 

Read a feather file.

>>> from pathlib import Path
>>> file_to_read = 'myfeatherfile.feather'
>>> path = Path(file_to_read)
>>> df = ds.read_file(file_name=path)

Read a feather file with columns list.

>>> from pathlib import Path
>>> file_to_read = 'myfeatherfile.feather'
>>> usecols = ['col1', 'col2']
>>> path = Path(file_to_read)
>>> df = ds.read_file(
...     file_name=path,
...     usecols=usecols
... ) 

Removed xlsb XLSB support because Arch Linux does not support. The following example is retained for historical purposes and in case Arch Linux supports it in future.

Read an xlsb file.

>>> file_name = 'myfile.xlsb'
>>> sheet_name = 'raw_data'
>>> df = ds.read_file(
...     file_name=file_name,
...     sheet_name=sheet_name
... )
>>> ds.dataframe_info(
...     df=df,
...     file_in=file_name
... )

Notes

The parameter “date_format” will be made available as soon as Arch Linux updates pandas to version 2.xx.

datasense.munging.remove_punctuation(*, list_dirty: list[str]) list[str]

Remove punctuation from list items.

Parameters:

list_dirty (list[str]) – The list of items containing punctuation.

Returns:

list_clean – The list of items without punctuation.

Return type:

list[str]

Example

>>> import datasense as ds
>>> list_clean = ds.remove_punctuation(list_dirty=list_dirty)     ...     
datasense.munging.rename_all_columns(*, df: DataFrame, labels: list[str]) DataFrame

Rename all DataFrame columns.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • labels (list[str]) – The list of all column names.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = ds.rename_all_columns(
...     df=df,
...     labels=labels
... ) 
datasense.munging.rename_directory(*, sources: list[str], destinations: list[str], ignore_errors: bool = True) None

Delete destination directories (if present) and rename source directories to the destination directories.

Parameters:
  • sources (list[str]) – The old directories.

  • destinations (list[str]) – The new directories.

  • ignore_errors (bool = True) – Boolean to deal with errors.

Example

>>> import datasense as ds
>>> sources = ['old_directory']
>>> destinations = ['new_directory']
>>> ds.rename_directory(sources=sources, destinations=destinations)     ...     
datasense.munging.rename_some_columns(*, df: DataFrame, column_names_dict: dict[str, str]) DataFrame

Rename some columns with a dictionary.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • column_names_dict (dict[str, str]) – The dictionary of old:new column names.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = ds.rename_some_columns(
...     df=df,
...     column_names_dict=column_names_dict
... ) 
datasense.munging.replace_column_values(*, s: Series, replace_dict: dict[str, str] | dict[int, int] | dict[float, float], regex: bool = False) Series

Replace values in a series using a dictionary.

Parameters:
  • s (pd.Series) – The input series.

  • replace_dict (dict[str, str] | dict[int, int] | dict[float, float]) – The dictionary of values to replace.

  • regex (bool = True) – Determines if the passed-in pattern is a regular expression.

Returns:

s – The output series.

Return type:

pd.Series

Example

>>> import datasense as ds
>>> s = ds.replace_column_values(
...     s=s,
...     replace_dict=replace_dict
... ) 
datasense.munging.replace_text_numbers(*, df: DataFrame, columns: list[str] | list[int] | list[float] | list[Pattern[str]], old: list[str] | list[int] | list[float] | list[Pattern[str]], new: list[int], regex: bool = True) DataFrame

Replace text or numbers with text or numbers.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • columns (list[str] | list[int] | list[float] | list[Pattern[str]]) – The list of columns for replacement.

  • old (list[str] | list[int] | list[float] | list[Pattern[str]]) – The list of item to replace.

  • new (list[int]) – The list of replacement items.

  • regex (bool = True) – Determines if the passed-in pattern is a regular expression.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Examples

>>> import datasense as ds
>>> list_y_1_n_5 = [
...     'Q01', 'Q02', 'Q03', 'Q04', 'Q05', 'Q06', 'Q10', 'Q17', 'Q18',
...     'Q19', 'Q20', 'Q21', 'Q23', 'Q24', 'Q25'
... ]
>>> list_y_5_n_1 = [
...     'Q07', 'Q11', 'Q12', 'Q13', 'Q15', 'Q16'
... ]
>>> data = ds.replace_text_numbers(
...     df=data,
...     columns=list_y_1_n_5,
...     old=['Yes', 'No'],
...     new=[1, 5],
...     regex=False
... ) 
>>> data = ds.replace_text_numbers(
...     df=data,
...     columns=list_y_5_n_1,
...     old=['Yes', 'No'],
...     new=[5, 1],
...     regex=False
... ) 
>>> data = ds.replace_text_numbers(
...     df=data,
...     columns=['Q23'],
...     old=[r' '],
...     new=[r' '],
...     regex=True
... ) 
>>> data = ds.replace_text_numbers(
...     df=data,
...     columns=['address_country'],
...     old=[
...         'AD', 'AE', 'AF', 'AG',
...         'AI', 'AL', 'AM', 'AN',
...         'AO', 'AQ', 'AR', 'AS',
...         'AT', 'AU', 'AW', 'AZ',
...     ]
...     new=[
...         'Andorra', 'Unit.Arab Emir.', 'Afghanistan', 'Antigua/Barbuda',
...         'Anguilla', 'Albania', 'Armenia', 'Niederl.Antill.',
...         'Angola', 'Antarctica', 'Argentina', 'Samoa,American',
...         'Austria', 'Australia', 'Aruba', 'Azerbaijan',
...     ],
...     regex=False
... ) 
datasense.munging.save_file(*, df: DataFrame | Series, file_name: str | Path, index: bool = False, index_label: str = None, sheet_name: str = 'sheet_001', encoding: str = 'utf-8') None

Save a DataFrame or Series to a file.

Parameters:
  • df (pd.DataFrame | pd.Series) – The DataFrame or Series to be saved to a file.

  • file_name (str | Path) – The name of the file to be saved.

  • index (bool = False) – If True, creates an index.

  • index_label (str = None) – The index label.

  • sheet_name (str = 'sheet_001') – The name of the worksheet in the workbook.

  • encoding (str = "utf-8") – Encoding to use for UTF when writing.

Examples

>>> import datasense as ds
>>> df = ds.create_dataframe()
>>> ds.save_file(
...     df=df,
...     file_name='x_y.csv'
... )
>>> ds.save_file(
...     df=df,
...     file_name='x_y.csv',
...     index=True
... )
>>> ds.save_file(
...     df=df,
...     file_name='x_y.xlsx'
... )
>>> ds.save_file(
...     df=df,
...     file_name='x_y.xlsx',
...     index=True,
...     sheet_name='sheet_one'
... )
>>> from pathlib import Path
>>> file_to_save = 'myfeatherfile.feather'
>>> path = Path(file_to_save)
>>> ds.save_file(
...     df=df,
...     file_name=path
... )
datasense.munging.series_memory_usage(s: Series, suffix: str = 'B') str

Determine memory usage of a pandas Series

Parameters:
  • s (pd.Series) – A pandas Series.

  • suffix (str = "B") – The units of the memory usage.

Returns:

memory_usage – A string with the value and units of memory usage.

Return type:

str

Example

>>> import datasense as ds
>>> memory_usage = ds.series_memory_usage(
...     s=s,
...     suffix="B"
... ) 
datasense.munging.series_replace_string(*, series: Series, find: str, replace: str, regex: bool = True) Series

Find and replace a string in a series.

Parameters:
  • series (pd.Series) – The input series of data.

  • find (str) – The string to find.

  • replace (str) – The replacement string.

  • regex (bool = True) – Determines if the passed-in pattern is a regular expression.

Returns:

series – The output series of data.

Return type:

pd.Series

Example

>>> import datasense as ds
>>> df[column] = series_replace_string(
...     series=df[column],
...     find='find this text',
...     replace='replace with this text'
... ) 
datasense.munging.sort_rows(*, df: DataFrame, sort_columns: list[str], sort_columns_bool: list[bool], kind: str = 'mergesort') DataFrame

Sort a DataFrame for one or more columns.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • sort_columns (list[str]) – The sort columns.

  • sort_columns_bool (list[bool]) – The booleans for sort_columns: True = ascending, False = descending.

  • kind (str = 'mergesort') – The sort algorithm.

Returns:

df – The output DataFrame.

Return type:

pd.DataFrame

Example

>>> import datasense as ds
>>> df = ds.sort_rows(
...     df=df,
...     sort_columns=sort_columns,
...     sort_columns_bool=sort_columns_bool,
...     kind='mergesort'
... ) 

datasense.process_capability module

Process capability refers to the ability of a process to meet a performance standard (specification). A process is capable if you have: - Specifications are defined and attainable. - Can measure sufficiently well. - Samples are representative. - Process variation is stable and predictable. - Process is on target with minimum dispersion.

datasense.process_capability.cp(average: float | int, std_devn: float | int, subgroup_size: int, number_subgroups: int, lower_spec: float | int, upper_spec: float | int, alpha: float = 0.05) tuple[float, float, float]

Cp compares the width of the process specification to the width of the process variation. It does not take into consideration the deviation from the average. It “assumes” the process is centred between the specification limits. The standard deviation estimate is taken from a range or moving range control chart.

Parameters:
  • average (float | int,) – The average of the process.

  • std_devn (float | int,) – The standard deviation of the process. It should be the “sample standard deviation”.

  • subgroup_size (int,) – This is the number of values in a control chart subgroup

  • number_subgroups (int,) – This is the number of subgroups.

  • lower_spec (float | int,) – The lower specification value.

  • upper_spec (float | int,) – The upper specification value.

  • alpha (float = 0.05) – The alpha value for the confidence interval calculations. An alpha of 0.05 is used for a 95 % confidence interval.

Returns:

A tuple of the capability, the lower confidence bound, and the upper confidence bound.

  • capabilityfloat

    The Pp process capability value.

  • lower_boundfloat

    The lower value of the confidence interval for Pp.

  • upper_boundfloat

    The upper value of the confidence interval for Pp.

Return type:

tuple[float, float, float]

Example

>>> import datasense as ds
>>> average = 0.11001
>>> std_devn = 0.89312
>>> subgroup_size = 1
>>> number_subgroups = 39
>>> lower_spec = -4
>>> upper_spec = 4
>>> alpha = 0.05
>>> result = ds.cp(
>>>     average=average,
>>>     std_devn=std_devn,
>>>     subgroup_size=subgroup_size,
>>>     number_subgroups=number_subgroups,
>>>     lower_spec=lower_spec,
>>>     upper_spec=upper_spec,
>>>     alpha=alpha
>>> )
(1.4928938253911381, 1.141174267641542, 1.8439148118984439)
datasense.process_capability.cpk(average: float | int, std_devn: float | int, subgroup_size: int, number_subgroups: int, lower_spec: float | int, upper_spec: float | int, alpha: float = 0.05, toler: float | int = 6) tuple[float, float, float, float, float]

Cpk compares the width of the process specification to the width of the process variation. It takes into consideration the deviation from the average. The standard deviation estimate is taken from a range or moving range control chart.

Parameters:
  • average (float | int,) – The average of the process.

  • std_devn (float | int,) – The standard deviation of the process. It should be the “sample standard deviation”.

  • subgroup_size (int,) – This is the number of values in a control chart subgroup

  • number_subgroups (int,) – This is the number of subgroups.

  • lower_spec (float | int,) – The lower specification value.

  • upper_spec (float | int,) – The upper specification value.

  • alpha (float = 0.05) – The alpha value for the confidence interval calculations. An alpha of 0.05 is used for a 95 % confidence interval.

  • toler (float | int = 6) – The multiplier of the standard deviation tolerance.

Returns:

A tuple of the capability, the lower Cpk, the upper Cpk, the lower confidence bound, and the upper confidence bound.

  • capabilityfloat

    The Cpk process capability value.

  • cpk_lowerfloat

    The Ppk value for left of the average,

  • cpk_lowerfloat

    The Ppk value for right of the average,

  • lower_boundfloat

    The lower value of the confidence interval for Cpk.

  • upper_boundfloat

    The upper value of the confidence interval for Cpk.

Return type:

tuple[float, float, float, float, float]

Example

>>> import datasense as ds
>>> average = 0.11001
>>> std_devn = 0.89312
>>> subgroup_size = 2
>>> number_subgroups = 39
>>> lower_spec = -4
>>> upper_spec = 4
>>> alpha = 0.05
>>> result = ds.cpk(
>>>     average=average,
>>>     std_devn=std_devn,
>>>     subgroup_size=subgroup_size,
>>>     number_subgroups=number_subgroups,
>>>     lower_spec=lower_spec,
>>>     upper_spec=upper_spec,
>>>     alpha=alpha,
>>>     toler=6,
>>> )
(
    1.4518355129583185, 1.533952137823958, 1.4518355129583185,
    1.0928917337156085, 1.8107792922010284
)
datasense.process_capability.cpm(average: float | int, std_devn: float | int, sample_size: int, target: float | int, lower_spec: float | int, upper_spec: float | int, alpha: float = 0.05) tuple[float, float]

Ppk and Cpk calculate process capability with respect to the deviation from the average. If a process average is not equal to the specification target, the process capability is not as good as one would assume. Cpm calculates process capability with respect to the deviation from the average and the the deviation from the target. The Cpm formula is closely related to the Taguchi Loss Function. The standard deviation uses the “population standard deviation.”

Parameters:
  • average (float | int,) – The average of the process.

  • std_devn (float | int,) – The standard deviation of the process. It should be the “sample standard deviation”.

  • sample_size (int,) – This is the sample size for the data being analysed.

  • target (float) – It is the target value of the product stream.

  • lower_spec (float | int,) – The lower specification value.

  • upper_spec (float | int,) – The upper specification value.

  • alpha (float = 0.05) – The alpha value for the confidence interval calculations. An alpha of 0.05 is used for a 95 % confidence interval.

Returns:

A tuple of the capability and the lower confidence bound.

  • capabilityfloat

    The Cpm process capability value.

  • lower_boundfloat

    The lower value of the confidence interval for Cpm.

Return type:

tuple[float, float]

Example

>>> import datasense as ds
>>> average = 0.11001
>>> std_devn = 0.868663
>>> sample_size = 40
>>> target = 0
>>> lower_spec = -4
>>> upper_spec = 4
>>> alpha = 0.05
>>> result = ds.cpm(
>>>     average=average,
>>>     std_devn=std_devn,
>>>     sample_size= sample_size,
>>>     target=target,
>>>     lower_spec=lower_spec,
>>>     upper_spec=upper_spec,
>>>     alpha=alpha
>>> )
(1.5227631097133512, 1.2396924251472865)
datasense.process_capability.pp(average: float | int, std_devn: float | int, sample_size: int, lower_spec: float | int, upper_spec: float | int, alpha: float = 0.05) tuple[float, float, float]

Pp compares the width of the process specification to the width of the process variation. It does not take into consideration the deviation from the average. It “assumes” the process is centred between the specification limits. The standard deviation uses the “sample standard deviation” formula.

Parameters:
  • average (float | int,) – The average of the process.

  • std_devn (float | int,) – The standard deviation of the process. It should be the “sample standard deviation”.

  • sample_size (int,) – This is the sample size for the data being analysed.

  • lower_spec (float | int,) – The lower specification value.

  • upper_spec (float | int,) – The upper specification value.

  • alpha (float = 0.05) – The alpha value for the confidence interval calculations. An alpha of 0.05 is used for a 95 % confidence interval.

Returns:

A tuple of the capability, the lower confidence bound, and the upper confidence bound.

  • capabilityfloat

    The Pp process capability value.

  • lower_boundfloat

    The lower value of the confidence interval for Pp.

  • upper_boundfloat

    The upper value of the confidence interval for Pp.

Return type:

tuple[float, float, float]

Example

>>> import datasense as ds
>>> average = 0.11001
>>> std_devn = 0.868663
>>> sample_size = 40
>>> lower_spec = -4
>>> upper_spec = 4
>>> alpha = 0.05
>>> result = ds.pp(
>>>     average=average,
>>>     std_devn=std_devn,
>>>     sample_size=sample_size,
>>>     lower_spec=lower_spec,
>>>     upper_spec=upper_spec,
>>>     alpha=alpha
>>> )
(1.5349258956964131, 1.1953921108301047, 1.873778000024199)
datasense.process_capability.ppk(average: float | int, std_devn: float | int, sample_size: int, lower_spec: float | int, upper_spec: float | int, alpha: float = 0.05, toler: float | int = 6) tuple[float, float, float, float, float]

Ppk compares the width of the process specification to the width of the process variation. It does take into consideration the deviation from the average. The standard deviation uses the “sample standard deviation” formula.

Parameters:
  • average (float | int,) – The average of the process.

  • std_devn (float | int,) – The standard deviation of the process. It should be the “sample standard deviation”.

  • sample_size (int,) – This is the sample size for the data being analysed.

  • lower_spec (float | int,) – The lower specification value.

  • upper_spec (float | int,) – The upper specification value.

  • alpha (float = 0.05) – The alpha value for the confidence interval calculations. An alpha of 0.05 is used for a 95 % confidence interval.

  • toler (float | int = 6) – The multiplier of the standard deviation tolerance.

Returns:

A tuple of the capability, the lower Ppk, the upper Ppk, the lower confidence bound, and the upper confidence bound.

  • capabilityfloat

    The Ppk process capability value.

  • ppk_lowerfloat

    The ppk value for left of the average,

  • ppk_lowerfloat

    The ppk value for right of the average,

  • lower_boundfloat

    The lower value of the confidence interval for Ppk.

  • upper_boundfloat

    The upper value of the confidence interval for Ppk.

Return type:

tuple[float, float, float, float, float]

Example

>>> import datasense as ds
>>> average = 0.11001
>>> std_devn = 0.868663
>>> sample_size = 40
>>> lower_spec = -4
>>> upper_spec = 4
>>> alpha = 0.05
>>> result = ds.ppk(
>>>     average=average,
>>>     std_devn=std_devn,
>>>     sample_size=sample_size,
>>>     lower_spec=lower_spec,
>>>     upper_spec=upper_spec,
>>>     alpha=alpha,
>>>     toler=6
>>> (
    1.4927115962500226, 1.5771401951428037, 1.4927115962500226,
    1.1457133294762083, 1.8397098630238369
)

datasense.pyxl module

openpyxl functions

datasense.pyxl.autofit_column_width(*, ws: Worksheet, extra_width: int) Worksheet

Autofit all columns in a worksheet.

Parameters:
  • ws (Worksheet) – The worksheet in which to autofit all columns.

  • extra_width (int) – An integer to add extra width so that the column edges are not flush.

Returns:

ws – The worksheet in which autofit was applied to all columns

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.autofit_column_width(
...     ws=ws,
...     extra_width=7
... )
datasense.pyxl.cell_fill_down(*, ws: Worksheet, min_row: int, max_row: int, min_col: int, max_col: int) Worksheet

Fill empty cell with the value from the cell above

Parameters:
  • ws (Worksheet) – The worksheet in which to change the case of column(s).

  • min_row (int) – The first row in the range to change.

  • max_row (int) – The last row in the range to change.

  • min_col (int) – The first column in the range to change.

  • max_col (int) – The last column in the range to change.

Returns:

ws – The worksheet in which cells were modified.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> fill_down_columns = ["col1", "col2", "col3"]
>>> for column in fill_down_columns:
...     ws = ds.cell_fill_down(
...         ws=ws,
...         min_row=2,
...         max_row=ws.max_row,
...         min_col=1,
...         max_col=3
...     )
datasense.pyxl.cell_style(*, wb: Workbook, style_name: str = 'cell_style', font_name: str = 'Lucida Sans', font_size: int = 11, font_bold: bool = True, font_colour: str = '000000', horizontal_alignment: str = 'center', vertical_alignment: str = 'center', wrap_text: str | bool = None, fill_type: str | bool = 'solid', foreground_colour: str | bool = 'd9d9d9', border_style: str | bool = None, border_colour: str | bool = None, number_format: str | bool = None) NamedStyle

Define a cell style

Parameters:
  • wb (Workbook) – The workbook in which to define the cell style.

  • style_name (str = 'cell_style') – The name for the cell style.

  • font_name (str = 'Lucida Sans') – The font name for the style.

  • font_size (int = 11) – The font size for the style.

  • font_bold (bool = True) – A boolean or string to apply bold style.

  • font_colour (str = 'ffffff') – The string for the font colour.

  • horizontal_alignment (str = 'center') – The string for horizontal alignment.

  • vertical_alignment (str = 'center') – The string for vertical alignment.

  • wrap_text (str | bool = None) – A boolean or string to wrap text.

  • fill_type (str = 'solid') – The string for the fill type.

  • foreground_colour (str = 'd9d9d9') – The string for the foreground colour.

  • border_style (str | bool = None) – A boolean or string to apply a border.

  • border_colour (str | bool = None) – A boolean or string to apply a border colour.

  • number_format (str | bool = None) – A boolean or string to apply a number format.

Returns:

row_style – The named style.

Return type:

NamedStyle

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> red_cell_style = ds.cell_style(
...     wb=wb,
...     style_name='red_cell_style',
...     font_colour='ffffff',
...     foreground_colour='c00000'
... )
>>> wb.add_named_style(red_cell_style) 
>>> for cell in ['C1', 'D1', 'E1']:
...     ws[cell].style = red_cell_style 
datasense.pyxl.change_case_worksheet_columns(*, ws: Worksheet, min_col: int, max_col: int, min_row: int, max_row: int, case: str = 'upper') Worksheet

Change case for one or more worksheet columns.

Parameters:
  • ws (Worksheet) – The worksheet in which to change the case of column(s).

  • min_col (int) – The first column in the range to change.

  • max_col (int) – The last column in the range to change.

  • min_row (int) – The first row in the range to change.

  • max_row (int) – The last row in the range to change.

  • case (str = 'upper') – The case to change. Currently only upper or lower.

Returns:

ws – A worksheet from a workbook.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.change_case_worksheet_columns(
...     ws=ws,
...     min_col=4,
...     max_col=6,
...     min_row=1,
...     max_row=ws.max_row,
...     case='upper'
... )
datasense.pyxl.exit_script(*, original_stdout: IO[str], output_url: str) None

Exit from a script and complete the html file.

Parameters:
  • original_stdout (IO[str]) – The original stdout.

  • output_url (str) – The output url.

Example

>>> import datasense as ds
>>> output_url = '../tests/my_html_file.html'
>>> original_stdout = ds.html_begin(output_url=output_url)
>>> ds.exit_script(
...     original_stdout=original_stdout,
...     output_url=output_url
... ) 
datasense.pyxl.list_duplicate_worksheet_rows(*, ws: Worksheet, columns_to_ignore: list[int] = None) list[int]

Find duplicate rows in a worksheet.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • columns_to_ignore (list[int] = None) – A list of column numbers to not use in determining duplicate rows.

Returns:

duplicate_rows – A list of duplicate row numbers.

Return type:

list[int]

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> duplicate_rows = ds.list_duplicate_worksheet_rows(ws=ws)
>>> ws = ds.remove_worksheet_rows(
...     ws=ws,
...     rows_to_remove=duplicate_rows
... )
datasense.pyxl.list_empty_and_nan_worksheet_rows(*, ws: Worksheet, min_row: int) list[int]

Create list of row numbers of blank worksheet rows.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • min_row (int) – Start row for iteration.

Returns:

blank_rows – List of row numbers.

Return type:

list[int]

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> sheetname = "sheetname"
>>> ws = wb[sheetname] 
>>> blank_rows = ds.list_nan_worksheet_rows(
...     ws=ws,
...     min_row=2
... ) 
datasense.pyxl.list_empty_except_nan_worksheet_rows(*, ws: Worksheet, min_row: int) list[int]

Create list of row numbers of empty worksheet rows, except those with np.nan.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • min_row (int) – Start row for iteration.

Returns:

empty_rows – List of row numbers.

Return type:

list[int]

Example

Remove empty rows starting from row 2. >>> import datasense as ds >>> wb = Workbook() >>> ws = wb.active >>> empty_rows = ds.list_empty_except_nan_worksheet_rows( … ws=ws, … min_row=2 … )

datasense.pyxl.list_nan_worksheet_rows(*, ws: Worksheet, min_row: int) list[int]

Create list of row numbers of blank worksheet rows.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • min_row (int) – Start row for iteration.

Returns:

blank_rows – List of row numbers.

Return type:

list[int]

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> sheetname = "sheetname"
>>> ws = wb[sheetname] 
>>> blank_rows = ds.list_nan_worksheet_rows(
...     ws=ws,
...     min_row=2
... )
datasense.pyxl.list_rows_with_content(*, ws: Worksheet, min_row: int, column: int, text: str) list[int]

List rows that contain specific text in a specified column.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • min_row (int) – Start row for iteration.

  • column (int) – The column to search.

  • text (str) – The text to search.

Returns:

A list of row numbers.

Return type:

list[int]

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> rows_with_text = ds.list_rows_with_content(
...     ws=ws,
...     min_row=2,
...     column=11,
...     text='ETA'
... )
datasense.pyxl.number_non_empty_rows(*, ws: Worksheet, column_number: int, start_row: int) int

Determine the number of non-empty rows for a single column.

Parameters:
  • ws (Worksheet) – The worksheet to analyze.

  • column_number (int) – The desired column number.

  • start_row (int) – The row at which to start evaluating cells.

Returns:

row_count – The number of non-empty rows.

Return type:

int

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> start_row = 2
>>> column_number = 1
>>> row_count = ds.number_non_empty_rows(
...     ws=ws,
...     column_number=column_number,
...     start_row=start_row,
... )
datasense.pyxl.read_workbook(*, filename: Path | str, data_only: bool = True) tuple[openpyxl.workbook.workbook.Workbook, list[str]]

Read a workbook, print the Path, and print the sheet names.

Parameters:
  • filename (Path | str) – The file containing the workbook.

  • data_only (bool = True) – If True, read values stored in the cells. If False, read formulae stored in the cells.

Returns:

A tuple of a Workbook and a list of sheet names.

  • wbWorkbook

    A workbook.

  • sheet_nameslist[str]

    The sheet names in the workbook.

Return type:

tuple[Workbook, list[str]]

Examples

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> filename = "../tests/excel_file.xlsx"
>>> wb, sheet_names = ds.read_workbook(
...     filename=filename,
...     data_only=True
... ) 
datasense.pyxl.remove_empty_worksheet_rows(*, ws: Worksheet, empty_rows: list[int]) Worksheet

Delete empty worksheet rows.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • empty_rows (list[int]) – List of row numbers.

Returns:

ws – A worksheet from a workbook.

Return type:

Worksheet

Example

Remove empty rows found. >>> import datasense as ds >>> wb = Workbook() >>> ws = wb.active >>> ws = ds.remove_empty_worksheet_rows( … ws=ws, … empty_rows=[5, 6, 7] … )

datasense.pyxl.remove_worksheet_columns(*, ws: Worksheet, starting_column: int, number_of_columns: int) Worksheet

Remove worksheet columns.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • starting_column (int) – The first column to remove.

  • number_of_columns (int) – The number of columns to remove.

Returns:

ws – A worksheet from a workbook.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.remove_worksheet_columns(
...     ws=ws,
...     starting_column=13,
...     number_of_columns=3
... )
datasense.pyxl.remove_worksheet_rows(*, ws: Worksheet, rows_to_remove: list[int]) Worksheet

Remove worksheet rows.

Parameters:
  • ws (Worksheet) – A worksheet from a workbook.

  • rows_to_remove (list[int]) – The list of row numbers to remove.

Returns:

ws – A worksheet from a workbook.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.remove_worksheet_rows(
...     ws=ws,
...     rows_to_remove=[4, 5, 6]
... )
datasense.pyxl.replace_text(*, ws: Worksheet, column: int, text: tuple[tuple[str, str]]) Worksheet

Search and replace text in a cell.

Parameters:
  • ws (Worksheet) – The worksheet in which to search and replace text.

  • column (int) – The column number in which to search and replace text.

  • tuple[tuple[str (text ;) – The search and replace text.

  • str]] – The search and replace text.

Returns:

ws – The worksheet in which text was replaced.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.replace_text(
...     ws=ws,
...     column=13,
...     text=['old_text', 'new_text']
... ) 
datasense.pyxl.unique_list_items(*, ws: Worksheet, row_of_labels: int, row_below_labels: int, column_name_varname: str, text_to_replace: list[str], text_to_remove: list[str]) tuple[list[int], list[int]]

Determine list of unique items in varname. Replace text.

TODO: This function does too many things. Break it up. Add detail to docstring.

Parameters:
  • ws (Worksheet) – The worksheet to analyze.

  • row_of_labels (int) – The row number of the labels in the worksheet.

  • row_below_labels (int) – The row number below the label row.

  • column_name_varname (str) – The label of a column.

  • text_to_replace (list[str]) – A list of the text to search for.

  • text_to_remove (list[str]) – A list of the text to remove.

Returns:

A tuple of a list of unique items and a list of column numbers.

  • varnametuple[list[int]

    The list of unique items.

  • column_numberslist[int]]

    The list of column numbers.

Return type:

tuple[list[int], list[int]]

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> list_part_numbers, list_column_numbers = ds.unique_list_items(
...     ws=ws,
...     row_of_labels=row_of_labels,
...     row_below_labels=row_below_labels,
...     column_name_varname=column_name,
...     text_to_replace=text_to_replace,
...     text_to_remove=text_to_remove
... ) 
datasense.pyxl.validate_column_labels(*, ws: Worksheet, column_labels: list[str], first_column: int, last_column: int, row_of_labels: int, start_time: float = None, stop_time: float = None, original_stdout: IO[str] = None, output_url: str = None) Worksheet

Validate the labels of a worksheet with a desired list of labels

Parameters:
  • ws (Worksheet) – The worksheet to analyze.

  • column_labels (list[str]) – The list of desired column labels.

  • first_column (int) – The first column of the label range in the worksheet.

  • last_column (int) – The last column of the label range in the worksheet.

  • row_of_labels (int) – The row number of the labels in the worksheet.

  • start_time (float = None) – The start time of the script.

  • stop_time (float = None) – The stop time of the script.

  • original_stdout (IO[str] = None) – The original stdout.

  • output_url (str = None) – The output url.

Returns:

ws – The worksheet to analyze.

Return type:

Worksheet,

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> ws = ds.validate_column_labels(
...     ws=ws,
...     column_labels=column_labels,
...     first_column=first_column,
...     last_column=last_column,
...     row_of_labels=row_of_labels,
...     start_time=start_time,
...     stop_time=time.time(),
...     original_stdout=original_stdout,
...     output_url=output_url
... ) 
datasense.pyxl.validate_sheet_names(*, wb: Workbook, filename: Path | str, sheet_name: str, sheet_names: list[str], start_time: float, original_stdout: TextIOWrapper, output_url: str) Workbook
Parameters:
  • wb (Workbook) – A workbook.

  • filename (Path | str) – The file containing the workbook.

  • sheet_name (str) – A sheet name in the workbook.

  • sheet_names (list[str]) – The sheet names in the workbook.

  • start_time (float) – The start time of the script.

  • original_stdout (io.TextIOWrapper) – The buffered text stream for the html output.

  • output_url (str) – The html filename.

Returns:

wb – The workbook with a revised sheetname.

Return type:

Workbook

Example

>>> import datasense as ds
>>> start_time = time.perf_counter()
>>> stop_time = time.perf_counter()
>>> filename = "../tests/excel_file.xlsx"
>>> sheet_name = "sheet_01"
>>> sheet_names = ["sheet_01", "sheet_02"]
>>> output_url = "../tests/my_html_file.html"
>>> original_stdout = ds.html_begin(output_url=output_url)
>>> wb = Workbook()
>>> ws = wb.active
>>> wb = validate_sheet_names(
...     wb=wb,
...     filename=filename,
...     sheet_name=sheet_name,
...     sheet_names=sheet_names,
...     start_time=start_time,
...     original_stdout=original_stdout,
...     output_url=output_url
... ) 
datasense.pyxl.write_dataframe_to_worksheet(*, ws: Worksheet, df: DataFrame, index: bool = False, header: bool = True) Worksheet

Write a dataframe to a worksheet.

Parameters:
  • ws (Worksheet) – The worksheet to which the dataframe will be written.

  • df (pd.DataFrame) – The dataframe of data.

  • index (bool = False) – Boolean to determine if dataframe index is written to worksheet.

  • header (bool = True) – Boolean to determine if dataframe header is written to worksheet.

Returns:

ws – The worksheet created.

Return type:

Worksheet

Example

>>> import datasense as ds
>>> wb = Workbook()
>>> ws = wb.active
>>> df = pd.DataFrame(data={
...     'X': [25.0, 24.0, 35.5, np.nan, 23.1],
...     'Y': [27, 24, np.nan, 23, np.nan],
...     'Z': ['a', 'b', np.nan, 'd', 'e']
... })
>>> ws = ds.write_dataframe_to_worksheet(
...     ws=ws,
...     df=df,
...     index=False,
...     header=True
... )

datasense.rgx module

Regular expressions

datasense.rgx.rgx_email_address(*, strings: list[str]) list[str]

Extract list of unique email addresses from a list of strings.

Parameters:

strings (list[str]) – A list of strings which may contain email addresses.

Returns:

matches – A list of strings containing the email addresses in the input list.

Return type:

list[str]

Examples

>>> import datasense as ds
>>> strings = [
...     "first email bob.smith@somemail.com send",
...     "second email bobsmith13@othermail.com received",
...     "third email tom@onemail.com and fourth mail frank@twomail.com"
... ]
>>> matches = ds.rgx_email_address(strings=strings)
>>> matches 
['bob.smith@somemail.com',
 'bobsmith13@othermail.com',
 'frank@twomail.com',
 'tom@onemail.com']

open a file containing email addresses and

>>> with open("../tests/mailbox.txt") as f:
...     data = f.read()
>>> strings = data.split("\n")
>>> matches = ds.rgx_email_address(strings=strings)
datasense.rgx.rgx_url(*, strings: list[str]) list[str]

Extract list of unique URLs from a list of strings.

This is a work-in-progress as I discover more URLs to test.

Parameters:

strings (list[str]) – A list of strings which may contain URLs.

Returns:

matches – A list of strings containing URLs in the input list.

Return type:

list[str]

Example

>>> import datasense as ds
>>> strings = [
...     "https://www.google.com",
...     "https://www.wikipedia.org/",
...     "http://www.wikipedia.org/",
...     "one https://www.wikipedia.org/ and two http://www.wikipedia.org/",
...     "http://localhost:631/jobs/",
...     "https://en.wikipedia.org/wiki/Regular_expression#History",
...     "www.regexbuddy.com",
...     "http://www.regexbuddy.com/index.html",
...     "http://www.regexbuddy.com/index.html?source=library",
...     "Download RegexBuddy at http://www.regexbuddy.com/download.html",
...     "http://10.2.2.1.2/ttxx/txt/gg",
...     "file:///home/gilles/downloads/cheat_sheet_finance.pdf"
... ]
>>> matches = ds.rgx_url(strings=strings)
>>> matches 
['file:///home/gilles/downloads/cheat_sheet_finance.pdf',
 'http://10.2.2.1.2/ttxx/txt/gg',
 'http://localhost:631/jobs/',
 'http://www.regexbuddy.com/download.html',
 'http://www.regexbuddy.com/index.html',
 'http://www.regexbuddy.com/index.html?source=library',
 'http://www.wikipedia.org/',
 'https://en.wikipedia.org/wiki/Regular_expression#History',
 'https://www.google.com',
 'https://www.wikipedia.org/']

datasense.sequel module

SQL functions

datasense.sequel.psycopg2_connection(params: dict) connect

Connect to the PostgreSQL database server.

Parameters:
  • database (str) – The name of the database.

  • user (str) – The name of the user.

  • password (str) – The password for the user.

  • host (str) – The host address.

  • port (int) – The port number.

Returns:

con – A connection to the PostgreSQL database server.

Return type:

psycopg2.connect

Example

>>> import datasense as ds
>>> connect_parameters = {
...     "database": "mydb",
...     "user": "postgres",
...     "password": "password",
...     "host": "localhost",
...     "port": 5432
... } 
>>> con = ds.psycopg2_connection(params=connect_parameters) 

datasense.stats module

Statistical analysis

  • Non-parametric statistical summary

  • Parametric statistical summary

  • Cubic spline smoothing for Y vs X, can handle missing values

  • Piecewise natural cubic spline helper

  • Generate random data of various distributions

  • Generate datetime data

  • Generate timedelta data

datasense.stats.cubic_spline(*, df: DataFrame, abscissa: str, ordinate: str) CubicSpline

Estimates the spline object for the abscissa and ordinate of a dataframe.

  • Requires that abscissa, ordinate be integer or float

  • Removes rows where there are missing values in abscissa and ordinate

  • Removes duplicate rows

  • Sorts the dataframe by abscissa in increasing order

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • abscissa (str) – The name of the abscissa column.

  • ordinate (str) – The name of the ordinate column.

Returns:

spline – A cubic spline.

Return type:

CubicSpline

Example

>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "abscissa": ds.random_data(
...             distribution="uniform",
...             random_state=42
...         ),
...         "ordinate": ds.random_data(
...             distribution="norm",
...             random_state=42
...         )
...     }
... ).sort_values(by=["abscissa"])
>>> spline = ds.cubic_spline(
...     df=df,
...     abscissa="abscissa",
...     ordinate="ordinate"
... ) 
    abscissa  ordinate
10  0.020584 -0.463418
29  0.046450 -0.291694
6   0.058084  1.579213
32  0.065052 -0.013497
37  0.097672 -1.959670
40  0.122038  0.738467
21  0.139494 -0.225776
5   0.155995 -0.234137
4   0.156019 -0.234153
31  0.170524  1.852278
14  0.181825 -1.724918
15  0.183405 -0.562288
26  0.199674 -1.150994
13  0.212339 -1.913280
19  0.291229 -1.412304
22  0.292145  0.067528
16  0.304242 -1.012831
36  0.304614  0.208864
23  0.366362 -1.424748
0   0.374540  0.496714
18  0.431945 -0.908024
39  0.440152  0.196861
24  0.456070 -0.544383
41  0.495177  0.171368
27  0.514234  0.375698
17  0.524756  0.314247
28  0.592415 -0.600639
3   0.598658  1.523030
8   0.601115 -0.469474
30  0.607545 -0.601707
20  0.611853  1.465649
38  0.684233 -1.328186
9   0.708073  0.542560
2   0.731994  0.647689
25  0.785176  0.110923
35  0.808397 -1.220844
12  0.832443  0.241962
7   0.866176  0.767435
33  0.948886 -1.057711
1   0.950714 -0.138264
34  0.965632  0.822545
11  0.969910 -0.465730
abscissa    float64
ordinate    float64
dtype: object
>>> df["predicted"] = spline(df["abscissa"])
>>> fig, ax = ds.plot_scatter_line_x_y1_y2(
...     X=df["abscissa"],
...     y1=df["ordinate"],
...     y2=df["predicted"]
... )
datasense.stats.datetime_data(*, start_year: str = None, start_month: str = None, start_day: str = None, start_hour: str = None, start_minute: str = None, start_second: str = None, end_year: str = None, end_month: str = None, end_day: str = None, end_hour: str = None, end_minute: str = None, end_second: str = None, time_delta_days: int = 41, time_delta_hours: int = 24) Series

Create a series of datetime data.

Parameters:
  • start_year (str = None,) – The start year of the series.

  • start_month (str = None,) – The start month of the series.

  • start_day (str = None,) – The start day of the series.

  • start_hour (str = None,) – The start hour of the series.

  • start_minute (str = None,) – The start minute of the series.

  • start_second (str = None,) – The start second of the series.

  • end_year (str = None,) – The end year of the series.

  • end_month (str = None,) – The end month of the series.

  • end_day (str = None,) – The end day of the series.

  • end_hour (str = None,) – The end hour of the series.

  • end_minute (str = None,) – The end minute of the series.

  • end_second (str = None,) – The end second of the series.

  • time_delta_days (int = 41,) – The daily increment for the series.

  • time_delta_hours (int = 24) – The hourly increment for the series.

Returns:

series – The datetime series.

Return type:

pd.Series

Examples

>>> import datasense as ds
>>> # Create a default datetime series
>>> X = ds.datetime_data()
>>> # Create a datetime series of one month in increments of six hours
>>> X = ds.datetime_data(
...     start_year="2020",
...     start_month="01",
...     start_day="01",
...     start_hour="00",
...     start_minute="00",
...     start_second="00",
...     end_year="2020",
...     end_month="02",
...     end_day="01",
...     end_hour="00",
...     end_minute="00",
...     end_second="00",
...     time_delta_hours=6
... )
datasense.stats.linear_regression(*, X: Series, y: Series, print_model_summary: bool = False) tuple[statsmodels.regression.linear_model.OLS, pandas.core.series.Series, pandas.core.series.Series, pandas.core.series.Series, pandas.core.series.Series, pandas.core.series.Series]

Linear regression of one X series and one Y series. The variables are integers or floats. The X and y values must be sorted by X.

Parameters:
  • X (pd.Series) – The pandas Series of the independent data.

  • y (pd.Series) – The pandas Series of the dependent data.

  • print_model_summary (bool = False) – Print the model summary.

Returns:

A tuple of the fitted model, the predictions, the lower confidence bound, the upper counfidence bound, the lower prediction bound, and the upper prediction bound.

  • fitted_modelsm.regression.linear_model.OLS

    The fitted model.

  • predictionspd.Series

    The pandas Series with the predictions from the model.

  • confidence_interval_lowerpd.Series,

    The lower confidence interval of the average.

  • confidence_interval_upperpd.Series,

    The upper confidence interval of the saverage.

  • prediction_interval_lowerpd.Series,

    The lower prediction interval of the data.

  • prediction_interval_upperpd.Series

    The upper prediction interval of the data.

Return type:

tuple[sm.regression.linear_model.OLS, pd.Series, pd.Series, pd.Series, pd.Series, pd.Series]

References

https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

Example

>>> import datasense as ds
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "abscissa": ds.random_data(
...             distribution="uniform",
...             random_state=42
...         ),
...         "ordinate": ds.random_data(
...             distribution="norm",
...             random_state=42
...         )
...     }
... ).sort_values(by=["abscissa"])
>>> X = df["abscissa"]
>>> y = df["ordinate"]
>>> (
...     fitted_model, predictions, confidence_interval_lower,
...     confidence_interval_upper, prediction_interval_lower,
...     prediction_interval_upper
... ) = ds.linear_regression(
...     X=X,
...     y=y
... )
datasense.stats.natural_cubic_spline(*, X: Series, y: Series, number_knots: int, list_knots: list[int] = None) Pipeline

Piecewise natural cubic spline helper function.

If number_knots is given, the calculated knots are equally-spaced within minval and maxval. The endpoints are not included as knots.

The X series must be in increasing order. The y series must not contain missing values.

Parameters:
  • X (pd.Series) – The data series of the abscissa.

  • y (pd.Series) – The data series of the ordinate.

  • number_knots (int) – The number of knots for the spline.

  • list_knots (list[int] = None) – A list of specific knots.

Returns:

p – The model object.

Return type:

Pipeline

Example

>>> import matplotlib.pyplot as plt
>>> import datasense as ds
>>> X = ds.random_data(distribution="uniform").sort_values()
>>> y = ds.random_data(distribution="norm")
>>> p = ds.natural_cubic_spline(
...     X=X,
...     y=y,
...     number_knots=10
... )
>>> fig, ax = ds.plot_scatter_line_x_y1_y2(
...     X=X,
...     y1=y,
...     y2=p.predict(X)
... )
datasense.stats.nonparametric_summary(*, series: Series, alphap: float = 0.3333333333333333, betap: float = 0.3333333333333333, decimals: int = 3) Series

Calculate empirical quantiles for a series.

Parameters:
  • series (pd.Series) – The input series.

  • alphap (float = 1/3) – Plotting positions.

  • betap (float = 1/3) – Plotting positions.

  • decimals (int = 3) – The number of decimal places for rounding.

  • positions (scipy.stats.mstats.mquantiles plotting) –

    R method 1, SAS method 3:

    not yet implemented in scipy.stats.mstats.mquantiles

    R method 2, SAS method 5:

    not yet implemented in scipy.stats.mstats.mquantiles

    R method 3, SAS method 2:

    not yet implemented in scipy.stats.mstats.mquantiles

    R method 4, SAS method 1:

    alphap=0, betap=1

    R method 5:

    alphap=0.5, betap=0.5

    R method 6, SAS method 4, Minitab, SPSS:

    alphap=0, betap=0

    R method 7, Splus 3.1, R default, pandas default, NumPy ‘linear’:

    alphap=1, betap=1

    R method 8:

    alphap=0.33, betap=0.33; NumPy ‘median_unbiased’

    R method 9:

    alphap=0.375, betap=0.375

    Cunnane’s method, SciPy default:

    alphap=0.4, betap=0.4

    APL method;

    alphap=0.35, betap=0.35

Returns:

lower outer fence : float lower inner fence : float lower quartile : float median : float upper quartile : float upper inner fence : float upper outer fence : float interquartile range : float inner outliers : list[float] outer outliers : list[float] minimum value : float maximum value : float count : int

Return type:

pd.Series containing

Examples

>>> import datasense as ds
>>> series = ds.random_data()
>>> series = ds.nonparametric_summary(series=series)
>>> series = ds.random_data()
>>> series = ds.nonparametric_summary(
...     series=series,
...     alphap=0,
...     betap=0
... )

Notes

The 1.57 used to calculate the confidence intervals was empirically determined. See: McGill, Robert, John W. Tukey, and Wayne A. Larsen (Feb. 1978). “Variations of Box Plots”. In: The American Statistician 32.1, pp. 12–16. doi: https://doi.org/10.2307/2683468. url: https://www.jstor.org/stable/2683468.

datasense.stats.one_sample_t(*, series: Series, hypothesized_value: int | float, alternative_hypothesis: str = 'two-sided', significance_level: float = 0.05, width: int = 7, decimals: int = 3) tuple[float, float, float, float, float, float, float, float, float, float, float]

One-sample t test.

  • Parametric statistics are calculated for the sample.

  • Non-parametric statistics are calculated for the sample.

  • The assumption for normality of each sample is evaluated.
    • Shapiro-Wilk, a parametric test

    • Anderson-Darling, a non-parametric test

    • Kolmogorov-Smirnov, a non-parametric test

Parameters:
  • series (pd.Series,) – The Series of data, consisting of one column with a label y.

  • hypothesized_value (int | float) – The hypothesized value for the test.

  • alternative_hypothesis (str = "two-sided",) – The alternative hypothesis for the t test. “two-sided” the sample is different from the hypothesized value “less” the sample is less than the hypothesized value “greater” the sample is greater than the hypothesized value

  • significance_level (float = 0.05) – The significance level for rejecting the null hypothesis.

  • width (int = 7) – The width for the formatted number.

  • decimals (int = 3) – The number of decimal places for the formatted number.

Returns:

A tuple containing eleven elements:

  • t_test_result.statistic: float

    The calculated t statistic for the hypothesis.

  • t_test_result.pvalue: float

    The calculated p value for the calculated t statistic.

  • power: float

    The power of the t test.

  • shapiro_wilk_test_statistic: float

    The Shapiro-Wilk calculated t statistic.

  • shapiro_wilk_p_value: float

    The Shapiro-Wilk calculated p value for the calculated t statistic.

  • ad_test_statistic: float

    The Anderson-Darling calculated t statistic.

  • ad_critical_values[2]: float

    The Anderson-Darling calculated p value for the calculated t statistic at alpha = 0.05.

  • kolmogorov_smirnov_test_statistic: float

    The Kolmogorov-Smirnov calculated t statistic.

  • kolmogorov_smirnov_test_pvalue: float

    The Komogorov-Smirnov calculated p value for the calculated t statistic.

Return type:

tuple[float, float, float, float, float, float, float, float, float, float, float]

Examples

Ho: the average of the sample is equal to the hypothesized value. Ha: the average of the sample is not equal to the hypothesized value.

>>> import datasense as ds
>>> series = ds.random_data(random_state=42)
>>> one_sample_t_test_result = ds.one_sample_t(
...     series=series,
...     hypothesized_value=0,
...     alternative_hypothesis="two-sided",
...     significance_level=0.05,
...     width=7
... ) 

Ho: the average of the sample is equal to the hypothesized value. Ha: the average of the sample is less than the hypothesized value.

>>> series = ds.random_data(random_state=42)
>>> one_sample_t_test_result = ds.one_sample_t(
...     series=series,
...     hypothesized_value=0,
...     alternative_hypothesis="less",
...     significance_level=0.05,
...     width=7
... ) 

Ho: the average of the sample is equal to the hypothesized value. Ha: the average of the sample is greater than the hypothesized value.

>>> series = ds.random_data(random_state=42)
>>> one_sample_t_test_result = ds.one_sample_t(
...     series=series,
...     hypothesized_value=0,
...     alternative_hypothesis="greater",
...     significance_level=0.05,
...     width=7
... ) 
datasense.stats.paired_t(*, series1: Series, series2: Series, hypothesized_value: int | float | bool = None, alternative_hypothesis: str = 'two-sided', significance_level: float = 0.05, width: int = 7, decimals: int = 3) tuple[float, float, float, float, float, float, float, float, float, float]

Two-sample t test. It is a one-sample t test for the average of the pairwise differences.

The two samples of a paired t test arise from any circumstance in which each data point in one sample is uniquely matched to a data point in the second sample. Paired samples are also called dependent samples.

  • Parametric statistics are calculated for each sample.

  • Non-parametric statistics are calculated for each sample.

  • The assumption for normality of each sample is evaluated.
    • Shapiro-Wilk, a parametric test

    • Anderson-Darling, a non-parametric test

    • Kolmogorov-Smirnov, a non-parametric test

The paired t test has the following assumptions: - The data must be paired. - The differences must be independent of each other. - The differences follow a normal distribution.

Parameters:
  • series1 (pd.Series) – The first series of data, with a name.

  • series2 (pd.Series) – The second series of data, with a name.

  • hypothesized_value (int | float | bool = None) – The hypothesized value for the test.

  • alternative_hypothesis (str = "two-sided",) – The alternative hypothesis for the paired t test. “two-sided” the average of the differences are not equal to zero or some hypothesized value. “less” the average of the differences are less than zero or some hypothesized value. “greater” the average of the differences are greater than zero or some hypothesized value.

  • significance_level (float = 0.05) – The significance level for rejecting the null hypothesis.

  • width (int = 7) – The width for the formatted number.

  • decimals (int = 3) – The number of decimal places for the formatted number.

Returns:

A tuple containing ten elements:

  • t_test_statisticfloat

    The calculated t statistic for the hypothesis.

  • t_test_p_valuefloat

    The calculated p value for the calculated t statistic.

  • shapiro_wilk_test_statisticfloat

    The Shapiro-Wilk calculated t statistic.

  • shapiro_wilk_p_valuefloat

    The Shapiro-Wilk calculated p value for the calculated t statistic.

  • ad_test_statisticfloat

    The Anderson-Darling calculated t statistic.

  • ad_critical_values[2]float

    The Anderson-Darling calculated p value for the calculated t statistic at alpha = 0.05.

  • kolmogorov_smirnov_test_statisticfloat

    The Kolmogorov-Smirnov calculated t statistic.

  • kolmogorov_smirnov_test_pvaluefloat

    The Komogorov-Smirnov calculated p value for the calculated t statistic.

  • hypothesis_test_ci_lower_bound: float

    The lower value of the confidence interval of the average of the differences.

  • hypothesis_test_ci_upper_bound: float

    The upper value of the confidence interval of the average of the differences.

Return type:

tuple[float, float, float, float, float, float, float, float, float, float]

Examples

Ho: The population average of the differences equals zero. Ha: The population average of the differences does not equal zero.

>>> import datasense as ds
>>> series1 = ds.random_data(random_state=13)
>>> series2 = ds.random_data(random_state=69)
>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     significance_level=0.05,
...     alternative_hypothesis="two-sided"
... ) 

Ho: The population average of the differences equals zero. Ha: The population average of the differences is less than zero.

>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     hypothesized_value=0,
...     alternative_hypothesis="less",
...     significance_level=0.05
... ) 

Ho: The population average of the differences equals zero. Ha: The population average of the differences is greater than zero.

>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     hypothesized_value=0,
...     alternative_hypothesis="greater",
...     significance_level=0.05
... ) 

Ho: The population average of the differences equals d. Ha: The population average of the differences does not equal d.

>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     hypothesized_value=3,
...     alternative_hypothesis="two-sided",
...     significance_level=0.05,
... ) 

Ho: The population average of the differences equals d. Ha: The population average of the differences is less than d.

>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     hypothesized_value=3,
...     alternative_hypothesis="less",
...     significance_level=0.05
... ) 

Ho: The population average of the differences equals d. Ha: The population average of the differences is greater than d.

>>> paired_t_result = ds.paired_t(
...     series1=series1,
...     series2=series2,
...     hypothesized_value=3,
...     alternative_hypothesis="greater",
...     significance_level=0.05
... ) 
datasense.stats.parametric_summary(*, series: Series, decimals: int = 3) Series

Return parametric statistics.

Parameters:
  • series (pd.Series) – The input series.

  • decimals (int = 3) – The number of decimal places for rounding.

Returns:

The output series containing: n : sample size min : minimum value max : maximum value ave : average s : sample standard deviation var : sample variance

Return type:

pd.Series

Example

>>> import datasense as ds
>>> series = ds.random_data()
>>> series = ds.parametric_summary(series=series)
datasense.stats.random_data(*, distribution: str = 'norm', size: int = 42, loc: float = 0, scale: float = 1, low: int = 13, high: int = 70, strings: list[str] = ['female', 'male'], categories: list[str] = ['small', 'medium', 'large'], random_state: int = None, fraction_nan: float = 0.13, name: str = None) Series

Create a series of random items from a distribution.

Parameters:
  • distribution (str = "norm") – A scipy.stats distribution, the standard normal by default.

  • size (int = 42) – The number of rows to create.

  • loc (float = 0) – The center of a distribution.

  • scale (float = 1) – The spread of a distribution.

  • low (int = 13,) – The low value (inclusive) for the integer distribution.

  • high (int = 70,) – The high value (exclusive) for the integer distribution.

  • strings (list[str] = ["female", "male"],) – The list of strings for the distribution of strings.

  • categories (list[str] = ["small", "medium", "large"],) – The list of strings for the distribution of categories.

  • random_state (int = None) – The random number seed.

  • fraction_nan (float = 0.13) – The fraction of cells to be made np.NaN.

  • name (str = None) – The name of the Series.

Returns:

series – A pandas series of random items.

Return type:

pd.Series

Examples

Generate a series of random floats, normal distribution, with the default parameters.

>>> import datasense as ds
>>> s = ds.random_data()

Generate a series of random floats, normal distribution, with the default parameters. Set random_state seed for repeatable sample.

>>> s = ds.random_data(random_state=42)

Create a series of random float, normal distribution, with sample size = 113, mean = 69, standard deviation = 13.

>>> s = ds.random_data(
...     distribution="norm",
...     size=113,
...     loc=69,
...     scale=13
... )

Create series of random floats, standard uniform distribution, with the default parameters.

>>> s = ds.random_data(distribution="uniform")

Create series of random floats, standard uniform distribution, with the default parameters. Set random_state seed for repeatable sample

>>> s = ds.random_data(
...     distribution="uniform",
...     random_state=42
... )

Create series of random floats, uniform distribution, size = 113, min = 13, max = 69.

>>> s = ds.random_data(
...     distribution="uniform",
...     size=113,
...     loc=13,
...     scale=70
... )

Create series of random integers, integer distribution, with the default parameters.

>>> s = ds.random_data(distribution="randint")

Create series of random nullable integers, integer distribution, with the default parameters.

>>> s = ds.random_data(distribution="randInt")

Create series of random integers, integer distribution, size = 113, min = 0, max = 1.

>>> s = ds.random_data(
...     distribution="randint",
...     size=113,
...     low=0,
...     high=2
... )

Create series of random integers, integer distribution, size = 113, min = 0, max = 1. Set random_state seed for repeatable sample

>>> s = ds.random_data(
...     distribution="randint",
...     size=113,
...     low=0,
...     high=2,
...     random_state=42
... )

Create series of random strings from the default list.

>>> s = ds.random_data(distribution="strings")

Create series of random strings from a list of strings.

>>> s = ds.random_data(
...     distribution="strings",
...     size=113,
...     strings=["tom", "dick", "harry"]
... )

Create series of random strings from a list of strings. Set random_state seed for repeatable sample

>>> s = ds.random_data(
...     distribution="strings",
...     size=113,
...     strings=["tom", "dick", "harry"],
...     random_state=42
... )

Create series of random booleans with the default parameters.

>>> s = ds.random_data(distribution="bool")

Create series of random nullable booleans with the default parameters.

>>> s = ds.random_data(distribution="boolean")

Create series of random booleans, size = 113.

>>> s = ds.random_data(
...     distribution="bool",
...     size=113
... )

Create series of random booleans, size = 113. Set random_state seed for repeatable sample

>>> s = ds.random_data(
...     distribution="bool",
...     size=113,
...     random_state=42
... )

Create series of unordered categories.

>>> s = ds.random_data(distribution="category")

Create series of ordered categories.

>>> s = ds.random_data(distribution="categories")

Create series of ordered categories.

>>> s = ds.random_data(
...     distribution="categories",
...     categories=["XS", "S", "M", "L", "XL"],
...     size=113
... )

Create series of ordered categories. Set random_state seed for repeatable sample

>>> s = ds.random_data(
...     distribution="categories",
...     categories=["XS", "S", "M", "L", "XL"],
...     size=113,
...     random_state=42
... )

Create series of timedelta64[ns].

>>> s = ds.random_data(
...     distribution="timedelta",
...     size=7
... )

Create series of datetime64[ns].

>>> s = ds.random_data(
...     distribution="datetime",
...     size=7
... )

Notes

Distribution dtypes returned for distribution options:

  • “uniform” float64

  • “bool” boolean

  • “boolean” boolean (nullable)

  • “strings” str

  • “norm” float64

  • “randint” int64

  • “randInt” Int64 (nullable)

  • “category” category

  • “categories” category of type CategoricalDtype(ordered=True)

datasense.stats.timedelta_data(*, time_delta_days: int = 41) Series

Create a series of timedelta data.

Parameters:

time_delta_days (int = 41) – The number of rows to create.

Returns:

series – The output series.

Return type:

pd.Series

Example

>>> import datasense as ds
>>> number_days_plus_one = 42
>>> series = timedelta_data(time_delta_days=number_days_plus_one)
datasense.stats.two_sample_t(*, series1: Series, series2: Series, alternative_hypothesis: str = 'two-sided', significance_level: float = 0.05, width: int = 7, decimals: int = 3) tuple[float]

Two-sample t test.

  • Parametric statistics are calculated for each sample.

  • Non-parametric statistics are calculated for each sample.

  • The assumption for normality of each sample is evaluted.
    • Shapiro-Wilk, a parametric test

    • Anderson-Darling, a non-parametric test

  • The homogeneity of variance of the samples is evaluated.
    • Bartlett, a parametric test

    • Levene, a non-parametric test

Parameters:
  • series1 (pd.Series) – The first series of data, with a name.

  • series2 (pd.Series) – The second series of data, with a name.

  • alternative_hypothesis (str = "two-sided",) – The alternative hypothesis for the t test. “two-sided” the sample averages are different “less” the average of sample 1 is < the average of sample 2 “greater” the average of sample 1 is > the average of sample 2

  • significance_level (float = 0.05) – The significance level for rejecting the null hypothesis.

  • width (int = 7) – The width for the formatted number.

  • decimals (int = 3) – The number of decimal places for the formatted number.

Returns:

A tuple containing fifteen elements.

  • t_test_statisticfloat

    The calculated t statistic for the hypothesis.

  • t_test_p_valuefloat

    The calculated p value for the calculated t statistic.

  • powerfloat

    The power of the t test.

  • swstat1float

    The Shapiro-Wilk calculated t statistic for level 1 of the dataset.

  • swpvalue1float

    The Shapiro-Wilk p value for the Shapiro-Wilk calculated t statistic for level 1 of the dataset.

  • swstat2float

    The Shapiro-Wilk calculated t statistic for level 2 of the dataset.

  • swpvalue2float

    The Shapiro-Wilk p value for the Shapiro-Wilk calculated t statistic for level 2 of the dataset.

  • bartlett_test_statisticfloat

    The Bartlett test statistic.

  • bartlett_p_valuefloat

    The Bartlett v p value for the Bartlett test statistic.

  • ad_test_statistic_1float

    The Anderson-Darling calculated t statistic for level 1 of the dataset.

  • ad_critical_value_1float

    The Anderson-Darling p value for the Anderson-Darling calculated t statistic for the level 1 of the dataset.

  • ad_test_statistic_2float

    The Anderson-Darling calculated t statistic for level 2 of the dataset.

  • ad_critical_value_2float

    The Anderson-Darling p value for the Anderson-Darling calculated t statistic for the level 2 of the dataset.

  • hypothesis_test_ci_lower_boundfloat

    The lower bound of the confidence interval of the difference in the sample averages.

  • hypothesis_test_ci_upper_boundfloat

    The upper bound of the confidence interval of the difference in the sample averages.

Return type:

tuple[float]

Examples

Ho: the average of sample one is equal to the average of sample two. Ha: the average of sample one is not equal to the average of sample two.

>>> import datasense as ds
>>> series1 = ds.random_data(random_state=13)
>>> series2 = ds.random_data(random_state=69)
>>> two_sample_t_test_result = ds.two_sample_t(
...     series1=series1,
...     series2=series2,
...     alternative_hypothesis="two-sided",
...     significance_level=0.05
... ) 

Ho: the average of sample one is equal to the average of sample two. Ha: the average of sample one is less than the average of sample two.

>>> series1 = ds.random_data(random_state=13)
>>> series2 = ds.random_data(random_state=69)
>>> two_sample_t_test_result = ds.two_sample_t(
...     series1=series1,
...     series2=series2,
...     alternative_hypothesis="less",
...     significance_level=0.05
... ) 

Ho: the average of sample one is equal to the average of sample two. Ha: the average of sample one is greater than the average of sample three.

>>> series1 = ds.random_data(random_state=13)
>>> series2 = ds.random_data(random_state=69)
>>> two_sample_t_test_result = ds.two_sample_t(
...     series1=series1,
...     series2=series2,
...     alternative_hypothesis="greater",
...     significance_level=0.05
... ) 

datasense.taguchi module

Taguchi Methods

datasense.taguchi.taguchi_loss_function(*, average: float | int, std_dev: float | int, target: float | int, cost: float | int, x: float | int) float

Calculate the average cost of use (ACU). It is also called the loss function.

Parameters:
  • average (float | int) – It is the average measurement value for the product stream. It is best to use a control chart (Xbar R | X mR) to estimate the average.

  • std_dev (float | int) – It is the standard deviation of the product stream. It is best to use a control chart (Xbar R | X mR) to estimate the average.

  • target (float | int) – It is the target value of the product stream.

  • cost (float | int) – It is the cost of scrap, rework, or repair for a single unit of product.

  • x (float | int) – It is the measurement value at which an item is scrapped, reworked, or repaired.

Returns:

acu – The average cost of use.

Return type:

float

Examples

Calculate ACU for an off-centred process with LS and US.

>>> import datasense as ds
>>> average = 4.66
>>> std_dev = 1.80
>>> target = 7.5
>>> cost = 0.25
>>> x = 15
>>> acu = ds.taguchi_loss_function(
...     average=average,
...     std_dev=std_dev,
...     target=target,
...     cost=cost,
...     x=x,
... )
>>> acu
0.05024711111111111

Calculate ACU for a centred process with LS and US.

>>> average = 7.5
>>> std_dev = 1.80
>>> target = 7.5
>>> cost = 0.25
>>> x = 15
>>> acu = ds.taguchi_loss_function(
...     average=average,
...     std_dev=std_dev,
...     target=target,
...     cost=cost,
...     x=x,
... )
>>> acu
0.014400000000000001