statistics

statistics.py

Extracts statistics from nanoindentation data: - Postprocess located popins - Extract pop-in intervals - Stress–strain transformation Kalidindi & Pathak (2008). - Calculate pop-in statistics (load-depth and stress-strain) - Calculate curve-level summary statistics (load-depth)

`calculate_curve_summary(df, start_col='start_idx', end_col='end_idx', time_col='Time (s)')`

Compute curve-level summary statistics about pop-in activity.

This function calculates the number of pop-ins, total pop-in duration, first and last pop-in times, and the average time between consecutive pop-ins.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame that includes pop-in intervals.	required
`start_col,`	`end_col (str`	Column names for start and end indices of pop-ins.	required
`time_col`	`str`	Column name for time.	`'Time (s)'`

Returns:

Type	Description
	pd.Series: Summary metrics: count, total duration, first/last timing, average interval.

Source code in src/merrypopins/statistics.py

def calculate_curve_summary(
    df, start_col="start_idx", end_col="end_idx", time_col="Time (s)"
):
    """
    Compute curve-level summary statistics about pop-in activity.

    This function calculates the number of pop-ins, total pop-in duration, first and last pop-in times,
    and the average time between consecutive pop-ins.

    Args:
        df (pd.DataFrame): DataFrame that includes pop-in intervals.
        start_col, end_col (str): Column names for start and end indices of pop-ins.
        time_col (str): Column name for time.

    Returns:
        pd.Series: Summary metrics: count, total duration, first/last timing, average interval.
    """
    interval_rows = df.dropna(subset=[start_col, end_col]).copy().reset_index(drop=True)
    n_popins = len(interval_rows)
    if n_popins > 0:
        all_starts = (
            interval_rows[start_col].astype(int).apply(lambda idx: df.at[idx, time_col])
        )
        all_ends = (
            interval_rows[end_col].astype(int).apply(lambda idx: df.at[idx, time_col])
        )
        total_popin_duration = all_ends.max() - all_starts.min()
        avg_time_between = all_starts.diff().dropna().mean()
        first_popin_time = all_starts.min()
        last_popin_time = all_ends.max()
    else:
        total_popin_duration = 0.0
        avg_time_between = None
        first_popin_time = None
        last_popin_time = None

    return pd.Series(
        {
            "n_popins": n_popins,
            "total_test_duration": df[time_col].max() - df[time_col].min(),
            "total_popin_duration": total_popin_duration,
            "first_popin_time": first_popin_time,
            "last_popin_time": last_popin_time,
            "avg_time_between_popins": avg_time_between,
        }
    )

`calculate_popin_statistics(df, precursor_stats=True, temporal_stats=True, popin_shape_stats=True, time_col='Time (s)', load_col='Load (µN)', depth_col='Depth (nm)', start_col='start_idx', end_col='end_idx', before_window=0.5, after_window=0.5)`

Compute descriptive statistics for each detected pop-in.

This function calculates time-based, precursor-based, and shape-based features for each interval where a pop-in occurred (based on start and end index).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with indentation data and interval metadata.	required
`precursor_stats`	`bool`	Whether to calculate average dLoad and slope before the pop-in.	`True`
`temporal_stats`	`bool`	Whether to calculate duration and inter-event timing features.	`True`
`popin_shape_stats`	`bool`	Whether to compute shape-based features like velocity and curvature.	`True`
`time_col,`	`load_col, depth_col (str`	Column names for time, load, and depth.	required
`start_col,`	`end_col (str`	Column names for the start and end index of pop-in intervals.	required
`before_window,`	`after_window (float`	Time window in seconds to use for context before/after the pop-in.	required

Returns:

Type	Description
	pd.DataFrame: Original DataFrame with per-pop-in statistics added (NaNs elsewhere).

Source code in src/merrypopins/statistics.py

def calculate_popin_statistics(
    df,
    precursor_stats=True,
    temporal_stats=True,
    popin_shape_stats=True,
    time_col="Time (s)",
    load_col="Load (µN)",
    depth_col="Depth (nm)",
    start_col="start_idx",
    end_col="end_idx",
    before_window=0.5,
    after_window=0.5,
):
    """
    Compute descriptive statistics for each detected pop-in.

    This function calculates time-based, precursor-based, and shape-based features
    for each interval where a pop-in occurred (based on start and end index).

    Args:
        df (pd.DataFrame): Input DataFrame with indentation data and interval metadata.
        precursor_stats (bool): Whether to calculate average dLoad and slope before the pop-in.
        temporal_stats (bool): Whether to calculate duration and inter-event timing features.
        popin_shape_stats (bool): Whether to compute shape-based features like velocity and curvature.
        time_col, load_col, depth_col (str): Column names for time, load, and depth.
        start_col, end_col (str): Column names for the start and end index of pop-in intervals.
        before_window, after_window (float): Time window in seconds to use for context before/after the pop-in.

    Returns:
        pd.DataFrame: Original DataFrame with per-pop-in statistics added (NaNs elsewhere).
    """
    df = df.copy()
    interval_rows = df.dropna(subset=[start_col, end_col]).copy().reset_index(drop=True)
    df["dLoad"] = df[load_col].diff() / df[time_col].diff()
    results = []

    for i, row in interval_rows.iterrows():
        start_idx = int(row[start_col])
        end_idx = int(row[end_col])
        start_time = df.at[start_idx, time_col]
        end_time = df.at[end_idx, time_col]

        # Ensure the 'before' window captures data before the pop-in
        before = df[
            (df[time_col] >= start_time - before_window) & (df[time_col] < start_time)
        ]

        # Ensure the 'during' window captures data during and after the pop-in
        during = df[
            (df[time_col] >= start_time) & (df[time_col] <= end_time + after_window)
        ]

        record = {"start_idx": start_idx, "end_idx": end_idx}

        if temporal_stats:
            record.update(
                _compute_temporal_stats(
                    start_time, end_time, interval_rows, i, df, time_col, start_col
                )
            )
        if precursor_stats:
            record.update(_compute_precursor_stats(before, time_col, load_col))
        if popin_shape_stats:
            record.update(
                _compute_shape_stats(
                    df, start_idx, end_idx, during, time_col, depth_col
                )
            )

        results.append(record)

    stats_df = pd.DataFrame(results)
    for col in stats_df.columns:
        if col not in [start_col, end_col]:
            df[col] = df[start_col].map(stats_df.set_index("start_idx")[col])

    logger.info(f"Computed pop-in statistics for {len(stats_df)} pop-ins")
    return df

`calculate_stress_strain(df, depth_col='Depth (nm)', load_col='Load (µN)', Reff_um=5.323, min_load_uN=2000, smooth_stress=True, smooth_window=11, smooth_polyorder=2, copy_popin_cols=True)`

Convert load–depth data to stress–strain using Kalidindi & Pathak (2008) formulas.

This function converts indentation data from load and depth measurements to stress and strain values using the Kalidindi & Pathak (2008). approach. It optionally copies pop-in markers from the input DataFrame and filters data based on load. Additionally, stress can be smoothed using the Savitzky-Golay filter. With the current setup, stress-strain data is accurate up to the yield point, after which it becomes increasingly inaccurate. To be expanded upon in a future version.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing the indentation data.	required
`depth_col`	`str`	Column name for the depth data.	`'Depth (nm)'`
`load_col`	`str`	Column name for the load data.	`'Load (µN)'`
`Reff_um`	`float`	Effective tip radius in microns.	`5.323`
`min_load_uN`	`float`	Minimum load threshold to filter out low-load points (in µN).	`2000`
`smooth_stress`	`bool`	Whether to apply smoothing to the stress signal.	`True`
`smooth_window`	`int`	Window size for the Savitzky-Golay filter.	`11`
`smooth_polyorder`	`int`	Polynomial order for the Savitzky-Golay filter.	`2`
`copy_popin_cols`	`bool`	Whether to copy pop-in markers from the input DataFrame.	`True`

Returns:

Type	Description
	pd.DataFrame: DataFrame with additional columns for stress, strain, and optionally pop-in markers.

Source code in src/merrypopins/statistics.py

def calculate_stress_strain(
    df,
    depth_col="Depth (nm)",
    load_col="Load (µN)",
    Reff_um=5.323,
    min_load_uN=2000,
    smooth_stress=True,
    smooth_window=11,
    smooth_polyorder=2,
    copy_popin_cols=True,
):
    """
    Convert load–depth data to stress–strain using Kalidindi & Pathak (2008) formulas.

    This function converts indentation data from load and depth measurements to stress and strain values using
    the Kalidindi & Pathak (2008). approach. It optionally copies pop-in markers from the input DataFrame and filters
    data based on load. Additionally, stress can be smoothed using the Savitzky-Golay filter. With the current setup,
    stress-strain data is accurate up to the yield point, after which it becomes increasingly inaccurate. To be
    expanded upon in a future version.

    Args:
        df (pd.DataFrame): DataFrame containing the indentation data.
        depth_col (str): Column name for the depth data.
        load_col (str): Column name for the load data.
        Reff_um (float): Effective tip radius in microns.
        min_load_uN (float): Minimum load threshold to filter out low-load points (in µN).
        smooth_stress (bool): Whether to apply smoothing to the stress signal.
        smooth_window (int): Window size for the Savitzky-Golay filter.
        smooth_polyorder (int): Polynomial order for the Savitzky-Golay filter.
        copy_popin_cols (bool): Whether to copy pop-in markers from the input DataFrame.

    Returns:
        pd.DataFrame: DataFrame with additional columns for stress, strain, and optionally pop-in markers.
    """
    required_cols = [depth_col, load_col, "Time (s)"]
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        raise ValueError(f"Missing required column(s): {missing_cols}")

    df = df.copy()

    # Calculate stress and strain before filtering
    h_m = df[depth_col] * 1e-9
    P_N = df[load_col] * 1e-6
    Reff_m = Reff_um * 1e-6

    a = np.sqrt(Reff_m * h_m)
    df["a_contact_m"] = a
    df["strain"] = h_m / (2.4 * a)
    df["stress"] = P_N / (np.pi * a**2) / 1e6  # MPa

    # Copy pop-in flags by index
    if copy_popin_cols:
        df["popin_start"] = False
        df["popin_end"] = False
        if "start_idx" in df.columns:
            df.loc[df["start_idx"].dropna().astype(int), "popin_start"] = True
        if "end_idx" in df.columns:
            df.loc[df["end_idx"].dropna().astype(int), "popin_end"] = True
        if "popin_selected" in df.columns:
            df["popin_selected"] = df["popin_selected"].fillna(False)

    # Filter by load
    df_filtered = df[df[load_col] >= min_load_uN].copy()

    if df_filtered.empty:
        raise ValueError("No data points remain after filtering by min_load_uN")

    # Apply smoothing if needed
    if smooth_stress and len(df_filtered) >= smooth_window:
        df_filtered["stress"] = savgol_filter(
            df_filtered["stress"], smooth_window, smooth_polyorder
        )

    logger.info(f"Computed stress–strain for {len(df_filtered)} points")
    return df_filtered

`calculate_stress_strain_statistics(df, start_col='start_idx', end_col='end_idx', time_col='Time (s)', stress_col='stress', strain_col='strain', before_window=0.5, precursor_stats=True, temporal_stats=True, shape_stats=True)`

Compute statistics for each pop-in in stress–strain space.

This function computes various statistics related to stress and strain for each detected pop-in event. It calculates features such as the jump in stress and strain, slope of the stress-strain curve, and temporal statistics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Data with stress/strain and pop-in intervals.	required
`start_col,`	`end_col (str`	Columns marking start and end indices of pop-ins.	required
`time_col`	`str`	Time column.	`'Time (s)'`
`stress_col,`	`strain_col (str`	Stress and strain columns.	required
`before_window`	`float`	Time window to use for precursor features.	`0.5`
`precursor_stats`	`bool`	Whether to compute precursor statistics (e.g., slope).	`True`
`temporal_stats`	`bool`	Whether to compute temporal statistics (e.g., pop-in duration).	`True`
`shape_stats`	`bool`	Whether to compute shape-based statistics (e.g., velocity, curvature).	`True`

Returns:

Type	Description
	pd.DataFrame: DataFrame with per-pop-in stress/strain statistics added.

Source code in src/merrypopins/statistics.py

def calculate_stress_strain_statistics(
    df,
    start_col="start_idx",
    end_col="end_idx",
    time_col="Time (s)",
    stress_col="stress",
    strain_col="strain",
    before_window=0.5,
    precursor_stats=True,
    temporal_stats=True,
    shape_stats=True,
):
    """
    Compute statistics for each pop-in in stress–strain space.

    This function computes various statistics related to stress and strain for each detected pop-in event.
    It calculates features such as the jump in stress and strain, slope of the stress-strain curve, and temporal statistics.

    Args:
        df (pd.DataFrame): Data with stress/strain and pop-in intervals.
        start_col, end_col (str): Columns marking start and end indices of pop-ins.
        time_col (str): Time column.
        stress_col, strain_col (str): Stress and strain columns.
        before_window (float): Time window to use for precursor features.
        precursor_stats (bool): Whether to compute precursor statistics (e.g., slope).
        temporal_stats (bool): Whether to compute temporal statistics (e.g., pop-in duration).
        shape_stats (bool): Whether to compute shape-based statistics (e.g., velocity, curvature).

    Returns:
        pd.DataFrame: DataFrame with per-pop-in stress/strain statistics added.
    """
    df = df.copy()
    interval_rows = df.dropna(subset=[start_col, end_col]).copy().reset_index(drop=True)
    results = []

    for i, row in interval_rows.iterrows():
        start_idx = int(row[start_col])
        end_idx = int(row[end_col])
        start_time = df.at[start_idx, time_col]
        end_time = df.at[end_idx, time_col]

        during = df[(df[time_col] >= start_time) & (df[time_col] <= end_time)]
        before = df[
            (df[time_col] >= start_time - before_window) & (df[time_col] < start_time)
        ]

        record = {"start_idx": start_idx, "end_idx": end_idx}

        if shape_stats:
            record.update(
                _compute_stress_strain_jump_stats(
                    df, start_idx, end_idx, stress_col, strain_col
                )
            )
            record.update(
                _compute_stress_strain_shape_stats(
                    during, time_col, stress_col, strain_col
                )
            )

        if precursor_stats:
            record.update(
                _compute_stress_strain_precursor_stats(
                    before, time_col, stress_col, strain_col
                )
            )

        if temporal_stats:
            # Add temporal statistics such as pop-in duration and time between events
            record.update(
                _compute_temporal_stats(
                    start_time, end_time, interval_rows, i, df, time_col, start_col
                )
            )

        results.append(record)

    stats_df = pd.DataFrame(results)
    for col in stats_df.columns:
        if col not in [start_col, end_col]:
            df[col] = df[start_col].map(stats_df.set_index("start_idx")[col])

    logger.info(f"Computed stress–strain statistics for {len(stats_df)} pop-ins")
    return df

`default_statistics(df_locate, popin_flag_column='popin', before_window=0.5, after_window=0.5)`

Pipeline to compute pop-in statistics from raw located popins.

This function extracts relevant columns, selects valid pop-in candidates based on local maxima, extracts intervals for each pop-in event, and calculates descriptive statistics for each interval.

Parameters:

Name	Type	Description	Default
`df_locate`	`DataFrame`	Input data containing pop-in candidate flags and indentation curve.	required
`popin_flag_column`	`str`	Column name indicating Boolean pop-in candidate (True/False).	`'popin'`
`before_window`	`float`	Time window (in seconds) to use for features before the pop-in event.	`0.5`
`after_window`	`float`	Time window (in seconds) to use for features after the pop-in event.	`0.5`

Returns:

Type	Description
	pd.DataFrame: DataFrame with annotated pop-in intervals and computed statistics (e.g., time, shape, precursor).

Source code in src/merrypopins/statistics.py

def default_statistics(
    df_locate, popin_flag_column="popin", before_window=0.5, after_window=0.5
):
    """
    Pipeline to compute pop-in statistics from raw located popins.

    This function extracts relevant columns, selects valid pop-in candidates based on local maxima,
    extracts intervals for each pop-in event, and calculates descriptive statistics for each interval.

    Args:
        df_locate (pd.DataFrame): Input data containing pop-in candidate flags and indentation curve.
        popin_flag_column (str): Column name indicating Boolean pop-in candidate (True/False).
        before_window (float): Time window (in seconds) to use for features before the pop-in event.
        after_window (float): Time window (in seconds) to use for features after the pop-in event.

    Returns:
        pd.DataFrame: DataFrame with annotated pop-in intervals and computed statistics (e.g., time, shape, precursor).
    """
    required_cols = ["Time (s)", "Load (µN)", "Depth (nm)", popin_flag_column]
    if "contact_point" in df_locate.columns:
        required_cols.append("contact_point")
    df_locate = df_locate[required_cols].copy()

    # Postprocess to select local maxima pop-ins
    df1 = postprocess_popins_local_max(df_locate, popin_flag_column=popin_flag_column)

    # Extract intervals for each pop-in
    df2 = extract_popin_intervals(df1)

    # Calculate statistics using before_window and after_window
    return calculate_popin_statistics(
        df2,
        time_col="Time (s)",
        before_window=before_window,  # Pass before_window to the function
        after_window=after_window,  # Pass after_window to the function
    )

`default_statistics_stress_strain(df_locate, popin_flag_column='popin', before_window=0.5, after_window=0.5, Reff_um=5.323, min_load_uN=2000, smooth_stress=True, stress_col='stress', strain_col='strain', time_col='Time (s)')`

Full pipeline: from raw data to stress–strain statistics.

This includes: - Load–depth pop-in detection - Interval extraction - Stress–strain transformation - Stress–strain statistics

Parameters:

Name	Type	Description	Default
`df_locate`	`DataFrame`	Raw indentation data with pop-in flag column.	required
`popin_flag_column`	`str`	Column with Boolean flags for pop-in candidates.	`'popin'`
`before_window`	`float`	Time window (in seconds) for computing precursor features.	`0.5`
`after_window`	`float`	Time window (in seconds) for computing shape-based features.	`0.5`
`Reff_um`	`float`	Effective tip radius in microns.	`5.323`
`min_load_uN`	`float`	Minimum load threshold for stress–strain conversion.	`2000`
`smooth_stress`	`bool`	Whether to smooth the stress signal.	`True`
`stress_col`	`str`	Column name for stress data.	`'stress'`
`strain_col`	`str`	Column name for strain data.	`'strain'`
`time_col`	`str`	Column name for time data.	`'Time (s)'`

Returns:

Type	Description
	pd.DataFrame: DataFrame with stress-strain statistics and pop-in intervals.

Source code in src/merrypopins/statistics.py

def default_statistics_stress_strain(
    df_locate,
    popin_flag_column="popin",
    before_window=0.5,
    after_window=0.5,
    Reff_um=5.323,
    min_load_uN=2000,
    smooth_stress=True,
    stress_col="stress",
    strain_col="strain",
    time_col="Time (s)",
):
    """
    Full pipeline: from raw data to stress–strain statistics.

    This includes:
    - Load–depth pop-in detection
    - Interval extraction
    - Stress–strain transformation
    - Stress–strain statistics

    Args:
        df_locate (pd.DataFrame): Raw indentation data with pop-in flag column.
        popin_flag_column (str): Column with Boolean flags for pop-in candidates.
        before_window (float): Time window (in seconds) for computing precursor features.
        after_window (float): Time window (in seconds) for computing shape-based features.
        Reff_um (float): Effective tip radius in microns.
        min_load_uN (float): Minimum load threshold for stress–strain conversion.
        smooth_stress (bool): Whether to smooth the stress signal.
        stress_col (str): Column name for stress data.
        strain_col (str): Column name for strain data.
        time_col (str): Column name for time data.

    Returns:
        pd.DataFrame: DataFrame with stress-strain statistics and pop-in intervals.
    """
    df_ld = default_statistics(
        df_locate,
        popin_flag_column=popin_flag_column,
        before_window=before_window,
        after_window=after_window,
    )

    df_stress = calculate_stress_strain(
        df_ld,
        Reff_um=Reff_um,
        min_load_uN=min_load_uN,
        smooth_stress=smooth_stress,
        copy_popin_cols=True,
    )

    df_stats = calculate_stress_strain_statistics(
        df_stress,
        start_col="start_idx",
        end_col="end_idx",
        time_col=time_col,
        stress_col=stress_col,
        strain_col=strain_col,
        before_window=before_window,
    )

    return df_stats

`extract_popin_intervals(df, popin_col='popin_selected', load_col='Load (µN)')`

Extract start and end indices for each pop-in event.

For each detected pop-in, this function identifies the start and end points based on the load curve. The start of a pop-in is where the load first increases, and the end is when the load returns to baseline.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with pop-in flags.	required
`popin_col`	`str`	The column with Boolean values indicating pop-in events.	`'popin_selected'`
`load_col`	`str`	The load column used to identify the recovery point.	`'Load (µN)'`

Returns:

Type	Description
	pd.DataFrame: DataFrame with added start and end index columns for each pop-in interval.

Source code in src/merrypopins/statistics.py

def extract_popin_intervals(df, popin_col="popin_selected", load_col="Load (µN)"):
    """
    Extract start and end indices for each pop-in event.

    For each detected pop-in, this function identifies the start and end points based on the load curve.
    The start of a pop-in is where the load first increases, and the end is when the load returns to baseline.

    Args:
        df (pd.DataFrame): DataFrame with pop-in flags.
        popin_col (str): The column with Boolean values indicating pop-in events.
        load_col (str): The load column used to identify the recovery point.

    Returns:
        pd.DataFrame: DataFrame with added start and end index columns for each pop-in interval.
    """
    start_idx_col = [None] * len(df)
    end_idx_col = [None] * len(df)
    popin_indices = df.index[df[popin_col]].tolist()

    for start_idx in popin_indices:
        load_start = df.at[start_idx, load_col]
        end_idx = start_idx
        for i in range(start_idx + 1, len(df)):
            if df.at[i, load_col] >= load_start:
                end_idx = i
                break
        start_idx_col[start_idx] = start_idx
        end_idx_col[start_idx] = end_idx

    df = df.copy()
    df["start_idx"] = start_idx_col
    df["end_idx"] = end_idx_col
    return df

`postprocess_popins_local_max(df, popin_flag_column='popin', window=1)`

Select pop-ins that have a local load maxima.

This function filters out pop-in events that do not represent local maxima in the load curve. A local maximum is defined as a point where the load is higher than the adjacent points within a sliding window.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input indentation data with a pop-in flag column.	required
`popin_flag_column`	`str`	The column that marks pop-in candidates (True/False).	`'popin'`
`window`	`int`	The local window size to assess if the current load is a maximum.	`1`

Returns:

Type	Description
	pd.DataFrame: The original DataFrame with a new column indicating selected pop-ins.

Source code in src/merrypopins/statistics.py

def postprocess_popins_local_max(df, popin_flag_column="popin", window=1):
    """
    Select pop-ins that have a local load maxima.

    This function filters out pop-in events that do not represent local maxima in the load curve.
    A local maximum is defined as a point where the load is higher than the adjacent points
    within a sliding window.

    Args:
        df (pd.DataFrame): Input indentation data with a pop-in flag column.
        popin_flag_column (str): The column that marks pop-in candidates (True/False).
        window (int): The local window size to assess if the current load is a maximum.

    Returns:
        pd.DataFrame: The original DataFrame with a new column indicating selected pop-ins.
    """
    df = df.copy()
    max_load_idx = df["Load (µN)"].idxmax()
    popin_flags = df[popin_flag_column]
    selected_indices = []

    for idx in df.index[window:-window]:
        if idx >= max_load_idx:
            break
        if not popin_flags.loc[idx]:
            continue
        prev_load = df.at[idx - window, "Load (µN)"]
        curr_load = df.at[idx, "Load (µN)"]
        next_load = df.at[idx + window, "Load (µN)"]
        if curr_load > prev_load and curr_load > next_load:
            selected_indices.append(idx)

    df["popin_selected"] = False
    df.loc[selected_indices, "popin_selected"] = True
    logger.info(
        f"Filtered to {len(selected_indices)} local max pop-ins before max load"
    )
    return df