Operate DataFrame

class operate_dataframe.DataFrameOperator[source]

A class that provides various DataFrame operations such as merging, concatenation, splitting, and other utility functions for DataFrame manipulation.

static apply_function(df: DataFrame, columns: List[str], func: Callable, element_wise: bool = True) → DataFrame[source]

Apply a custom function to specified columns.

Parameters:

df (pd.DataFrame) – The input DataFrame.
columns (List[str]) – List of column names to apply the function to.
func (Callable) – The function to apply.
element_wise (bool, optional) – If True, apply function element-wise. If False, apply column-wise. Defaults to True.

Returns:

A DataFrame with the function applied to the specified columns.

Return type:

pd.DataFrame

static change_column_types(df: DataFrame, columns_types: Dict[str, str | type]) → DataFrame[source]

Change the data types of specified columns.

Parameters:

df (pd.DataFrame) – The input DataFrame.
columns_types (Dict[str, Union[str, type]]) – A dictionary mapping column names to target data types.

Returns:

A DataFrame with the specified column types changed.

Return type:

pd.DataFrame

static concat_dataframes(dfs: List[DataFrame], axis: int = 0, join: str = 'outer', ignore_index: bool = False, keys: List | None = None, levels: List | None = None, names: List[str] | None = None, verify_integrity: bool = False, sort: bool = False, copy: bool = True) → DataFrame[source]

Concatenate pandas objects along a particular axis.

Parameters:

dfs (List[pd.DataFrame]) – List of DataFrames to concatenate.
axis (int, optional) – The axis to concatenate along (0 for index, 1 for columns). Defaults to 0.
join (str, optional) – How to handle indexes on other axes (‘inner’, ‘outer’). Defaults to ‘outer’.
ignore_index (bool, optional) – If True, do not use the index values along the concatenation axis. Defaults to False.
keys (List, optional) – Sequence of keys to use to construct a hierarchical index. Defaults to None.
levels (List, optional) – Specific levels to use for the hierarchical index. Defaults to None.
names (List[str], optional) – Names for the levels in the resulting hierarchical index. Defaults to None.
verify_integrity (bool, optional) – Check whether the new concatenated axis contains duplicates. Defaults to False.
sort (bool, optional) – Sort non-concatenation axis if not aligned. Defaults to False.
copy (bool, optional) – If False, do not copy data unnecessarily. Defaults to True.

Returns:

The concatenated DataFrame.

Return type:

pd.DataFrame

static drop_columns(df: DataFrame, columns: List[str | int]) → DataFrame[source]

Drop specified columns from the DataFrame by name or index position.

Parameters:

df (pd.DataFrame) – The input DataFrame.
columns (List[Union[str, int]]) – List of column names or index positions to drop.

Returns:

A DataFrame with the specified columns dropped.

Return type:

pd.DataFrame

static drop_duplicates(df: DataFrame, subset: List[str] | None = None, keep: str = 'first', inplace: bool = False, ignore_index: bool = False) → DataFrame[source]

Remove duplicate rows from the DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.
subset (List[str], optional) – Columns to consider when identifying duplicates.
keep (str, optional) – Which duplicate to keep (‘first’, ‘last’, False). Defaults to ‘first’.
inplace (bool, optional) – If True, perform operation in-place. Defaults to False.
ignore_index (bool, optional) – If True, reset index after dropping duplicates. Defaults to False.

Returns:

The DataFrame with duplicates removed.

Return type:

pd.DataFrame

Fill missing values in the DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.
value (Union[float, Dict[str, Union[float, str]]], optional) – Value to use for filling holes.
columns (List[str], optional) – Specific columns to fill missing values in.
method (str, optional) – Method to use for filling holes (‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None).
axis (int, optional) – Axis along which to fill missing values.
limit (int, optional) – Maximum number of consecutive NaNs to fill.

Returns:

A DataFrame with missing values filled.

Return type:

pd.DataFrame

static filter_rows(df: DataFrame, condition: str) → DataFrame[source]

Filter rows in the DataFrame based on a given condition.

Parameters:

df (pd.DataFrame) – The input DataFrame.
condition (str) – The condition to filter rows by (e.g., “age > 30”).

Returns:

A new DataFrame with filtered rows.

Return type:

pd.DataFrame

static groupby(df: DataFrame, by: str | List[str], agg_funcs: str | List[str] | Dict[str, str | List[str]]) → DataFrame[source]

Perform a group-by operation and apply aggregation functions.

Parameters:

df (pd.DataFrame) – The input DataFrame.
by (Union[str, List[str]]) – Column(s) to group by.
agg_funcs (Union[str, List[str], Dict[str, Union[str, List[str]]]]) – Aggregation function(s).

Returns:

A DataFrame with grouped and aggregated data.

Return type:

pd.DataFrame

static merge_dataframes(df1: DataFrame, df2: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, left_index: bool = False, right_index: bool = False, suffixes: Tuple[str, str] = ('_x', '_y'), indicator: bool = False, validate: str | None = None) → DataFrame[source]

Merge two DataFrames using database-style joins.

Parameters:

df1 (pd.DataFrame) – The first DataFrame.
df2 (pd.DataFrame) – The second DataFrame.
on (Union[str, List[str], None], optional) – Column or index level names to join on.
how (str, optional) – Type of merge to be performed (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’). Defaults to ‘inner’.
left_on (Union[str, List[str], None], optional) – Column(s) from the left DataFrame to use as keys.
right_on (Union[str, List[str], None], optional) – Column(s) from the right DataFrame to use as keys.
left_index (bool, optional) – Use index from the left DataFrame as join key. Defaults to False.
right_index (bool, optional) – Use index from the right DataFrame as join key. Defaults to False.
suffixes (Tuple[str, str], optional) – Suffixes to apply to overlapping column names. Defaults to (‘_x’, ‘_y’).
indicator (bool, optional) – Adds a column ‘_merge’ with merge information. Defaults to False.
validate (str, optional) – Checks if merge is of specified type. Defaults to None.

Returns:

A merged DataFrame.

Return type:

pd.DataFrame

Create a spreadsheet-style pivot table as a DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame.
values (Union[str, List[str]], optional) – Column(s) to aggregate.
index (Union[str, List[str]], optional) – Keys to group by on the pivot table index.
columns (Union[str, List[str]], optional) – Keys to group by on the pivot table column.
aggfunc (Union[str, List[str], Dict[str, Union[str, List[str]]]], optional) – Aggregation function(s). Defaults to ‘mean’.
fill_value (Any, optional) – Value to replace missing values with.
margins (bool, optional) – Add all rows/columns (subtotals). Defaults to False.
dropna (bool, optional) – Do not include columns whose entries are all NaN. Defaults to True.
margins_name (str, optional) – Name of the row/column that will contain the totals. Defaults to ‘All’.
observed (bool, optional) – This only applies if any of the groupers are categoricals. Defaults to False.
sort (bool, optional) – Sort group keys. Defaults to True.

Returns:

The pivot table.

Return type:

pd.DataFrame

static rename_columns(df: DataFrame, columns_dict: Dict[str, str]) → DataFrame[source]

Rename columns in the DataFrame based on a given dictionary.

Parameters:

df (pd.DataFrame) – The input DataFrame.
columns_dict (Dict[str, str]) – A dictionary mapping old column names to new ones.

Returns:

A DataFrame with renamed columns.

Return type:

pd.DataFrame

Return a random sample of items from an axis of object.

Parameters:

df (pd.DataFrame) – The input DataFrame.
n (int, optional) – Number of items from axis to return.
frac (float, optional) – Fraction of axis items to return.
replace (bool, optional) – Sample with or without replacement. Defaults to False.
weights (Union[str, pd.Series], optional) – Weights for sampling.
random_state (int, optional) – Seed for the random number generator.
axis (int, optional) – Axis to sample. Defaults to 0.

Returns:

A random sample of the DataFrame.

Return type:

pd.DataFrame

static sort_values(df: DataFrame, by: str | List[str], ascending: bool | List[bool] = True, inplace: bool = False, na_position: str = 'last') → DataFrame[source]

Sort the DataFrame by specified column(s).

Parameters:

df (pd.DataFrame) – The input DataFrame.
by (Union[str, List[str]]) – Column name(s) to sort by.
ascending (Union[bool, List[bool]], optional) – Sort ascending vs. descending. Defaults to True.
inplace (bool, optional) – If True, perform operation in-place. Defaults to False.
na_position (str, optional) – ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. Defaults to ‘last’.

Returns:

The sorted DataFrame.

Return type:

pd.DataFrame

static split_by_missing_values(df: DataFrame) → Tuple[DataFrame, DataFrame][source]

Split the DataFrame into two DataFrames based on missing values.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

DataFrame with columns that contain missing values.
DataFrame with columns that do not have any missing values.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

static split_dataframe(df: DataFrame, columns: List[str]) → Tuple[DataFrame, DataFrame][source]

Split a DataFrame into two DataFrames based on specified columns.

Parameters:

df (pd.DataFrame) – The input DataFrame.
columns (List[str]) – List of column names to separate.

Returns:

A tuple containing:

DataFrame with the specified columns.
DataFrame without the specified columns.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]