Operate DataFrame
- class operate_dataframe.DataFrameOperator[source]
A class that provides various DataFrame operations such as merging, concatenation, splitting, and other utility functions for DataFrame manipulation.
- static apply_function(df: DataFrame, columns: List[str], func: Callable, element_wise: bool = True) DataFrame[source]
Apply a custom function to specified columns.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
columns (List[str]) – List of column names to apply the function to.
func (Callable) – The function to apply.
element_wise (bool, optional) – If True, apply function element-wise. If False, apply column-wise. Defaults to True.
- Returns:
A DataFrame with the function applied to the specified columns.
- Return type:
pd.DataFrame
- static change_column_types(df: DataFrame, columns_types: Dict[str, str | type]) DataFrame[source]
Change the data types of specified columns.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
columns_types (Dict[str, Union[str, type]]) – A dictionary mapping column names to target data types.
- Returns:
A DataFrame with the specified column types changed.
- Return type:
pd.DataFrame
- static concat_dataframes(dfs: List[DataFrame], axis: int = 0, join: str = 'outer', ignore_index: bool = False, keys: List | None = None, levels: List | None = None, names: List[str] | None = None, verify_integrity: bool = False, sort: bool = False, copy: bool = True) DataFrame[source]
Concatenate pandas objects along a particular axis.
- Parameters:
dfs (List[pd.DataFrame]) – List of DataFrames to concatenate.
axis (int, optional) – The axis to concatenate along (0 for index, 1 for columns). Defaults to 0.
join (str, optional) – How to handle indexes on other axes (‘inner’, ‘outer’). Defaults to ‘outer’.
ignore_index (bool, optional) – If True, do not use the index values along the concatenation axis. Defaults to False.
keys (List, optional) – Sequence of keys to use to construct a hierarchical index. Defaults to None.
levels (List, optional) – Specific levels to use for the hierarchical index. Defaults to None.
names (List[str], optional) – Names for the levels in the resulting hierarchical index. Defaults to None.
verify_integrity (bool, optional) – Check whether the new concatenated axis contains duplicates. Defaults to False.
sort (bool, optional) – Sort non-concatenation axis if not aligned. Defaults to False.
copy (bool, optional) – If False, do not copy data unnecessarily. Defaults to True.
- Returns:
The concatenated DataFrame.
- Return type:
pd.DataFrame
- static drop_columns(df: DataFrame, columns: List[str | int]) DataFrame[source]
Drop specified columns from the DataFrame by name or index position.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
columns (List[Union[str, int]]) – List of column names or index positions to drop.
- Returns:
A DataFrame with the specified columns dropped.
- Return type:
pd.DataFrame
- static drop_duplicates(df: DataFrame, subset: List[str] | None = None, keep: str = 'first', inplace: bool = False, ignore_index: bool = False) DataFrame[source]
Remove duplicate rows from the DataFrame.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
subset (List[str], optional) – Columns to consider when identifying duplicates.
keep (str, optional) – Which duplicate to keep (‘first’, ‘last’, False). Defaults to ‘first’.
inplace (bool, optional) – If True, perform operation in-place. Defaults to False.
ignore_index (bool, optional) – If True, reset index after dropping duplicates. Defaults to False.
- Returns:
The DataFrame with duplicates removed.
- Return type:
pd.DataFrame
- static fill_missing(df: DataFrame, value: float | Dict[str, float | str] | None = 0, columns: List[str] | None = None, method: str | None = None, axis: int | None = None, limit: int | None = None) DataFrame[source]
Fill missing values in the DataFrame.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
value (Union[float, Dict[str, Union[float, str]]], optional) – Value to use for filling holes.
columns (List[str], optional) – Specific columns to fill missing values in.
method (str, optional) – Method to use for filling holes (‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None).
axis (int, optional) – Axis along which to fill missing values.
limit (int, optional) – Maximum number of consecutive NaNs to fill.
- Returns:
A DataFrame with missing values filled.
- Return type:
pd.DataFrame
- static filter_rows(df: DataFrame, condition: str) DataFrame[source]
Filter rows in the DataFrame based on a given condition.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
condition (str) – The condition to filter rows by (e.g., “age > 30”).
- Returns:
A new DataFrame with filtered rows.
- Return type:
pd.DataFrame
- static groupby(df: DataFrame, by: str | List[str], agg_funcs: str | List[str] | Dict[str, str | List[str]]) DataFrame[source]
Perform a group-by operation and apply aggregation functions.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
by (Union[str, List[str]]) – Column(s) to group by.
agg_funcs (Union[str, List[str], Dict[str, Union[str, List[str]]]]) – Aggregation function(s).
- Returns:
A DataFrame with grouped and aggregated data.
- Return type:
pd.DataFrame
- static merge_dataframes(df1: DataFrame, df2: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, left_index: bool = False, right_index: bool = False, suffixes: Tuple[str, str] = ('_x', '_y'), indicator: bool = False, validate: str | None = None) DataFrame[source]
Merge two DataFrames using database-style joins.
- Parameters:
df1 (pd.DataFrame) – The first DataFrame.
df2 (pd.DataFrame) – The second DataFrame.
on (Union[str, List[str], None], optional) – Column or index level names to join on.
how (str, optional) – Type of merge to be performed (‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’). Defaults to ‘inner’.
left_on (Union[str, List[str], None], optional) – Column(s) from the left DataFrame to use as keys.
right_on (Union[str, List[str], None], optional) – Column(s) from the right DataFrame to use as keys.
left_index (bool, optional) – Use index from the left DataFrame as join key. Defaults to False.
right_index (bool, optional) – Use index from the right DataFrame as join key. Defaults to False.
suffixes (Tuple[str, str], optional) – Suffixes to apply to overlapping column names. Defaults to (‘_x’, ‘_y’).
indicator (bool, optional) – Adds a column ‘_merge’ with merge information. Defaults to False.
validate (str, optional) – Checks if merge is of specified type. Defaults to None.
- Returns:
A merged DataFrame.
- Return type:
pd.DataFrame
- static pivot_table(df: DataFrame, values: str | List[str] | None = None, index: str | List[str] | None = None, columns: str | List[str] | None = None, aggfunc: str | List[str] | Dict[str, str | List[str]] = 'mean', fill_value: Any | None = None, margins: bool = False, dropna: bool = True, margins_name: str = 'All', observed: bool = False, sort: bool = True) DataFrame[source]
Create a spreadsheet-style pivot table as a DataFrame.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
values (Union[str, List[str]], optional) – Column(s) to aggregate.
index (Union[str, List[str]], optional) – Keys to group by on the pivot table index.
columns (Union[str, List[str]], optional) – Keys to group by on the pivot table column.
aggfunc (Union[str, List[str], Dict[str, Union[str, List[str]]]], optional) – Aggregation function(s). Defaults to ‘mean’.
fill_value (Any, optional) – Value to replace missing values with.
margins (bool, optional) – Add all rows/columns (subtotals). Defaults to False.
dropna (bool, optional) – Do not include columns whose entries are all NaN. Defaults to True.
margins_name (str, optional) – Name of the row/column that will contain the totals. Defaults to ‘All’.
observed (bool, optional) – This only applies if any of the groupers are categoricals. Defaults to False.
sort (bool, optional) – Sort group keys. Defaults to True.
- Returns:
The pivot table.
- Return type:
pd.DataFrame
- static rename_columns(df: DataFrame, columns_dict: Dict[str, str]) DataFrame[source]
Rename columns in the DataFrame based on a given dictionary.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
columns_dict (Dict[str, str]) – A dictionary mapping old column names to new ones.
- Returns:
A DataFrame with renamed columns.
- Return type:
pd.DataFrame
- static sample_dataframe(df: DataFrame, n: int | None = None, frac: float | None = None, replace: bool = False, weights: str | Series | None = None, random_state: int | None = None, axis: int = 0) DataFrame[source]
Return a random sample of items from an axis of object.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
n (int, optional) – Number of items from axis to return.
frac (float, optional) – Fraction of axis items to return.
replace (bool, optional) – Sample with or without replacement. Defaults to False.
weights (Union[str, pd.Series], optional) – Weights for sampling.
random_state (int, optional) – Seed for the random number generator.
axis (int, optional) – Axis to sample. Defaults to 0.
- Returns:
A random sample of the DataFrame.
- Return type:
pd.DataFrame
- static sort_values(df: DataFrame, by: str | List[str], ascending: bool | List[bool] = True, inplace: bool = False, na_position: str = 'last') DataFrame[source]
Sort the DataFrame by specified column(s).
- Parameters:
df (pd.DataFrame) – The input DataFrame.
by (Union[str, List[str]]) – Column name(s) to sort by.
ascending (Union[bool, List[bool]], optional) – Sort ascending vs. descending. Defaults to True.
inplace (bool, optional) – If True, perform operation in-place. Defaults to False.
na_position (str, optional) – ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. Defaults to ‘last’.
- Returns:
The sorted DataFrame.
- Return type:
pd.DataFrame
- static split_by_missing_values(df: DataFrame) Tuple[DataFrame, DataFrame][source]
Split the DataFrame into two DataFrames based on missing values.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
- Returns:
DataFrame with columns that contain missing values.
DataFrame with columns that do not have any missing values.
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- static split_dataframe(df: DataFrame, columns: List[str]) Tuple[DataFrame, DataFrame][source]
Split a DataFrame into two DataFrames based on specified columns.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
columns (List[str]) – List of column names to separate.
- Returns:
- A tuple containing:
DataFrame with the specified columns.
DataFrame without the specified columns.
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]