Extracting Data

Before we begin, we need to import both pandas and numpy as well as reading the titanic data from a csv file. We will store the dataset in a variable called raw_df and copy it to the variable df before each example transformation.

import pandas as pd
import numpy as np

raw_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv')
df = raw_df.copy()
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Transformation #1

After inspecting the data, we will begin our transformations. The first steps will be adding PassengerId to the index and then converting Sex and Cabin to the category datatype.

df = raw_df.copy()
df = df.set_index('PassengerId')
df['Sex'] = df.Sex.astype('category')
df['Cabin'] = df.Cabin.astype('category')

The first big difference in chaining is that we need to rap our transformation in (). The next difference is using a new method called assign to create or change existing columns.

df = raw_df.copy()
df = (
    df
    .set_index('PassengerId')
    .assign(Sex=df.Sex.astype('category'),
            Cabin=df.Cabin.astype('category'))
)

df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris female 22.0 1 0 A/5 21171 7.2500 C85 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 NaN C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C123 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) male 35.0 1 0 113803 53.1000 NaN S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Transformation #2

Next, we want to extract the Title from the Name column and convert uncommon titles to either Ms, Mrs, Mr, or Rare.

Note: New code is added under ## NEW SECTION ##
df = raw_df.copy()
df = df.set_index('PassengerId')
df['Sex'] = df.Sex.astype('category')
df['Cabin'] = df.Cabin.astype('category')

## NEW SECTION ##
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.')

df['Title'] = np.select(
    condlist=(df.Title.isin(['Mlle', 'Ms']),
              df.Title.isin(['Mme', 'Mrs']),
              df.Title.isin(['Mr'])),
    choicelist=('Miss', 'Mrs', 'Mr'),
    default='Rare')

We are still using assign to create the new columns but there are items of note. We are unable to have the same column name within a single assign method so we need to create a second assign method. Also, because Title does not exist in the original dataframe we have to reference the current object with lambda.

df = raw_df.copy()
df = (
    df
    .set_index('PassengerId')
    .assign(Sex=df.Sex.astype('category'),
            Cabin=df.Cabin.astype('category'),

            ## NEW SECTION ##
            Title=df.Name.str.extract('([A-Za-z]+)\.'))
    .assign(Title=lambda df:
            np.select(
                condlist=(df.Title.isin(['Mlle', 'Ms']),
                          df.Title.isin(['Mme', 'Mrs']),
                          df.Title.isin(['Mr'])),
                choicelist=('Miss', 'Mrs', 'Mr'),
                default='Rare'))
)

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Rare
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Mr

Transformation #3

The next three transformations will be to bin Age and Category, convert them to a category datatype, and drop columns we will not use in our machine learning model.

df = raw_df.copy()
df = df.set_index('PassengerId')
df['Sex'] = df.Sex.astype('category')
df['Cabin'] = df.Cabin.astype('category')
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.')

df['Title'] = np.select(
    condlist=(df.Title.isin(['Mlle', 'Ms']),
              df.Title.isin(['Mme', 'Mrs']),
              df.Title.isin(['Mr'])),
    choicelist=('Miss', 'Mrs', 'Mr'),
    default='Rare')

## NEW SECTION ##
df['Age_bin'] = pd.cut(
    df['Age'],
    bins=[0, 12, 20, 40, 120],
    labels=['Children', 'Teenage', 'Adult', 'Elder'])

df['Fare_bin'] = pd.cut(
    df['Fare'],
    bins=[0, 7.91, 14.45, 31, 120],
    labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])

df['Age_bin'] = df.Age_bin.astype('category')
df['Fare_bin'] = df.Fare_bin.astype('category')
df = df.drop(['Age', 'Fare', 'Name', 'Ticket'], axis=1)

We are continuing to use assign to create our columns. Note how we have to introduce a third assign method to categorize Age_bin and Fare_bin into categories since those columns are already referenced in the previous assign.

df = raw_df.copy()
df = (
    df
    .set_index('PassengerId')
    .assign(
        Title=df.Name.str.extract('([A-Za-z]+)\.')
    )
    .assign(Title=lambda df:
            np.select(
                condlist=(df.Title.isin(['Mlle', 'Ms']),
                          df.Title.isin(['Mme', 'Mrs']),
                          df.Title.isin(['Mr'])),
                choicelist=('Miss', 'Mrs', 'Mr'),
                default='Rare'),

            ## NEW SECTION ##
            Age_bin=pd.cut(
                df['Age'],
                bins=[0, 12, 20, 40, 120],
                labels=['Children', 'Teenage', 'Adult', 'Elder']),

            Fare_bin=pd.cut(
                df['Fare'],
                bins=[0, 7.91, 14.45, 31, 120],
                labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])
            )
    .assign(
        Age_bin=lambda df: df.Age_bin.astype('category'),
        Fare_bin=lambda df: df.Fare_bin.astype('category')
    )
    .drop(['Age', 'Fare', 'Name', 'Ticket'], axis=1)
)

df.head()
Survived Pclass Sex SibSp Parch Cabin Embarked Title Age_bin Fare_bin
PassengerId
1 0 3 male 1 0 NaN S Mrs Adult high_fare
2 1 1 female 1 0 C85 C Rare Adult median_fare
3 1 3 female 0 0 NaN S Mrs Adult high_fare
4 1 1 female 1 0 C123 S Mr Adult median_fare
5 0 3 male 0 0 NaN S Mr NaN median_fare

Transformation #4

To demonstrate how to filter a dataframe while chaining we will filter for only non survivors.

Note: A non survivor is labeled as 0 in the Survived column
df = raw_df.copy()
df = df.set_index('PassengerId')
df['Sex'] = df.Sex.astype('category')
df['Cabin'] = df.Cabin.astype('category')
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.')

df['Title'] = np.select(
    condlist=(df.Title.isin(['Mlle', 'Ms']),
              df.Title.isin(['Mme', 'Mrs']),
              df.Title.isin(['Mr'])),
    choicelist=('Miss', 'Mrs', 'Mr'),
    default='Rare')

df['Age_bin'] = pd.cut(
    df['Age'],
    bins=[0, 12, 20, 40, 120],
    labels=['Children', 'Teenage', 'Adult', 'Elder'])

df['Fare_bin'] = pd.cut(
    df['Fare'],
    bins=[0, 7.91, 14.45, 31, 120],
    labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])

df['Age_bin'] = df.Age_bin.astype('category')
df['Fare_bin'] = df.Fare_bin.astype('category')
df = df.drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)

## NEW SECTION ##
df = df[df.Survived.eq(0)]  # or df.Survived == 0
df = raw_df.copy()
df = (
    df
    .set_index('PassengerId')
    .assign(
        Title=df.Name.str.extract('([A-Za-z]+)\.')
    )
    .assign(Title=lambda df:
            np.select(
                condlist=(df.Title.isin(['Mlle', 'Ms']),
                          df.Title.isin(['Mme', 'Mrs']),
                          df.Title.isin(['Mr'])),
                choicelist=('Miss', 'Mrs', 'Mr'),
                default='Rare'),

            Age_bin=pd.cut(
                df['Age'],
                bins=[0, 12, 20, 40, 120],
                labels=['Children', 'Teenage', 'Adult', 'Elder']),

            Fare_bin=pd.cut(
                df['Fare'],
                bins=[0, 7.91, 14.45, 31, 120],
                labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])
            )
    .assign(
        Age_bin=lambda df: df.Age_bin.astype('category'),
        Fare_bin=lambda df: df.Fare_bin.astype('category')
    )
    .drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)

    ## NEW SECTION ##
    .query('Survived == 0') # .loc/iloc is also an option

)

df.head()
Survived Pclass Sex SibSp Parch Embarked Title Age_bin Fare_bin
PassengerId
1 0 3 male 1 0 S Mrs Adult high_fare
5 0 3 male 0 0 S Mr NaN median_fare
6 0 3 male 0 0 Q Mr Elder high_fare
7 0 1 male 0 0 S Rare Children Average_fare
8 0 3 male 3 1 S Mrs Adult median_fare

Transformation #5

The final step in our transformation is to convert our categorical data into dummy variables.

df = raw_df.copy()
df = df.set_index('PassengerId')
df['Cabin'] = df.Cabin.astype('category')
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.')

df['Title'] = np.select(
    condlist=(df.Title.isin(['Mlle', 'Ms']),
              df.Title.isin(['Mme', 'Mrs']),
              df.Title.isin(['Mr'])),
    choicelist=('Miss', 'Mrs', 'Mr'),
    default='Rare')

df['Age_bin'] = pd.cut(
    df['Age'],
    bins=[0, 12, 20, 40, 120],
    labels=['Children', 'Teenage', 'Adult', 'Elder'])

df['Fare_bin'] = pd.cut(
    df['Fare'],
    bins=[0, 7.91, 14.45, 31, 120],
    labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])

df['Age_bin'] = df.Age_bin.astype('category')
df['Fare_bin'] = df.Fare_bin.astype('category')
df = df.drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)
df = df[df.Survived.eq(0)]  # or df.Survived == 0

## NEW SECTION ##
df = pd.get_dummies(
    df, columns=["Sex", "Title", "Age_bin", "Embarked", "Fare_bin"])

The pipe method is used when you have a function that takes your dataframe as an argument such as pd.get_dummies.

df = raw_df.copy()
df = (
    df
    .set_index('PassengerId')
    .assign(
        Title=df.Name.str.extract('([A-Za-z]+)\.')
    )
    .assign(Title=lambda df:
            np.select(
                condlist=(df.Title.isin(['Mlle', 'Ms']),
                          df.Title.isin(['Mme', 'Mrs']),
                          df.Title.isin(['Mr'])),
                choicelist=('Miss', 'Mrs', 'Mr'),
                default='Rare'),

            Age_bin=pd.cut(
                df['Age'],
                bins=[0, 12, 20, 40, 120],
                labels=['Children', 'Teenage', 'Adult', 'Elder']),

            Fare_bin=pd.cut(
                df['Fare'],
                bins=[0, 7.91, 14.45, 31, 120],
                labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])
            )
    .assign(
        Age_bin=lambda df: df.Age_bin.astype('category'),
        Fare_bin=lambda df: df.Fare_bin.astype('category')
    )
    .drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)
    .query('Survived == 0')

    ## NEW SECTION ##
    .pipe(lambda df: pd.get_dummies(df, columns=["Sex", "Title", "Age_bin", "Embarked", "Fare_bin"]))

)

df.head()
Survived Pclass SibSp Parch Sex_female Sex_male Title_Miss Title_Mr Title_Mrs Title_Rare ... Age_bin_Teenage Age_bin_Adult Age_bin_Elder Embarked_C Embarked_Q Embarked_S Fare_bin_Low_fare Fare_bin_median_fare Fare_bin_Average_fare Fare_bin_high_fare
PassengerId
1 0 3 1 0 0 1 0 0 1 0 ... 0 1 0 0 0 1 0 0 0 1
5 0 3 0 0 0 1 0 1 0 0 ... 0 0 0 0 0 1 0 1 0 0
6 0 3 0 0 0 1 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 1
7 0 1 0 0 0 1 0 0 0 1 ... 0 0 0 0 0 1 0 0 1 0
8 0 3 3 1 0 1 0 0 1 0 ... 0 1 0 0 0 1 0 1 0 0

5 rows × 21 columns

Transformation Final

Finally, we wrap the transformation into a function that can be used to conveniently apply to the data.

def variable_transform(raw_df):
    df = raw_df.copy()
    
    df = df.set_index('PassengerId')
    df['Cabin'] = df.Cabin.astype('category')
    df['Title'] = df.Name.str.extract('([A-Za-z]+)\.')

    df['Title'] = np.select(
        condlist=(df.Title.isin(['Mlle', 'Ms']),
                df.Title.isin(['Mme', 'Mrs']),
                df.Title.isin(['Mr'])),
        choicelist=('Miss', 'Mrs', 'Mr'),
        default='Rare')

    df['Age_bin'] = pd.cut(
        df['Age'],
        bins=[0, 12, 20, 40, 120],
        labels=['Children', 'Teenage', 'Adult', 'Elder'])

    df['Fare_bin'] = pd.cut(
        df['Fare'],
        bins=[0, 7.91, 14.45, 31, 120],
        labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])

    df['Age_bin'] = df.Age_bin.astype('category')
    df['Fare_bin'] = df.Fare_bin.astype('category')
    df = df.drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)
    df = df[df.Survived.eq(0)]  # or df.Survived == 0

    df = pd.get_dummies(
        df, columns=["Sex", "Title", "Age_bin", "Embarked", "Fare_bin"])

    return df


def chain_transform(raw_df):
    df = raw_df.copy()

    df = (
        df
        .set_index('PassengerId')
        .assign(
            Title=df.Name.str.extract('([A-Za-z]+)\.')
        )
        .assign(Title=lambda df:
                np.select(
                    condlist=(df.Title.isin(['Mlle', 'Ms']),
                              df.Title.isin(['Mme', 'Mrs']),
                              df.Title.isin(['Mr'])),
                    choicelist=('Miss', 'Mrs', 'Mr'),
                    default='Rare'),

                Age_bin=pd.cut(
                    df['Age'],
                    bins=[0, 12, 20, 40, 120],
                    labels=['Children', 'Teenage', 'Adult', 'Elder']),

                Fare_bin=pd.cut(
                    df['Fare'],
                    bins=[0, 7.91, 14.45, 31, 120],
                    labels=['Low_fare', 'median_fare', 'Average_fare', 'high_fare'])
                )
        .assign(
            Age_bin=lambda df: df.Age_bin.astype('category'),
            Fare_bin=lambda df: df.Fare_bin.astype('category')
        )
        .drop(['Age', 'Fare', 'Name', 'Ticket', 'Cabin'], axis=1)
        .query('Survived == 0')
        .pipe(lambda df: pd.get_dummies(df, columns=["Sex", "Title", "Age_bin", "Embarked", "Fare_bin"]))
    )

    return df

Performance

Below I tested the execution time for both variable assignment and chaining. The variable assignment in this example was 15% faster.

import timeit


def wrapper(func, *args, **kwargs):
    # https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped


variable = wrapper(variable_transform, raw_df)
chain = wrapper(chain_transform, raw_df)

print(f'Variables took {timeit.timeit(variable, number=100):.2f} seconds')
print(f'Chaining took {timeit.timeit(chain, number=100):.2f} seconds')
Variables took 1.56 seconds
Chaining took 1.90 seconds

Summary

Chaining can take some time to get used to but I do agree it's worth the effort. It makes it very clear what each step does in the transformation without needing to write a single code comment. Let me know your thoughts on whether or not you prefer variable assignment or chaining.