I spent an entire month answering pandas related StackOverflow question earning over 1,000 reputation. Below are some of the more useful methods outside the basics which can used to solve common occuring data transformation problems.

StackOverFlow Profile

import pandas as pd
import numpy as np

np.where

A faster way to implement an if else condition

df = pd.DataFrame(['Dog', 'Cat'] * 2, columns=['Adult'])

df

df['Baby'] = np.where(df['Adult'].eq('Dog'), 'Puppy', 'Kitten')

df

np.select

An alternative to np.where when you have multiple conditions

df = pd.DataFrame({"color"  : ["red", "green", "blue", "black", "white", "purple"]})

df

conditions = [df.color.isin(['red', 'green', 'blue']), df.color.eq('black') | df.color.eq('white')]

# create a list of choices, if both conditions are True then the first choice is choosen
choices = ['rgb', 'b&w']

df['category'] = np.select(condlist=conditions, choicelist=choices, default=df.color)

df

def func(df):
    if df.color in ['red', 'green', 'blue']:
        return 'rgb'
    elif df.color == 'black' or df.color == 'white':
        return 'b&w'
    else:
        return df.color

df['category'] = df.apply(func, axis=1)

df

Mask / Where

A quick way to return a value when a condition is (not) met.

https://stackoverflow.com/questions/51982417/pandas-mask-where-methods-versus-numpy-np-where

df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('id', 'sample'))

df

df.mask(df['id'].eq('2'), 'mask')

df.where(df['id'].eq('2'), 'where')

Align dummy columns between train and test sets

In machine learning, your train and test sets may become misaligned when creating dummy columns. This will result in an error.

https://stackoverflow.com/questions/58136267/check-reference-list-in-pandas-column-using-numpy-vectorization

df = pd.DataFrame(['A', 'B', 'C', 'Z', 'A', 'C', 'D', 'E', 'X', 'Y'], columns=['Letter'])

df

train = pd.get_dummies(df.iloc[:4])
train

test = pd.get_dummies(df.iloc[4:])

test

test = test.reindex(columns=train.columns, fill_value=0)
test

Understanding Groupby

Grouping your data can solve a lot of transformation challenges. It's a good idea to understand how it works.

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

df = pd.DataFrame({
    'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
    'Zip Code': [55555, 55555, 44444, 55555, 66666]
})

df

group = df.groupby('Name')

for key, dataframe in group:
    print(f'{key} \n {dataframe} \n')

David 
     Name  Zip Code
0  David     55555
1  David     55555 

Harry 
     Name  Zip Code
3  Harry     55555
4  Harry     66666 

Steve 
     Name  Zip Code
2  Steve     44444

group.apply(lambda x: type(x))

Name
David    <class 'pandas.core.frame.DataFrame'>
Harry    <class 'pandas.core.frame.DataFrame'>
Steve    <class 'pandas.core.frame.DataFrame'>
dtype: object

Groupby Filtering

If you want to select subsets of data within each group you can use filter.

https://stackoverflow.com/questions/58453245/pandas-match-and-unmatch-on-differnt-columns-in-python

df = pd.DataFrame({
    'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
    'Zip Code': [55555, 55555, 44444, 55555, 66666]
})

df

df.groupby('Name').nunique()

df.groupby('Name').filter(lambda x: x['Zip Code'].nunique() > 1)

# alternative using transform
df[(df.groupby('Name').transform('nunique') > 1).values]

Groupby Transform

Returns an aggregated result of the same size of the original dataframe

df = pd.DataFrame({
    'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
    'Count': [10, 15, 20, 25, 5]
})

df

df.groupby('Name').sum()

df['Total'] = df.groupby('Name').transform('sum')

df

Convert an Uneven Dictonary or List to a Dataframe

https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths

d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )

d

{'A': array([1, 2]), 'B': array([1, 2, 3, 4])}

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))

mylist = [[1,2,3], [4,5], 6]

mylist

[[1, 2, 3], [4, 5], 6]

pd.DataFrame([ pd.Series(value) for value in mylist ])

Convert Uneven List of Tuples to a Dataframe

https://stackoverflow.com/questions/58450965/list-of-lists-conversion-to-pandas-dataframe

data = [
    [('category', 'evaluation'), ('polarity', 'pos'), ('strength', '1'), ('type', 'good')], 
    [('category', 'intensifier'), ('type', 'shifter')],
    [('category', 'evaluation'), ('polarity', 'pos'), ('strength', '2'), ('type', 'good')]
]

[dict(e) for e in data]

[{'category': 'evaluation',
  'polarity': 'pos',
  'strength': '1',
  'type': 'good'},
 {'category': 'intensifier', 'type': 'shifter'},
 {'category': 'evaluation',
  'polarity': 'pos',
  'strength': '2',
  'type': 'good'}]

df = pd.DataFrame(data=[dict(e) for e in data])

df

Decay ffill

A vectorized implementation of a custom forward fill function.

https://stackoverflow.com/questions/58404949/how-to-speed-up-a-decay-ffill-function-for-pandas-series

s = pd.Series([0,0, 10, 15, 0,0,25,0,0,20])

s

0     0
1     0
2    10
3    15
4     0
5     0
6    25
7     0
8     0
9    20
dtype: int64

decay = 0.8

(s.mask(s.eq(0)).ffill() * decay ** s.groupby(s.ne(0).cumsum()).cumcount()).fillna(0)

0     0.0
1     0.0
2    10.0
3    15.0
4    12.0
5     9.6
6    25.0
7    20.0
8    16.0
9    20.0
dtype: float64

Create Columns in a Loop

https://stackoverflow.com/questions/58364225/grouping-by-and-applying-lambda-with-condition-for-the-first-row-pandas

df = pd.DataFrame(np.arange(1,6), columns=['number'])

df

for num in df.number:
    df[f'number x {num}'] = num * df.number
    
df

Explode

Transform List Like Elements into individual rows

(only v.25 and above)

pd.__version__

'0.25.1'

df = pd.DataFrame([[['A', 'B', 'C']], [['D', 'E', 'F']]], columns=['lists'])

df

df.explode('lists')

Expand Dataframe based on a Date Range

Expand Dataframe to range of Dates and fill values

df = pd.DataFrame({
 'Date': ['2018-01-01', '2018-01-03'],
 'Product': ['A', 'B']
 }
)

df['Date'] = pd.to_datetime(df['Date'])

df

df = df.set_index('Date')

df

date_range  = pd.date_range(start=df.index.min(), end=pd.DateOffset(days=1) + df.index.max())

date_range

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D')

df.reindex(date_range).ffill()

Wide to Long

Similar to melt but can handle prefixed column names

https://stackoverflow.com/questions/58135102/pivot-tables-on-python

df = pd.DataFrame(
    {'date': ['1/1/11', '2/1/11', '3/1/11'],
     'online_won': [9, 1, 10],
     'retail_won': [10, 2, 8],
     'outbound_won': [11, 13, 14],
     'online_leads': [12, 15, 17],
     'retail_leads': [14.0, np.nan, np.nan],
     'outbound_leads': [np.nan, np.nan, np.nan]}
)

df

df.columns = ['_'.join(x.split('_')[::-1]) for x in df.columns ]

df.columns

Index(['date', 'won_online', 'won_retail', 'won_outbound', 'leads_online',
       'leads_retail', 'leads_outbound'],
      dtype='object')

df = pd.wide_to_long(df, ['won','leads'], 'date', 'source', sep='_', suffix='\w+')

df

Rolling Mean

s = pd.Series(np.arange(1,10))

s

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int32

s.expanding().mean()

0    1.0
1    1.5
2    2.0
3    2.5
4    3.0
5    3.5
6    4.0
7    4.5
8    5.0
dtype: float64

s.cumsum() / (s.index.values + 1)

0    1.0
1    1.5
2    2.0
3    2.5
4    3.0
5    3.5
6    4.0
7    4.5
8    5.0
dtype: float64

Rank Multiple Columns

https://stackoverflow.com/questions/58136881/compute-rank-average-for-multiple-columns-manually

df = pd.DataFrame(data={'String':['a','a','a','a','b'],'Integer':[1,2,3,3,1]})

df

df = df.sort_values(['String','Integer'], ascending=[True, False])

df

df['rank'] = np.arange(len(df)) + 1

df

df['rank'] = df.groupby(['String', 'Integer'])['rank'].transform('mean')

df

df['rank'] = df.groupby(['String', 'Integer'], sort=False).ngroup().rank()

df

Select Dates Ignoring Year

Slice Dataframe between two dates ignoring years

https://stackoverflow.com/questions/58117659/select-rows-in-pandas-dataframe-between-dates-regardless-of-year

df = pd.DataFrame({'date': pd.date_range('2016-09-01','2019-09-01', freq='D')})

df.head()

s = df.date.dt.strftime('%m%d').astype(int)

s.head()

0    901
1    902
2    903
3    904
4    905
Name: date, dtype: int32

df[s.between(316,318)]

Split a Dataframe np.split

df = pd.DataFrame({'A':np.arange(10), 'B':np.arange(10)})

df

df1, df2 = np.split(df, 2)

df1

df2

pd.concat([df1,df2.reset_index(drop=True)], axis=1)

Apply function with multiple arguments

https://stackoverflow.com/questions/58089770/using-apply-function-dataframe

import math


x = [-0.75853, -0.75853, -0.75853, -0.75852]
y = [-0.63435, -0.63434, -0.63435, -0.63436]
z = [-0.10488, -0.10490, -0.10492, -0.10495]
w = [-0.10597, -0.10597, -0.10597, -0.10596]

df = pd.DataFrame([x, y, z, w], columns=['x', 'y', 'z', 'w'])

df

def roll(qw, qx, qy, qz, var=10):
    # x-axis rotation
    sinr_cosp = +2.0 * (qw * qx + qy + qz)
    cosr_cosp = +1.0 - 2.0 * (qx * qx + qy * qy)
    roll = math.atan2(sinr_cosp, cosr_cosp)
    
    return roll * var

df.apply(lambda x : roll(x['w'], x['x'], x['y'], x['z'], 20), axis=1)

0   -43.509430
1   -38.182063
2    -7.883262
3    -7.957697
dtype: float64

def roll(df, var=10):
    # x-axis rotation
    sinr_cosp = +2.0 * (df.w * df.x + df.y + df.z)
    cosr_cosp = +1.0 - 2.0 * (df.x * df.x + df.y * df.y)
    roll = math.atan2(sinr_cosp, cosr_cosp)
    return roll * var

df.apply(roll, args=(2,), axis=1)

0   -4.350943
1   -3.818206
2   -0.788326
3   -0.795770
dtype: float64

Custom Styling

https://stackoverflow.com/questions/58087857/applying-pandas-styles-to-arbitrary-non-product-subsets-of-a-dataframe

df = pd.DataFrame(data={'A': [0, 1, np.nan], 'B': [.5, np.nan, 0], 'C': [np.nan, 1, 1]})

df

from matplotlib.cm import get_cmap
cmap = get_cmap('PuBu')

# update with low-high option
def threshold(x,low=0,high=1,mid=0.5):
    # nan cell
    if np.isnan(x): return ''

    # non-nan cell
    x = (x-low)/(high-low)
    background = f'background-color: rgba{cmap (x, bytes=True)}'
    text_color = f'color: white' if x > mid else ''
    return background+';'+text_color

# apply the style
df.style.applymap(threshold, low=-1, high=1, mid=0.3)

Reverse an Array

https://stackoverflow.com/questions/6771428/most-efficient-way-to-reverse-a-numpy-array

np.arange(1,11)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

np.arange(1,11)[::-1]

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

X = np.arange(1,11).reshape(2,5)

X

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

np.fliplr(X)

array([[ 5,  4,  3,  2,  1],
       [10,  9,  8,  7,  6]])

Symmetric difference

Values not common to both

https://stackoverflow.com/questions/45845965/python-compare-two-columns-of-features-return-values-which-are-not-common-to

df1 = pd.DataFrame({'a':['a', 'b', 'c', 'x', 'd', 'l', 'z']})

df1

df2 = pd.DataFrame({'b': ['b', 'a', 'd', 'c', 'y']})

df2

pd.Index.symmetric_difference(pd.Index(df1.a), pd.Index(df2.b)).values

array(['l', 'x', 'y', 'z'], dtype=object)

np.setdiff1d(np.union1d(df1.a, df2.b), np.intersect1d(df1.a, df2.b))

array(['l', 'x', 'y', 'z'], dtype=object)

Transpose a 3D array

https://stackoverflow.com/questions/32034237/how-does-numpys-transpose-method-permute-the-axes-of-an-array

two_d = np.arange(16).reshape(8,2)

print(two_d.shape)
print(two_d.transpose().shape)

(8, 2)
(2, 8)

three_d = np.arange(20).reshape((1, 4, 5))

print(three_d.shape)
print(three_d.transpose(1,0,2).shape)
print(three_d.transpose(2,1,0).shape)
print(three_d.transpose(0,1,2).shape)

(1, 4, 5)
(4, 1, 5)
(5, 4, 1)
(1, 4, 5)

literal eval

Convert a string representation of a list into a list

https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list

from ast import literal_eval

df = pd.DataFrame({'list':['[1,2,3]', '[5,6,7]']})

df

df.list.map(lambda x: type(x))

0    <class 'str'>
1    <class 'str'>
Name: list, dtype: object

df.list.map(lambda x: type(literal_eval(x)))

0    <class 'list'>
1    <class 'list'>
Name: list, dtype: object

Difference between Nan and None

https://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none

str accessor

s = pd.Series(['ABC','DEF'])

s

0    ABC
1    DEF
dtype: object

s.str.contains('A')

0     True
1    False
dtype: bool

s.str.contains('A|E')  # A or E

0    True
1    True
dtype: bool

s.str[::-1]

0    CBA
1    FED
dtype: object

s.str[:2]

0    AB
1    DE
dtype: object

Regular Expression

https://stackoverflow.com/questions/4736/learning-regular-expressions

Cartesian Product

df_A = pd.DataFrame({'A':[1,2,3]})
df_B = pd.DataFrame({'B':['A','B','C']})

df_A

df_B

df_A.assign(key=1).merge(df_B.assign(key=1), on='key').drop('key', axis=1)

to_frame

Maintain a Dataframe Object

df = pd.DataFrame({'A':[1,2,3]})

df

df.A

0    1
1    2
2    3
Name: A, dtype: int64

print(type(df.A))

<class 'pandas.core.series.Series'>

df.A.to_frame()

type(df.A.to_frame())

pandas.core.frame.DataFrame

Date Manipulation

dates = pd.Series(pd.to_datetime(['1/1/2016', '12/15/2017', '10/31/2019']))

dates

0   2016-01-01
1   2017-12-15
2   2019-10-31
dtype: datetime64[ns]

dates + pd.DateOffset(day=1)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\datetimes.py:837: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex
  PerformanceWarning,

0   2016-01-01
1   2017-12-01
2   2019-10-01
dtype: datetime64[ns]

dates + pd.offsets.MonthEnd(0)

0   2016-01-31
1   2017-12-31
2   2019-10-31
dtype: datetime64[ns]

dates.dt.month

0     1
1    12
2    10
dtype: int64

dates.dt.day_name()

0      Friday
1      Friday
2    Thursday
dtype: object

Total Seconds

Calculate total seconds between to datetimes

(pd.to_datetime('1/5/2019 00:01:35') - pd.to_datetime('1/1/2019 00:05:00')).seconds

86195

(pd.to_datetime('1/5/2019 00:01:35') - pd.to_datetime('1/1/2019 00:05:00')).total_seconds()

345395.0

	Letter_A	Letter_B	Letter_C	Letter_Z
0	1	0	0	0
1	0	1	0	0
2	0	0	1	0
3	0	0	0	1

	Letter_A	Letter_C	Letter_D	Letter_E	Letter_X	Letter_Y
4	1	0	0	0	0	0
5	0	1	0	0	0	0
6	0	0	1	0	0	0
7	0	0	0	1	0	0
8	0	0	0	0	1	0
9	0	0	0	0	0	1

	Letter_A	Letter_C
4	1	0
5	0	1
6	0	0
7	0	0
8	0	0
9	0	0

	date
0	2016-09-01
1	2016-09-02
2	2016-09-03
3	2016-09-04
4	2016-09-05

	date
196	2017-03-16
197	2017-03-17
198	2017-03-18
561	2018-03-16
562	2018-03-17
563	2018-03-18
926	2019-03-16
927	2019-03-17
928	2019-03-18

	id	sample
0	where	where
1	where	where
2	where	where
3	2	40
4	2	40
5	2	40
6	where	where
7	where	where
8	where	where

	category	polarity	strength	type
0	evaluation	pos	1	good
1	intensifier	NaN	NaN	shifter
2	evaluation	pos	2	good

	date	online_won	retail_won	outbound_won	online_leads	retail_leads	outbound_leads
0	1/1/11	9	10	11	12	14.0	NaN
1	2/1/11	1	2	13	15	NaN	NaN
2	3/1/11	10	8	14	17	NaN	NaN

		won	leads
date	source
1/1/11	online	9	12.0
2/1/11	online	1	15.0
3/1/11	online	10	17.0
1/1/11	retail	10	14.0
2/1/11	retail	2	NaN
3/1/11	retail	8	NaN
1/1/11	outbound	11	NaN
2/1/11	outbound	13	NaN
3/1/11	outbound	14	NaN

	x	y	z	w
0	-0.75853	-0.75853	-0.75853	-0.75852
1	-0.63435	-0.63434	-0.63435	-0.63436
2	-0.10488	-0.10490	-0.10492	-0.10495
3	-0.10597	-0.10597	-0.10597	-0.10596

	Adult
0	Dog
1	Cat
2	Dog
3	Cat

	Adult	Baby
0	Dog	Puppy
1	Cat	Kitten
2	Dog	Puppy
3	Cat	Kitten

	color
0	red
1	green
2	blue
3	black
4	white
5	purple

	color	category
0	red	rgb
1	green	rgb
2	blue	rgb
3	black	b&w
4	white	b&w
5	purple	purple