Pandas Tips and Tricks
Pandas Tips and Tricks learned from StackOverflow
- np.where
- np.select
- Mask / Where
- Align dummy columns between train and test sets
- Understanding Groupby
- Groupby Filtering
- Groupby Transform
- Convert an Uneven Dictonary or List to a Dataframe
- Convert Uneven List of Tuples to a Dataframe
- Decay ffill
- Create Columns in a Loop
- Explode
- Expand Dataframe based on a Date Range
- Wide to Long
- Rolling Mean
- Rank Multiple Columns
- Select Dates Ignoring Year
- Split a Dataframe np.split
- Apply function with multiple arguments
- Custom Styling
- Reverse an Array
- Symmetric difference
- Transpose a 3D array
- literal eval
- Difference between Nan and None
- str accessor
- Regular Expression
- Cartesian Product
- to_frame
- Date Manipulation
- Total Seconds
I spent an entire month answering pandas related StackOverflow question earning over 1,000 reputation. Below are some of the more useful methods outside the basics which can used to solve common occuring data transformation problems.
import pandas as pd
import numpy as np
df = pd.DataFrame(['Dog', 'Cat'] * 2, columns=['Adult'])
df
df['Baby'] = np.where(df['Adult'].eq('Dog'), 'Puppy', 'Kitten')
df
df = pd.DataFrame({"color" : ["red", "green", "blue", "black", "white", "purple"]})
df
conditions = [df.color.isin(['red', 'green', 'blue']), df.color.eq('black') | df.color.eq('white')]
# create a list of choices, if both conditions are True then the first choice is choosen
choices = ['rgb', 'b&w']
df['category'] = np.select(condlist=conditions, choicelist=choices, default=df.color)
df
def func(df):
if df.color in ['red', 'green', 'blue']:
return 'rgb'
elif df.color == 'black' or df.color == 'white':
return 'b&w'
else:
return df.color
df['category'] = df.apply(func, axis=1)
df
Mask / Where
A quick way to return a value when a condition is (not) met.
https://stackoverflow.com/questions/51982417/pandas-mask-where-methods-versus-numpy-np-where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('id', 'sample'))
df
df.mask(df['id'].eq('2'), 'mask')
df.where(df['id'].eq('2'), 'where')
df = pd.DataFrame(['A', 'B', 'C', 'Z', 'A', 'C', 'D', 'E', 'X', 'Y'], columns=['Letter'])
df
train = pd.get_dummies(df.iloc[:4])
train
test = pd.get_dummies(df.iloc[4:])
test
test = test.reindex(columns=train.columns, fill_value=0)
test
Understanding Groupby
Grouping your data can solve a lot of transformation challenges. It's a good idea to understand how it works.
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
df = pd.DataFrame({
'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
'Zip Code': [55555, 55555, 44444, 55555, 66666]
})
df
group = df.groupby('Name')
for key, dataframe in group:
print(f'{key} \n {dataframe} \n')
group.apply(lambda x: type(x))
Groupby Filtering
If you want to select subsets of data within each group you can use filter.
https://stackoverflow.com/questions/58453245/pandas-match-and-unmatch-on-differnt-columns-in-python
df = pd.DataFrame({
'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
'Zip Code': [55555, 55555, 44444, 55555, 66666]
})
df
df.groupby('Name').nunique()
df.groupby('Name').filter(lambda x: x['Zip Code'].nunique() > 1)
# alternative using transform
df[(df.groupby('Name').transform('nunique') > 1).values]
df = pd.DataFrame({
'Name': ['David', 'David', 'Steve' ,'Harry', 'Harry'],
'Count': [10, 15, 20, 25, 5]
})
df
df.groupby('Name').sum()
df['Total'] = df.groupby('Name').transform('sum')
df
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
d
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
mylist = [[1,2,3], [4,5], 6]
mylist
pd.DataFrame([ pd.Series(value) for value in mylist ])
Convert Uneven List of Tuples to a Dataframe
https://stackoverflow.com/questions/58450965/list-of-lists-conversion-to-pandas-dataframe
data = [
[('category', 'evaluation'), ('polarity', 'pos'), ('strength', '1'), ('type', 'good')],
[('category', 'intensifier'), ('type', 'shifter')],
[('category', 'evaluation'), ('polarity', 'pos'), ('strength', '2'), ('type', 'good')]
]
[dict(e) for e in data]
df = pd.DataFrame(data=[dict(e) for e in data])
df
s = pd.Series([0,0, 10, 15, 0,0,25,0,0,20])
s
decay = 0.8
(s.mask(s.eq(0)).ffill() * decay ** s.groupby(s.ne(0).cumsum()).cumcount()).fillna(0)
df = pd.DataFrame(np.arange(1,6), columns=['number'])
df
for num in df.number:
df[f'number x {num}'] = num * df.number
df
pd.__version__
df = pd.DataFrame([[['A', 'B', 'C']], [['D', 'E', 'F']]], columns=['lists'])
df
df.explode('lists')
df = pd.DataFrame({
'Date': ['2018-01-01', '2018-01-03'],
'Product': ['A', 'B']
}
)
df['Date'] = pd.to_datetime(df['Date'])
df
df = df.set_index('Date')
df
date_range = pd.date_range(start=df.index.min(), end=pd.DateOffset(days=1) + df.index.max())
date_range
df.reindex(date_range).ffill()
Wide to Long
Similar to melt but can handle prefixed column names
https://stackoverflow.com/questions/58135102/pivot-tables-on-python
df = pd.DataFrame(
{'date': ['1/1/11', '2/1/11', '3/1/11'],
'online_won': [9, 1, 10],
'retail_won': [10, 2, 8],
'outbound_won': [11, 13, 14],
'online_leads': [12, 15, 17],
'retail_leads': [14.0, np.nan, np.nan],
'outbound_leads': [np.nan, np.nan, np.nan]}
)
df
df.columns = ['_'.join(x.split('_')[::-1]) for x in df.columns ]
df.columns
df = pd.wide_to_long(df, ['won','leads'], 'date', 'source', sep='_', suffix='\w+')
df
s = pd.Series(np.arange(1,10))
s
s.expanding().mean()
s.cumsum() / (s.index.values + 1)
df = pd.DataFrame(data={'String':['a','a','a','a','b'],'Integer':[1,2,3,3,1]})
df
df = df.sort_values(['String','Integer'], ascending=[True, False])
df
df['rank'] = np.arange(len(df)) + 1
df
df['rank'] = df.groupby(['String', 'Integer'])['rank'].transform('mean')
df
df['rank'] = df.groupby(['String', 'Integer'], sort=False).ngroup().rank()
df
df = pd.DataFrame({'date': pd.date_range('2016-09-01','2019-09-01', freq='D')})
df.head()
s = df.date.dt.strftime('%m%d').astype(int)
s.head()
df[s.between(316,318)]
df = pd.DataFrame({'A':np.arange(10), 'B':np.arange(10)})
df
df1, df2 = np.split(df, 2)
df1
df2
pd.concat([df1,df2.reset_index(drop=True)], axis=1)
Apply function with multiple arguments
https://stackoverflow.com/questions/58089770/using-apply-function-dataframe
import math
x = [-0.75853, -0.75853, -0.75853, -0.75852]
y = [-0.63435, -0.63434, -0.63435, -0.63436]
z = [-0.10488, -0.10490, -0.10492, -0.10495]
w = [-0.10597, -0.10597, -0.10597, -0.10596]
df = pd.DataFrame([x, y, z, w], columns=['x', 'y', 'z', 'w'])
df
def roll(qw, qx, qy, qz, var=10):
# x-axis rotation
sinr_cosp = +2.0 * (qw * qx + qy + qz)
cosr_cosp = +1.0 - 2.0 * (qx * qx + qy * qy)
roll = math.atan2(sinr_cosp, cosr_cosp)
return roll * var
df.apply(lambda x : roll(x['w'], x['x'], x['y'], x['z'], 20), axis=1)
def roll(df, var=10):
# x-axis rotation
sinr_cosp = +2.0 * (df.w * df.x + df.y + df.z)
cosr_cosp = +1.0 - 2.0 * (df.x * df.x + df.y * df.y)
roll = math.atan2(sinr_cosp, cosr_cosp)
return roll * var
df.apply(roll, args=(2,), axis=1)
df = pd.DataFrame(data={'A': [0, 1, np.nan], 'B': [.5, np.nan, 0], 'C': [np.nan, 1, 1]})
df
from matplotlib.cm import get_cmap
cmap = get_cmap('PuBu')
# update with low-high option
def threshold(x,low=0,high=1,mid=0.5):
# nan cell
if np.isnan(x): return ''
# non-nan cell
x = (x-low)/(high-low)
background = f'background-color: rgba{cmap (x, bytes=True)}'
text_color = f'color: white' if x > mid else ''
return background+';'+text_color
# apply the style
df.style.applymap(threshold, low=-1, high=1, mid=0.3)
np.arange(1,11)
np.arange(1,11)[::-1]
X = np.arange(1,11).reshape(2,5)
X
np.fliplr(X)
df1 = pd.DataFrame({'a':['a', 'b', 'c', 'x', 'd', 'l', 'z']})
df1
df2 = pd.DataFrame({'b': ['b', 'a', 'd', 'c', 'y']})
df2
pd.Index.symmetric_difference(pd.Index(df1.a), pd.Index(df2.b)).values
np.setdiff1d(np.union1d(df1.a, df2.b), np.intersect1d(df1.a, df2.b))
two_d = np.arange(16).reshape(8,2)
print(two_d.shape)
print(two_d.transpose().shape)
three_d = np.arange(20).reshape((1, 4, 5))
print(three_d.shape)
print(three_d.transpose(1,0,2).shape)
print(three_d.transpose(2,1,0).shape)
print(three_d.transpose(0,1,2).shape)
from ast import literal_eval
df = pd.DataFrame({'list':['[1,2,3]', '[5,6,7]']})
df
df.list.map(lambda x: type(x))
df.list.map(lambda x: type(literal_eval(x)))
Difference between Nan and None
https://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
s = pd.Series(['ABC','DEF'])
s
s.str.contains('A')
s.str.contains('A|E') # A or E
s.str[::-1]
s.str[:2]
Regular Expression
https://stackoverflow.com/questions/4736/learning-regular-expressions
df_A = pd.DataFrame({'A':[1,2,3]})
df_B = pd.DataFrame({'B':['A','B','C']})
df_A
df_B
df_A.assign(key=1).merge(df_B.assign(key=1), on='key').drop('key', axis=1)
df = pd.DataFrame({'A':[1,2,3]})
df
df.A
print(type(df.A))
df.A.to_frame()
type(df.A.to_frame())
dates = pd.Series(pd.to_datetime(['1/1/2016', '12/15/2017', '10/31/2019']))
dates
dates + pd.DateOffset(day=1)
dates + pd.offsets.MonthEnd(0)
dates.dt.month
dates.dt.day_name()
(pd.to_datetime('1/5/2019 00:01:35') - pd.to_datetime('1/1/2019 00:05:00')).seconds
(pd.to_datetime('1/5/2019 00:01:35') - pd.to_datetime('1/1/2019 00:05:00')).total_seconds()