Python Data Operations 3: Filtering
(Using the numpy and pandas packages imported in section one.)
This third section contains:
Conditional Selection
pd.Series and Operators
Basic Filters
Missing Values
Identify Nulls
Filter Nulls
Fill in Nulls
Remove Nulls
Unique Values
# create test dataframe
test_df = pd.DataFrame([
['A3', 0, -1, 0, 'si'],
['B1', 1, None, 0, 'no'],
['B3', 4, None, 0, 'no'],
['B3', 5, 1, 0, 'si'],
['A1', 4, 0, None, None],
['A3', 1, 2, 1, 'si'],
['C2', 4, 1, 1, 'no']],
columns=['A', 'B', 'C', 'D', 'E'],
index=[f'R{i}' for i in range(7)]
)
test_df
A B C D E
R0 A3 0 -1.0 0.0 si
R1 B1 1 NaN 0.0 no
R2 B3 4 NaN 0.0 no
R3 B3 5 1.0 0.0 si
R4 A1 4 0.0 NaN None
R5 A3 1 2.0 1.0 si
R6 C2 4 1.0 1.0 no
Conditional selection
In pandas conditional selection is filtering some records according to certain criteria
, The syntax is df[filter] where filter is a sequence of boolean values of the same length
as the table, and the command allows us to select/filter records according to a certain
condition.
# create simple filter
filter = [True, True, False, False, True, True, False]
filter
[True, True, False, False, True, True, False]
# apply simple filter
# .iloc and .loc are equivalent here
test_df.iloc[filter, :]
test_df.loc[filter, :]
A B C D E
R0 A3 0 -1.0 0.0 si
R1 B1 1 NaN 0.0 no
R4 A1 4 0.0 NaN None
R5 A3 1 2.0 1.0 si
# filter columns containing a list of column names
columns = ['A', 'B', 'C']
# apply columns filter
test_df.loc[:, columns]
test_df[columns] #equivalent to previous line