Yes it can be googled, or gpt'ed, but I should know this by heart.
Both Series and Dataframes can handle arithmetic operations in a similar way as numpy arrays. (broadcasting etc), if their types allow for it.
Series
Arrays that allow different types. Corresponds to one column in a dataframe
Access
please access indexes 1 to 4 in a pandas series {python} s
?
{python}result = s[1:5]
Please convert a pandas series {python}s
to:
- A list
- A numpy array
- A dict
- A string
- A dataframe
?
s.to_list()
s.to_numpy()
s.to_dict()
s.to_string()
s.to_frame(name=COLUMNNAME)
Dataframes
Creation
Please create a dataframe from the following dictionary
{python}data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
?
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
Please create a dataframe from a list of lists. Name the columns accordingly.
{python}data = [['Alex',10],['Bob',12],['Clarke',13]]
?
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data, columns=["Name", "Age"])
Please create a dataframe from a list of dictionaries. Notice how not all features are present in the first dictionary. What if you want to specify the features/columns?
{python}data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
?
With all columns:
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data) # NAN will be appended to the first dict in columns 'c'
Selecting only columns a and b, disregarding c:
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, columns=['a', 'b']) # columns c will not get created
Access
Each row of a dataframe usually represents a datapoint. Each column a feature. To be able to access these rows, we use indexes.
Indexes are created per default. They go from 0 to n (amt of rows - 1)
Please access row 50 of a dataframe {python}df
?
{python}df.iloc[49]
Please access rows 1 to 3 of dataframe {python}df
?
{python}print(df.iloc[1:4])
Please access the (row) index of a dataframe.
?
{python}df.index
Please access the columns of a dataframe.
?
{python}df.columns
# usually a list of strings (column names)
Please rename the dataframe columns {python}column1
to {python}age
?
{python}df = df.rename(columns={"column1":"age"})
Please select rows 1 to 5 and the columns "column1", "column3" from the dataframe {python}df
?
print(df.loc[1:5, ["column1", "column3"]]) # it will fetch row 5 as well.
Filtering
Please filter a dataframe, only select rows where the age is between 30 and 40
?
{python}filtered_df = df[(df['Age'] > 30) & (df['Age'] < 40)]
Deletion
Please drop row 43 from dataframe {python}df
. What if you want to drop multiple rows? From a list and slicing please.
?
{python}df.drop(43)
{python}df.drop([1,2,3,4,43])
{python}df.drop(df.index[1:32])
Please delete columns {python}"column1", "column3"
from {python}df
?
df = df.drop(columns=["column1", "column3"])
# or use inplace
df.drop(columns=["column1", "column3"], inplace=True) # return None
Insertion
Please add a row to a dataframe. Assume that the dataframe index is numerical from 0 to n-1.
import pandas as pd
X = {"column1": [1,2,3,4,5], "column2": [6,7,8,9,10]}
df = pd.DataFrame(X)
# add a row to the dataframe here
?
new_row = {'column1':4, 'column2':98}
df.loc[len(df)] = new_row
Please insert column "C" into an existing dataframe {python}df
?
{python}df['C'] = [7, 9, 19]
Modification
What does the {python}df.apply(...)
function do and why use it?
?
{python}df = df.apply(func, axis, ..)
it will apply the func on the axis of the dataframe (it will replace the values inside of the dataframe). Unless your func is vectorized, I would avoid using it.
Example: substract by the mean of each column:
def f(col):
mean = col.mean()
return col - mean
df = df.apply(f) # axis = 0, so the func f is applied to each column
Please modify value from row 7 and column "column3" from 4 to 23.
?
{python}df.loc[7, "column3"] = 23
Other
Please concatenate two dataframes: {python}df1, df2
. Assume that the indexes will overlap and can be ignored
?
{python}combined_df = pd.concat([df1, df2], ignore_index=True)
I/O tools with pandas
You have a csv file. Please read it and convert it to a dataframe
?
{python}df = pd.read_csv(filepath)
You have json data. Please convert it to a dataframe.
?
{python}df = pd.read_json(filepath)
Please save a dataframe to a csv file
?
{python}df.to_csv(filepath)
You have an excel file, please read it and convert it to a dataframe
?
{python}pd.read_excel(filepath)
You have an excel file, please read {python}"sheet2"
from it.
?
{python}pd.read_excel(filepath, sheet_name="sheet2")
Please use IPython to display a dataframe in a nice way.
?
from IPython.display import display
display(df)
Practical tips
Let's say you have multiple files with data. One column of the data is the id. This ID is unique. It is possible to combine data into one dataframe using that id.
It is then a good idea to use that "id" column as an index.
Please set an "id" column as the index of a dataframe {python}df
.
?
df.index = df["id"]
df = df.drop(columns=["id"])
display(df)
Or shorter:
df.set_index(keys=["id"], inplace=True) # here we could set more than just one column.
Note that the parameter is called keys and not columns (which would seem more consistent with other library methods, because it would also accept an external key, like a list, an array or a pandas series of the same length like the dataframe. Even though I don't see a reason to do this ever.
Styling
All methods regarding how the dataframe is visualised. Basically stylistic changes to the Style, when calling {python}display(df)
Please make sure, that {python}display(df)
shows all rows of a dataframe in an interactive python environment. Please use a context manager.
?
# best practice is to use a context manager
with pd.option_context("display.max_rows", None):
display(df) # will show all rows
Please give a dataframe {python}df
a title
?
{python}display(df.style.set_caption("dataframe title"))
Careful, this call returns a styler object, not a dataframe anymore. You can however display that.
Please highlight the maximum value in each column of a dataframe {python}df
?
{python}display(df.style.highlight_max(axis = 0))
Please modify the style of one dataframes {python}df
column. Use the following values:
{
"background-color": "#00BFA5",
"color": "#000000",
}
?
style_params = {"background-color": "#00BFA5", "color": "#000000"}
display(df.style.set_properties(subset=["Col1"], **style_params))
Why the use of **kwargs, python arbitrary keyword arguments? Because the arguments are passed to a css function and are not actually used within python itself.
Please apply the following css style {python}"background-color: #00BFA5; color: #000000"
to all cells in a dataframe column, if their values are above 2
?
def highlight_values(val):
style = "background-color: #00BFA5; color: #000000" if val > 2 else ''
return style
display(df.style.apply(
lambda col: col.map(highlight_values),
)
)