- Load the Dataset
import pandas as pd
dataset = pd.read_csv("filename")
- Understand the variables inside of it
If available, read the data dictionary, it should be available at the website where the data is.
- Get a feeling for the dataset
print(dataset.shape)
dataset.info()
dataset.info()
prints information about a dataset, listing the column types and amount of non null values.
visualise the first 5 entries
print(dataset.head())
Get an overview of minimum, maximum and mean as well as their 25th, 50th and 75th percentiles.
print(dataset.describe())
or if we want it for a single column:
print(dataset["columnName"].describe())
- Remove any columns we might not need
Maybe we have already identified some columns that are obviously useless?
dataset.drop(columns = ["uselesscolumn1", "uselesscolumn2"])
We could also decide to drop the columns that are "different" from type to simplify the first steps and use them again at a later date.
Speedup tricks
Sample the data
# or samples_df = df.sample(n=numbersamples)
sampled_df = df.sample(frac=0.25)
See the Bootstrap method for more information.
Plotting
Maybe we want to plot our data/columns?