1. Load the Dataset
import pandas as pd

dataset = pd.read_csv("filename")
  1. Understand the variables inside of it

If available, read the data dictionary, it should be available at the website where the data is.

  1. Get a feeling for the dataset
print(dataset.shape)
dataset.info()

dataset.info() prints information about a dataset, listing the column types and amount of non null values.

visualise the first 5 entries

print(dataset.head())

Get an overview of minimum, maximum and mean as well as their 25th, 50th and 75th percentiles.

print(dataset.describe())

or if we want it for a single column:

print(dataset["columnName"].describe())
  1. Remove any columns we might not need

Maybe we have already identified some columns that are obviously useless?

dataset.drop(columns = ["uselesscolumn1", "uselesscolumn2"])

We could also decide to drop the columns that are "different" from type to simplify the first steps and use them again at a later date.

Speedup tricks

Sample the data

# or samples_df = df.sample(n=numbersamples)
sampled_df = df.sample(frac=0.25)

See the Bootstrap method for more information.

Plotting

Maybe we want to plot our data/columns?