Iris dataset is one of the most easiest and straightforward datasets to use.
In Iris dataset, we can represent dataset as a matrix in the following format.
Petal length | Petal Width | Sepal length | Sepal width | |
Flower-1 | ||||
Flower-2 | ||||
: | ||||
Flower-n |
However, more often than not, a dataset also contains labels or output values.
Dataset D is mathematically expressed as
D= { xi,yi}ni=1
Furthermore, most of the labeled dataset also contains well class labels. In the case of Iris dataset, the class labels are the flower names.
In the case of Iris dataset, xi can be any real number whereas yi is a value from a set of flower names.
Petal length | Petal Width | Sepal length | Sepal width | Flower type | |
Flower-1 | Virginia | ||||
Flower-2 | Sentosa | ||||
: |
Using pair plots we can show we can differentiate between different flowers.
In this analysis, we use this dataset to introduce the readers to exploratory data analysis.
By plotting the parameters of the different species, helps in finding a useful relation to distinguishing between these flowers.
Some of the observation we found were:
- Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
- Separating Versicolor from Virginica is much harder as they have considerable overlap.
- petal_length and petal_width are the most useful features to identify various flower types.
- While Setosa can be easily identified (linearly separable), Virginica and Versicolor have some overlap (almost linearly separable).
We can find "lines" and "if-else" conditions to build a simple model to classify the flower types
0 Comments