Skip to content

Commit 3abcc61

Browse files
authored
Update README.md
1 parent 586b07b commit 3abcc61

File tree

1 file changed

+88
-108
lines changed

1 file changed

+88
-108
lines changed

README.md

Lines changed: 88 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -31,138 +31,118 @@ most of these books share the following steps (checklist):
3131
## 2-1 Real world Application Vs Competitions
3232
<img src="http://s9.picofile.com/file/8339956300/reallife.png" height="400" width="300" />
3333
<a id="3"></a> <br>
34-
# 3- Problem Definition
34+
## 3- Problem Definition
3535
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)
3636

3737
Problem Definition has four steps that have illustrated in the picture below:
38-
<img src="http://s8.picofile.com/file/8338227734/ProblemDefination.png">
39-
<a id="4"></a> <br>
40-
### 3-1 Problem Feature
41-
we will use the classic Iris data set. This dataset contains information about three different types of Iris flowers:
42-
43-
* Iris Versicolor
44-
* Iris Virginica
45-
* Iris Setosa
46-
47-
The data set contains measurements of four variables :
48-
49-
* sepal length
50-
* sepal width
51-
* petal length
52-
* petal width
53-
54-
The Iris data set has a number of interesting features:
55-
56-
1. One of the classes (Iris Setosa) is linearly separable from the other two. However, the other two classes are not linearly separable.
57-
58-
2. There is some overlap between the Versicolor and Virginica classes, so it is unlikely to achieve a perfect classification rate.
59-
60-
3. There is some redundancy in the four input variables, so it is possible to achieve a good solution with only three of them, or even (with difficulty) from two, but the precise choice of best variables is not obvious.
61-
62-
**Why am I using this dataset:**
63-
64-
1- This is a good project because it is so well understood.
65-
66-
2- Attributes are numeric so you have to figure out how to load and handle data.
67-
68-
3- It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
69-
70-
4- It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
71-
72-
5- It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
73-
74-
6- All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.[5]
75-
76-
7- we can define problem as clustering(unsupervised algorithm) project too.
77-
<a id="5"></a> <br>
38+
<img src="http://s8.picofile.com/file/8344103134/Problem_Definition2.png" width=400 height=400>
39+
## 3-1 Problem Feature
40+
The sinking of the Titanic is one of the most infamous shipwrecks in history. **On April 15, 1912**, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing **1502 out of 2224** passengers and crew. That's why the name DieTanic. This is a very unforgetable disaster that no one in the world can forget.
41+
42+
It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science and participate in competitions in Kaggle.
43+
44+
ٌWe will use the classic titanic data set. This dataset contains information about **11 different variables**:
45+
<img src="http://s9.picofile.com/file/8340453092/Titanic_feature.png" height="500" width="500">
46+
47+
1. Survival
48+
1. Pclass
49+
1. Name
50+
1. Sex
51+
1. Age
52+
1. SibSp
53+
1. Parch
54+
1. Ticket
55+
1. Fare
56+
1. Cabin
57+
1. Embarked
58+
59+
> <font color="red"><b>Note :</b></font>
60+
You must answer the following question:
61+
How does your company expact to use and benfit from your model.
7862
### 3-2 Aim
79-
The aim is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals
63+
It is your job to predict if a **passenger** survived the sinking of the Titanic or not. For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.
8064
<a id="6"></a> <br>
8165
### 3-3 Variables
82-
The variables are :
83-
**sepal_length**: Sepal length, in centimeters, used as input.
84-
**sepal_width**: Sepal width, in centimeters, used as input.
85-
**petal_length**: Petal length, in centimeters, used as input.
86-
**petal_width**: Petal width, in centimeters, used as input.
87-
**setosa**: Iris setosa, true or false, used as target.
88-
**versicolour**: Iris versicolour, true or false, used as target.
89-
**virginica**: Iris virginica, true or false, used as target.
90-
91-
**<< Note >>**
92-
> You must answer the following question:
93-
How does your company expact to use and benfit from your model.
94-
<a id="7"></a> <br>
95-
# 4- Inputs & Outputs
96-
<a id="8"></a> <br>
66+
67+
1. **Age** :
68+
1. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
69+
70+
1. **Sibsp** :
71+
1. The dataset defines family relations in this way...
72+
73+
a. Sibling = brother, sister, stepbrother, stepsister
74+
75+
b. Spouse = husband, wife (mistresses and fiancés were ignored)
76+
77+
1. **Parch**:
78+
1. The dataset defines family relations in this way...
79+
80+
a. Parent = mother, father
81+
82+
b. Child = daughter, son, stepdaughter, stepson
83+
84+
c. Some children travelled only with a nanny, therefore parch=0 for them.
85+
86+
1. **Pclass** :
87+
* A proxy for socio-economic status (SES)
88+
* 1st = Upper
89+
* 2nd = Middle
90+
* 3rd = Lower
91+
1. **Embarked** :
92+
* nominal datatype
93+
1. **Name**:
94+
* nominal datatype . It could be used in feature engineering to derive the gender from title
95+
1. **Sex**:
96+
* nominal datatype
97+
1. **Ticket**:
98+
* that have no impact on the outcome variable. Thus, they will be excluded from analysis
99+
1. **Cabin**:
100+
* is a nominal datatype that can be used in feature engineering
101+
1. **Fare**:
102+
* Indicating the fare
103+
1. **PassengerID**:
104+
* have no impact on the outcome variable. Thus, it will be excluded from analysis
105+
1. **Survival**:
106+
* **[dependent variable](http://www.dailysmarty.com/posts/difference-between-independent-and-dependent-variables-in-machine-learning)** , 0 or 1
107+
## 4- Inputs & Outputs
108+
<a id="41"></a> <br>
97109
### 4-1 Inputs
98-
**Iris** is a very popular **classification** and **clustering** problem in machine learning and it is such as "Hello world" program when you start learning a new programming language. then I decided to apply Iris on 20 machine learning method on it.
99-
The Iris flower data set or Fisher's Iris data set is a **multivariate data set** introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers in three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
100-
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
101-
102-
As a result, **iris dataset is used as the input of all algorithms**.
103-
![iris](https://image.ibb.co/gbH3ue/iris.png)
104-
[image source](https://rpubs.com/wjholst/322258)
105-
<a id="9"></a> <br>
110+
What's our input for this problem:
111+
1. train.csv
112+
1. test.csv
113+
<a id="42"></a> <br>
106114
### 4-2 Outputs
107-
the outputs for our algorithms totally depend on the type of classification or clustering algorithms.
108-
the outputs can be the number of clusters or predict for new input.
109-
110-
**setosa**: Iris setosa, true or false, used as target.
111-
**versicolour**: Iris versicolour, true or false, used as target.
112-
**virginica**: Iris virginica, true or false, used as a target.
113-
<a id="10"></a> <br>
114-
# 5-Installation
115-
#### Windows:
116-
* Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.
117-
* Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.
118-
* Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)
119-
#### Linux
120-
Package managers of respective Linux distributions are used to install one or more packages in SciPy stack.
121-
122-
For Ubuntu Users:
123-
sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook
124-
python-pandas python-sympy python-nose
125-
<a id="11"></a> <br>
126-
## 5-1 Jupyter notebook
127-
I strongly recommend installing **Python** and **Jupyter** using the **[Anaconda Distribution](https://www.anaconda.com/download/)**, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
128-
129-
First, download Anaconda. We recommend downloading Anaconda’s latest Python 3 version.
130-
131-
Second, install the version of Anaconda which you downloaded, following the instructions on the download page.
132-
133-
Congratulations, you have installed Jupyter Notebook! To run the notebook, run the following command at the Terminal (Mac/Linux) or Command Prompt (Windows):
134-
<a id="15"></a> <br>
135-
## 5-5 Loading Packages
115+
1. Your score is the percentage of passengers you correctly predict. This is known simply as "**accuracy**”.
136116

137-
In this kernel we are using the following packages:
138117

139-
<img src="http://s8.picofile.com/file/8338227868/packages.png">
140-
141-
Now we import all of them
142-
143-
<a id="16"></a> <br>
118+
The Outputs should have exactly **2 columns**:
119+
120+
1. PassengerId (sorted in any order)
121+
1. Survived (contains your binary predictions: 1 for survived, 0 for deceased)
122+
## 5- Loading Packages
123+
In this kernel we are using the following packages:
124+
<img src="http://s8.picofile.com/file/8338227868/packages.png" width=400 height=400>
144125
# 6- Exploratory Data Analysis(EDA)
145126
In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.
146127

147128
* Which variables suggest interesting relationships?
148129
* Which observations are unusual?
130+
* Analysis of the features!
149131

150-
By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:
132+
By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both **insightful** and **beautiful**. then We will review analytical and statistical operations:
151133

152134
* 5-1 Data Collection
153135
* 5-2 Visualization
154136
* 5-3 Data Preprocessing
155137
* 5-4 Data Cleaning
156138
<img src="http://s9.picofile.com/file/8338476134/EDA.png">
157-
<a id="17"></a> <br>
139+
140+
><font color="red"><b>Note:</b></font>
141+
You can change the order of the above steps.
158142
## 6-1 Data Collection
159143
**Data collection** is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]
160-
161-
**Iris dataset** consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray
162-
163-
The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.[6]
164-
165-
<a id="18"></a> <br>
144+
<br>
145+
I start Collection Data by the training and testing datasets into Pandas DataFrames
166146
## 6-2 Visualization
167147
**Data visualization** is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.
168148

0 commit comments

Comments
 (0)