
In this project I have analyzed data of 5000 patients and explored some of the reasons that could potentially cause strokes. I have also used logistic regression to predict whether an individual can have a stroke or not.
Step 1
Step 2
Step 3
Step 4
Finding the data
The data for this project was found on kaggle.
Click here for the dataset
Understanding and cleaning the dataset

After importing the dataset in python, it is important to understand the dataset and remove irrelevant columns and to check for any oddities.
We first check the shape of our dataset to understand the number of rows and columns in it.

We have 5110 rows and 11 columns in our dataset
​
​
We then remove the id column because it doesn't provide any relevant information and has no direct relation to a person getting a stroke or not.

Once the irrelevant columns are dropped, we then check each column for anomalies.
​
In the dataset, we can see that age is in float datatype even though months haven't been specified, therefore we convert the age column from float to int datatype.

Whiles't checking the unique values in the gender column, I discover that the column has 3 unique values, namely - ['Male', 'Female', 'Others'] and the 'Others' gender category contains only 1 patient. Therefore the 'Others' category was excluded from the analysis in order to create a slightly more accurate prediction model.

We now check if there are any NA values in the dataset and drop all rows that contain NA values.

Once the rows that consisted NA values are dropped, we can see that we now have 4908 rows which means we ended up dropping 202 rows from our dataset.
​
Categorizing data in column
It is important to categorize the data which ultimately helps in obtaining more accurate results while training our prediction model.

Here we look at the unique values in the work_type columns which shows us the kind of work done by patients on a day to day basis.
Private = Works in Private Sector
Self-employed = Have their own business
Govt_job = Works for the government
Children = The dataset doesn't provide any explanation for this category. Therefore it is important to analyze this category first before categorizing our data in the column.
Never_worked = People that are unemployed and have never worked before.

In the above code we are analyzing the 'children' category in the work_type column. From the name itself we can interpret 2 possibilities:
1) They are children, which means they are unemployed.
2) They are parents or care takers looking after children.
​
Whilst looking at their age, we notice that all of them are below 18 years of age which confirms the first possibility which means they are unemployed.

In the above diagram we are looking at a grouped bar chart which shows us the number of people based on work_type and grouped by whether they had a stroke or not where 0 = 'No' and 1 = 'Yes'.
Because most of the people who are employed show that they have had a stroke, it is better to categorize the work type into employed and unemployed for a better accuracy in our prediction model.

Above we have categorized the variables in the work_type column as either 'employed' or 'unemployed' where
1) 'Private', 'Self-employed', 'Govt_job' = 'Employed'
2) 'children', 'Never_worked' = 'Unemployed'
Extracting insights
The main reason why we clean and organize data is to extract valuable and accurate insights from it.
​
First we will check correlations between various attributes in our dataset.
​
From the diagram below we can see that there is a positIve correlation between bmi and avg_glucose_level which tells us that, the higher the bmi, the higher is the avg_glucose level and vice versa.
The red plots indicate that the person has had a stroke and the blue plots indicate the person has never had a stroke.

Below we can see that there is a positive correlation between age and avg_glucose_level which indicates that, the older a persons age, the higher is the avg_glucose level.
We can also see that majority of people who got a stroke were above 40 years in age which is represented by the red plots.

In the diagram we can conclude that there is a positive correlation between age and bmi which indicates that, the older the person, the higher is the bmi.

For our next analysis we check if being married increases the chances of person getting a stroke or not.
We first group our data into 2 categorizes - Married and Unmarried. We then further group each category into 2 groups - if they had a stroke and if they never had a stroke.

In the first code above, we have grouped our data and in the second code we have calculated the percentages for each group.
So, looking at our data in a bar chart, we can say that 5.81% of married people got a stroke but only 1.35% of unmarried people got a stroke. The difference between a married person getting a stroke compared to an unmarried person getting a stroke is almost 4 times, this could be because our data contains a higher number of married individuals than unmarried. More data will be needed to ascertain if marriage could be one of the reasons for a stroke.
Some of the reasons which may contribute to a higher stroke rate among married individuals could be due to:
-
Pressure to provide resources for the family
-
Issues in marital life
And the list goes on but it is important to conduct an in-depth research to get to the root cause. Therefore we can't conclusively state that marriage causes a stroke until further research and more data.

Now let's see if smoking increases the chances of getting a stroke.
We will follow the same steps as we did earlier and categorize and then group our data based on smoking habits.

After grouping our data in the first code, we then take the percentages of each groups.
​
In the chart below, we can see that 6.82% of people who formerly smoked got a stroke, 5.29% of people who currently smoke got a stroke whereas 4.54% of people who never smoked got a stroke.
From this we can say that smoking could be one of the causes of getting a stroke.

Now let's see what age groups do majority of the people get a stroke at. For this I have created age ranges also know as bins.

As you can see, as the age range increases, the number of people getting a stroke also increases. I have used binning method to create age ranges. While creating bins, the number to the right is included which means bins like[0, 15, 30, 45, 60, 75, 90] indicate (0,15], (15,30], (30,45],(45,60],(60,75],(75,90] . For example 0-15 includes 15 and 15-30 excludes 15.
Step 5
Prediction Model
We will now create a logistic regression model which will help us predict if a person will get a stroke or not.
​
First we have to convert all the categorical attributes into boolean or numerical format.
Below we have converted 4 columns into boolean, which are 'gender', 'ever_married', 'work_type', 'Residence_type'.
Boolean data type is a form of data with only two possible outcomes (True OR False, Yes OR No)
We cannot convert smoking_status into boolean because there are more than 2 outcomes - 'Smoke', 'Never Smoked', 'Formerly Smoked'.

Below we have converted the smoking status into numerical format.

Now all our attributes are in numerical format.

We now divide our data into 2 parts:
-
X = Attributes based on which the outcome will be predicted, also known as independent variable
-
y = The target variable also known as dependent variable. (Attribute we want to predict)
​
In our model we want to predict if a person gets a stroke or not, therefore stroke is our y variable (Dependent variable) and all the rest of the attributes are our x variable (Independent variable).

Once we have divided our data into X and y variables, we will now further divide them into train and test dataset.
Train dataset is used to train our model based on which it learns and will be able to predict the outcomes on the test dataset.
Below we have divided our dataset into X_train, y_train, X_test, y_test from which 60% is for training and 40% is for testing which means 60% of our dataset will be used for training the model and 40% will be used for testing.

In the below code, we are creating our logistic regression model and training this model by providing it with the X_train and y_train variable which we created earlier.

Once our dataset is trained, the model will now use the test dataset to predict the outcomes for it. Below we have added our X_test dataset for predictions and which will ultimately store our predictions in variable y_pred

We have now finished training and testing our model. Let's see how well did our model predict by using confusion matrix. Confusion matrix uses the actual test results (y_test) and compares them to our predictions (y_pred) made by our model.

Above we can see that from 2945 test subjects, our model correctly predicted that 2815 people did not have a stroke and 1 person did have a stroke. It also wrongly predicted that 129 people had a stroke but in reality they did not have one. In order to make the model more accurate, it is important to clean and categorize our data further and train our model using more data.
Below are some classifications which help us understand our model performance better.
Precision = Precision is defined as the ratio of true positives to the sum of true and false positives.
Recall = Recall is defined as the ratio of true positives to the sum of true positives and false negatives.
F1 Score = The F1 is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.
Support = Support is the number of actual occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the performance evaluation process.
