We will install ggpubr
package using install.packages("ggpubr")
as we need one of its function to merge/arrange more than one ggplots.
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'DT' was built under R version 3.5.3
The Behavioral Risk Factor Surveillance System (BRFSS) is an ongoing surveillance system (hence an observational study) designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. Surveillance data on risk behaviors are collected through monthly telephone interviews (both landline telephone, from a randomly selected adult in a household and cellular telephone, from an adult who participates by using a cellular telephone and resides in a private residence or college housing) for all 50 states, the District of Columbia, Puerto Rico, Guam and the US Virgin Islands. All 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually, and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point-in-time (usually one to three months).
Data collection is done with a standardized questionnaire following the BRFSS protocol to ensure a quality report.
The BRFSS uses two samples: one for landline telephone respondents and one for cellular telephone respondents. Since landline telephones are often shared among persons living within a residence, household sampling is used in the landline sample to collect information on the number of adults living within a residence and then select randomly from all eligible adults. Cellular telephone respondents are weighted as single adult households.
The method used for the landline telephone sampling is called Disproportionate stratified sampling (DSS). It draws known telephone household numbers from two strata (a large proportion of target numbers, the high-density block and a set that contains a smaller proportion of target numbers, the medium-density block). The BRFSS samples landline telephone numbers based on sub-state geographic regions. Regional sampling is used to target data collection to geographic subpopulations (such as residents within a public health district).
The cellular telephone sample is randomly generated from a sampling frame of the confirmed cellular area code and prefix combinations. Cellular telephone respondents are randomly selected with each having an equal probability of selection. From 2013, cellular telephone stratification was conducted by the BRFSS, although geographic specificity is less reliable for cellular telephone numbers than for landline numbers.
Also, appropriate weighting methods (design weighting and iterative proportional fitting) are used in an attempt to remove bias in the sample.
The BRFSS 2013 data is as a result of the retrospective observational study (as the data recorded are as a result of events that have taken place) and not an experimental study, hence no random assignment and as a result, we cannot make causal conclusions from the data (we can only associate). BRFSS uses random sampling as explained above; hence the data is generalizable. We are simply saying that since there is no random assignment, only random sampling, there is no causal relationship, only association and the data is generalizable.
This data is generalizable to the non-institutionalized adult population, aged 18 years or older, who reside in the US (N.B: In this project, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico).
Research Question 1: We want to find out the health status of those who sleep 7-9hrs per day and meet the aerobic exercise recommendation and consume at least one or more fruits and vegetable each per day (Case 1). What percentage of adults who meet the above reported good or better health?
Interest: To ensure overall good health, the National Sleep Foundation recommends at least 7 hours of sleep for adults from age 18 years (a recommended range of 7 to 9 hours of sleep per day) and the American Heart Association recommends at least 150minutes per week of moderate exercise or 75 minutes per week of vigorous exercise (or a combination of moderate and vigorous activity), to improve overall cardiovascular health. Also for good health, fruits and vegetables should be the main part of a meal.
Research Question 2: We want to know the percentage of the population that was ever diagnosed with a heart attack; what percentage of the diagnosed population is above or below 65 years of age, we want to know their sex and also race. We will also do that for those that (were told they) ever had a stroke.
Interest: Heart disease and stroke are the world’s two leading causes of death, and for all Americans, heart disease is the No. 1 killer and stroke is also a leading cause of death.
Research question 3: We want to compare the health status of several cases (cases 1 to 5); find out if those who fulfill each case were ever diagnosed with a heart attack. We will also compare the percentage that was ever diagnosed with stroke for each case.
Interest: Heart diseases can lead to a heart attack and stroke, which can be prevented or treated with healthy lifestyle choices. I am personally interested in preventive health practices, so knowing the effects of different lifestyle choices can aid us in our quest of living healthy/ensuring a healthy lifestyle.
Research Questions Breakdown
We are interested in 5 cases of adults in the sample population:
1. Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation per week and consume at least one or more fruits and vegetable each per day.
2. Those who sleep 7-9hrs per day and those who do any physical activity or exercise and consume at least one or more fruits and vegetable each per day.
3.Those who sleep 7-9hrs per day and those who do not do any physical activity or exercise but consume at least one or more fruits and vegetable each per day.
4. Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation per week but consume less than one fruits and vegetable each per day.
5. Those who sleep 7-9hrs per day and who do not do any physical activity or exercise and consume less than one fruits and vegetable each per day.
6. Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation per week.
7. Those who meet the aerobic exercise recommendation per week and consume at least one or more fruits and vegetable each per day. We will look at their health based on their sleep time (group by sleepbreaks), then plot a segmented bar plot. With this, we will answer the question: do those who sleep less than 7-9 hours or more report better health considering that they met the aerobic exercise recommendation and consumed at least one or more fruits and vegetable each per day?
For cases (1) to (5), we will look at the percentage of the population that undergo each, then the percentage of each that have good or better health/fair or poor health.
For only case 1, we will analyze the age group that had the most reported good or better health and the health status of case 1 adult within each month.
We will look at the percentage of the general sample population that had good or better health, visualize their health status based on months to know which month the population reported the most good or better health. We will analyze the following and more:
1. Number of healthy days (good physical and mental health) and active days a typical adult has.
2. The typical sleep time of an adult in the sample population.
In doing the above, we will answer the following questions and more:
1. What is the probability that an adult in the sample population (who gave a response) have ever had(been told/diagnosed with) a stroke? We will be considering only those who gave a yes or no response to the question.
2. What is the probability that an adult age 65 and above reported good or better health given that the adult fulfilled case 1.
3. What is the chance that a Black or African American who met case 1 was ever diagnosed with a heart attack?
4. What percentage of the population who met case 1 and reported that (s)he had good or better health was also ever diagnosed with a heart attack?
In this analysis, we will exclude all missing results, those who did not give a response or did not know (all analysis excludes NAs). The results of the analysis are generalized to only non-institutionalized adult population, aged 18 years or older in the States (all areas participating in BRFSS)- see part 1 for details. In this analysis, the term “adult” refers to “non-institutionalized adult population, aged 18 years and above in the States”.
Note the term “we” used throughout this analysis simply refers to “me (/I) explaining the concept to my audience.”
Before we begin our analysis, we will select the variables that are of importance to us and create a new dataframe from brfss2013 called “cbrfss” (this is after we have gone through the code book and calculated variables to know the variables we are interested in and their types/structures, keeping in mind our research questions.
Also, we create a new variable sleepbreaks from sleptim1 to group the amount of time adults sleep into ten groups and add to the data frame, “cbrfss”.
sleepbreaks<-cut(cbrfss$sleptim1,breaks = c(0,3,5,7,10,13,16,25,451),labels = c("0-2","3-4","5-6","7-9","10-12","13-15","16-24","25+"),right = FALSE)
cbrfss<-cbrfss%>%mutate(sleepbreaks)
table(cbrfss$sleepbreaks,useNA = "ifany")
##
## 0-2 3-4 5-6 7-9 10-12 13-15 16-24 25+ <NA>
## 1305 17757 139633 307371 16610 1013 697 2 7387
Or you could combine both codes as below:
cbrfss<-cbrfss%>%mutate(sleepbreaks=cut(sleptim1,breaks = c(0,3,5,7,10,13,16,25,451),labels = c(“0-2”,“3-4”,“5-6”,“7-9”,“10-12”,“13-15”,“16-24”,“25+”),right = FALSE))
Use summary(cbrfss) to view the data summary and str(cbrfss) to view the structure of the data. Knowing the values that make up the variables (the type and structure) will ease our analysis. In order not to take up pages, we will not include the result of summary(cbrfss) in our report.
## 'data.frame': 491775 obs. of 19 variables:
## $ X_frtlt1 : Factor w/ 2 levels "Consumed fruit one or more times per day",..: 1 2 2 2 2 1 1 2 1 2 ...
## $ X_veglt1 : Factor w/ 2 levels "Consumed vegetables one or more times per day",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ X_age65yr : Factor w/ 2 levels "Age 18 to 64",..: 1 1 1 1 2 1 1 1 1 2 ...
## $ X_age_g : Factor w/ 6 levels "Age 18 to 24",..: 5 4 5 5 6 4 3 5 4 6 ...
## $ X_ageg5yr : Factor w/ 13 levels "Age 18 to 24",..: 9 7 8 9 10 6 4 9 7 10 ...
## $ fmonth : Factor w/ 12 levels "January","February",..: 1 1 1 1 2 3 3 3 4 4 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ X_rfhlth : Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
## $ physhlth : int 30 0 3 2 10 0 1 5 0 0 ...
## $ menthlth : int 29 0 2 0 2 0 15 0 0 0 ...
## $ poorhlth : int 30 NA 0 0 0 NA 0 10 NA NA ...
## $ sleptim1 : int NA 6 9 8 6 8 7 6 8 8 ...
## $ X_totinda : Factor w/ 2 levels "Had physical activity or exercise",..: 2 1 2 1 2 1 1 1 1 1 ...
## $ X_paindx1 : Factor w/ 2 levels "Met aerobic recommendations",..: 2 2 2 2 2 2 1 2 1 1 ...
## $ cvdinfr4 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ cvdstrk3 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ X_mrace1 : Factor w/ 7 levels "White","Black or African American",..: 2 1 1 1 1 2 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ sleepbreaks: Factor w/ 8 levels "0-2","3-4","5-6",..: NA 3 4 4 3 4 4 3 4 4 ...
Before we delve into our research questions, we want to understand the health status of the population as stated earlier (see Part 2 for details)
To get the distribution and percentage of the sample population’ health in terms of those that responded if they had Good or Better OR Fair or Poor Health, we tabulate “X_rfhlth” and plot a bar chart.
## perc_hg_b perc_hf_p
## 1 0.8066972 0.1933028
We filter NAs by using !is.na and plot a bar chart of the sample population health status.
80.67% of the sample population have “good or better health”. It can be attributed to good health care, reduced unemployment rates, good general standard of living as expected in First World Countries.
We can tabulate the percentage of the health Status of the sample population, in decimal, across age groups
## # A tibble: 13 x 3
## X_ageg5yr perc_hg_b perc_hf_p
## <fct> <dbl> <dbl>
## 1 Age 18 to 24 0.920 0.0802
## 2 Age 25 to 29 0.909 0.0908
## 3 Age 30 to 34 0.897 0.103
## 4 Age 35 to 39 0.883 0.117
## 5 Age 40 to 44 0.860 0.140
## 6 Age 45 to 49 0.829 0.171
## 7 Age 50 to 54 0.799 0.201
## 8 Age 55 to 59 0.777 0.223
## 9 Age 60 to 64 0.774 0.226
## 10 Age 65 to 69 0.780 0.220
## 11 Age 70 to 74 0.768 0.232
## 12 Age 75 to 79 0.734 0.266
## 13 Age 80 or older 0.706 0.294
To visualize the above result, we plot a bar chart.To do this, we first create a subset of the data frame “percent_rfhlth” and mutate to include the percentage of health status within age groups, in decimal; then plot.
percent_rfhlth<-cbrfss%>%filter(!is.na(X_ageg5yr),!is.na(X_rfhlth))%>%group_by(X_ageg5yr)%>%count(X_rfhlth)%>%mutate(percentage=n/sum(n))
ggplot(percent_rfhlth,aes(x=X_ageg5yr,y=percentage,fill=X_rfhlth))+geom_bar(stat= "identity",position = position_dodge(width = 0.9),width =1)+geom_text(aes(label=round(percentage,2)),vjust=-0.25)+theme(axis.text.x = element_text(angle = 45,hjust = 1))+ggtitle("Plot of Health Status of the Sample Population within Age Groups")+theme(plot.title = element_text(hjust = 0.3))
We can use geom_bar (with stat=“identity”), or we plot the bar chart with geom_col in order to indicate the variable in the y-axis. We can decide to flip the plot using “coord_flip” to get a better view.
From the chart, it can be seen that the ratio of “good or better health” to “fair or poor health” reduces as the age of the adult increases, hence we can say that the older one gets, the more likely his/her health deteriorates (having fair or poor health)- slight negative association. Though there are other confounding variables, e.g., healthy lifestyles like regular check-ups,exercise, sleep duration, fruits, and vegetable consumption, we can say younger adults generally have good or better health, due to age.
We want to check if the report of fair or poor health is predominant in a particular season. We will use the variable fmonth (with the assumption that it represents the actual month of the event). We will find the percentage of each health status, in decimal, across each month.
## # A tibble: 12 x 3
## fmonth perc_hg_b perc_hf_p
## <fct> <dbl> <dbl>
## 1 January 0.815 0.185
## 2 February 0.812 0.188
## 3 March 0.807 0.193
## 4 April 0.805 0.195
## 5 May 0.801 0.199
## 6 June 0.802 0.198
## 7 July 0.807 0.193
## 8 August 0.806 0.194
## 9 September 0.803 0.197
## 10 October 0.811 0.189
## 11 November 0.806 0.194
## 12 December 0.802 0.198
To visualize the above result, we plot a segmented (dodge) bar chart
percent_mrfhlth<-cbrfss%>%filter(!is.na(fmonth),!is.na(X_rfhlth))%>%group_by(fmonth)%>%count(X_rfhlth)%>%mutate(percentage=n/sum(n))
ggplot(percent_mrfhlth,aes(x=fmonth,y=percentage,fill=X_rfhlth))+geom_bar(stat= "identity",position = position_dodge(width = 0.9),width =1)+geom_text(aes(label=round(percentage,2)),vjust=-0.25)+theme(axis.text.x = element_text(angle = 45,hjust = 1))+ggtitle("Plot of Health Status of the Sample Population within Months")+theme(plot.title = element_text(hjust = 0.5))
From the bar chart and tabular representation above, we can see that the health status of the sample population across months is approximately equal (~81%); so the population health status is not dependent on the month of the year.
Number of healthy (good physical and mental health) and active days a typical adult has
Create new variables: gphyshlth (number of days physical health is good), gmenthlth (number of days mental health is good), activedays(number of activedays, assuming it as the number of days you were able to do your usual activities, due to good physical and mental health) and add to the dataframe
View the first six values of the added variable
## gphyshlth gmenthlth activedays
## 1 0 1 0
## 2 30 30 NA
## 3 27 28 30
## 4 28 30 30
## 5 20 28 30
## 6 30 30 NA
Plot histogram to determine the shape of the sample population’ healthy (physical and mental) and active days
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above plots, we can see that the number of healthy (good physical and good mental health) and active days are left-skewed.
We can use the function summary() to find the summary statistics of healthy and active days
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -30.00 27.00 30.00 25.65 30.00 30.00 10957
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -4970.00 28.00 30.00 26.62 30.00 30.00 8627
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -6970.0 25.0 30.0 24.7 30.0 30.0 243153
The distributions of gphyshlth, gmenthlth, and activedays of the sampled population are left skewed, so a typical good physical health, good mental health, active days for an adult in the sample population is 30 days each (the median value).
The typical sleep time of an adult in the sample population
Get the distribution of sleptim1 and summary statistics to know the duration a typical adult in the sample population sleeps per day. Use function: summary() Or
cbrfss%>%filter(!is.na(sleptim1))%>%summarise(median(sleptim1),mean(sleptim1),IQR(sleptim1),sd(sleptim1)) Or
cbrfss%>%summarise(median(sleptim1,na.rm = TRUE),mean(sleptim1,na.rm = TRUE),IQR(sleptim1,na.rm = TRUE),sd(sleptim1,na.rm = TRUE))
Plot a histogram to determine the shape of the sleep time of an adult in the sample population. Note that non-finite values are removed by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7387 rows containing non-finite values (stat_bin).
The distribution of sleep time is right skewed, so for robust statistics, we find the median and IQR (to determine variability) of the data. We can also include the mean and standard deviation in our summary.
## median(sleptim1) mean(sleptim1) IQR(sleptim1) sd(sleptim1)
## 1 7 7.052099 2 1.60411
A typical adult sleeps 7hours a day.
We can decide to see the distribution of sleep time within age groups by plotting a bar chart of “X_age_g and sleepbreaks” or we can plot a side by side box plot using the variable “sleptim1”.
From the chart, we can see that the typical sleep period across age groups is 7-9hours.
This plot confirms the typical sleep time of 7 hours for adults across all age groups.
Research Question 1
Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation and consume at least one or more fruits and vegetable each per day (tsleepexfrvg)
We will create a new variable “tsleepexfrvg” for case 1 and add to new dataframe “scbrfss” (created from and same as “cbrfss”- the essence of this is to easily get hold of your original dataframe in a case of wrong computation). First, we create a subset of data frame “cbrfss” and mutate case 1.
sleepexfvg<-cbrfss%>%select(sleepbreaks,X_paindx1,X_frtlt1,X_veglt1)%>%mutate(sleep_ex_fr_vg=ifelse(sleepbreaks=="7-9"&X_paindx1=="Met aerobic recommendations"&X_veglt1=="Consumed vegetables one or more times per day"&X_frtlt1=="Consumed fruit one or more times per day","True","False"))
table(sleepexfvg$sleep_ex_fr_vg,useNA = "ifany")
##
## False True <NA>
## 364838 94477 32460
Save the variable “sleep_ex_fr_vg” as “tsleepexfrvg” and add to the dataframe scbrfss
Count and Percentage of the sample population for Case 1
## count_pop_1 perc_pop_1
## 1 94477 0.2056911
Age Group that does Case 1
Out of 20.57% of the sample population that fulfill Case 1,from the bar chart above, adults of ages 50 and above (highest count- age 60 to 64) are more conscious of their health as they have a higher count compared to those below 50 in ensuring that they sleep 7 to 9hours per day and meet the aerobic exercise recommendations and eat at least one or more fruit and vegetable each per day. It could also be that there are “more” older adults in the population hence the higher count of them fulfilling case 1; we can only confirm this if we consider the number of those who fulfilled case 1 to those who did not within each group.
age_case1<-scbrfss%>%filter(!is.na(tsleepexfrvg),!is.na(X_ageg5yr))%>%group_by(X_ageg5yr)%>%count(tsleepexfrvg)%>%mutate(percentage=n/sum(n))
age_case1%>%filter(tsleepexfrvg=="True")
## # A tibble: 13 x 4
## # Groups: X_ageg5yr [13]
## X_ageg5yr tsleepexfrvg n percentage
## <fct> <chr> <int> <dbl>
## 1 Age 18 to 24 True 3967 0.157
## 2 Age 25 to 29 True 3580 0.168
## 3 Age 30 to 34 True 4303 0.170
## 4 Age 35 to 39 True 4858 0.184
## 5 Age 40 to 44 True 5542 0.187
## 6 Age 45 to 49 True 6292 0.183
## 7 Age 50 to 54 True 8776 0.197
## 8 Age 55 to 59 True 10322 0.207
## 9 Age 60 to 64 True 11198 0.221
## 10 Age 65 to 69 True 11112 0.236
## 11 Age 70 to 74 True 9148 0.246
## 12 Age 75 to 79 True 7018 0.251
## 13 Age 80 or older True 8361 0.226
From the above result, we can confirm that older adults fulfill case 1 more; with age 75 to 79 years actually having the highest percentage (25.06% of adults age 75 to 79 fulfill case 1 while 74.94% of adults do not) and age 18 to 24, the lowest (15.71% of adults age 18 to 24 fulfill case 1 while 84.29% do not). This can be seen in the light that older adults are retired and they have more time to rest and exercise compared to younger adults.
Percentage of case 1 that have good or better health/fair or poor health
## perc_hg_b perc_hf_p
## 1 0.9221906 0.07780937
92.22% of adults who fulfill case 1 have good or better health, which is a pretty good percentage. We can say that if at least 50% (as against 20.57%) of adults fulfill case 1, the general population will have a better health status than 80.67%.
In order to know the distribution of health status of adults across age groups, we plot a chart Showing Health Status of Case 1 for different Age Groups.
percent_rfhlth_case1<-scbrfss%>%filter(tsleepexfrvg=="True",!is.na(X_ageg5yr),!is.na(X_rfhlth))%>%group_by(X_ageg5yr)%>%count(X_rfhlth)%>%mutate(percentage=n/sum(n))
ggplot(percent_rfhlth_case1,aes(x=X_ageg5yr,y=percentage,fill=X_rfhlth))+geom_bar(stat= "identity",position = position_dodge(width = 0.9),width =1)+coord_flip()+geom_text(aes(label=round(percentage,2)),vjust=-0.25)+ggtitle("Chart Showing Health Status of Case 1 within Age Groups")+theme(plot.title = element_text(hjust = 0.5))
The result across age groups is quite similar. This explains the idea that adults who sleep 7-9hours per day, meet the aerobic exercise recommendations and eat at least one or more fruit and vegetable each per day are sure of good or better health irrespective of their age (84-96% of the time). The slight disparity between the age group could be attributed to the idea that aged adults are more prone to weakness/illness compared to younger adults, due to their age; so it is expected that younger adults are healthier than aged adults as seen in the plot of population health status across age groups.
Question: What is the probability that an adult age 65 and above reported having good or better health given that the adult fulfills case 1? Ans: 0.88 (88.81%)
## # A tibble: 2 x 4
## # Groups: X_age65yr [1]
## X_age65yr X_rfhlth n probability
## <fct> <fct> <int> <dbl>
## 1 Age 65 or older Good or Better Health 31549 0.888
## 2 Age 65 or older Fair or Poor Health 3975 0.112
We can check if the health status of those that fulfill case 1 varies by month, by plotting a bar chart of case 1’ health status across months (we use the file month).
percent_mrfhlth_case1<-scbrfss%>%filter(tsleepexfrvg=="True",!is.na(fmonth),!is.na(X_rfhlth))%>%group_by(fmonth)%>%count(X_rfhlth)%>%mutate(percentage=n/sum(n))
ggplot(percent_mrfhlth_case1,aes(x=fmonth,y=percentage,fill=X_rfhlth))+geom_bar(stat= "identity",position = position_dodge(width = 0.9),width =1)+coord_flip()+geom_text(aes(label=round(percentage,2)),vjust=-0.25)+ggtitle("Chart Showing Health Status of Case 1 within Months")+theme(plot.title = element_text(hjust = 0.5))
We can see that health status of adults that fulfill case 1 is not dependent on the season of the year, same as that of the sample population because each month, the percentage of good or better health to fair or poor health is almost same.
We want to find out what number of healthy (physical, mental) and active days a typical adult that fulfills case 1 has. We determine the shape of the distribution by plotting a histogram for each.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Then we analyze the center of the distribution (For robust statistics for a left skewed distribution, we concentrate on the median and IQR of the data). We will find the mean and standard deviation of healthy days though. (N.B: the result of the first and second analysis for gphyshlth are same but with different codes)
## median(gphyshlth) mean(gphyshlth) IQR(gphyshlth) sd(gphyshlth)
## 1 30 27.82958 1 5.995038
## median(gphyshlth, na.rm = TRUE) mean(gphyshlth, na.rm = TRUE)
## 1 30 27.82958
## IQR(gphyshlth, na.rm = TRUE) sd(gphyshlth, na.rm = TRUE)
## 1 1 5.995038
## median(gmenthlth) mean(gmenthlth) IQR(gmenthlth) sd(gmenthlth)
## 1 30 28.36929 0 5.014422
## median(activedays) mean(activedays) IQR(activedays) sd(activedays)
## 1 30 27.34904 2 6.266497
The distributions of gphyshlth,gmenthlth and activedays of Case 1 of the sampled population are left skewed, so a typical adult has 30 days each of good physical health, good mental health, active days (same as a typical adult of the sampled population)
Those who sleep 7-9hrs per day and those who do any physical activity or exercise during the past 30 days other than their regular jobs and consume at least one or more fruits and vegetable each per day (tsleepanyexfrvg).
We create a new data frame (sleepanyexfvg) and new variable (sleep_anyex_fr_vg) for Case 2
sleepanyexfvg<-scbrfss%>%select(sleepbreaks,X_totinda,X_frtlt1,X_veglt1)%>%mutate(sleep_anyex_fr_vg=ifelse(sleepbreaks=="7-9"&X_totinda=="Had physical activity or exercise"&X_frtlt1=="Consumed fruit one or more times per day"&X_veglt1=="Consumed vegetables one or more times per day","True","False"))
table(sleepanyexfvg$sleep_anyex_fr_vg,useNA = "ifany")
##
## False True <NA>
## 335932 134971 20872
Add the variable “sleep_anyex_fr_vg” to the dataframe scbrfss and save as “tsleepanyexfrvg”
Count and Percentage of the sampled population for Case 2
## count_pop_2 perc_pop_2
## 1 134971 0.2866217
The adults that fulfill case 2 are 8.05% more than case 1.
We want to find out the percentage of case 2 that have good or better health/fair or poor health
## perc_hg_b perc_hf_p
## 1 0.9122638 0.08773624
The adults that fulfill case 1 have 0.99% more good or better health than case 2 adults. So we can infer that if one is able to sleep 7-9hrs per day and consume at least one or more fruits and vegetable each per day, there is no much difference in the outcome of health status if the aerobic exercise recommendation is not met provided he/she does any physical activity or exercise during the past 30 days other than their regular jobs.
Those who sleep 7-9hrs per day and those who do not do any physical activity or exercise but consume at least one or more fruits and vegetable each per day (tsleepnoextfrvg).
We create a new data frame (sleepnoextfvg) and new variable (sleep_noex_tfr_vg) for Case 3
sleepnoextfvg<-scbrfss%>%select(sleepbreaks,X_totinda,X_frtlt1,X_veglt1)%>%mutate(sleep_noex_tfr_vg=ifelse(sleepbreaks=="7-9"&X_totinda=="No physical activity or exercise in last 30 days"&X_frtlt1=="Consumed fruit one or more times per day"&X_veglt1=="Consumed vegetables one or more times per day","True","False"))
table(sleepnoextfvg$sleep_noex_tfr_vg,useNA = "ifany")
##
## False True <NA>
## 442298 29598 19879
Add the variable “sleep_noex_tfr_vg” to the dataframe scbrfss and save as “tsleepnoextfrvg”
Count and Percentage of the sampled population for Case 3
## count_pop_3 perc_pop_3
## 1 29598 0.06272145
Only 6.28% of adults in the sample population fulfill case 3.
Let us find out the percentage of case 3 that have good or better health/fair or poor health
## perc_hg_b perc_hf_p
## 1 0.7524709 0.2475291
A very low percentage of adults fulfill case 3; but from the health status result, we can deduce the importance of participating in any exercise (either meet the aerobic exercise recommendation or take part in any physical activity or exercise other than work) as only 75.25% of case 3 adults have good or better health.
Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation but consume less than one fruits and vegetable each per day (tsleepexlfvg).
We create a new data frame (sleepexlfvg) and new variable (sleep_ex_lfr_vg) for Case 4
sleepexlfvg<-scbrfss%>%select(sleepbreaks,X_paindx1,X_frtlt1,X_veglt1)%>%mutate(sleep_ex_lfr_vg=ifelse(sleepbreaks=="7-9"&X_paindx1=="Met aerobic recommendations"&X_frtlt1=="Consumed fruit less than one time per day"&X_veglt1=="Consumed vegetables less than one time per day","True","False"))
table(sleepexlfvg$sleep_ex_lfr_vg,useNA = "ifany")
##
## False True <NA>
## 459776 10547 21452
Add the variable “sleep_ex_lfr_vg” to the dataframe scbrfss and save as “tsleepexlfvg”
Count and Percentage of the sample population for Case 4
## count_pop_4 perc_pop_4
## 1 10547 0.02242501
Only 2.24% of adults in the sample population fulfill case 4.
Let us find out the percentage of case 4 that have good or better health/fair or poor health
## perc_hg_b perc_hf_p
## 1 0.8643761 0.1356239
A very low percentage of adults fulfill case 4, but from the health status result, we can deduce the importance of participating in meeting the aerobic exercise recommendation. The percentage of good or better health is approximately 6% less than case 1 (difference in cases being fruits and vegetable consumption, each, at least one or more times per day) and about 11% more than those who did not do any physical activity or exercise and they consumed fruits and vegetable each at least one or more times per day. This enacts the importance of exercise, irrespective of having a healthy diet when it pertains to the health status of an adult.
Those who sleep 7-9hrs per day and who do not do any physical activity or exercise and consume less than one fruits and vegetable each per day (tsleepexlfvg).
We create a new data frame (a subset of scbrfss), “sleepnoexlfvg” and add new variable “sleep_noex_lfr_vg” for Case 5
sleepnoexlfvg<-scbrfss%>%select(sleepbreaks,X_totinda,X_frtlt1,X_veglt1)%>%mutate(sleep_noex_lfr_vg=ifelse(sleepbreaks=="7-9"&X_totinda=="No physical activity or exercise in last 30 days"&X_frtlt1=="Consumed fruit less than one time per day"&X_veglt1=="Consumed vegetables less than one time per day","True","False"))
table(sleepnoexlfvg$sleep_noex_lfr_vg,useNA = "ifany")
##
## False True <NA>
## 459338 12929 19508
Add the variable “sleep_noex_lfr_vg” to the dataframe scbrfss and save as “tsleepexlfvg”
Count and Percentage of the sample population for Case 5
## count_pop_5 perc_pop_5
## 1 12929 0.02737646
Only 2.73% of adults in the sample population fulfill case 5.
We want to find out the percentage of case 5 that have good or better health/fair or poor health
## perc_hg_b perc_hf_p
## 1 0.7135971 0.2864029
A very low percentage of adults fulfill case 5, but from the health status result, we can see that sleeping 7-9hours a day without exercising and also consuming less than one fruit and vegetable each per day equates to fair or poor health 28.64% of the time. Those who consume at least one or more fruit each day without exercising (see case 3) have approximately 4% less chance of having fair and poor health.
Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation per week (tsleepex).
We create a new data frame (sleepex), a subset of “scbrfss” and add new variable (sleep_ex) for Case 6
sleepex<-scbrfss%>%select(sleepbreaks,X_paindx1)%>%mutate(sleep_ex=ifelse(sleepbreaks=="7-9"&X_paindx1=="Met aerobic recommendations","True","False"))
table(sleepex$sleep_ex,useNA = "ifany")
##
## False True <NA>
## 302521 144433 44821
Add the variable “sleep_ex” to the dataframe scbrfss and rename as “tsleepex”
Count and Percentage of the sample population for Case 6
## count_pop_6 perc_pop_6
## 1 144433 0.3231496
32.31% of adults in the sample population fulfill case 6. Some of the 144,433 adults who fulfill case 6 fulfill cases 1 and 4.
Let us find out the percentage of case 6 that have good or better health/fair or poor health.
## perc_hg_b perc_hf_p
## 1 0.9100034 0.0899966
91% of adults who meet the aerobic exercise recommendation and sleep 7-9 hours per day have good or better health.
Those who meet the aerobic exercise recommendation per week and consume at least one or more fruits and vegetable each per day (texfrvg)
We create a new data frame (exfvg) and new variable (ex_fr_vg) for Case 7
exfvg<-scbrfss%>%select(X_paindx1,X_frtlt1,X_veglt1)%>%mutate(ex_fr_vg=ifelse(X_paindx1=="Met aerobic recommendations"&X_veglt1=="Consumed vegetables one or more times per day"&X_frtlt1=="Consumed fruit one or more times per day","True","False"))
table(exfvg$ex_fr_vg,useNA = "ifany")
##
## False True <NA>
## 306778 136395 48602
Add the variable “sleep_ex_fr_vg” to the dataframe scbrfss and rename as “texfrvg”
Question: Do those who sleep less than 7-9 hours or more report better health considering that they met the aerobic exercise recommendation and consumed at least one or more fruits and vegetable each per day?
To answer this, we plot a bar chart to show the health status (X_rfhlth) for different sleep periods (sleepbreaks)
percent_srfhlth_case7<-scbrfss%>%filter(texfrvg=="True",!is.na(sleepbreaks),!is.na(X_rfhlth))%>%group_by(sleepbreaks)%>%count(X_rfhlth)%>%mutate(percentage=n/sum(n))
ggplot(percent_srfhlth_case7,aes(x=sleepbreaks,y=percentage,fill=X_rfhlth))+geom_bar(stat= "identity",position = "dodge")+geom_text(aes(label=round(percentage,2),vjust=-0.5))+ggtitle("Chart Showing Health Status of Case 7 for different Sleep Period")+theme(plot.title = element_text(hjust = 0.5))
From the chart, we can see that those who sleep 7-9hours per day seems to have good or better health (92%) given case 7 is met compared to those who sleep less or more. However, those who sleep 5-6hours per day seconds that (86% of adults reported good or better health).
Research Question 2
Question: What is the probability that an adult in the sample population (who gave a response) have ever (been told they) had a heart attack? Remember we are only considering those who gave a yes or no response to the question.
## # A tibble: 2 x 3
## cvdinfr4 n `n/sum(n)`
## <fct> <int> <dbl>
## 1 Yes 29284 0.0599
## 2 No 459904 0.940
5.99% of a total of 489,188 adults have ever (been told they) had a heart attack.
We can also go further to see the age group of those (5.99%) that were diagnosed with a heart attack.
age_hattack<-scbrfss%>%filter(!is.na(X_ageg5yr),cvdinfr4=="Yes")%>%group_by(X_ageg5yr)%>%count(cvdinfr4)%>%ungroup %>%mutate(percentage=n/sum(n))
age_hattack
## # A tibble: 13 x 4
## X_ageg5yr cvdinfr4 n percentage
## <fct> <fct> <int> <dbl>
## 1 Age 18 to 24 Yes 113 0.00389
## 2 Age 25 to 29 Yes 114 0.00392
## 3 Age 30 to 34 Yes 174 0.00598
## 4 Age 35 to 39 Yes 263 0.00904
## 5 Age 40 to 44 Yes 477 0.0164
## 6 Age 45 to 49 Yes 885 0.0304
## 7 Age 50 to 54 Yes 1802 0.0620
## 8 Age 55 to 59 Yes 2657 0.0914
## 9 Age 60 to 64 Yes 3798 0.131
## 10 Age 65 to 69 Yes 4561 0.157
## 11 Age 70 to 74 Yes 4306 0.148
## 12 Age 75 to 79 Yes 3822 0.131
## 13 Age 80 or older Yes 6113 0.210
From the result above, we can see that the proportion of adults who had ever been diagnosed with a heart attack is highest for age group 80 or older; meaning 21% of the adults who ever (been told) had a heart attack are age 80 or older. We can infer that the chance of a heart attack increases as an adult’s age increases, though other factors like a general way of living (e.g. exercise, food consumption, amount of time one rests), genetics play a role.
We will look at some demographics (are they more of males or females, what race and are they older than 65 or younger) for those who had a heart attack.
age65yr_hattack<-scbrfss%>%filter(!is.na(X_age65yr),cvdinfr4=="Yes")%>%group_by(X_age65yr)%>%count(cvdinfr4)%>%ungroup %>%mutate(percentage=n/sum(n))
ggplot(age65yr_hattack,aes(x=X_age65yr,y=percentage))+geom_bar(stat="identity")+geom_text(aes(label=round(percentage,2),vjust=-0.5))+ggtitle(label="Plot of Diagnosed Heart Attack for Ages below and above 65")+theme(plot.title = element_text(hjust=0.5))
65% of the diagnosed adults that had a heart attack were 65 years and above. We can generalize that older adults of age 65 and above have a higher chance of a heart attack compared to those below 65 years.
Sex of adults ever diagnosed with a heart attack in the sample population.
sex_hattack<-scbrfss%>%filter(!is.na(sex),cvdinfr4=="Yes")%>%group_by(sex)%>%count(cvdinfr4)%>%ungroup %>%mutate(percentage=n/sum(n))
ggplot(sex_hattack,aes(x=sex,y=percentage))+geom_bar(stat="identity")+geom_text(aes(label=round(percentage,2),vjust=-0.5))+ggtitle(label="Proportion of Adults ever Diagnosed with a Heart Attack")+theme(plot.title = element_text(hjust=0.5))
Out of the 29,284 adults that were diagnosed with a heart attack, 55% was male and 45% female. A question that may arise from this result which will require further observation, experiment and/or analysis are: are male adults more prone to a heart attack compared to females?
We also want to see if heart attack diagnosis is dependent on races.
race_hattack<-scbrfss%>%filter(!is.na(X_mrace1),cvdinfr4=="Yes")%>%group_by(X_mrace1)%>%count(cvdinfr4)%>%ungroup %>%mutate(percentage=n/sum(n))
ggplot(race_hattack,aes(x=X_mrace1,y=percentage))+geom_bar(stat="identity")+geom_text(aes(label=round(percentage,2),vjust=-0.3))+ggtitle(label="Distribution of the Sample Population' Heart Attack Diagnosis across Races")+theme(plot.title = element_text(hjust=0.5))+scale_x_discrete(label= function(x)lapply(strwrap(x,width = 15,simplify = FALSE),paste,collapse="\n"))#the last code (scale_x_discrete......) is for wrapping the text of the races
84% of adults ever diagnosed with a heart attack are white. Black or African American ever diagnosed are 8%. So we could generalize that whites are more susceptible to a heart attack but we might want to find out if the percentage of whites are extremely more than other races because that could be a reason for the above results (if the population is predominantly white, then definitely more whites will be subjected to a heart attack).
## # A tibble: 7 x 3
## X_mrace1 n percentage
## <fct> <int> <dbl>
## 1 White 400421 0.828
## 2 Black or African American 41221 0.0853
## 3 American Indian or Alaskan Native 9087 0.0188
## 4 Asian 9850 0.0204
## 5 Native Hawaiian or other Pacific Islander 1931 0.00399
## 6 Other race only 10663 0.0221
## 7 Multiracial 10227 0.0212
## Warning: Factor `X_mrace1` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 1 x 2
## X_mrace1 n
## <fct> <int>
## 1 <NA> 8375
82% of the sample population that gave a valid response is whites (N.B- only 8375 NAs), that is a pretty high percentage compared to other races. We can generalize that whites make up way more than half of the population, and thus, they are more likely to have a heart attack.
We can also visualize the likelihood of each race having a heart attack by plotting the percentage of those (each race) ever diagnosed with a heart attack.
race_Hattack<-scbrfss%>%filter(!is.na(cvdinfr4),!is.na(X_mrace1))%>%group_by(X_mrace1)%>%count(cvdinfr4)%>%mutate(percentage=n/sum(n))
race_Hattack%>%filter(cvdinfr4=="No")
## # A tibble: 7 x 4
## # Groups: X_mrace1 [7]
## X_mrace1 cvdinfr4 n percentage
## <fct> <fct> <int> <dbl>
## 1 White No 374386 0.939
## 2 Black or African American No 38598 0.944
## 3 American Indian or Alaskan Native No 8278 0.920
## 4 Asian No 9533 0.975
## 5 Native Hawaiian or other Pacific Islander No 1844 0.965
## 6 Other race only No 10044 0.950
## 7 Multiracial No 9418 0.928
## # A tibble: 7 x 4
## # Groups: X_mrace1 [7]
## X_mrace1 cvdinfr4 n percentage
## <fct> <fct> <int> <dbl>
## 1 White Yes 24244 0.0608
## 2 Black or African American Yes 2294 0.0561
## 3 American Indian or Alaskan Native Yes 719 0.0799
## 4 Asian Yes 241 0.0247
## 5 Native Hawaiian or other Pacific Islander Yes 67 0.0351
## 6 Other race only Yes 532 0.0503
## 7 Multiracial Yes 735 0.0724
Visualizing the likelihood of each race having a heart attack, we have
race_Hattack_plot<-race_Hattack%>%filter(cvdinfr4=="Yes")
ggplot(race_Hattack_plot,aes(x=X_mrace1,y=percentage))+geom_bar(stat="identity")+scale_x_discrete(label= function(x)lapply(strwrap(x,width = 15,simplify = FALSE),paste,collapse="\n"))
American Indian or Alaskan Native diagnosed with a heart attack with respect to all American Indian or Alaskan Native are highest (7.99% i.e 719 out of 8997 American Indian or Alaskan) and Asians have the least (2.46% i.e 241 out of 9774 Asians) diagnosis within their race. We can generalise that 1 in every about 40 Asians are likely to be diagnosed with a heart attack in the States and about 1 in every 12 American Indian or Alaskan Native are likely to be diagnosed with a heart attack in the States.
Question: What is the probability that an American Indian or Alaskan Native who was diagnosed with a heart attack is below the age of 65?
## # A tibble: 2 x 4
## X_age65yr cvdinfr4 n percentage
## <fct> <fct> <int> <dbl>
## 1 Age 18 to 64 Yes 357 0.499
## 2 Age 65 or older Yes 359 0.501
In every 1 in 12 American Indian or Alaskan Native diagnosed with a heart attack, the likelihood of him/her being younger than 65 years is approximately 0.50.
Question:What is the probability that an Asian who was diagnosed with a heart attack is below the age of 65?
## # A tibble: 2 x 4
## X_age65yr cvdinfr4 n percentage
## <fct> <fct> <int> <dbl>
## 1 Age 18 to 64 Yes 130 0.546
## 2 Age 65 or older Yes 108 0.454
In every 1 in 40 Asians diagnosed with a heart attack, the likelihood of him/her being younger than 65 years is 0.55.
Question: What is the probability that an adult in the sample population (who gave a response) have ever (been told they had/diagnosed with) a stroke? Remember we are only considering those who gave a yes or no response to the question. We can decide to use the function summary() to know only the count.
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 20391 0.0416
## 2 No 469917 0.958
4.16% of a total of 490,308 adults have ever (been told they) had a stroke.
We will look at some demographics (are they more of males or females, what race and are they older than 65 or younger) of those who had a heart attack.
age65yr_stroke<-scbrfss%>%filter(!is.na(X_age65yr),cvdstrk3=="Yes")%>%group_by(X_age65yr)%>%count(cvdstrk3)%>%ungroup %>%mutate(percentage=n/sum(n))
ggplot(age65yr_stroke,aes(x=X_age65yr,y=percentage))+geom_bar(stat="identity")+geom_text(aes(label=round(percentage,2),vjust=-0.5))+ggtitle(label="Plot of Diagnosed Stroke for Ages below and above 65")+theme(plot.title = element_text(hjust=0.5))
62% of the diagnosed adults that had stroke were 65 years and above (3% less than those diagnosed with a heart attack). Adults of age 65 and above have a higher chance of a heart attack and also stroke . Further analysis in the future can be centered on if those who were diagnosed with a heart attack also had a stroke or vice versa.
Sex of adults in the sample population ever diagnosed with stroke
sex_stroke<-scbrfss%>%filter(!is.na(sex),cvdstrk3=="Yes")%>%group_by(sex)%>%count(cvdstrk3)%>%ungroup %>%mutate(percentage=n/sum(n))
sex_stroke
## # A tibble: 2 x 4
## sex cvdstrk3 n percentage
## <fct> <fct> <int> <dbl>
## 1 Male Yes 7966 0.391
## 2 Female Yes 12424 0.609
Out of the 20,390 adults ever diagnosed with a stroke, 39% was male and 61% female, unlike heart attack that had 55% male diagnosis (Female had stroke more, compared to male). We can infer that females are more likely to have a stroke than a heart attack compared to male. A question which comes to mind is: why are female adults more prone to stroke diagnosis compared to male (this will require further observational/experimental study and analysis)
We also want to see if stroke attack diagnosis is dependent on races.
First, we look at the diagnosed stroke cases across races.
race_stroke<-scbrfss%>%filter(!is.na(X_mrace1),cvdstrk3=="Yes")%>%group_by(X_mrace1)%>%count(cvdstrk3)%>%ungroup %>%mutate(percentage=n/sum(n))
ggplot(race_stroke,aes(x=X_mrace1,y=percentage))+geom_bar(stat="identity")+geom_text(aes(label=round(percentage,2),vjust=-0.3))+ggtitle(label="Distribution of the Sample Population' Stroke Diagnosis across Races")+theme(plot.title = element_text(hjust=0.5))+scale_x_discrete(label= function(x)lapply(strwrap(x,width = 15,simplify = FALSE),paste,collapse="\n"))#the last code (scale_x_discrete......) is for wrapping the text of the races
Since the population is predominantly white, adults diagnosed with stroke are more of white. Though, comparing the results of heart attack and stroke, we can say that Whites are 5% more prone to a heart attack diagnosis than stroke diagnosis while Black or African Americans are 4% more prone to a stroke diagnosis compared to heart attack diagnosis.
We will visualize the likelihood of each race having a stroke by plotting the percentage of those (each race) ever diagnosed with a stroke.
Race_Stroke<-scbrfss%>%filter(!is.na(cvdstrk3),!is.na(X_mrace1))%>%group_by(X_mrace1)%>%count(cvdstrk3)%>%mutate(percentage=n/sum(n))
Race_Stroke%>%filter(cvdstrk3=="No")
## # A tibble: 7 x 4
## # Groups: X_mrace1 [7]
## X_mrace1 cvdstrk3 n percentage
## <fct> <fct> <int> <dbl>
## 1 White No 383377 0.960
## 2 Black or African American No 38657 0.940
## 3 American Indian or Alaskan Native No 8520 0.940
## 4 Asian No 9631 0.982
## 5 Native Hawaiian or other Pacific Islander No 1864 0.968
## 6 Other race only No 10273 0.968
## 7 Multiracial No 9607 0.943
## # A tibble: 7 x 4
## # Groups: X_mrace1 [7]
## X_mrace1 cvdstrk3 n percentage
## <fct> <fct> <int> <dbl>
## 1 White Yes 15945 0.0399
## 2 Black or African American Yes 2455 0.0597
## 3 American Indian or Alaskan Native Yes 540 0.0596
## 4 Asian Yes 172 0.0175
## 5 Native Hawaiian or other Pacific Islander Yes 61 0.0317
## 6 Other race only Yes 340 0.0320
## 7 Multiracial Yes 583 0.0572
Visualizing the likelihood of the above result for only those ever diagnosed with a stroke, we have
Race_Stroke_plot<-Race_Stroke%>%filter(cvdstrk3=="Yes")
ggplot(Race_Stroke_plot,aes(x=X_mrace1,y=percentage))+geom_bar(stat="identity")+scale_x_discrete(label= function(x)lapply(strwrap(x,width = 15,simplify = FALSE),paste,collapse="\n"))
American Indian or Alaskan Native and Black or African American (both) diagnosed with a stroke with respect to each of their race are equally highest (about 6% each- 540 out of 9060 American Indian or Alaskan Native and 2455 out of 41,113 Black or African American) and Asians have the least (1.75% i.e 172 out of 9803 Asians) diagnosis within their race. We can generalize that 1 in about every 57 Asians is likely to be diagnosed with a stroke in the States, 1 in about every 17 American Indian or Alaskan is likely to be diagnosed with stroke and about 1 in about every 17 Black or African American is likely to be diagnosed with stroke in the States.
Research Question 3. N.B: Cases 1 to 5 for Health Status have been introduced earlier- see the “Research Question 1” section.
We will look at the count and percentage of adults that (was ever told they) had a heart attack, also stroke for each case to see if there is any considerable difference between cases.
Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation per week and consume at least one or more fruits and vegetable each per day (tsleepexfrvg)
Before we delve into the analysis of case 1’ heart attack and stroke diagnosis, let us get a further understanding of case 1 adults. We will look at their demographics: age again (this time below and above 65), sex and visualize only the race of adults who met case 1.
Age of those who fulfilled case 1
## # A tibble: 2 x 4
## X_age65yr tsleepexfrvg n percentage
## <fct> <chr> <int> <dbl>
## 1 Age 18 to 64 True 58837 0.623
## 2 Age 65 or older True 35639 0.377
62.28% of adults who fulfil case 1 are age 65 and above.
Sex of those who fulfilled case 1
## # A tibble: 2 x 4
## sex tsleepexfrvg n percentage
## <fct> <chr> <int> <dbl>
## 1 Male True 35870 0.380
## 2 Female True 58607 0.620
More females make up adults who fulfill case 1 (62% female).
Race of those who fulfilled case 1
race_case1<-scbrfss%>%filter(tsleepexfrvg=="True",!is.na(X_mrace1))%>%group_by(X_mrace1)%>%count(tsleepexfrvg)%>%ungroup()%>%mutate(percentage=n/sum(n))
race_case1
## # A tibble: 7 x 4
## X_mrace1 tsleepexfrvg n percentage
## <fct> <chr> <int> <dbl>
## 1 White True 83359 0.893
## 2 Black or African American True 4006 0.0429
## 3 American Indian or Alaskan Native True 1178 0.0126
## 4 Asian True 1569 0.0168
## 5 Native Hawaiian or other Pacific Islander True 251 0.00269
## 6 Other race only True 1371 0.0147
## 7 Multiracial True 1600 0.0171
89.31% of adults who fulfill case 1 are white as expected since the population of adults in the States is predominantly white.
Since we now have a good understanding of our case 1 adults, we will go ahead with case 1’ heart attack and stroke diagnosis analysis.
We want to know the proportion of adults that fulfilled case 1 and was diagnosed with a heart attack.
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 4080 0.0433
## 2 No 90161 0.957
Also, let us get the proportion of Case 1 adults that had a stroke
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 2480 0.0263
## 2 No 91829 0.974
The percentage of adults who fulfilled case 1, that had a heart attack is about 4.32% and more than those diagnosed with a stroke (2.63%).
Question: What percentage of the population who met case 1 and reported that (s)he had good or better health was also diagnosed with a heart attack?
## prop_hattack
## 1 0.03331143
3.33% of adults who met case 1 and reported that (s)he had good or better health was also diagnosed with a heart attack.
Question: What is the chance that a Black or African American who met case 1 was ever diagnosed with a heart attack? Ans: 0.035
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 141 0.0353
## 2 No 3850 0.965
## Warning: Factor `cvdinfr4` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 1 x 2
## cvdinfr4 n
## <fct> <int>
## 1 <NA> 15
Note that the sum total of Black or African Americans who fulfil case 1 is 4006 but we excluded the NAs (15) of “cvdinfr4” in our analysis. 3.53% of Black or African Americans who met case 1 were once diagnosed with a heart attack
Those who sleep 7-9hrs per day and those who do any physical activity or exercise during the past 30 days other than their regular jobs and consume at least one or more fruits and vegetable each per day (tsleepanyexfrvg)
We want to know the proportion of adults that fulfilled case 2 and were diagnosed with a heart attack.
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 5690 0.0423
## 2 No 128876 0.958
Also, let us get the proportion of Case 2 adults that had a stroke
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 3603 0.0267
## 2 No 131109 0.973
The percentage of adults who fulfilled case 2, that had a heart attack is about 4.23% and more than those diagnosed with a stroke (2.67%).
Those who sleep 7-9hrs per day and those who do not do any physical activity or exercise but consume at least one or more fruits and vegetable each per day (tsleepnoextfrvg)
We want to know the proportion of adults that fulfilled case 3 and were ever diagnosed with a heart attack.
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 2286 0.0777
## 2 No 27151 0.922
Also, let us get the proportion of Case 3 adults that ever had a stroke
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 1622 0.0549
## 2 No 27903 0.945
The percentage of adults who fulfilled case 3, that ever had a heart attack is about 7.76% and more than those diagnosed with a stroke (5.49%). We can see the effect of doing exercise from the above result; though the adults slept 7-9 hours per day and consumed fruits and vegetable at least one or more times per day each, those who were diagnosed with a heart attack and also stroke are at least 1.8 times more than those diagnosed in cases 2 and 3.
Those who sleep 7-9hrs per day and meet the aerobic exercise recommendation but consume less than one fruit and vegetable each per day (tsleepexlfvg)
We want to know the proportion of adults that fulfilled case 4 and were ever diagnosed with a heart attack.
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 563 0.0536
## 2 No 9940 0.946
Also, let us get the proportion of Case 4 adults that ever had a stroke
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 351 0.0334
## 2 No 10172 0.967
The percentage of adults who fulfilled case 4, that had a heart attack is about 5.36% and more than those diagnosed with a stroke (3.34%).
Those who sleep 7-9hrs per day and who do not do any physical activity or exercise and consume less than one fruits and vegetable each per day (tsleepnoexlfvg).
We want to know the proportion of adults that fulfilled case 5 and were ever diagnosed with a heart attack.
## # A tibble: 2 x 3
## cvdinfr4 n percentage
## <fct> <int> <dbl>
## 1 Yes 929 0.0723
## 2 No 11925 0.928
Also, let us get the proportion of Case 5 adults that ever had a stroke
## # A tibble: 2 x 3
## cvdstrk3 n percentage
## <fct> <int> <dbl>
## 1 Yes 687 0.0533
## 2 No 12210 0.947
The percentage of adults who fulfilled case 5, that ever had a heart attack is about 7.22% and more than those diagnosed with stroke (5.32%).
We will tabulate the results of cases 1 to 5 then visualize to know the case that has more diagnosed heart attack and also stroke. Due to the scope of my learning thus far, we will input the results above manually then create a table and then plot.
We will do that by creating a dataframe “Summary_Cases_HAttack_Stroke” that has 3 variables: Cases-Cases 1 to 5, Percentage_HAttack- percentage of adults in each case ever diagnosed with a heart attack and also the Percentage_Stroke- percentage of adults in each case ever diagnosed with a stroke.
Cases<-c("Case 1","Case 2","Case 3","Case 4","Case 5")
Percentage_HAttack<-c(4.33,4.23,7.76,5.36,7.23)
Percentage_Stroke<-c(2.63,2.67,5.49,3.34,5.32)
Summary_Cases_HAttack_Stroke<-data.frame(Cases,Percentage_HAttack,Percentage_Stroke)
Summary_Cases_HAttack_Stroke
## Cases Percentage_HAttack Percentage_Stroke
## 1 Case 1 4.33 2.63
## 2 Case 2 4.23 2.67
## 3 Case 3 7.76 5.49
## 4 Case 4 5.36 3.34
## 5 Case 5 7.23 5.32
We will visualize the above results by plotting a bar chart for the cases
plotH<-ggplot(Summary_Cases_HAttack_Stroke,aes(x=Cases,y=Percentage_HAttack))+geom_bar(stat = "identity")+ggtitle("% of Adults that had a Heart Attack")
plotS<-ggplot(Summary_Cases_HAttack_Stroke,aes(x=Cases,y=Percentage_Stroke))+geom_bar(stat = "identity")+ggtitle("Percentage of Adults that had Stroke")
ggarrange(plotH,plotS,ncol = 2,nrow = 1)
We can see that Cases 1 and 2 have a lower percentage of adults ever diagnosed with a heart attack and stroke. Case 4 has a slightly higher percentage of adults diagnosed with a heart attack and also stroke (the difference with respect to cases 1 and 2 is the amount of fruits and vegetable consumed). Case 3 and Case 5 have more percentage of adults diagnosed with a heart attack and also a stroke.
Conclusion: Meeting the recommended aerobic exercise or doing any physical activity or exercise and consuming at least one or more fruits and vegetable each per day can be associated with reduced chances of a heart attack and also stroke, though other factors like genetics, race, age, sex, play a vital role.
We will tabulate the results of cases 1 to 5 then visualize to know the case that has good or better health compared to the others.
We will do that by creating a dataframe “Summary_Cases_Results” that has 5 variables: Cases-Cases 1 to 5, Count- number of adults in the sample population that fulfills each case, Percentage_of Population- percentage of adults in the sample population that fulfills each case, lastly percentage of adults in the sample population that have good or better health and fair or poor health for each case
Cases<-c("Case 1","Case 2","Case 3","Case 4","Case 5")
Count<-c(94477,134971,29598,10547,12929)
Percentage_of_Population<-c(20.57,28.66,6.27,2.22,2.74)
Good_or_Better_Health<-c(92.22,91.23,75.25,86.44,71.36)
Fair_or_Poor_Health<-c(7.78,8.77,24.75,13.56,28.64)
Summary_Cases_Results<-data.frame(Cases,Count,Percentage_of_Population,Good_or_Better_Health,Fair_or_Poor_Health)
datatable(Summary_Cases_Results)
We will visualise the above results by plotting a bar chart for the cases
plot1<-ggplot(Summary_Cases_Results,aes(x=Cases,y=Count))+geom_bar(stat = "identity")+ggtitle("Number of Adults for each Case")
plot2<-ggplot(Summary_Cases_Results,aes(x=Cases,y=Percentage_of_Population))+geom_bar(stat = "identity")+ggtitle("Percentage of Adults for each Case")
plot3<-ggplot(Summary_Cases_Results,aes(x=Cases,y=Good_or_Better_Health))+geom_bar(stat = "identity")+ggtitle("% of Adults with Good or Better Health")
plot4<-ggplot(Summary_Cases_Results,aes(x=Cases,y=Fair_or_Poor_Health))+geom_bar(stat = "identity")+ggtitle("% of Adults with Fair or Poor Health")
ggarrange(plot1,plot2,plot3,plot4,ncol = 2,nrow = 2)
Adults who met cases 1 and 2 (highest is case 2) are more than those who met cases 3, 4 and 5 each. A good number of the population (49.23%) have a healthy standard of living: sleep 7-9hours per day, meet the recommended aerobic exercise/do any physical activity or exercise and consume at least one or more fruits and vegetable each per day. We can see that more than 90% each of those who met case 1 and also case 2 have good or better health. For those (case 4) who consumed less fruit and vegetable compared to cases 1 and 2 but sleep and exercise like cases 1 and 2, 86.44% of them reported good or better health. Though exercising regularly and sleeping 7-9 hours per day is good, it is also important to consume at least one or more fruits and vegetable each day.
Conclusion: We can generalize that those who sleep 7-9hours per day, meet the recommended aerobic exercise/do any physical activity or exercise and consume at least one or more fruits and vegetable each per day have “good or better health”, lower chances of a heart attack and also stroke than those who do not.
N.B: The terms “proportion and percentage, in decimal”, used in this analysis mean the same thing. Proportion sum equates to 1, and the percentage (%) sum equates to 100. The results for the code used in calculating percentage are expressed in decimal.
For this analysis, I used contents from the following websites as a guide:
4. https://www.sleepfoundation.org
6. https://www.health.harvard.edu
9. https://bradleyboehmke.github.io
13. https://stats.idre.ucla.edu/r
14. https://www.storybench.org
15. https://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
16 https://rstudio-pubs-static.s3.amazonaws.com/3364_d1a578f521174152b46b19d0c83cbe7e.html
I also used the book “OpenIntro Statistics,” Third Edition, by David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel and definitely knowledge gained from the course “Introduction to Probability and Data,” by Duke University on Coursera.