ASSOCIATION RULES ANALYSIS On Student Alcohol Consumption Dataset
Introduction
The aim of this report is to identify any relationship between students’ social factors and their likelihood to consume alcohol. Alcohol consumption in secondary education has been a pertaining problem. Alcohol is in fact the most commonly abused drug among youth in the USA. According to the CDC, individuals aged between 12 and 21years old made approximately 119,000 emergency rooms visits for injuries and conditions related to alcohol in 2013. Disruptive consumption of alcohol among minors may lead to education failure, physical problems, death from alcohol poisoning and unwanted sexual activity. We will be using the association rules mining to analyze the Student Alcohol Consumption dataset. The method uses the Apriori algorithm to identify frequent patterns and find association between attributes in a set. The algorithm generate rules based on statistical metrics such as support and confidence. Support is the probability an antecedent occurs in the dataset divided by the total number of observations. Confidence shows the probability of occurrence of the consequent based on the probability of the antecedents. We hope the Apriori rule method will help us derive the combination of factors that may lead to alcohol consumption among students
Data Description
The Student Alcohol Consumption dataset analyzed using R programming in this report was taken from the archives of the Machine Learning repository of the University of California, Irvine (UCI). The data is a multivariate dataset collected from a survey from high school students with a mix of categorical and numerical variables. we used the str() function to check the data structure and identify the type of variables present. The output in figure 1 reveals that the dataset contains 33 attributes and 395 cases representing students. The variables name as well as the type are also displayed in figure 1. Some variables name include age, gender, family size, education background, daily alcohol consumption, health, absences and grades. A full detail of the variables is listed in the appendix section. We can also note that there is no identification key in the output, so we will not need to drop a column from the data.
Figure 1: Student Alcohol Consumption Dataset Structure
Next, we use the summary() function to understand the data distribution. The output in figure 2 shows six descriptive statistics for numeric variables in the dataset. Those statistics include the minimum value, maximum value, 1st and 3rd quartile, median, and mean. Additionally, we can note that there’s no missing values in the dataset; however, given the nature of the analysis being performed, we will need to transform all numerical variables to categorical in order to conduct the Apriori rules method. We can also observe that most numerical variables have small range. The attribute age for example ranges between 15 and 22, and the daily alcohol consumption (dalc) ranges between 1 and . We can also note the large range of the absence variable, having a range between 0 and 75. This variable will be helpful in determining if alcohol use has an impact on absences.
Figure 2: Descriptive Statistics for All Variables
Since we are interested in determining the effect of alcohol consumption on other variables, it will be interesting to look at the distribution of the daily alcohol distribution (Dalc). We can see from the histogram in figure 3 that the daily alcohol consumption is skewed to the right, with more than 75% of students having a low daily consumption.
Figure 3: Student Daily Alcohol Consumption
Data Preprocessing
As noted above, in order to use the Apriori Rule method, we will need to convert all the numerical variables to discrete or factor variables. For variables with small range like age and medium use (medu), we will use the factor() function to covert to categorical variables. Because this is a survey dataset, most variables have limited values allowed, favorizing scaled responses (low/high, bad/good, etc.). For variable with relatively large range (G1, G2, & G3), we will use an interval method discretization, and the fixed method discretization for absence variable. The commands for discretization are included in the appendix below. We also added labels to the study time, family relation (famrel), free time, go out, daily alcohol consumption and health variables to facilitate the supervised learning. Figure 4 below display the summary of the all variable in the dataset after conversion of numerical variables to categorical variables.
Figure 4: Summary of the Discretized Outputs
As we can see in figure 4, all variables are now categorical and we are now ready to run association rules method
Association Rules Analysis
The main goal of the analysis is to generate association rules linked to alcohol consumption. In the previous steps, we labeled the daily consumption of alcohol as very low, low, medium, high and very high. We ran two separate sets of rules one on Dalc: one with low or very low and another one with high or very high, then we compared the results of the two runs.
For te run where Dalc is very low/low, we used a minimum support of 0.2, a minimum confidence level of 0.8, and a minimum length of 2. The result for this run is displayed in figure 5 below.
Figure 5: Apriori Output for Dalc = very low / low On the Left-Hand Side
The above run took 122 items as the method input and returned 1,404 rules.
Next, we ran some commands to sort the rules by lift and eliminate the redundant rules or any subset of a more general rule. The lift parameter help us evaluate the strength of the rule by providing the degree of correlation between the antecedent and the consequent. Figure 6 below display the summary of the rules after the pruning step. The number of rules has decreased from 1,404 to 362 rules, and Figure 7 displays the top 5 strongest rules, arranged by lift value.
Figure 6: Remaining Rules After Pruning Redundant Rules
Figure 7: Top 5 Association rules by lift Value
The top five rules suggest that all students (confidence = 1) who have activities and drink less over the weekend are less likely to have high daily alcohol intake.
A set of 362 rules is still quite high. Increasing the minimum support value to 0.35 have reduced the number of rules to 13, thus improving the algorithm efficiency.
We then plot a parallel coordinate of 12 rules to better highlight the rules.
Figure 8: Parallel Coordinates Plot for 12 rules
Figure 8 shows that female students going to GP school who have internet and are not in a relationship are likely to have a very low daily alcohol intake.
Simultaneously, we run the algorithm to generate the rules leading to a medium/high/very high daily alcohol intake, with the same parameter as the first run (a minimum support of 0.2, a minimum confidence level of 0.8, and a minimum length of 2.). However, the algorithm returned zero rules. So, we adjusted the parameters as follow: min-supp = 0.05, confidence = 0.1 and min length = 2, and the algorithm generated 3 non redundant rules as seen in figure 9 below.
Figure 9: Apriori Output for Medium Daily Alcohol Intake in the RHS
Figure 10 displays the 3 generated rules, which can be visualized in figure 11.
Figure 10 : 3 Rules Generated for Medium Daily Alcohol Intake in RHS
Figure 11: Parallel Coordinates Plot for the 3 rules With Dalc in RHS
The rules indicates that moderate daily alcohol consumption is positively correlated with male students with no school sup.
Conclusion
Association rules are useful in identifying frequent patterns in the data. They can scan a large database and provide many “if this, then that” rules. The method was efficient in answering our research question; however, multiple parameter must be tested to generate the desired rules. We derived from the analysis that students with low alcohol consumption are mostly single females who have internet. The data set provided in this analysis can be analyzed in various ways. Similar analysis could be done to evaluate the student’s academic performance based on other attributes in the data. Or how alcohol may influence a student GPA or class attendance.
References
Centers for Disease Control and Prevention (CDC). Alcohol-Related Disease Impact (ARDI). Atlanta, GA: CDC.
https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
Github Code:
Appendix
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
- school — student’s school (binary: ‘GP’ — Gabriel Pereira or ‘MS’ — Mousinho da Silveira)
- sex — student’s sex (binary: ‘F’ — female or ‘M’ — male)
- age — student’s age (numeric: from 15 to 22)
- address — student’s home address type (binary: ‘U’ — urban or ‘R’ — rural)
- famsize — family size (binary: ‘LE3’ — less or equal to 3 or ‘GT3’ — greater than 3)
- Pstatus — parent’s cohabitation status (binary: ‘T’ — living together or ‘A’ — apart)
- Medu — mother’s education (numeric: 0 — none, 1 — primary education (4th grade), 2–5th to 9th grade, 3 — secondary education or 4 — higher education)
- Fedu — father’s education (numeric: 0 — none, 1 — primary education (4th grade), 2–5th to 9th grade, 3 — secondary education or 4 — higher education)
- Mjob — mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- Fjob — father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- reason — reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
- guardian — student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
- traveltime — home to school travel time (numeric: 1 — <15 min., 2–15 to 30 min., 3–30 min. to 1 hour, or 4 →1 hour)
- studytime — weekly study time (numeric: 1 — <2 hours, 2–2 to 5 hours, 3–5 to 10 hours, or 4 →10 hours)
- failures — number of past class failures (numeric: n if 1<=n❤, else 4)
- schoolsup — extra educational support (binary: yes or no)
- famsup — family educational support (binary: yes or no)
- paid — extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities — extra-curricular activities (binary: yes or no)
- nursery — attended nursery school (binary: yes or no)
- higher — wants to take higher education (binary: yes or no)
- internet — Internet access at home (binary: yes or no)
- romantic — with a romantic relationship (binary: yes or no)
- famrel — quality of family relationships (numeric: from 1 — very bad to 5 — excellent)
- freetime — free time after school (numeric: from 1 — very low to 5 — very high)
- goout — going out with friends (numeric: from 1 — very low to 5 — very high)
- Dalc — workday alcohol consumption (numeric: from 1 — very low to 5 — very high)
- Walc — weekend alcohol consumption (numeric: from 1 — very low to 5 — very high)
- health — current health status (numeric: from 1 — very bad to 5 — very good)
- absences — number of school absences (numeric: from 0 to 93)
summary(alcohol)
**Copied directly from https://www.kaggle.com/uciml/student-alcohol-consumption/home
R Code for the analysis
#load libraries
library(“arules”)
library(“arulesViz”)
#Load the Student Alcohol consumption data
alcohol<-read.csv(file=”student-mat.csv”, head=TRUE, sep=”;”, as.is = FALSE)
#run the Str and summary commands to acquaint ourself with the data
str(alcohol)
summary(alcohol)
#graph and review daily alcohol consumption
hist(alcohol$Dalc)
#Discretize numerical variables and add labels for supervised learning
alcohol$age<-factor(alcohol$age)
alcohol$Medu <-factor(alcohol$Medu)
alcohol$Fedu <-factor(alcohol$Fedu)
alcohol$traveltime <-factor(alcohol$traveltime)
alcohol$studytime <-factor(alcohol$studytime, labels = c(“<2hrs”, “2 to 5 hrs”,”5 to 10 hrs”,”over 10 hrs”))
alcohol$famrel <-factor(alcohol$famrel, labels = c(“very bad”, “bad”,”fair”,”good”,”very good”))
alcohol$failures <-factor(alcohol$failures)
alcohol$freetime <-factor(alcohol$freetime, labels = c(“very low”, “low”,”medium”,”high”,”very high”))
summary(alcohol)
alcohol$goout <-factor(alcohol$goout, labels = c(“very low”, “low”,”medium”,”high”,”very high”))
alcohol$Dalc <-factor(alcohol$Dalc, labels = c(“very low”, “low”,”medium”,”high”,”very high”))
alcohol$Walc <-factor(alcohol$Walc, labels = c(“very low”, “low”,”medium”,”high”,”very high”))
alcohol$health <-factor(alcohol$health, labels = c(“very bad”, “bad”,”fair”,”good”,”very good”))
alcohol$G1<-discretize(alcohol$G1, method=”interval”, breaks=5)
alcohol$G2<-discretize(alcohol$G2, method=”interval”, breaks=5)
alcohol$G3<-discretize(alcohol$G3, method=”interval”, breaks=5)
alcohol$absences<-discretize(alcohol$absences, method=”fixed”, breaks=c(0, 5, 10, 15, 75))
# Structure and summary of the discretized data
str(alcohol)
summary(alcohol)
#Run apriori method rules to get rules for low daily alcohol use
alclow<-apriori(alcohol, parameter= list(supp=0.35, conf=0.8, minlen=2), appearance=list(rhs=c(“Dalc=very low”, “Dalc=low”), default=”lhs”))
alclow
#remove the redundant rules and display the remaining rules
rules.sorted <- sort(alclow, by=”lift”)
inspect(rules.sorted)
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- F
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]
inspect(rules.pruned)
summary(rules.pruned)
# preview the top 5 rules by lift
inspect(head(sort(rules.pruned, by=”lift”)),n=5)
# reduce the number of rules by changing the min-supp to 0.3
alclow<-apriori(alcohol, parameter= list(supp=0.35, conf=0.8, minlen=2), appearance=list(rhs=c(“Dalc=very low”, “Dalc=low”), default=”lhs”))
alclow
#Graph the data
plot(rules.pruned, method=”paracoord”, control=list(reorder=TRUE))
#Run apriori method rules to get rules for high daily alcohol use
alchigh<-apriori(alcohol, parameter= list(supp=0.05, conf=0.1, minlen=2), appearance=list(rhs=c(“Dalc=very high”, “Dalc=high”,”Dalc=medium”), default=”lhs”))
alchigh
# view the 3 generated rules by lift
inspect(head(sort(rules.pruned1, by=”lift”)))
#Graph the data
plot(rules.pruned1, method=”paracoord”, control=list(reorder=TRUE))
# End Script