Article Shares Prediction

Shyam Gadhwala & Kamlesh Pandey

Library

library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(caret)
library(glmnet)

Introdcution

We are living in a digital world where people are more concerned about the digital footprint and people often consider the like and shares they get on a post that they publish online as an important metric. We have several social media websites and print media to share our articles. Online print media platform like Mashable publishes thousand of online media everyday and it is important for them get a more user engagement from the the article they post online. In this data set we are trying to predict the shares using that data that would have several potential influencing factors, in other words, predictors, such as text sentiment polarity, rate of negative words, rate of positive words, to name some.

Why this analysis is important ? From this predictive model, Mashable can potentially predict the number of shares they can receive based on the type of article they are publishing online.

Data

The data set summarizes a heterogeneous set of features about articles published by in a period of two years.The data set has 39644 entries and 61 feature columns. The project is aimed to subset the original data set based on type of data channel (one of the six type) and then to predict the number of shares.

For this part of the project we are using LIFESTYLE data channel for training and building a predictive modeling and extending same models to other five data channels.

newspopData <- read_csv('OnlineNewsPopularity.csv')

# select specified data channel from params and drop other data channel columns
newspopData <- newspopData %>% filter(newspopData[params$data_channel] == 1) %>%
  select( -starts_with('data_channel_is_'), -url)


newspopData
## # A tibble: 2,099 × 54
##    timedelta n_tokens…¹ n_tok…² n_uni…³ n_non…⁴
##        <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
##  1       731          8     960   0.418    1.00
##  2       731         10     187   0.667    1.00
##  3       731         11     103   0.689    1.00
##  4       731         10     243   0.619    1.00
##  5       731          8     204   0.586    1.00
##  6       731         11     315   0.551    1.00
##  7       731         10    1190   0.409    1.00
##  8       731          6     374   0.641    1.00
##  9       730         12     499   0.513    1.00
## 10       729         11     223   0.662    1.00
## # … with 2,089 more rows, 49 more variables:
## #   n_non_stop_unique_tokens <dbl>,
## #   num_hrefs <dbl>, num_self_hrefs <dbl>,
## #   num_imgs <dbl>, num_videos <dbl>,
## #   average_token_length <dbl>,
## #   num_keywords <dbl>, kw_min_min <dbl>,
## #   kw_max_min <dbl>, kw_avg_min <dbl>, …

Exploratory Data Analysis

summary(newspopData$shares)
##    Min. 1st Qu.  Median    Mean 3rd Qu. 
##      28    1100    1700    3682    3250 
##    Max. 
##  208300

Dividing the popularity of the shares based on quantiles to classify them as to how popular they are. The 1st quarter percentile would be popularity index 4, next quarter be index 3, so on and so forth. Higher the number of shares of that article, higher the popularity index.

pShares <- newspopData$shares
pShares <- sort(pShares)

quantileValues <- quantile(pShares, c(.25, .5, .75)) 

newspopData <- newspopData %>% mutate(article_popularity_index = if_else(shares<quantileValues[1], "4",
                                                                  if_else(shares<quantileValues[2], "3",
                                                                  if_else(shares<quantileValues[3],"2", "1"))))

newspopData$article_popularity_index <- as_factor(newspopData$article_popularity_index)

Starting EDA with the visualization of the target variable shares as a function of the most obvious factor, number of words in the article:

newspopData$scaledShare <- scale(newspopData$shares, center = T, scale = T)

#calculating correlation index
corrIndex <- round(cor(newspopData$scaledShare, newspopData$n_tokens_content),2)

# plotting the shares as a function of words in an article
ggplot(newspopData, aes(x= n_tokens_content, y = scaledShare)) +
  geom_point()+
  labs(subtitle = 'Word Count v/s Number of Shares',
      y = 'Number of Share (Scaled)',
      x = 'Number of Words in article', 
      title = toupper(str_replace_all(params$data_channel, "_", " ")),
      caption = 'Source: News Popularity Dataset') + 
      geom_text(x = max(newspopData$n_tokens_content)/2, 
                y = max(newspopData$scaledShare)/2, size = 4, 
                label = paste0("Correlation coefficient = ", corrIndex), color = 'red')

We can inspect trend of Number of shares as a function of Number of words in the article. If the points show an upward trend, then the article with high number of words are shared more. However, if we see a negative trend then we can estimate that with increasing number of words in the article, number of shares decreases. This trend can be investigated further with the correlation coefficientm which in this case is 0.07.

Now, creating a new variable here that help eliminates the dummy variable for each day of the week. This new variable will take the value of each day of the week that corresponds to the dummy variable’s value:

newspopData <- newspopData %>% select(-scaledShare)
newspopData <- newspopData %>% mutate(day = if_else(weekday_is_monday==   1, "Monday", 
                                            if_else(weekday_is_tuesday==  1, "Tuesday",
                                            if_else(weekday_is_wednesday==1, "Wednesday",
                                            if_else(weekday_is_thursday== 1, "Thursday",
                                            if_else(weekday_is_friday==   1, "Friday",
                                            if_else(weekday_is_saturday== 1,"Saturday",
                                            if_else(weekday_is_sunday==   1, "Sunday", 
                                                    "-"))))))))

newspopData$day <- as_factor(newspopData$day)

Some statistics based on the variables is as shown:

knitr::kable(newspopData %>% group_by(day) %>% summarize(Count_of_Articles = n()))
day Count_of_Articles
Monday 322
Tuesday 334
Wednesday 388
Thursday 358
Friday 305
Saturday 182
Sunday 210
knitr::kable(newspopData %>% group_by(day, article_popularity_index) %>% summarize(Count_of_Articles = n()))
day article_popularity_index Count_of_Articles
Monday 4 84
Monday 2 73
Monday 1 85
Monday 3 80
Tuesday 4 87
Tuesday 2 71
Tuesday 1 77
Tuesday 3 99
Wednesday 4 109
Wednesday 2 103
Wednesday 1 82
Wednesday 3 94
Thursday 4 90
Thursday 2 81
Thursday 1 90
Thursday 3 97
Friday 4 79
Friday 2 74
Friday 1 69
Friday 3 83
Saturday 4 9
Saturday 2 64
Saturday 1 57
Saturday 3 52
Sunday 4 17
Sunday 2 70
Sunday 1 65
Sunday 3 58

This table shows the shares of articles on each day of the week. For tech based articles, we can expect a rise in shares in mid week or Fridays or after any tech has been launched. For entertainment and lifestyle articles, post-weekend period would be the most active period of sharing. While world, social media and business would not have a definitive trend.

ggplot(newspopData, aes(x = day, y = shares/1000000)) +
  geom_bar(stat = 'identity', width = 0.5, fill = 'tomato3')+
  labs(subtitle = 'Number of Shares (Million) Vs Day of Week',
       caption = 'Source : News Popularity Dataset',
       y = 'Total Share count in Million',
       x = 'Day of the Week',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),) +
       theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

From this bar chart we can visualize the shares trend across the week. The users engagement with the type of data Chanel (tech, entertainment, politics) will be different across the week. Users may be more inclined toward sharing lifestyle and entertainment news during weekend and prefer less to share technology/science related news during same time.

ggplot(newspopData, aes(x = global_rate_positive_words, y = global_sentiment_polarity)) +
       geom_point(aes(col = day, shape = article_popularity_index)) + 
       geom_smooth(aes(col = day), method = 'lm', se = F) + 
       scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
       labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
       x = 'Global Positive Word Rate',
       y = 'Global Sentiment Polarity',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       color = 'Days',
       size = 'Shares per Million') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

ggplot(newspopData, aes(x = global_rate_negative_words, y = global_sentiment_polarity)) + 
      geom_point(aes (col = day, shape = article_popularity_index)) + 
      geom_smooth(aes(col = day), method = 'lm', se = F) + 
      scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
      labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
       x = 'Global Negative Word Rate',
       y = 'Global Sentiment Polarity',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       color = 'Days',
       size = 'Shares per Million') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))  

The above plots estimate the general sentiments of users based on positive and negative words in the content. Ideal expectations would be: an article with more positive words has a positive sentiment index and more shares, and vice versa. Exceptions may be found in peculiar cases.

Plotting the number of shares based on the number of images and videos that an article has, based on what day of the week it is:

ggplot(newspopData, aes(x=day, y =   shares/1000000)) + 
  geom_bar(aes(fill = as_factor(num_imgs)), stat="identity", position="dodge") + 
  scale_fill_discrete(name = 'Number of Images \n in Article') +
  labs(subtitle = 'Number of Images in article v/s number of shares',
       x = 'Number of Images',
       y = 'Number of shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset") + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

This plot shows the number of shares (in millions) based on number of images included in each articles for each day of the week. The general trend might see a increase in sharing of articles over Fridays and weekends when there is more leisure time. Moreover, tech based articles can also see a rise in share if any new technology is released in middle of the week. Entertainment and Lifestyles articles can see a rise in share if any events happened over the weekend which is the case most of the times. World and social media Articles should not have a trend, as events from all over the world keep on happening throughout the week and updates are posted on social media non stop. Business articles would be more trending during working days.

in addition to that, not all articles would be oriented towards images. For example, tech and business based articles readers would not care about the number of images included, but only the content itself, while for entertainment, world, lifestyle articles, images are one of the major factors in capturing user’s attention, and thus, shares.

ggplot(newspopData, aes(x =  num_keywords, y =  shares/1000000)) + 
  geom_bar(aes(fill = day), stat="identity") + 
  scale_x_discrete(limits=c(2:10)) +
  labs(subtitle = 'Shares based on number of keywords',
       x = 'Number of Keywords',
       y = 'Shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       fill = 'Days') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

This graph represents the shares trend of any article based on the number of keywords. A general speculated trend would be: the higher the number of keywords, the higher it is descriptive about its content, and hence the number of shares would also be higher. However. this might not hold true for business based articles where short and crisp key words can also do the trick. Otherwise, the trend of shares should be positive with respect to number of keywords.

Now, the title sentiment polarity is one of the first things that users read while making a decision about reading an article let alone sharing it. Hence it is important to see how is the title sentiment polarity, is it negative or positive. Here, we are dividing the polarities into 4 sections; Highly positive (Sentiment Polarity between 0.5 and 1), positive (Sentiment Polarity between 0 and 0.5), negative ((Sentiment Polarity between -0.5 and 0)) and highly negative (Sentiment Polarity between -1 and -0.5).

Creating the new variable to classify the sentiments into above category:

pols = c("Extremely Positive", "Positive", "Negative", "Extremely Negative")

newspopData <- newspopData %>% mutate(title_Sentiment_Class = 
                                        if_else(title_sentiment_polarity > 0.5, pols[1],
                                        if_else(title_sentiment_polarity > 0, pols[2],
                                        if_else(title_sentiment_polarity > -0.5, pols[3], pols[4]))))


newspopData$title_Sentiment_Class <- as_factor(newspopData$title_Sentiment_Class)

Statistics about the articles that are in each of these categories:

knitr::kable(newspopData %>% group_by(title_Sentiment_Class) %>% summarize(Count_of_articles = n()))
title_Sentiment_Class Count_of_articles
Negative 1259
Extremely Negative 58
Positive 639
Extremely Positive 143
knitr::kable(newspopData %>% group_by(title_Sentiment_Class, article_popularity_index) %>% summarize(Count_of_articles = n()))
title_Sentiment_Class article_popularity_index Count_of_articles
Negative 4 292
Negative 2 320
Negative 1 320
Negative 3 327
Extremely Negative 4 12
Extremely Negative 2 7
Extremely Negative 1 20
Extremely Negative 3 19
Positive 4 138
Positive 2 169
Positive 1 152
Positive 3 180
Extremely Positive 4 33
Extremely Positive 2 40
Extremely Positive 1 33
Extremely Positive 3 37

The tables above represents the number of articles classified into each title sentiment category, and based on their popularity index.

knitr::kable(newspopData %>% group_by(title_Sentiment_Class, day) %>% summarize(Count_of_articles = n()))
title_Sentiment_Class day Count_of_articles
Negative Monday 189
Negative Tuesday 201
Negative Wednesday 247
Negative Thursday 216
Negative Friday 199
Negative Saturday 91
Negative Sunday 116
Extremely Negative Monday 11
Extremely Negative Tuesday 10
Extremely Negative Wednesday 12
Extremely Negative Thursday 12
Extremely Negative Friday 6
Extremely Negative Saturday 2
Extremely Negative Sunday 5
Positive Monday 105
Positive Tuesday 100
Positive Wednesday 109
Positive Thursday 105
Positive Friday 80
Positive Saturday 67
Positive Sunday 73
Extremely Positive Monday 17
Extremely Positive Tuesday 23
Extremely Positive Wednesday 20
Extremely Positive Thursday 25
Extremely Positive Friday 20
Extremely Positive Saturday 22
Extremely Positive Sunday 16

This table further shows the shares of different title sentimental types of articles for each day of the week.

newspopData <- newspopData %>% mutate(content_Sentiment_Class = 
                                        if_else( global_sentiment_polarity > 0.5, pols[1],
                                        if_else( global_sentiment_polarity > 0, pols[2],
                                        if_else( global_sentiment_polarity > -0.5, pols[3], pols[4]))))


newspopData$content_Sentiment_Class <- as_factor(newspopData$content_Sentiment_Class)

# average article length 
knitr::kable(newspopData %>%
  group_by(content_Sentiment_Class, day)%>%
    select(day, content_Sentiment_Class, n_tokens_content) %>%
    summarise(Average_Content_Length = round(sum(n_tokens_content)/n())))
content_Sentiment_Class day Average_Content_Length
Positive Monday 591
Positive Tuesday 603
Positive Wednesday 668
Positive Thursday 634
Positive Friday 655
Positive Saturday 683
Positive Sunday 657
Negative Monday 354
Negative Tuesday 324
Negative Wednesday 344
Negative Thursday 221
Negative Friday 292
Negative Saturday 259
Negative Sunday 197
Extremely Positive Wednesday 399
Extremely Positive Thursday 161

We further want to extend our analysis based on the average number of tokens on an article published during the weekdays and weekend.

ggplot(newspopData, aes(x=title_Sentiment_Class, y=shares/1000000)) +
  geom_bar(aes(fill = day), stat="identity", position="dodge") +
  labs(subtitle = 'Shares based on title sentiment and day of the week',
       x = 'Title Sentiment Category',
       y = 'Shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       fill = 'Days') +
      theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1)) + 
       geom_text(x = 3, y = 0.6, size = 3, 
                label = paste("Title Sentiment Score range", "Extremely Positive = 0.5 <--> 1", 
                              'Positive = 0 <--> 0.5 ', 
                              'Negative = -0.5 <--> 0', 
                              'Extremely Negative = -1 <--> -0.5', 
                              sep = '\n' ), color = 'black')

The above graph represents the shares of articles (in millions) based on title sentiment category for each day of the week. Some of the trends that can be thought of here is that entertainment and lifestyle articles can have a shares of both extremes ends of classifications, mostly in the starting of the week after any events that might have happened, or any other controversy that their target celebrity would be involved in.

For tech, a high number of positive ends shares can be expected, while social media, business, world can have a mix of all the types of sentimental categorical shares, depending on the latest news.

ggplot(newspopData, aes(x= title_sentiment_polarity)) +
  geom_density(aes(fill=title_Sentiment_Class), alpha = 0.7) +
  facet_grid(. ~day)+
  scale_fill_discrete(name = "Title Sentiment") +
  labs(subtitle = 'Distribution of articles on title sentiment and day of the week',
       x = 'Title Sentiment Polarity',
       y = 'Density',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset") + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

The plot above shows the density of distribution of the shares of each type of title sentimental category for each day of the week, in other words, how much density is covered by each type of sentimental titled article for each day. Again, the trend would be the same for each type of the data, depending on various situations that drive these articles, and its publications.

Modelling

Firstly, we are checking for the skewness in the data set for the response variable that is “shares”. For that we are plotting the distribution for the data of that variable below:

ggplot(newspopData) + geom_density(aes(x = shares))+ 
  labs(title = "Shares Predictor Distribution in Dataset")

From the plot above showing distribution of shares, we can see that the distribution is left-skewed. To mitigate this, we are taking a log transform of that variable.

ggplot(newspopData) + geom_density(aes(x = log(shares)))+
  labs(title = "Logarithmic transformation of shares distribution")

As seen from the plot above, after doing the log transformation, shares distribution can be estimated as a normal distribution, which will be better for the models that we would be fitting for this project. Hence, for all the models that are built in this project, the target response variable is the log transform (“logShares”) of the original shares variable.

set.seed(42)

# Removing mutated variables from the data set
lmData <- newspopData %>% select(-day, -title_Sentiment_Class, -content_Sentiment_Class, -article_popularity_index)

lmData$logShares <- log(lmData$shares)

# train test split
index <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)

# Removing the shares variable because we are using logShares as response
trainDf <- lmData[index, ] %>% select(-shares)
testDf  <- lmData[-index, ] %>% select(-shares)

Variable Selection

For variable selection we used p-test values to consider significant coefficients from the linear model fit. For a confidence interval of 95%, the significant predictors are used to fit a Linear and a Lasso regression model. For the significant parameter selection, our NULL and Alternate hypothesis is explained below:
H~0: All coefficients have zero slope
H~a: At-least one predictor variable has a non-zero slope

modelLmVar <- lm(logShares~. ,data = trainDf)

# getting p-values 
pVal <- data.frame(summary(modelLmVar)$coefficients[,4])
pVal$row_names <- row.names(pVal)

# checking if slope of intercept is significant (i.e. <<< 0.05)
pVal <- pVal %>% filter(pVal[,1] < 0.05)
cols = c()

# remove intercept term and store all significant intercepts in "cols" variable 
if (pVal[,2][1] == "(Intercept)"){
  cols = pVal[,2][-1]
}else{
  cols = pVal[,2]
}
cols
##  [1] "n_non_stop_words"         
##  [2] "n_non_stop_unique_tokens" 
##  [3] "num_imgs"                 
##  [4] "kw_max_avg"               
##  [5] "kw_avg_avg"               
##  [6] "weekday_is_tuesday"       
##  [7] "weekday_is_wednesday"     
##  [8] "weekday_is_thursday"      
##  [9] "weekday_is_friday"        
## [10] "global_sentiment_polarity"
## [11] "min_positive_polarity"

Multiple Linear Regression Model

Linear regression is a regression model that uses a straight line to describe a relationship between two or more predictor and a response variable. Out of the 54 predictor variables in the data set, the predictors that were deemed significant from the step above are used in this model, with a response variable of logShares.

trControl <- trainControl(method = 'repeatedcv', number = 10)

# model training
modelLm <- train(logShares ~ .,
              data = trainDf %>% 
              select(all_of(cols), logShares),
              method = "lm",
              preProcess = c('center', 'scale'),
              trControl = trControl)

title = toupper(str_replace_all(params$data_channel, "_", " "))

# variable importance for linear model
plot(varImp(modelLm), 
     main = paste("Variable importance Plot for ", "\n", title))

lmPredict <- predict(modelLm, newdata = testDf %>% select(all_of(cols), logShares))
lmRes <- postResample(lmPredict, obs = testDf$logShares)


# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf %>% select(all_of(cols)))[2]
adjR2Linear <-  1-(((1-lmRes[2]^2)*(n-1))/(n-p-1))  
 
lmRes
##      RMSE  Rsquared       MAE 
## 0.9173554 0.0321371 0.7096754

Lasso Regression

LASSO or the Least Absolute Shrinkage and Selection Operator is a special type of linear model that uses penalty based shrinkage. Shrinkage here means converging data values to a central point, or where the absolute magnitude of the coefficients are shrunk to a 0 based on L1 regularization penalty method which adds a penalty factor(lambda) for every misclassified or wrongly predicted values.

LASSO generally works well with models having less variable parameters. From the above process, we have narrowed down a few most influencing and statistically significant variables which we will be using for this model as well.

The caret package is used to train the model here using the lasso method. The prediction metrics are then derived using the postResample function. The same predictor and response variable from the model above are used here as well.

set.seed(43)

# train
fitControl <- trainControl(method = "repeatedcv",number = 5)
fitLASSO <- train(logShares ~ . ,
                  data=trainDf,
                  method = "lasso", 
                  trControl = fitControl,
                  preProcess = c("center", "scale")
                )


predLASSO <- predict(fitLASSO, newdata = testDf)

coeff <- predict(fitLASSO$finalModel, type = "coef", mode = "fraction",
                 s = fitLASSO$bestTune$fraction, testDf)


lassoRes <- postResample(predLASSO, obs = testDf$logShares)


# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf)[2]
adjR2Lasso <-  1-(((1-lassoRes[2]^2)*(n-1))/(n-p-1))

lassoRes
##      RMSE  Rsquared       MAE 
## 0.9047553 0.0560580 0.6982046

Random Forest

Before building the random forest model, seeing at the data, the data set is high dimensional, hence, it would be easier to select the “best” subsets of variables for random forest ensemble tree learning method to make the process more efficient. More details about Random Forest is given before the model is formulated.

Best Variable Selection: Principal Component Analysis

Since the data set includes a large number of variables, it becomes important to select the best subsets of the data that defines the data set and variations in the data set for number of shares. To get these variables, here we are conducting Principal Component Analysis (PCA). PCA standardizes the variables, and represent new variables/PCs that are a function of the original variables from the data set.

Again, to keep the models separate, we are again splitting the data into train and test data set, and removing the categorical variables that were added previously in the data. After that the PCs are formulated.

set.seed(200)

newspopData$logShares <- log(newspopData$shares)

trainIndex <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)

trainData <- newspopData[trainIndex, ]
testData <- newspopData[-trainIndex, ]

Creating new PCs, and plotting the cumulative variance graph:

# Removing shares, logShares and mutated variables to form PCs
PCs <- prcomp(trainData %>% 
                select(-shares, -logShares, -day, 
                       -title_Sentiment_Class, 
                       -content_Sentiment_Class,
                       -article_popularity_index), 
                        center=TRUE, scale=TRUE)

# cumulative variance
var = cumsum(PCs$sdev**2/sum(PCs$sdev**2))

show(plot(var, xlab = "Principal Component", 
          ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b'))
## NULL
pt <- min(which(var > 0.80))

points(pt, var[pt], col="red", cex=2, pch=20)
text(8, 0.8, col="red", paste0(round(100*var[pt],2), "% variance \nis explained by ", pt, " \nnumber of PC(s)"))

From the above graph, we can see that at least 80% variance is explained by 21 PCs, and hence for the subsequent steps, 21 PCs will be used.

Now, we are creating the training data for Random Forest Ensemble model.

# Creating training data PCs without the logShares variables
treeTrainData <- predict(PCs, newdata = trainData %>% select(-shares, -logShares)) %>% 
  as_tibble() %>% 
  select(PC1:paste0("PC", pt))

treeTrainData$logShares <- trainData$logShares

# This the final training data that will be used to train random forest model
treeTrainData
## # A tibble: 1,472 × 22
##      PC1   PC2     PC3     PC4    PC5    PC6
##    <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
##  1 -1.84 2.74  -3.64    3.46    1.35  -2.01 
##  2 -4.38 1.88   0.722  -0.832   0.735  0.133
##  3 -4.53 0.714  0.0971 -1.52    0.585  0.170
##  4 -4.56 1.58   0.150  -0.357   0.772  0.200
##  5 -2.63 4.61  -1.88   -0.565   0.512 -1.21 
##  6 -3.25 1.39  -1.01    0.0392  1.88   1.52 
##  7 -1.92 2.23  -3.54    3.96    1.47  -2.40 
##  8 -4.45 1.13   0.568  -0.391   0.582 -0.550
##  9 -4.55 0.610 -0.851   1.20    0.452  1.00 
## 10 -5.22 0.958  0.473   0.516  -0.692  0.311
## # … with 1,462 more rows, and 16 more
## #   variables: PC7 <dbl>, PC8 <dbl>,
## #   PC9 <dbl>, PC10 <dbl>, PC11 <dbl>,
## #   PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
## #   PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
## #   PC18 <dbl>, PC19 <dbl>, PC20 <dbl>,
## #   PC21 <dbl>, logShares <dbl>

Random Forest Model:

Random Forest model is a tree based ensemble learning method where the training data set is sampled into smaller data sets, and some variables from the data set are included in the smaller sampled data. A tree is fitted to that data. This method is repeated several times, and finally a model is finalized that will be an average of all these sampled tree models. Since Random Forest takes only a subset of variables in any given sampling, it creates really strong model to predict on with unknown variables since the sampled data is able to explain variance much better with fewer variables.

We are using train() function from caret package to train the Random Forest model using the rf method. Pre-processing is done by centering and scaling the data, and a 3 fold cross validation is also used. (The value 3 is selected to have some computation ease). Lastly, the model is tried for a number of values for the tuning parameter “mtry”, out of which the best mtry value is selected.

Training the model:

set.seed(400)
trctrl <- trainControl(method = "repeatedcv", number = 3)
randomForest <- train(logShares ~., data = treeTrainData, 
                      method = "rf",
                      trControl=trctrl,
                      preProcess = c("center", "scale"),
                      tuneGrid = expand.grid(mtry = 1:as.integer(pt/3)))

randomForest
## Random Forest 
## 
## 1472 samples
##   21 predictor
## 
## Pre-processing: centered (21), scaled (21) 
## Resampling: Cross-Validated (3 fold, repeated 1 times) 
## Summary of sample sizes: 981, 983, 980 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared    MAE      
##   1     0.9153664  0.02243316  0.7056881
##   2     0.9159739  0.02430438  0.7074691
##   3     0.9162671  0.02617734  0.7074963
##   4     0.9188450  0.02376155  0.7103767
##   5     0.9190373  0.02434655  0.7108327
##   6     0.9199683  0.02441688  0.7110105
##   7     0.9191692  0.02664557  0.7119071
## 
## RMSE was used to select the optimal
##  model using the smallest value.
## The final value used for the model was mtry
##  = 1.

Testing the model on the test data set to see the metrics of evaluation. The predict() function is used to create both PC based testing data, and to predict the logShares using the Random Forest model. Further, the postResample() function is used to extract metrics such as RMSE, R-Squared, and MAE, represented below.

# Creating test data PCs
treeTestData <- predict(PCs, newdata = testData %>% select(-shares, -logShares)) %>% 
                         as_tibble() %>% 
                         select(PC1:paste0("PC", pt))

# Completing the test data
treeTestData$logShares <- testData$logShares

rfPredict <- predict(randomForest, newdata = treeTestData)
randomFRes <- postResample(rfPredict, obs = treeTestData$logShares)


# Adjusted R-square
n <- dim(treeTestData)[1]
p <- dim(treeTestData)[2] - 1
adjR2Rf <-  1-(((1-randomFRes[2]^2)*(n-1))/(n-p-1))
 
randomFRes
##       RMSE   Rsquared        MAE 
## 0.96673542 0.03525394 0.72522972

Boosted Tree Model

Boosting is another tree based approach for improving the predictions resulting from a decision tree. Like other tree models, boosting can be applied for classification and regression. Boosting works in sequentially: each tree is grown using information from previously grown trees. In this project, we are using gbm method within caret train function.

An additional input tunegrid also provided to train function, which will find best hyper parameter combinations of number of trees, and depth.

Here, all the available parameters are used as predictor variables, and logShares is the target response variable.

# Training the model
boostedTree <- train(logShares ~., data = trainData %>% select(-shares), method = "gbm",
                     trControl = trainControl(method = "repeatedcv", number = 3),
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(n.trees = c(25, 50, 100, 150, 200),
                                            interaction.depth = 1:4,
                                            shrinkage = 0.1,
                                            n.minobsinnode = 10),
                     verbose = FALSE)

# Best combination of tuning hyperparameters
print(boostedTree$bestTune)
##    n.trees interaction.depth shrinkage
## 12      50                 3       0.1
##    n.minobsinnode
## 12             10
boostedTPredict <- predict(boostedTree, newdata = testData %>% select(-shares))

boostedTRes <- postResample(boostedTPredict, obs = (testData %>% select(-shares))$logShares)

# plot
plot(boostedTree, xlab = "Max Tree Depth ", 
     ylab = "Root Mean Square Error (Found through CV)", 
     main = "All Predictors - Minimizing RMSE")

# Adjusted R-square
n <- dim(testData)[1]
p <- dim(testData)[2] - 1
adjR2Boosted <-  1-(((1-boostedTRes[2]^2)*(n-1))/(n-p-1))
 

boostedTRes
##      RMSE  Rsquared       MAE 
## 0.4657117 0.7848593 0.2731229

From the plot we can visualize the best combinations of parameters that were used to find minimum RMSE value using Boosted tree algorithm, which can also be seen by using (boostedTree$bestTune) function.

Model Comparision

model_types = c("Linear", "Ensemble")

modelComparisonDf = tibble(model = c("Linear Model"), model_type = c(model_types[1]), RMSE = c(round(lmRes[[1]],3)), R_Squared = c(round(lmRes[[2]], 4)), Adj_R2 = c(round(adjR2Linear,4)) )

modelComparisonDf <- rbind(modelComparisonDf, c("LASSO", model_types[1], round(lassoRes[[1]],3), round(lassoRes[[2]],4), round(adjR2Lasso,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Random Forest", model_types[2], round(randomFRes[[1]],3), round(randomFRes[[2]],4), round(adjR2Rf,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Boosted Trees", model_types[2], round(boostedTRes[[1]],3), round(boostedTRes[[2]],4), round(adjR2Boosted,4)))

knitr::kable(modelComparisonDf)
model model_type RMSE R_Squared Adj_R2
Linear Model Linear 0.917 0.0321 -0.0168
LASSO Linear 0.905 0.0561 -0.091
Random Forest Ensemble 0.967 0.0353 -0.0334
Boosted Trees Ensemble 0.466 0.7849 0.5768
bestLinearModel = ""
bestLinearRMSE = ""
bestLinearR2 = ""

bestEnsembleModel = ""
bestEnsembleRMSE = ""
bestEnsembleR2 = ""

bestOverallModel = ""
bestOverallRMSE = ""
bestOverallR2 = ""

lineardf <- modelComparisonDf %>% filter(model_type == "Linear")

ensembleDf <- modelComparisonDf %>% filter(model_type == "Ensemble")

if (length(unique(lineardf$R_Squared)) > 1){
  bestLinearModel <- lineardf[which.min(lineardf$RMSE),][[1]]
  bestLinearRMSE <- lineardf[which.min(lineardf$RMSE),][[3]]
  bestLinearR2 <- lineardf[which.min(lineardf$RMSE),][[4]]
} else{
  bestLinearModel <- lineardf[which.max(lineardf$Adj_R2),][[1]]
  bestLinearRMSE <- lineardf[which.max(lineardf$Adj_R2),][[3]]
  bestLinearR2 <- lineardf[which.max(lineardf$Adj_R2),][[4]]
}


if (length(unique(ensembleDf$R_Squared)) > 1){
  bestEnsembleModel <- ensembleDf[which.min(ensembleDf$RMSE),][[1]]
  bestEnsembleRMSE <- ensembleDf[which.min(ensembleDf$RMSE),][[3]]
  bestEnsembleR2 <- ensembleDf[which.min(ensembleDf$RMSE),][[4]]
} else{
  bestEnsembleModel <- ensembleDf[which.max(ensembleDf$Adj_R2),][[1]]
  bestEnsembleRMSE <- ensembleDf[which.max(ensembleDf$Adj_R2),][[3]]
  bestEnsembleR2 <- ensembleDf[which.max(ensembleDf$Adj_R2),][[4]]
}


if (length(unique(modelComparisonDf$R_Squared)) > 1){
  bestOverallModel <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[1]]
  bestOverallRMSE <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[3]]
  bestOverallR2 <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[4]]
} else{
  bestOverallModel <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[1]]
  bestOverallRMSE <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[3]]
  bestOverallR2 <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[4]]
}

Conclusion

Selection of best model is based on comparing the lowest prediction RMSE value across all models.

For the data set for LIFESTYLE:

The best linear model is LASSO with RMSE value of 0.905, and R Squared value of 0.0561.

The best ensemble method is Boosted Trees with RMSE value of 0.466, and R Squared value of 0.7849.

The best overall model is Boosted Trees with RMSE value of 0.466, and R Squared value of 0.7849.