We are living in a digital world where people are more concerned about the digital footprint and people often consider the like and shares they get on a post that they publish online as an important metric. We have several social media websites and print media to share our articles. Online print media platform like Mashable publishes thousand of online media everyday and it is important for them get a more user engagement from the the article they post online. In this data set we are trying to predict the shares using that data that would have several potential influencing factors, in other words, predictors, such as text sentiment polarity, rate of negative words, rate of positive words, to name some.

Why this analysis is important ? From this predictive model, Mashable can potentially predict the number of shares they can receive based on the type of article they are publishing online.

Data

The data set summarizes a heterogeneous set of features about articles published by in a period of two years.The data set has 39644 entries and 61 feature columns. The project is aimed to subset the original data set based on type of data channel (one of the six type) and then to predict the number of shares.

For this part of the project we are using LIFESTYLE data channel for training and building a predictive modeling and extending same models to other five data channels.

newspopData <- read_csv('OnlineNewsPopularity.csv')

# select specified data channel from params and drop other data channel columns
newspopData <- newspopData %>% filter(newspopData[params$data_channel] == 1) %>%
  select( -starts_with('data_channel_is_'), -url)


newspopData

Exploratory Data Analysis

summary(newspopData$shares)

Dividing the popularity of the shares based on quantiles to classify them as to how popular they are. The 1st quarter percentile would be popularity index 4, next quarter be index 3, so on and so forth. Higher the number of shares of that article, higher the popularity index.

pShares <- newspopData$shares
pShares <- sort(pShares)

quantileValues <- quantile(pShares, c(.25, .5, .75)) 

newspopData <- newspopData %>% mutate(article_popularity_index = if_else(shares<quantileValues[1], "4",
                                                                  if_else(shares<quantileValues[2], "3",
                                                                  if_else(shares<quantileValues[3],"2", "1"))))

newspopData$article_popularity_index <- as_factor(newspopData$article_popularity_index)

Starting EDA with the visualization of the target variable shares as a function of the most obvious factor, number of words in the article:

newspopData$scaledShare <- scale(newspopData$shares, center = T, scale = T)

#calculating correlation index
corrIndex <- round(cor(newspopData$scaledShare, newspopData$n_tokens_content),2)

# plotting the shares as a function of words in an article
ggplot(newspopData, aes(x= n_tokens_content, y = scaledShare)) +
  geom_point()+
  labs(subtitle = 'Word Count v/s Number of Shares',
      y = 'Number of Share (Scaled)',
      x = 'Number of Words in article', 
      title = toupper(str_replace_all(params$data_channel, "_", " ")),
      caption = 'Source: News Popularity Dataset') + 
      geom_text(x = max(newspopData$n_tokens_content)/2, 
                y = max(newspopData$scaledShare)/2, size = 4, 
                label = paste0("Correlation coefficient = ", corrIndex), color = 'red')

We can inspect trend of Number of shares as a function of Number of words in the article. If the points show an upward trend, then the article with high number of words are shared more. However, if we see a negative trend then we can estimate that with increasing number of words in the article, number of shares decreases. This trend can be investigated further with the correlation coefficientm which in this case is 0.07.

Now, creating a new variable here that help eliminates the dummy variable for each day of the week. This new variable will take the value of each day of the week that corresponds to the dummy variable’s value:

newspopData <- newspopData %>% select(-scaledShare)
newspopData <- newspopData %>% mutate(day = if_else(weekday_is_monday==   1, "Monday", 
                                            if_else(weekday_is_tuesday==  1, "Tuesday",
                                            if_else(weekday_is_wednesday==1, "Wednesday",
                                            if_else(weekday_is_thursday== 1, "Thursday",
                                            if_else(weekday_is_friday==   1, "Friday",
                                            if_else(weekday_is_saturday== 1,"Saturday",
                                            if_else(weekday_is_sunday==   1, "Sunday", 
                                                    "-"))))))))

newspopData$day <- as_factor(newspopData$day)

knitr::kable(newspopData %>% group_by(day) %>% summarize(Count_of_Articles = n()))

day	Count_of_Articles
Monday	322
Tuesday	334
Wednesday	388
Thursday	358
Friday	305
Saturday	182
Sunday	210

knitr::kable(newspopData %>% group_by(day, article_popularity_index) %>% summarize(Count_of_Articles = n()))

This table shows the shares of articles on each day of the week. For tech based articles, we can expect a rise in shares in mid week or Fridays or after any tech has been launched. For entertainment and lifestyle articles, post-weekend period would be the most active period of sharing. While world, social media and business would not have a definitive trend.

day	article_popularity_index	Count_of_Articles
Monday	4	84
Monday	2	73
Monday	1	85
Monday	3	80
Tuesday	4	87
Tuesday	2	71
Tuesday	1	77
Tuesday	3	99
Wednesday	4	109
Wednesday	2	103
Wednesday	1	82
Wednesday	3	94
Thursday	4	90
Thursday	2	81
Thursday	1	90
Thursday	3	97
Friday	4	79
Friday	2	74
Friday	1	69
Friday	3	83
Saturday	4	9
Saturday	2	64
Saturday	1	57
Saturday	3	52
Sunday	4	17
Sunday	2	70
Sunday	1	65
Sunday	3	58

ggplot(newspopData, aes(x = day, y = shares/1000000)) +
  geom_bar(stat = 'identity', width = 0.5, fill = 'tomato3')+
  labs(subtitle = 'Number of Shares (Million) Vs Day of Week',
       caption = 'Source : News Popularity Dataset',
       y = 'Total Share count in Million',
       x = 'Day of the Week',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),) +
       theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

From this bar chart we can visualize the shares trend across the week. The users engagement with the type of data Chanel (tech, entertainment, politics) will be different across the week. Users may be more inclined toward sharing lifestyle and entertainment news during weekend and prefer less to share technology/science related news during same time.

ggplot(newspopData, aes(x = global_rate_positive_words, y = global_sentiment_polarity)) +
       geom_point(aes(col = day, shape = article_popularity_index)) + 
       geom_smooth(aes(col = day), method = 'lm', se = F) + 
       scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
       labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
       x = 'Global Positive Word Rate',
       y = 'Global Sentiment Polarity',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       color = 'Days',
       size = 'Shares per Million') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

ggplot(newspopData, aes(x = global_rate_negative_words, y = global_sentiment_polarity)) + 
      geom_point(aes (col = day, shape = article_popularity_index)) + 
      geom_smooth(aes(col = day), method = 'lm', se = F) + 
      scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
      labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
       x = 'Global Negative Word Rate',
       y = 'Global Sentiment Polarity',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       color = 'Days',
       size = 'Shares per Million') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

The above plots estimate the general sentiments of users based on positive and negative words in the content. Ideal expectations would be: an article with more positive words has a positive sentiment index and more shares, and vice versa. Exceptions may be found in peculiar cases.

Plotting the number of shares based on the number of images and videos that an article has, based on what day of the week it is:

ggplot(newspopData, aes(x=day, y =   shares/1000000)) + 
  geom_bar(aes(fill = as_factor(num_imgs)), stat="identity", position="dodge") + 
  scale_fill_discrete(name = 'Number of Images \n in Article') +
  labs(subtitle = 'Number of Images in article v/s number of shares',
       x = 'Number of Images',
       y = 'Number of shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset") + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

This plot shows the number of shares (in millions) based on number of images included in each articles for each day of the week. The general trend might see a increase in sharing of articles over Fridays and weekends when there is more leisure time. Moreover, tech based articles can also see a rise in share if any new technology is released in middle of the week. Entertainment and Lifestyles articles can see a rise in share if any events happened over the weekend which is the case most of the times. World and social media Articles should not have a trend, as events from all over the world keep on happening throughout the week and updates are posted on social media non stop. Business articles would be more trending during working days.

in addition to that, not all articles would be oriented towards images. For example, tech and business based articles readers would not care about the number of images included, but only the content itself, while for entertainment, world, lifestyle articles, images are one of the major factors in capturing user’s attention, and thus, shares.

ggplot(newspopData, aes(x =  num_keywords, y =  shares/1000000)) + 
  geom_bar(aes(fill = day), stat="identity") + 
  scale_x_discrete(limits=c(2:10)) +
  labs(subtitle = 'Shares based on number of keywords',
       x = 'Number of Keywords',
       y = 'Shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       fill = 'Days') + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

This graph represents the shares trend of any article based on the number of keywords. A general speculated trend would be: the higher the number of keywords, the higher it is descriptive about its content, and hence the number of shares would also be higher. However. this might not hold true for business based articles where short and crisp key words can also do the trick. Otherwise, the trend of shares should be positive with respect to number of keywords.

Now, the title sentiment polarity is one of the first things that users read while making a decision about reading an article let alone sharing it. Hence it is important to see how is the title sentiment polarity, is it negative or positive. Here, we are dividing the polarities into 4 sections; Highly positive (Sentiment Polarity between 0.5 and 1), positive (Sentiment Polarity between 0 and 0.5), negative ((Sentiment Polarity between -0.5 and 0)) and highly negative (Sentiment Polarity between -1 and -0.5).

pols = c("Extremely Positive", "Positive", "Negative", "Extremely Negative")

newspopData <- newspopData %>% mutate(title_Sentiment_Class = 
                                        if_else(title_sentiment_polarity > 0.5, pols[1],
                                        if_else(title_sentiment_polarity > 0, pols[2],
                                        if_else(title_sentiment_polarity > -0.5, pols[3], pols[4]))))


newspopData$title_Sentiment_Class <- as_factor(newspopData$title_Sentiment_Class)

knitr::kable(newspopData %>% group_by(title_Sentiment_Class) %>% summarize(Count_of_articles = n()))

title_Sentiment_Class	Count_of_articles
Negative	1259
Extremely Negative	58
Positive	639
Extremely Positive	143

knitr::kable(newspopData %>% group_by(title_Sentiment_Class, article_popularity_index) %>% summarize(Count_of_articles = n()))

The tables above represents the number of articles classified into each title sentiment category, and based on their popularity index.

title_Sentiment_Class	article_popularity_index	Count_of_articles
Negative	4	292
Negative	2	320
Negative	1	320
Negative	3	327
Extremely Negative	4	12
Extremely Negative	2	7
Extremely Negative	1	20
Extremely Negative	3	19
Positive	4	138
Positive	2	169
Positive	1	152
Positive	3	180
Extremely Positive	4	33
Extremely Positive	2	40
Extremely Positive	1	33
Extremely Positive	3	37

knitr::kable(newspopData %>% group_by(title_Sentiment_Class, day) %>% summarize(Count_of_articles = n()))

This table further shows the shares of different title sentimental types of articles for each day of the week.

title_Sentiment_Class	day	Count_of_articles
Negative	Monday	189
Negative	Tuesday	201
Negative	Wednesday	247
Negative	Thursday	216
Negative	Friday	199
Negative	Saturday	91
Negative	Sunday	116
Extremely Negative	Monday	11
Extremely Negative	Tuesday	10
Extremely Negative	Wednesday	12
Extremely Negative	Thursday	12
Extremely Negative	Friday	6
Extremely Negative	Saturday	2
Extremely Negative	Sunday	5
Positive	Monday	105
Positive	Tuesday	100
Positive	Wednesday	109
Positive	Thursday	105
Positive	Friday	80
Positive	Saturday	67
Positive	Sunday	73
Extremely Positive	Monday	17
Extremely Positive	Tuesday	23
Extremely Positive	Wednesday	20
Extremely Positive	Thursday	25
Extremely Positive	Friday	20
Extremely Positive	Saturday	22
Extremely Positive	Sunday	16

newspopData <- newspopData %>% mutate(content_Sentiment_Class = 
                                        if_else( global_sentiment_polarity > 0.5, pols[1],
                                        if_else( global_sentiment_polarity > 0, pols[2],
                                        if_else( global_sentiment_polarity > -0.5, pols[3], pols[4]))))


newspopData$content_Sentiment_Class <- as_factor(newspopData$content_Sentiment_Class)

# average article length 
knitr::kable(newspopData %>%
  group_by(content_Sentiment_Class, day)%>%
    select(day, content_Sentiment_Class, n_tokens_content) %>%
    summarise(Average_Content_Length = round(sum(n_tokens_content)/n())))

We further want to extend our analysis based on the average number of tokens on an article published during the weekdays and weekend.

content_Sentiment_Class	day	Average_Content_Length
Positive	Monday	591
Positive	Tuesday	603
Positive	Wednesday	668
Positive	Thursday	634
Positive	Friday	655
Positive	Saturday	683
Positive	Sunday	657
Negative	Monday	354
Negative	Tuesday	324
Negative	Wednesday	344
Negative	Thursday	221
Negative	Friday	292
Negative	Saturday	259
Negative	Sunday	197
Extremely Positive	Wednesday	399
Extremely Positive	Thursday	161

ggplot(newspopData, aes(x=title_Sentiment_Class, y=shares/1000000)) +
  geom_bar(aes(fill = day), stat="identity", position="dodge") +
  labs(subtitle = 'Shares based on title sentiment and day of the week',
       x = 'Title Sentiment Category',
       y = 'Shares (in millions)',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset",
       fill = 'Days') +
      theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1)) + 
       geom_text(x = 3, y = 0.6, size = 3, 
                label = paste("Title Sentiment Score range", "Extremely Positive = 0.5 <--> 1", 
                              'Positive = 0 <--> 0.5 ', 
                              'Negative = -0.5 <--> 0', 
                              'Extremely Negative = -1 <--> -0.5', 
                              sep = '\n' ), color = 'black')

The above graph represents the shares of articles (in millions) based on title sentiment category for each day of the week. Some of the trends that can be thought of here is that entertainment and lifestyle articles can have a shares of both extremes ends of classifications, mostly in the starting of the week after any events that might have happened, or any other controversy that their target celebrity would be involved in.

For tech, a high number of positive ends shares can be expected, while social media, business, world can have a mix of all the types of sentimental categorical shares, depending on the latest news.

ggplot(newspopData, aes(x= title_sentiment_polarity)) +
  geom_density(aes(fill=title_Sentiment_Class), alpha = 0.7) +
  facet_grid(. ~day)+
  scale_fill_discrete(name = "Title Sentiment") +
  labs(subtitle = 'Distribution of articles on title sentiment and day of the week',
       x = 'Title Sentiment Polarity',
       y = 'Density',
       title = toupper(str_replace_all(params$data_channel, "_", " ")),
       caption = "Source : News Popularity Dataset") + 
       theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))

The plot above shows the density of distribution of the shares of each type of title sentimental category for each day of the week, in other words, how much density is covered by each type of sentimental titled article for each day. Again, the trend would be the same for each type of the data, depending on various situations that drive these articles, and its publications.

Modelling

Firstly, we are checking for the skewness in the data set for the response variable that is “shares”. For that we are plotting the distribution for the data of that variable below:

ggplot(newspopData) + geom_density(aes(x = shares))+ 
  labs(title = "Shares Predictor Distribution in Dataset")

From the plot above showing distribution of shares, we can see that the distribution is left-skewed. To mitigate this, we are taking a log transform of that variable.

ggplot(newspopData) + geom_density(aes(x = log(shares)))+
  labs(title = "Logarithmic transformation of shares distribution")

As seen from the plot above, after doing the log transformation, shares distribution can be estimated as a normal distribution, which will be better for the models that we would be fitting for this project. Hence, for all the models that are built in this project, the target response variable is the log transform (“logShares”) of the original shares variable.

set.seed(42)

# Removing mutated variables from the data set
lmData <- newspopData %>% select(-day, -title_Sentiment_Class, -content_Sentiment_Class, -article_popularity_index)

lmData$logShares <- log(lmData$shares)

# train test split
index <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)

# Removing the shares variable because we are using logShares as response
trainDf <- lmData[index, ] %>% select(-shares)
testDf  <- lmData[-index, ] %>% select(-shares)

Variable Selection

For variable selection we used p-test values to consider significant coefficients from the linear model fit. For a confidence interval of 95%, the significant predictors are used to fit a Linear and a Lasso regression model. For the significant parameter selection, our NULL and Alternate hypothesis is explained below:
H~0: All coefficients have zero slope
H~a: At-least one predictor variable has a non-zero slope

modelLmVar <- lm(logShares~. ,data = trainDf)

# getting p-values 
pVal <- data.frame(summary(modelLmVar)$coefficients[,4])
pVal$row_names <- row.names(pVal)

# checking if slope of intercept is significant (i.e. <<< 0.05)
pVal <- pVal %>% filter(pVal[,1] < 0.05)
cols = c()

# remove intercept term and store all significant intercepts in "cols" variable 
if (pVal[,2][1] == "(Intercept)"){
  cols = pVal[,2][-1]
}else{
  cols = pVal[,2]
}
cols

Multiple Linear Regression Model

Linear regression is a regression model that uses a straight line to describe a relationship between two or more predictor and a response variable. Out of the 54 predictor variables in the data set, the predictors that were deemed significant from the step above are used in this model, with a response variable of logShares.

trControl <- trainControl(method = 'repeatedcv', number = 10)

# model training
modelLm <- train(logShares ~ .,
              data = trainDf %>% 
              select(all_of(cols), logShares),
              method = "lm",
              preProcess = c('center', 'scale'),
              trControl = trControl)

title = toupper(str_replace_all(params$data_channel, "_", " "))

# variable importance for linear model
plot(varImp(modelLm), 
     main = paste("Variable importance Plot for ", "\n", title))

lmPredict <- predict(modelLm, newdata = testDf %>% select(all_of(cols), logShares))
lmRes <- postResample(lmPredict, obs = testDf$logShares)


# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf %>% select(all_of(cols)))[2]
adjR2Linear <-  1-(((1-lmRes[2]^2)*(n-1))/(n-p-1))  
 
lmRes

Lasso Regression

LASSO or the Least Absolute Shrinkage and Selection Operator is a special type of linear model that uses penalty based shrinkage. Shrinkage here means converging data values to a central point, or where the absolute magnitude of the coefficients are shrunk to a 0 based on L1 regularization penalty method which adds a penalty factor(lambda) for every misclassified or wrongly predicted values.

LASSO generally works well with models having less variable parameters. From the above process, we have narrowed down a few most influencing and statistically significant variables which we will be using for this model as well.

The caret package is used to train the model here using the lasso method. The prediction metrics are then derived using the postResample function. The same predictor and response variable from the model above are used here as well.

set.seed(43)

# train
fitControl <- trainControl(method = "repeatedcv",number = 5)
fitLASSO <- train(logShares ~ . ,
                  data=trainDf,
                  method = "lasso", 
                  trControl = fitControl,
                  preProcess = c("center", "scale")
                )


predLASSO <- predict(fitLASSO, newdata = testDf)

coeff <- predict(fitLASSO$finalModel, type = "coef", mode = "fraction",
                 s = fitLASSO$bestTune$fraction, testDf)


lassoRes <- postResample(predLASSO, obs = testDf$logShares)


# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf)[2]
adjR2Lasso <-  1-(((1-lassoRes[2]^2)*(n-1))/(n-p-1))

lassoRes

Random Forest

Before building the random forest model, seeing at the data, the data set is high dimensional, hence, it would be easier to select the “best” subsets of variables for random forest ensemble tree learning method to make the process more efficient. More details about Random Forest is given before the model is formulated.

Best Variable Selection: Principal Component Analysis

Since the data set includes a large number of variables, it becomes important to select the best subsets of the data that defines the data set and variations in the data set for number of shares. To get these variables, here we are conducting Principal Component Analysis (PCA). PCA standardizes the variables, and represent new variables/PCs that are a function of the original variables from the data set.

Again, to keep the models separate, we are again splitting the data into train and test data set, and removing the categorical variables that were added previously in the data. After that the PCs are formulated.

set.seed(200)

newspopData$logShares <- log(newspopData$shares)

trainIndex <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)

trainData <- newspopData[trainIndex, ]
testData <- newspopData[-trainIndex, ]

# Removing shares, logShares and mutated variables to form PCs
PCs <- prcomp(trainData %>% 
                select(-shares, -logShares, -day, 
                       -title_Sentiment_Class, 
                       -content_Sentiment_Class,
                       -article_popularity_index), 
                        center=TRUE, scale=TRUE)

# cumulative variance
var = cumsum(PCs$sdev**2/sum(PCs$sdev**2))

show(plot(var, xlab = "Principal Component", 
          ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b'))

pt <- min(which(var > 0.80))

points(pt, var[pt], col="red", cex=2, pch=20)
text(8, 0.8, col="red", paste0(round(100*var[pt],2), "% variance \nis explained by ", pt, " \nnumber of PC(s)"))

From the above graph, we can see that at least 80% variance is explained by 21 PCs, and hence for the subsequent steps, 21 PCs will be used.

# Creating training data PCs without the logShares variables
treeTrainData <- predict(PCs, newdata = trainData %>% select(-shares, -logShares)) %>% 
  as_tibble() %>% 
  select(PC1:paste0("PC", pt))

treeTrainData$logShares <- trainData$logShares

# This the final training data that will be used to train random forest model
treeTrainData

Random Forest Model:

Random Forest model is a tree based ensemble learning method where the training data set is sampled into smaller data sets, and some variables from the data set are included in the smaller sampled data. A tree is fitted to that data. This method is repeated several times, and finally a model is finalized that will be an average of all these sampled tree models. Since Random Forest takes only a subset of variables in any given sampling, it creates really strong model to predict on with unknown variables since the sampled data is able to explain variance much better with fewer variables.

We are using train() function from caret package to train the Random Forest model using the rf method. Pre-processing is done by centering and scaling the data, and a 3 fold cross validation is also used. (The value 3 is selected to have some computation ease). Lastly, the model is tried for a number of values for the tuning parameter “mtry”, out of which the best mtry value is selected.

set.seed(400)
trctrl <- trainControl(method = "repeatedcv", number = 3)
randomForest <- train(logShares ~., data = treeTrainData, 
                      method = "rf",
                      trControl=trctrl,
                      preProcess = c("center", "scale"),
                      tuneGrid = expand.grid(mtry = 1:as.integer(pt/3)))

randomForest

Testing the model on the test data set to see the metrics of evaluation. The predict() function is used to create both PC based testing data, and to predict the logShares using the Random Forest model. Further, the postResample() function is used to extract metrics such as RMSE, R-Squared, and MAE, represented below.

# Creating test data PCs
treeTestData <- predict(PCs, newdata = testData %>% select(-shares, -logShares)) %>% 
                         as_tibble() %>% 
                         select(PC1:paste0("PC", pt))

# Completing the test data
treeTestData$logShares <- testData$logShares

rfPredict <- predict(randomForest, newdata = treeTestData)
randomFRes <- postResample(rfPredict, obs = treeTestData$logShares)


# Adjusted R-square
n <- dim(treeTestData)[1]
p <- dim(treeTestData)[2] - 1
adjR2Rf <-  1-(((1-randomFRes[2]^2)*(n-1))/(n-p-1))
 
randomFRes

Boosted Tree Model

Boosting is another tree based approach for improving the predictions resulting from a decision tree. Like other tree models, boosting can be applied for classification and regression. Boosting works in sequentially: each tree is grown using information from previously grown trees. In this project, we are using gbm method within caret train function.

An additional input tunegrid also provided to train function, which will find best hyper parameter combinations of number of trees, and depth.

Here, all the available parameters are used as predictor variables, and logShares is the target response variable.

# Training the model
boostedTree <- train(logShares ~., data = trainData %>% select(-shares), method = "gbm",
                     trControl = trainControl(method = "repeatedcv", number = 3),
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(n.trees = c(25, 50, 100, 150, 200),
                                            interaction.depth = 1:4,
                                            shrinkage = 0.1,
                                            n.minobsinnode = 10),
                     verbose = FALSE)

# Best combination of tuning hyperparameters
print(boostedTree$bestTune)

boostedTPredict <- predict(boostedTree, newdata = testData %>% select(-shares))

boostedTRes <- postResample(boostedTPredict, obs = (testData %>% select(-shares))$logShares)

# plot
plot(boostedTree, xlab = "Max Tree Depth ", 
     ylab = "Root Mean Square Error (Found through CV)", 
     main = "All Predictors - Minimizing RMSE")

# Adjusted R-square
n <- dim(testData)[1]
p <- dim(testData)[2] - 1
adjR2Boosted <-  1-(((1-boostedTRes[2]^2)*(n-1))/(n-p-1))
 

boostedTRes

From the plot we can visualize the best combinations of parameters that were used to find minimum RMSE value using Boosted tree algorithm, which can also be seen by using (boostedTree$bestTune) function.

Model Comparision

model_types = c("Linear", "Ensemble")

modelComparisonDf = tibble(model = c("Linear Model"), model_type = c(model_types[1]), RMSE = c(round(lmRes[[1]],3)), R_Squared = c(round(lmRes[[2]], 4)), Adj_R2 = c(round(adjR2Linear,4)) )

modelComparisonDf <- rbind(modelComparisonDf, c("LASSO", model_types[1], round(lassoRes[[1]],3), round(lassoRes[[2]],4), round(adjR2Lasso,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Random Forest", model_types[2], round(randomFRes[[1]],3), round(randomFRes[[2]],4), round(adjR2Rf,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Boosted Trees", model_types[2], round(boostedTRes[[1]],3), round(boostedTRes[[2]],4), round(adjR2Boosted,4)))

knitr::kable(modelComparisonDf)

model	model_type	RMSE	R_Squared	Adj_R2
Linear Model	Linear	0.917	0.0321	-0.0168
LASSO	Linear	0.905	0.0561	-0.091
Random Forest	Ensemble	0.967	0.0353	-0.0334
Boosted Trees	Ensemble	0.466	0.7849	0.5768

bestLinearModel = ""
bestLinearRMSE = ""
bestLinearR2 = ""

bestEnsembleModel = ""
bestEnsembleRMSE = ""
bestEnsembleR2 = ""

bestOverallModel = ""
bestOverallRMSE = ""
bestOverallR2 = ""

lineardf <- modelComparisonDf %>% filter(model_type == "Linear")

ensembleDf <- modelComparisonDf %>% filter(model_type == "Ensemble")

if (length(unique(lineardf$R_Squared)) > 1){
  bestLinearModel <- lineardf[which.min(lineardf$RMSE),][[1]]
  bestLinearRMSE <- lineardf[which.min(lineardf$RMSE),][[3]]
  bestLinearR2 <- lineardf[which.min(lineardf$RMSE),][[4]]
} else{
  bestLinearModel <- lineardf[which.max(lineardf$Adj_R2),][[1]]
  bestLinearRMSE <- lineardf[which.max(lineardf$Adj_R2),][[3]]
  bestLinearR2 <- lineardf[which.max(lineardf$Adj_R2),][[4]]
}


if (length(unique(ensembleDf$R_Squared)) > 1){
  bestEnsembleModel <- ensembleDf[which.min(ensembleDf$RMSE),][[1]]
  bestEnsembleRMSE <- ensembleDf[which.min(ensembleDf$RMSE),][[3]]
  bestEnsembleR2 <- ensembleDf[which.min(ensembleDf$RMSE),][[4]]
} else{
  bestEnsembleModel <- ensembleDf[which.max(ensembleDf$Adj_R2),][[1]]
  bestEnsembleRMSE <- ensembleDf[which.max(ensembleDf$Adj_R2),][[3]]
  bestEnsembleR2 <- ensembleDf[which.max(ensembleDf$Adj_R2),][[4]]
}


if (length(unique(modelComparisonDf$R_Squared)) > 1){
  bestOverallModel <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[1]]
  bestOverallRMSE <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[3]]
  bestOverallR2 <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[4]]
} else{
  bestOverallModel <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[1]]
  bestOverallRMSE <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[3]]
  bestOverallR2 <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[4]]
}

Conclusion

Selection of best model is based on comparing the lowest prediction RMSE value across all models.

The best linear model is LASSO with RMSE value of 0.905, and R Squared value of 0.0561.

The best ensemble method is Boosted Trees with RMSE value of 0.466, and R Squared value of 0.7849.

The best overall model is Boosted Trees with RMSE value of 0.466, and R Squared value of 0.7849.

Article Shares Prediction

Library

Introdcution