Shyam Gadhwala & Kamlesh Pandey
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(caret)
library(glmnet)
We are living in a digital world where people are more concerned about the digital footprint and people often consider the like and shares they get on a post that they publish online as an important metric. We have several social media websites and print media to share our articles. Online print media platform like Mashable publishes thousand of online media everyday and it is important for them get a more user engagement from the the article they post online. In this data set we are trying to predict the shares using that data that would have several potential influencing factors, in other words, predictors, such as text sentiment polarity, rate of negative words, rate of positive words, to name some.
Why this analysis is important ? From this predictive model, Mashable can potentially predict the number of shares they can receive based on the type of article they are publishing online.
The data set summarizes a heterogeneous set of features about articles published by in a period of two years.The data set has 39644 entries and 61 feature columns. The project is aimed to subset the original data set based on type of data channel (one of the six type) and then to predict the number of shares.
For this part of the project we are using SOCMED data channel for training and building a predictive modeling and extending same models to other five data channels.
<- read_csv('OnlineNewsPopularity.csv')
newspopData
# select specified data channel from params and drop other data channel columns
<- newspopData %>% filter(newspopData[params$data_channel] == 1) %>%
newspopData select( -starts_with('data_channel_is_'), -url)
newspopData
## # A tibble: 2,323 × 54
## timedelta n_tokens…¹ n_tok…² n_uni…³ n_non…⁴
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 731 8 257 0.568 1.00
## 2 731 8 218 0.663 1.00
## 3 731 9 1226 0.410 1.00
## 4 731 10 1121 0.451 1.00
## 5 729 9 168 0.778 1.00
## 6 729 9 100 0.760 1.00
## 7 729 10 1596 0.420 1.00
## 8 728 7 518 0.486 1.00
## 9 727 8 358 0.503 1.00
## 10 727 6 358 0.622 1.00
## # … with 2,313 more rows, 49 more variables:
## # n_non_stop_unique_tokens <dbl>,
## # num_hrefs <dbl>, num_self_hrefs <dbl>,
## # num_imgs <dbl>, num_videos <dbl>,
## # average_token_length <dbl>,
## # num_keywords <dbl>, kw_min_min <dbl>,
## # kw_max_min <dbl>, kw_avg_min <dbl>, …
summary(newspopData$shares)
## Min. 1st Qu. Median Mean 3rd Qu.
## 5 1400 2100 3629 3800
## Max.
## 122800
Dividing the popularity of the shares based on quantiles to classify them as to how popular they are. The 1st quarter percentile would be popularity index 4, next quarter be index 3, so on and so forth. Higher the number of shares of that article, higher the popularity index.
<- newspopData$shares
pShares <- sort(pShares)
pShares
<- quantile(pShares, c(.25, .5, .75))
quantileValues
<- newspopData %>% mutate(article_popularity_index = if_else(shares<quantileValues[1], "4",
newspopData if_else(shares<quantileValues[2], "3",
if_else(shares<quantileValues[3],"2", "1"))))
$article_popularity_index <- as_factor(newspopData$article_popularity_index) newspopData
Starting EDA with the visualization of the target variable shares as a function of the most obvious factor, number of words in the article:
$scaledShare <- scale(newspopData$shares, center = T, scale = T)
newspopData
#calculating correlation index
<- round(cor(newspopData$scaledShare, newspopData$n_tokens_content),2)
corrIndex
# plotting the shares as a function of words in an article
ggplot(newspopData, aes(x= n_tokens_content, y = scaledShare)) +
geom_point()+
labs(subtitle = 'Word Count v/s Number of Shares',
y = 'Number of Share (Scaled)',
x = 'Number of Words in article',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = 'Source: News Popularity Dataset') +
geom_text(x = max(newspopData$n_tokens_content)/2,
y = max(newspopData$scaledShare)/2, size = 4,
label = paste0("Correlation coefficient = ", corrIndex), color = 'red')
We can inspect trend of Number of shares as a function of Number of words in the article. If the points show an upward trend, then the article with high number of words are shared more. However, if we see a negative trend then we can estimate that with increasing number of words in the article, number of shares decreases. This trend can be investigated further with the correlation coefficientm which in this case is 0.05.
Now, creating a new variable here that help eliminates the dummy variable for each day of the week. This new variable will take the value of each day of the week that corresponds to the dummy variable’s value:
<- newspopData %>% select(-scaledShare)
newspopData <- newspopData %>% mutate(day = if_else(weekday_is_monday== 1, "Monday",
newspopData if_else(weekday_is_tuesday== 1, "Tuesday",
if_else(weekday_is_wednesday==1, "Wednesday",
if_else(weekday_is_thursday== 1, "Thursday",
if_else(weekday_is_friday== 1, "Friday",
if_else(weekday_is_saturday== 1,"Saturday",
if_else(weekday_is_sunday== 1, "Sunday",
"-"))))))))
$day <- as_factor(newspopData$day) newspopData
Some statistics based on the variables is as shown:
::kable(newspopData %>% group_by(day) %>% summarize(Count_of_Articles = n())) knitr
day | Count_of_Articles |
---|---|
Monday | 337 |
Wednesday | 416 |
Thursday | 463 |
Friday | 332 |
Saturday | 180 |
Sunday | 137 |
Tuesday | 458 |
::kable(newspopData %>% group_by(day, article_popularity_index) %>% summarize(Count_of_Articles = n())) knitr
day | article_popularity_index | Count_of_Articles |
---|---|---|
Monday | 2 | 84 |
Monday | 4 | 84 |
Monday | 1 | 99 |
Monday | 3 | 70 |
Wednesday | 2 | 110 |
Wednesday | 4 | 108 |
Wednesday | 1 | 100 |
Wednesday | 3 | 98 |
Thursday | 2 | 124 |
Thursday | 4 | 123 |
Thursday | 1 | 105 |
Thursday | 3 | 111 |
Friday | 2 | 90 |
Friday | 4 | 77 |
Friday | 1 | 88 |
Friday | 3 | 77 |
Saturday | 2 | 61 |
Saturday | 4 | 26 |
Saturday | 1 | 44 |
Saturday | 3 | 49 |
Sunday | 2 | 38 |
Sunday | 4 | 17 |
Sunday | 1 | 46 |
Sunday | 3 | 36 |
Tuesday | 2 | 108 |
Tuesday | 4 | 128 |
Tuesday | 1 | 109 |
Tuesday | 3 | 113 |
This table shows the shares of articles on each day of the week. For tech based articles, we can expect a rise in shares in mid week or Fridays or after any tech has been launched. For entertainment and lifestyle articles, post-weekend period would be the most active period of sharing. While world, social media and business would not have a definitive trend.
ggplot(newspopData, aes(x = day, y = shares/1000000)) +
geom_bar(stat = 'identity', width = 0.5, fill = 'tomato3')+
labs(subtitle = 'Number of Shares (Million) Vs Day of Week',
caption = 'Source : News Popularity Dataset',
y = 'Total Share count in Million',
x = 'Day of the Week',
title = toupper(str_replace_all(params$data_channel, "_", " ")),) +
theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
From this bar chart we can visualize the shares trend across the week. The users engagement with the type of data Chanel (tech, entertainment, politics) will be different across the week. Users may be more inclined toward sharing lifestyle and entertainment news during weekend and prefer less to share technology/science related news during same time.
ggplot(newspopData, aes(x = global_rate_positive_words, y = global_sentiment_polarity)) +
geom_point(aes(col = day, shape = article_popularity_index)) +
geom_smooth(aes(col = day), method = 'lm', se = F) +
scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
x = 'Global Positive Word Rate',
y = 'Global Sentiment Polarity',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
color = 'Days',
size = 'Shares per Million') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
ggplot(newspopData, aes(x = global_rate_negative_words, y = global_sentiment_polarity)) +
geom_point(aes (col = day, shape = article_popularity_index)) +
geom_smooth(aes(col = day), method = 'lm', se = F) +
scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
x = 'Global Negative Word Rate',
y = 'Global Sentiment Polarity',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
color = 'Days',
size = 'Shares per Million') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
The above plots estimate the general sentiments of users based on positive and negative words in the content. Ideal expectations would be: an article with more positive words has a positive sentiment index and more shares, and vice versa. Exceptions may be found in peculiar cases.
Plotting the number of shares based on the number of images and videos that an article has, based on what day of the week it is:
ggplot(newspopData, aes(x=day, y = shares/1000000)) +
geom_bar(aes(fill = as_factor(num_imgs)), stat="identity", position="dodge") +
scale_fill_discrete(name = 'Number of Images \n in Article') +
labs(subtitle = 'Number of Images in article v/s number of shares',
x = 'Number of Images',
y = 'Number of shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset") +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
This plot shows the number of shares (in millions) based on number of images included in each articles for each day of the week. The general trend might see a increase in sharing of articles over Fridays and weekends when there is more leisure time. Moreover, tech based articles can also see a rise in share if any new technology is released in middle of the week. Entertainment and Lifestyles articles can see a rise in share if any events happened over the weekend which is the case most of the times. World and social media Articles should not have a trend, as events from all over the world keep on happening throughout the week and updates are posted on social media non stop. Business articles would be more trending during working days.
in addition to that, not all articles would be oriented towards images. For example, tech and business based articles readers would not care about the number of images included, but only the content itself, while for entertainment, world, lifestyle articles, images are one of the major factors in capturing user’s attention, and thus, shares.
ggplot(newspopData, aes(x = num_keywords, y = shares/1000000)) +
geom_bar(aes(fill = day), stat="identity") +
scale_x_discrete(limits=c(2:10)) +
labs(subtitle = 'Shares based on number of keywords',
x = 'Number of Keywords',
y = 'Shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
fill = 'Days') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
This graph represents the shares trend of any article based on the number of keywords. A general speculated trend would be: the higher the number of keywords, the higher it is descriptive about its content, and hence the number of shares would also be higher. However. this might not hold true for business based articles where short and crisp key words can also do the trick. Otherwise, the trend of shares should be positive with respect to number of keywords.
Now, the title sentiment polarity is one of the first things that users read while making a decision about reading an article let alone sharing it. Hence it is important to see how is the title sentiment polarity, is it negative or positive. Here, we are dividing the polarities into 4 sections; Highly positive (Sentiment Polarity between 0.5 and 1), positive (Sentiment Polarity between 0 and 0.5), negative ((Sentiment Polarity between -0.5 and 0)) and highly negative (Sentiment Polarity between -1 and -0.5).
Creating the new variable to classify the sentiments into above category:
= c("Extremely Positive", "Positive", "Negative", "Extremely Negative")
pols
<- newspopData %>% mutate(title_Sentiment_Class =
newspopData if_else(title_sentiment_polarity > 0.5, pols[1],
if_else(title_sentiment_polarity > 0, pols[2],
if_else(title_sentiment_polarity > -0.5, pols[3], pols[4]))))
$title_Sentiment_Class <- as_factor(newspopData$title_Sentiment_Class) newspopData
Statistics about the articles that are in each of these categories:
::kable(newspopData %>% group_by(title_Sentiment_Class) %>% summarize(Count_of_articles = n())) knitr
title_Sentiment_Class | Count_of_articles |
---|---|
Extremely Negative | 53 |
Positive | 787 |
Negative | 1346 |
Extremely Positive | 137 |
::kable(newspopData %>% group_by(title_Sentiment_Class, article_popularity_index) %>% summarize(Count_of_articles = n())) knitr
title_Sentiment_Class | article_popularity_index | Count_of_articles |
---|---|---|
Extremely Negative | 2 | 10 |
Extremely Negative | 4 | 11 |
Extremely Negative | 1 | 21 |
Extremely Negative | 3 | 11 |
Positive | 2 | 206 |
Positive | 4 | 178 |
Positive | 1 | 216 |
Positive | 3 | 187 |
Negative | 2 | 370 |
Negative | 4 | 340 |
Negative | 1 | 316 |
Negative | 3 | 320 |
Extremely Positive | 2 | 29 |
Extremely Positive | 4 | 34 |
Extremely Positive | 1 | 38 |
Extremely Positive | 3 | 36 |
The tables above represents the number of articles classified into each title sentiment category, and based on their popularity index.
::kable(newspopData %>% group_by(title_Sentiment_Class, day) %>% summarize(Count_of_articles = n())) knitr
title_Sentiment_Class | day | Count_of_articles |
---|---|---|
Extremely Negative | Monday | 7 |
Extremely Negative | Wednesday | 10 |
Extremely Negative | Thursday | 14 |
Extremely Negative | Friday | 9 |
Extremely Negative | Saturday | 3 |
Extremely Negative | Sunday | 3 |
Extremely Negative | Tuesday | 7 |
Positive | Monday | 108 |
Positive | Wednesday | 146 |
Positive | Thursday | 140 |
Positive | Friday | 106 |
Positive | Saturday | 66 |
Positive | Sunday | 55 |
Positive | Tuesday | 166 |
Negative | Monday | 202 |
Negative | Wednesday | 246 |
Negative | Thursday | 284 |
Negative | Friday | 192 |
Negative | Saturday | 98 |
Negative | Sunday | 65 |
Negative | Tuesday | 259 |
Extremely Positive | Monday | 20 |
Extremely Positive | Wednesday | 14 |
Extremely Positive | Thursday | 25 |
Extremely Positive | Friday | 25 |
Extremely Positive | Saturday | 13 |
Extremely Positive | Sunday | 14 |
Extremely Positive | Tuesday | 26 |
This table further shows the shares of different title sentimental types of articles for each day of the week.
<- newspopData %>% mutate(content_Sentiment_Class =
newspopData if_else( global_sentiment_polarity > 0.5, pols[1],
if_else( global_sentiment_polarity > 0, pols[2],
if_else( global_sentiment_polarity > -0.5, pols[3], pols[4]))))
$content_Sentiment_Class <- as_factor(newspopData$content_Sentiment_Class)
newspopData
# average article length
::kable(newspopData %>%
knitrgroup_by(content_Sentiment_Class, day)%>%
select(day, content_Sentiment_Class, n_tokens_content) %>%
summarise(Average_Content_Length = round(sum(n_tokens_content)/n())))
content_Sentiment_Class | day | Average_Content_Length |
---|---|---|
Positive | Monday | 548 |
Positive | Wednesday | 554 |
Positive | Thursday | 644 |
Positive | Friday | 548 |
Positive | Saturday | 803 |
Positive | Sunday | 703 |
Positive | Tuesday | 698 |
Negative | Monday | 241 |
Negative | Wednesday | 322 |
Negative | Thursday | 175 |
Negative | Friday | 304 |
Negative | Saturday | 271 |
Negative | Sunday | 203 |
Negative | Tuesday | 489 |
Extremely Positive | Wednesday | 34 |
Extremely Positive | Friday | 158 |
Extremely Positive | Sunday | 181 |
Extremely Positive | Tuesday | 211 |
We further want to extend our analysis based on the average number of tokens on an article published during the weekdays and weekend.
ggplot(newspopData, aes(x=title_Sentiment_Class, y=shares/1000000)) +
geom_bar(aes(fill = day), stat="identity", position="dodge") +
labs(subtitle = 'Shares based on title sentiment and day of the week',
x = 'Title Sentiment Category',
y = 'Shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
fill = 'Days') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1)) +
geom_text(x = 3, y = 0.6, size = 3,
label = paste("Title Sentiment Score range", "Extremely Positive = 0.5 <--> 1",
'Positive = 0 <--> 0.5 ',
'Negative = -0.5 <--> 0',
'Extremely Negative = -1 <--> -0.5',
sep = '\n' ), color = 'black')
The above graph represents the shares of articles (in millions) based on title sentiment category for each day of the week. Some of the trends that can be thought of here is that entertainment and lifestyle articles can have a shares of both extremes ends of classifications, mostly in the starting of the week after any events that might have happened, or any other controversy that their target celebrity would be involved in.
For tech, a high number of positive ends shares can be expected, while social media, business, world can have a mix of all the types of sentimental categorical shares, depending on the latest news.
ggplot(newspopData, aes(x= title_sentiment_polarity)) +
geom_density(aes(fill=title_Sentiment_Class), alpha = 0.7) +
facet_grid(. ~day)+
scale_fill_discrete(name = "Title Sentiment") +
labs(subtitle = 'Distribution of articles on title sentiment and day of the week',
x = 'Title Sentiment Polarity',
y = 'Density',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset") +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
The plot above shows the density of distribution of the shares of each type of title sentimental category for each day of the week, in other words, how much density is covered by each type of sentimental titled article for each day. Again, the trend would be the same for each type of the data, depending on various situations that drive these articles, and its publications.
Firstly, we are checking for the skewness in the data set for the response variable that is “shares”. For that we are plotting the distribution for the data of that variable below:
ggplot(newspopData) + geom_density(aes(x = shares))+
labs(title = "Shares Predictor Distribution in Dataset")
From the plot above showing distribution of shares, we can see that the distribution is left-skewed. To mitigate this, we are taking a log transform of that variable.
ggplot(newspopData) + geom_density(aes(x = log(shares)))+
labs(title = "Logarithmic transformation of shares distribution")
As seen from the plot above, after doing the log transformation, shares distribution can be estimated as a normal distribution, which will be better for the models that we would be fitting for this project. Hence, for all the models that are built in this project, the target response variable is the log transform (“logShares”) of the original shares variable.
set.seed(42)
# Removing mutated variables from the data set
<- newspopData %>% select(-day, -title_Sentiment_Class, -content_Sentiment_Class, -article_popularity_index)
lmData
$logShares <- log(lmData$shares)
lmData
# train test split
<- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)
index
# Removing the shares variable because we are using logShares as response
<- lmData[index, ] %>% select(-shares)
trainDf <- lmData[-index, ] %>% select(-shares) testDf
For variable selection we used p-test values to consider significant
coefficients from the linear model fit. For a confidence interval of
95%, the significant predictors are used to fit a Linear and a Lasso
regression model. For the significant parameter selection, our NULL and
Alternate hypothesis is explained below:
H~0: All coefficients have zero slope
H~a: At-least one predictor variable has a non-zero slope
<- lm(logShares~. ,data = trainDf)
modelLmVar
# getting p-values
<- data.frame(summary(modelLmVar)$coefficients[,4])
pVal $row_names <- row.names(pVal)
pVal
# checking if slope of intercept is significant (i.e. <<< 0.05)
<- pVal %>% filter(pVal[,1] < 0.05)
pVal = c()
cols
# remove intercept term and store all significant intercepts in "cols" variable
if (pVal[,2][1] == "(Intercept)"){
= pVal[,2][-1]
cols else{
}= pVal[,2]
cols
} cols
## [1] "timedelta"
## [2] "n_non_stop_words"
## [3] "n_non_stop_unique_tokens"
## [4] "num_hrefs"
## [5] "num_videos"
## [6] "num_keywords"
## [7] "kw_min_max"
## [8] "kw_max_avg"
## [9] "kw_avg_avg"
## [10] "weekday_is_tuesday"
## [11] "weekday_is_wednesday"
## [12] "weekday_is_thursday"
## [13] "weekday_is_saturday"
## [14] "LDA_00"
## [15] "LDA_01"
## [16] "global_subjectivity"
## [17] "min_positive_polarity"
## [18] "max_positive_polarity"
Linear regression is a regression model that uses a straight line to describe a relationship between two or more predictor and a response variable. Out of the 54 predictor variables in the data set, the predictors that were deemed significant from the step above are used in this model, with a response variable of logShares.
<- trainControl(method = 'repeatedcv', number = 10)
trControl
# model training
<- train(logShares ~ .,
modelLm data = trainDf %>%
select(all_of(cols), logShares),
method = "lm",
preProcess = c('center', 'scale'),
trControl = trControl)
= toupper(str_replace_all(params$data_channel, "_", " "))
title
# variable importance for linear model
plot(varImp(modelLm),
main = paste("Variable importance Plot for ", "\n", title))
<- predict(modelLm, newdata = testDf %>% select(all_of(cols), logShares))
lmPredict <- postResample(lmPredict, obs = testDf$logShares)
lmRes
# Adjusted R-square
<- dim(testDf)[1]
n <- dim(testDf %>% select(all_of(cols)))[2]
p <- 1-(((1-lmRes[2]^2)*(n-1))/(n-p-1))
adjR2Linear
lmRes
## RMSE Rsquared MAE
## 0.7482074 0.1393485 0.5600929
LASSO or the Least Absolute Shrinkage and Selection Operator is a special type of linear model that uses penalty based shrinkage. Shrinkage here means converging data values to a central point, or where the absolute magnitude of the coefficients are shrunk to a 0 based on L1 regularization penalty method which adds a penalty factor(lambda) for every misclassified or wrongly predicted values.
LASSO generally works well with models having less variable parameters. From the above process, we have narrowed down a few most influencing and statistically significant variables which we will be using for this model as well.
The caret package is used to train the model here using the lasso method. The prediction metrics are then derived using the postResample function. The same predictor and response variable from the model above are used here as well.
set.seed(43)
# train
<- trainControl(method = "repeatedcv",number = 5)
fitControl <- train(logShares ~ . ,
fitLASSO data=trainDf,
method = "lasso",
trControl = fitControl,
preProcess = c("center", "scale")
)
<- predict(fitLASSO, newdata = testDf)
predLASSO
<- predict(fitLASSO$finalModel, type = "coef", mode = "fraction",
coeff s = fitLASSO$bestTune$fraction, testDf)
<- postResample(predLASSO, obs = testDf$logShares)
lassoRes
# Adjusted R-square
<- dim(testDf)[1]
n <- dim(testDf)[2]
p <- 1-(((1-lassoRes[2]^2)*(n-1))/(n-p-1))
adjR2Lasso
lassoRes
## RMSE Rsquared MAE
## 1.234437e+04 1.424295e-03 4.687996e+02
Before building the random forest model, seeing at the data, the data set is high dimensional, hence, it would be easier to select the “best” subsets of variables for random forest ensemble tree learning method to make the process more efficient. More details about Random Forest is given before the model is formulated.
Since the data set includes a large number of variables, it becomes important to select the best subsets of the data that defines the data set and variations in the data set for number of shares. To get these variables, here we are conducting Principal Component Analysis (PCA). PCA standardizes the variables, and represent new variables/PCs that are a function of the original variables from the data set.
Again, to keep the models separate, we are again splitting the data into train and test data set, and removing the categorical variables that were added previously in the data. After that the PCs are formulated.
set.seed(200)
$logShares <- log(newspopData$shares)
newspopData
<- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)
trainIndex
<- newspopData[trainIndex, ]
trainData <- newspopData[-trainIndex, ] testData
Creating new PCs, and plotting the cumulative variance graph:
# Removing shares, logShares and mutated variables to form PCs
<- prcomp(trainData %>%
PCs select(-shares, -logShares, -day,
-title_Sentiment_Class,
-content_Sentiment_Class,
-article_popularity_index),
center=TRUE, scale=TRUE)
# cumulative variance
= cumsum(PCs$sdev**2/sum(PCs$sdev**2))
var
show(plot(var, xlab = "Principal Component",
ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b'))
## NULL
<- min(which(var > 0.80))
pt
points(pt, var[pt], col="red", cex=2, pch=20)
text(8, 0.8, col="red", paste0(round(100*var[pt],2), "% variance \nis explained by ", pt, " \nnumber of PC(s)"))
From the above graph, we can see that at least 80% variance is explained by 20 PCs, and hence for the subsequent steps, 20 PCs will be used.
Now, we are creating the training data for Random Forest Ensemble model.
# Creating training data PCs without the logShares variables
<- predict(PCs, newdata = trainData %>% select(-shares, -logShares)) %>%
treeTrainData as_tibble() %>%
select(PC1:paste0("PC", pt))
$logShares <- trainData$logShares
treeTrainData
# This the final training data that will be used to train random forest model
treeTrainData
## # A tibble: 1,628 × 21
## PC1 PC2 PC3 PC4 PC5 PC6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.41 1.36 2.46 1.04 2.26 -0.895
## 2 -1.67 -0.402 2.67 1.05 1.98 -0.328
## 3 -0.376 -1.42 -0.589 3.84 5.73 -2.09
## 4 1.74 -1.19 5.13 -2.32 1.52 -1.33
## 5 -2.50 0.189 3.06 -0.497 2.49 0.439
## 6 -1.65 -2.05 1.78 -0.347 3.44 0.458
## 7 -0.810 -1.29 2.15 0.662 3.78 -1.18
## 8 1.46 -1.10 3.96 -1.39 2.64 -0.0528
## 9 -1.49 -1.05 2.36 -1.09 2.78 -0.163
## 10 -3.55 -0.375 1.09 0.277 3.32 -1.94
## # … with 1,618 more rows, and 15 more
## # variables: PC7 <dbl>, PC8 <dbl>,
## # PC9 <dbl>, PC10 <dbl>, PC11 <dbl>,
## # PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
## # PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
## # PC18 <dbl>, PC19 <dbl>, PC20 <dbl>,
## # logShares <dbl>
Random Forest model is a tree based ensemble learning method where the training data set is sampled into smaller data sets, and some variables from the data set are included in the smaller sampled data. A tree is fitted to that data. This method is repeated several times, and finally a model is finalized that will be an average of all these sampled tree models. Since Random Forest takes only a subset of variables in any given sampling, it creates really strong model to predict on with unknown variables since the sampled data is able to explain variance much better with fewer variables.
We are using train() function from caret package to train the Random Forest model using the rf method. Pre-processing is done by centering and scaling the data, and a 3 fold cross validation is also used. (The value 3 is selected to have some computation ease). Lastly, the model is tried for a number of values for the tuning parameter “mtry”, out of which the best mtry value is selected.
Training the model:
set.seed(400)
<- trainControl(method = "repeatedcv", number = 3)
trctrl <- train(logShares ~., data = treeTrainData,
randomForest method = "rf",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(mtry = 1:as.integer(pt/3)))
randomForest
## Random Forest
##
## 1628 samples
## 20 predictor
##
## Pre-processing: centered (20), scaled (20)
## Resampling: Cross-Validated (3 fold, repeated 1 times)
## Summary of sample sizes: 1085, 1086, 1085
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.7971803 0.1532479 0.5880551
## 2 0.7921464 0.1530724 0.5823241
## 3 0.7932806 0.1478411 0.5836838
## 4 0.7930320 0.1470004 0.5825096
## 5 0.7919961 0.1499486 0.5801403
## 6 0.7937275 0.1462340 0.5817156
##
## RMSE was used to select the optimal
## model using the smallest value.
## The final value used for the model was mtry
## = 5.
Testing the model on the test data set to see the metrics of evaluation. The predict() function is used to create both PC based testing data, and to predict the logShares using the Random Forest model. Further, the postResample() function is used to extract metrics such as RMSE, R-Squared, and MAE, represented below.
# Creating test data PCs
<- predict(PCs, newdata = testData %>% select(-shares, -logShares)) %>%
treeTestData as_tibble() %>%
select(PC1:paste0("PC", pt))
# Completing the test data
$logShares <- testData$logShares
treeTestData
<- predict(randomForest, newdata = treeTestData)
rfPredict <- postResample(rfPredict, obs = treeTestData$logShares)
randomFRes
# Adjusted R-square
<- dim(treeTestData)[1]
n <- dim(treeTestData)[2] - 1
p <- 1-(((1-randomFRes[2]^2)*(n-1))/(n-p-1))
adjR2Rf
randomFRes
## RMSE Rsquared MAE
## 0.7510496 0.1193722 0.5849793
Boosting is another tree based approach for improving the predictions resulting from a decision tree. Like other tree models, boosting can be applied for classification and regression. Boosting works in sequentially: each tree is grown using information from previously grown trees. In this project, we are using gbm method within caret train function.
An additional input tunegrid also provided to train function, which will find best hyper parameter combinations of number of trees, and depth.
Here, all the available parameters are used as predictor variables, and logShares is the target response variable.
# Training the model
<- train(logShares ~., data = trainData %>% select(-shares), method = "gbm",
boostedTree trControl = trainControl(method = "repeatedcv", number = 3),
preProcess = c("center", "scale"),
tuneGrid = expand.grid(n.trees = c(25, 50, 100, 150, 200),
interaction.depth = 1:4,
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
# Best combination of tuning hyperparameters
print(boostedTree$bestTune)
## n.trees interaction.depth shrinkage
## 12 50 3 0.1
## n.minobsinnode
## 12 10
<- predict(boostedTree, newdata = testData %>% select(-shares))
boostedTPredict
<- postResample(boostedTPredict, obs = (testData %>% select(-shares))$logShares)
boostedTRes
# plot
plot(boostedTree, xlab = "Max Tree Depth ",
ylab = "Root Mean Square Error (Found through CV)",
main = "All Predictors - Minimizing RMSE")
# Adjusted R-square
<- dim(testData)[1]
n <- dim(testData)[2] - 1
p <- 1-(((1-boostedTRes[2]^2)*(n-1))/(n-p-1))
adjR2Boosted
boostedTRes
## RMSE Rsquared MAE
## 0.3270764 0.8336740 0.2235659
From the plot we can visualize the best combinations of parameters that were used to find minimum RMSE value using Boosted tree algorithm, which can also be seen by using (boostedTree$bestTune) function.
= c("Linear", "Ensemble")
model_types
= tibble(model = c("Linear Model"), model_type = c(model_types[1]), RMSE = c(round(lmRes[[1]],3)), R_Squared = c(round(lmRes[[2]], 4)), Adj_R2 = c(round(adjR2Linear,4)) )
modelComparisonDf
<- rbind(modelComparisonDf, c("LASSO", model_types[1], round(lassoRes[[1]],3), round(lassoRes[[2]],4), round(adjR2Lasso,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Random Forest", model_types[2], round(randomFRes[[1]],3), round(randomFRes[[2]],4), round(adjR2Rf,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Boosted Trees", model_types[2], round(boostedTRes[[1]],3), round(boostedTRes[[2]],4), round(adjR2Boosted,4)))
modelComparisonDf
::kable(modelComparisonDf) knitr
model | model_type | RMSE | R_Squared | Adj_R2 |
---|---|---|---|---|
Linear Model | Linear | 0.748 | 0.1393 | -0.0067 |
LASSO | Linear | 12344.371 | 0.0014 | -0.0844 |
Random Forest | Ensemble | 0.751 | 0.1194 | -0.015 |
Boosted Trees | Ensemble | 0.327 | 0.8337 | 0.6672 |
= ""
bestLinearModel = ""
bestLinearRMSE = ""
bestLinearR2
= ""
bestEnsembleModel = ""
bestEnsembleRMSE = ""
bestEnsembleR2
= ""
bestOverallModel = ""
bestOverallRMSE = ""
bestOverallR2
<- modelComparisonDf %>% filter(model_type == "Linear")
lineardf
<- modelComparisonDf %>% filter(model_type == "Ensemble")
ensembleDf
if (length(unique(lineardf$R_Squared)) > 1){
<- lineardf[which.min(lineardf$RMSE),][[1]]
bestLinearModel <- lineardf[which.min(lineardf$RMSE),][[3]]
bestLinearRMSE <- lineardf[which.min(lineardf$RMSE),][[4]]
bestLinearR2 else{
} <- lineardf[which.max(lineardf$Adj_R2),][[1]]
bestLinearModel <- lineardf[which.max(lineardf$Adj_R2),][[3]]
bestLinearRMSE <- lineardf[which.max(lineardf$Adj_R2),][[4]]
bestLinearR2
}
if (length(unique(ensembleDf$R_Squared)) > 1){
<- ensembleDf[which.min(ensembleDf$RMSE),][[1]]
bestEnsembleModel <- ensembleDf[which.min(ensembleDf$RMSE),][[3]]
bestEnsembleRMSE <- ensembleDf[which.min(ensembleDf$RMSE),][[4]]
bestEnsembleR2 else{
} <- ensembleDf[which.max(ensembleDf$Adj_R2),][[1]]
bestEnsembleModel <- ensembleDf[which.max(ensembleDf$Adj_R2),][[3]]
bestEnsembleRMSE <- ensembleDf[which.max(ensembleDf$Adj_R2),][[4]]
bestEnsembleR2
}
if (length(unique(modelComparisonDf$R_Squared)) > 1){
<- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[1]]
bestOverallModel <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[3]]
bestOverallRMSE <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[4]]
bestOverallR2 else{
} <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[1]]
bestOverallModel <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[3]]
bestOverallRMSE <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[4]]
bestOverallR2 }
Selection of best model is based on comparing the lowest prediction RMSE value across all models.
For the data set for SOCMED:
The best linear model is Linear Model with RMSE value of 0.748, and R Squared value of 0.1393.
The best ensemble method is Boosted Trees with RMSE value of 0.327, and R Squared value of 0.8337.
The best overall model is Boosted Trees with RMSE value of 0.327, and R Squared value of 0.8337.