News Sentiment Analysis (Part I)

There are a number of studies in the Finance literature, which point to the media as a predictive indicator of stock market movement. While the the earlier set of sentiment studies such as Tetlock (2017) - “Giving Content to Investor Sentiment: The Role of Media in the Stock Market” look focus on the U.S. market, more recent paper by Fraiberger et al (2018)[paper] suggests that media effect may be prevalent in other markets as well.

Sentiment analysis is one of the most studied topic in machine learning with increasing number of open datasets for training. The two papers, Tetlock (2017) and Fraiberger et al (2018), make use of word count based sentiment index. Namely, an article is labelled positive/negative depending on whether the article has more positive/negative words. Although the classification algorithm sounds intuitive, the classification model isn’t objectively trained, and the accuracy of this algorithm cannot be tested. In this post, I will look at how closely word count based algorithm performs compared to an ML-model trained on labelled yet not directly related dataset, and whether sentiment classification by either of the two algorithms have predictive power on the Korean market.

Let’s first start with the dataset. Fraiberger et al (2018) uses news articles published on Reuters obtained from Factiva. As I am not currently affiliated with a research institution, I don’t have access to Factiva. Instead, I opt for directly scraping news market associated with “Asian Market”. The python code used for scraping Reuters articles can be found here. The oldest set of articles I could access were until April 10th, 2018.

As with Fraiberger et al (2018), I use Loughran McDonald Sentiment Word List, which can be found here.

The code below uses the 9,231 articles (2 misplaced articles from 2016 excluded) between 2018-04-10 and 2020-03-31. As with Fraiberger et al (2018) I look at both local effect (articles related to Korea) and global effect (all articles). Due to the size of entire article set, I cannot include them in my dataset, but completed sentiment.csv file can be found here

#import necessary libraries
import os 
from bs4 import BeautifulSoup
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import datetime, pytz

os.chdir("local directory")

#Load word dictionary
pos_word_list = [w.lower() for w in pd.read_excel("LoughranMcDonald_SentimentWordLists_2018.xlsx", sheet_name = "Positive", header = None)[0]]
neg_word_list = [w.lower() for w in pd.read_excel("LoughranMcDonald_SentimentWordLists_2018.xlsx", sheet_name = "Negative", header = None)[0]]
with open("StopWords_Generic.txt", "r") as f:
    stop_words = [w.lower() for w in f.read().split("\n")]

files = os.listdir("asian_market")

#initialize lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

#Parse article raw files
article_list =[]
for i,file in enumerate(files):
    with open(f"asian_market/{file}", "rb") as f:
        soup = BeautifulSoup(f.read(), "html.parser")
    #check whether Korea is in keywords
    keywords = soup.find('meta', attrs = {'name':'analyticsAttributes.keywords'})['content']
    korea_ind = 1 if re.search("south *korea", keywords.lower()) else 0
    
    #Extract article time from metadata
    time_raw = soup.find('meta', attrs = {'name':'analyticsAttributes.articleDate'})['content']
    article_time_utc = datetime.datetime.strptime(time_raw, "%Y-%m-%dT%H:%M:%S%z")
    article_time = article_time_utc.astimezone(pytz.timezone("Asia/Seoul"))
    
    #Extract article text
    main_article = soup.find("div", attrs = {'class':"StandardArticle_container"})
    body = main_article.find('div', attrs= {'class':'StandardArticleBody_body'})
    
    #Pre-process article text
    words = [w.lower() for w in word_tokenize(body.text)] #tokenize article into lower case words
    words = [wordnet_lemmatizer.lemmatize(w) for w in words if not re.match('-*[0-9\.]+', w)] #remove numbers
    words = [wordnet_lemmatizer.lemmatize(w) for w in words if not re.match('^[^a-zA-Z]+$', w)] #remove if no alphabet
    words = [w for w in words if w not in stop_words] #remove stopwords
    words = [w for w in words if not re.match('^.$', w)] #remove single characters
    
    #Divide into positive/negative words
    pos_words, neg_words, neut_words = [], [], []
    for word in words:
        if word in pos_word_list:
            pos_words.append(word)
        elif word in neg_word_list:
            neg_words.append(word)
        else:
            neut_words.append(word)
	#Log article date and sentiment value for each article
    row_dt = {'time':article_time, 'korea':korea_ind,'sentiment':(len(pos_words)-len(neg_words))/len(words)}
    article_list.append(row_dt)

Now that sentiment value has been assigned to each of the articles, I can aggregate at daily level to see how sentiments change over time.

#Articles into Data Frame
df = pd.DataFrame(article_list)

#Aggregate articles into daily level
df['date'] = df['time'].apply(lambda x: x.strftime("%Y-%m-%d"))
daily_sentiment = df.groupby("date")['sentiment'].mean().reset_index(name = "global")
daily_volume = df.groupby("date")['sentiment'].count().reset_index(name = "global_volume")
korea_sentiment = df.loc[df.korea == 1].groupby("date")['sentiment'].mean().reset_index(name = "korea")

sentiment = pd.merge(daily_sentiment, korea_sentiment, how = "outer", on = "date")
sentiment = pd.merge(sentiment, daily_volume, how = "outer", on = "date")
sentiment = sentiment.fillna(value = 0) #For days without Korean articles, set sentiment as 0

sentiment = sentiment.loc[sentiment.date>='2018-01-01'] #remove 2 articles into 2016 (outliers)

sentiment.to_csv("sentiment.csv", index = False) #Save sentiment data as csv

Below is the overall dynamics of global and local sentiment. Global and local sentiment appears to be roughtly 33% correlated.

import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=sentiment.date,y=sentiment['global'],name="global"))
fig.add_trace(go.Scatter(x=sentiment.date,y=sentiment.korea,name="korea"))
fig.update_layout(
    title="Global vs Local Sentiment",
)
fig.show()
print("Sentiment Correlation",round(np.corrcoef(sentiment.korea, sentiment['global'])[0][1],5))

For the predictive analysis, I will use R. The KOSPI 200 data I use can be downloaded from here

library(data.table)
library(lmtest) #Granger Causality
library(sandwich) #Newey West

#Import Sentiment and KOSPI200 data
sentiment = fread("sentiment.csv")
kospi200 = fread("kospi200.csv")

#Compute KOSPI200 return
kospi200[,ret:=cls_prc/shift(cls_prc,1)]

#Merge the return and sentiment data (weekend sentiments ignored)
dt = merge(sentiment, kospi200, on = "date")

Since I will be looking at predictive power of global/local sentiment on KOSPI200 return, one natural test to start with is Granger Causality Test, which tests whether an independent variable (global/local sentiment) has any additional predictive power to lagged dependent variable (KOSPI200 return). Below is result of Granger Causality for global sentiment. I used 3 day lags, but the result doesn’t materially change when different lags are used. As can be seen global sentiment doesn’t appear to have any predictive power on KOSPI 200 return.

#Granger Causality of 
grangertest(dt$global,dt$ret, order = 3)

## Granger causality test
## 
## Model 1: dt$ret ~ Lags(dt$ret, 1:3) + Lags(dt$global, 1:3)
## Model 2: dt$ret ~ Lags(dt$ret, 1:3)
##   Res.Df Df      F Pr(>F)
## 1    475                 
## 2    478 -3 0.8823 0.4501

The result is different for local sentiment. Sentiment on Korea related articles does appear to have predictive power.

#Granger Causality of 
grangertest(dt$korea,dt$ret, order = 3)

## Granger causality test
## 
## Model 1: dt$ret ~ Lags(dt$ret, 1:3) + Lags(dt$korea, 1:3)
## Model 2: dt$ret ~ Lags(dt$ret, 1:3)
##   Res.Df Df      F  Pr(>F)  
## 1    475                    
## 2    478 -3 2.8995 0.03465 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To look at the combined effect, I can estimate a simple OLS regression. As stock returns tend to have serial correlation, I will report Newey-West standard deviation to compute the t value.

$$R_t = \beta_0 + \sum_{l=1}^L\beta_{1l}\text{G.Sentiment}_{t-l} + \sum_{l=1}^L\beta_{2l}\text{L.Sentiment}_{t-l} +\sum_{l=1}^L\beta_{3l}R_{t-l}+ \varepsilon_{t}$$
#Compute Lagged Varibles
for (l in 1:3){
  dt[,paste("l",l,"global",sep = ""):=shift(global,l)]
  dt[,paste("l",l,"korea",sep = ""):=shift(korea, l)]
  dt[,paste("l",l,"ret",sep = ""):=shift(ret, l)]
}

fit = lm(ret~l1global+l2global+l3global+l1korea+l2korea+l3korea+l1ret+l2ret+l3ret, data = dt)
coeftest(fit, vcov = NeweyWest(fit, verbose =T))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  0.671456   0.138135  4.8609 1.594e-06 ***
## l1global     0.045587   0.057269  0.7960  0.426425    
## l2global    -0.055105   0.078117 -0.7054  0.480896    
## l3global     0.063124   0.057766  1.0928  0.275055    
## l1korea     -0.092066   0.036934 -2.4928  0.013017 *  
## l2korea     -0.029937   0.032581 -0.9188  0.358650    
## l3korea     -0.022215   0.037345 -0.5949  0.552226    
## l1ret       -0.022999   0.112887 -0.2037  0.838648    
## l2ret        0.237033   0.081921  2.8934  0.003987 ** 
## l3ret        0.113193   0.059430  1.9047  0.057433 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Consistent with Fraiberger et al (2018), local news article does appear to have contrarian predictive power on the stock market. Increase in sentiment on Korea related news has statistically significant (at 5% level) predictive power on the next trade day’s return. This is after controling for global sentiment (which doesn’t have predictive power) and past returns. In the next post, I will test how the classification based on trained model performs.

Tags: python  NLP  keras  webscraping 

Discussion and feedback