News Sentiment Analysis (Part 2)

In the previous post, I’ve replicated Fraiberger et al (2018) for the Korean market to show that Reuter Korea related news have predictive power on the next day’s KOSPI 200 index return when sentiment is measured through word frequency method. In this post, I will use a deep learning based supervised learning to classify same Reuter headlines into positive and negative news. The benefits of using supervised learning approach is that, assuming a relevant training set is available, more sophisticated relationships among words can be modelled to arrive at the final sentiment. Unfortunately, as there does not exist a publicly available training set for financial news yet, so I use a training set from a different domain, which is likely to lower out-of-sample accuracy.

For this analysis, I will use large movie reviews dataset, which can be found here. The dataset contains 50,000 movie reviews from IMDB, which has been labelled based on the star ratings that reviewers along with their reviews. The order of analysis will be 1) preprocess and divide movie reviews dataset into train / test set, 2) train an algorithm on the train set, 3) confirm that the accuracy holds in the test set, 4) deploy the algorithm on the Reuter headline, 5) inspect a sample of classified Reuter headlines.

For the first step, I will preprocess data and then divide it into train / test set. Let’s first load the data and inspect few rows.

import pandas as pd

movie_reviews = pd.read_csv("IMDB Dataset.csv")
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
5 Probably my all-time favorite movie, a story o... positive
6 I sure would like to see a resurrection of a u... positive
7 This show was an amazing, fresh & innovative i... negative
8 Encouraged by the positive comments about this... negative
9 If you like original gut wrenching laughter yo... positive

As can seen above, the reviews contain HTML tags and punctuations that are irrelevant in sentiment analysis. For pre-processing, I will remove the HTML tags, remove numbers and punctuations, and remove multiple spaces / single letter words. Upper/lower case will be taken care of at tokenizing step.

import re

def preprocess_text(sen):
    # Remove html tags
    sentence = re.sub(r'<[^>]+>','', sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # remove one letter words
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # remove multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

X = []
sentences = list(movie_reviews['review'])
for sen in sentences:

#Change label into 1 - Positive, 0 - Negative
y = movie_reviews['sentiment']
y = list(map(lambda x: 1 if x=="positive" else 0, y))

Now that the reviews have been pre-processed, I will divide them into training (80%) / testing set (20%)

from sklearn.model_selection import train_test_split

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

The machine learning model used in this analysis is a deep learning models, which are multi-layered neural networks. Neural networks take series of numbers between 0-1 as an input, so the a series of steps needs to be taken to convert the movie reviews into a series of numbers. The first step is to tokenize the reviews. I will use 5,000 most frequent words in the training set as the dictionary.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#Initialize tokenizer and set word treshold to 5,000
tokenizer = Tokenizer(num_words=5000)

#Tokenize reviews in train and test set
X_train = tokenizer.texts_to_sequences(X_train_raw)

#Vocab size will be used for defining embedding matrix size
vocab_size = len(tokenizer.word_index) + 1

#Last 200 words will be used in sentimenet analysis
maxlen = 200
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)

#Preview of X_train
array([[ 217,    9,    1, ...,  198,  345, 3812],
       [  15,   46,   59, ...,   82,   99,    6],
       [ 128, 1307,  108, ...,    0,    0,    0],
       [ 778,    8,   24, ...,    0,    0,    0],
       [   8,  347,   10, ...,    0,    0,    0],
       [   8,    5,  525, ...,    0,    0,    0]])

After the tokenizing step, each reviews are converted into an integer vector of length 200. For reviews with more than 200 words, the last 200 words are kept and for reviews with fewer than 200 words, 0s are padded at the end. While the tokenized words are closer to the input specification in that they are numbers, each input number is not between 0 and 1 yet. For the next step, I will use embedding matrix.

Embeddings in NLP domain referes to low dimensional vector representation of words. I will use 100 dimensional vectors, meaning each word will be transformed into a vector of numbers between 0 and 1 of length 100. Hence, the resulting transformation would convert a review to a matrix of numbers with dimension $200\times 100$.

For mapping each word to a embedding vector, I will use Global Vector for Word Representation (GloVe). The version I use can be found here. GloVe provides embedding vector trained on Wikipedia and Gigaword for a large number of english words.

import numpy as np

#Read glove file into embeddings dictionary
embeddings_dictionary = dict()
glove_file = open('glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions

#Transform embeddings dictionary into a matrix based on 5,000 most frequent words
embedding_matrix = np.zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

Now I am ready to design neural network structure. The network structure used in this post will be sequential model of Long Short Term Memory (LSTM) and Convolution layers. LSTM layers are used in deep learning when history matters in determining significance of an input where as Convolution layers are used when relative position matters. The exact specification I use can be seen in the figure below.

For training, I run 10 epochs and keep 20% of the training set as a validation set.

from keras.models import Sequential
from keras.layers import LSTM, Conv1D, GlobalMaxPooling1D
from keras.layers.core import Dense
from keras.layers.embeddings import Embedding

#Build Sequential Model
model = Sequential()
#Embedding weights are taken as given
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
#Add LSTM Layer
#Add Conv1D Layer (Convolution Layer)
model.add(Conv1D(128, 5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']), y_train, batch_size=128, epochs=10, verbose=0, validation_split=0.2)

The training can take sometime and given the randomness of stochastic gradient descent and train/test split, each trained weights and tokenizer will be slightly different. I include my trained weights and tokenizer here. For measuring accuracy on trained data, I will use the saved weights and tokenizer.

import pickle

with open("trained_model.pick", "rb") as f:
    params = pickle.load(f)

#some parameters are commented out because they are stationary
tokenizer = params['tokenizer']
#maxlen = params['maxlen']
#embedding_matrix = params['embedding_matrix']
weights = params['weights']
#vocab_size = params['vocab_size']

#Set pretrained weights

#Based on stored tokenizer, tokenize test reviews and pad it
X_test = tokenizer.texts_to_sequences(X_test_raw)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

score = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy on Testing Set", score[1])
Accuracy on Testing Set 0.8816

The trained model scores roughly 88% on the testing set, which is quite good on this data set. Now it is time to deploy the trained model to the Reuter news. Here are article body and title extracted from Reuter news in the previous post.

articles = pd.read_csv("reuter_articles.csv")
body date korea title
0 * Shanghai shares add 0.4% , blue-chips up 0.7... 2020-02-14 0 China stocks end higher, post first weekly gai...
1 * Shanghai shares, blue-chips both add 2.3% * ... 2020-02-17 0 China stocks recoup virus losses as Beijing st...
2 * Shanghai shares add 1.8%; blue-chips up 2.3%... 2020-02-20 0 China stocks find footing on rate cut, policy ...
3 * Shanghai shares add 0.3%, blue-chips up 0.1%... 2020-02-21 0 Shanghai stocks seal best week in 10 months on...
4 SHANGHAI, Feb 24 (Reuters) - China’s main stoc... 2020-02-24 0 China main indexes fall as coronavirus spreads...
5 SHANGHAI, Feb 26 (Reuters) - China stocks ende... 2020-02-26 0 China stocks slide as global coronavirus fears...
6 * CSI300 +0.3%, Shanghai Composite +0.1% * Chi... 2020-02-27 0 China stocks rise on fewer virus deaths, stimu...
7 * Shanghai shares up 3.2%, largest gain since ... 2020-03-02 0 China stocks rebound as dismal data fuels stim...
8 SHANGHAI, March 4 (Reuters) - China stocks set... 2020-03-04 0 China stocks end higher as Fed rate cut bolste...
9 * CSI300 up 2.2%, hits highest since Jan. 14 *... 2020-03-05 0 China blue-chips rise to 7-week high on policy...

Let’s look at the sentiment based on last 200 words of news article body.

#Preprocess article body
X = []
sentences = list(articles['body'])
for sen in sentences:

#Based on stored tokenizer, tokenize test reviews and pad it
X_article = tokenizer.texts_to_sequences(X)
X_article = pad_sequences(X_article, padding='post', maxlen=maxlen)

sentiment_pred = model.predict(X_article, verbose = 0)

Rather than testing the predictive power of the resulting sentiment on KOSPI 200 returns, I will compare the performance of sentiment classification from supervised learning to classification from word frequency based method.

#Sentiment from Supervised Learning
articles['sentiment_supervised'] = sentiment_pred

#Sentiment from Frequency Based
articles['sentiment_word_freq'] = pd.read_csv("freq_based_sentiment.csv")['sentiment']

print("Correlation is",np.corrcoef(articles['sentiment_supervised'],articles['sentiment_word_freq'])[0,1])
Correlation is 0.20378549809070076

As expected the correlation between the two forms of sentiment measure is not great. There appears to be about 20% correlation. Now let’s inspect the relative performance.

Below are 20 articles with the most different sentiment scores. In the word frequency measure, 0 is assumed the divider of positive/negative news whereas in the supervised learning method, 0.5 is assumed to be the divider. As expected, word frequency method appears to be better performing since the training set wasn’t finance related.

#Normalize both sentiment measures
articles['N_sentiment_word_freq'] = articles['sentiment_word_freq']/np.std(articles['sentiment_word_freq'])

articles['N_sentiment_supervised'] = (articles['sentiment_supervised'] - 0.5)/\

#Display 20 headlines with the highest difference
pd.options.display.max_colwidth =200
articles['sentiment_diff'] = articles['N_sentiment_word_freq'] - articles['N_sentiment_supervised']
articles.sort_values("sentiment_diff", ascending=False)[['title','N_sentiment_word_freq','N_sentiment_supervised']].head(20)
title N_sentiment_word_freq N_sentiment_supervised
8345 Singapore MAS to require investors to report short positions on SGX stocks 2.018885 -1.344420
1550 UPDATE 1-Singapore MAS to require investors to report short positions on SGX stocks 1.334613 -1.545239
139 Hong Kong shares end higher on trade deal hopes, up for 6th week 1.102909 -1.571239
2010 SE Asia Stocks-Rise on Wall Street rebound; Philippines, Singapore lead gains 1.280798 -1.382014
400 Indian shares end higher; Bharti Infratel top gainer 1.418026 -1.237163
478 China's securities regulator says has not changed rules for IPO review 1.353571 -1.241889
6945 S.Korea stocks hit near 9-month high ahead of Phase 1 trade deal 1.280798 -1.291429
1883 San Miguel says confident investors will buy stake in food unit 0.934229 -1.613568
2797 SE Asia Stocks-Rise on deadline extension hopes; Philippines leads gains 1.240773 -1.276552
5551 JGBs dip as equity gains dim demand for safe-haven debt 1.063520 -1.398065
235 Indian shares end higher; lenders gain on Yes Bank profit beat 2.010366 -0.442756
438 China studying reform to outbound QDII investment scheme 0.696574 -1.654075
975 U.S. says China reneging on trade commitments, talks continue 1.044862 -1.296420
1219 Nikkei rises, financials lead gains on higher U.S. yields 2.134663 -0.188313
410 Indian shares end higher; Zee Entertainment top gainer on NSE index 3.379127 1.064884
8170 MIDEAST STOCKS-Saudi shares flat on banks, petrochemicals; other markets up 0.853865 -1.443667
3554 RPT-WRAPUP 2-China shares, yuan rise on hopes for last-minute trade deal 0.987134 -1.296672
418 India shares end mostly flat; metal, PSU banks gain 0.721904 -1.537553
8594 HK shares end higher on China policy boost, trade talk hopes 1.155047 -1.094102
1931 Nikkei has best gain in 5 weeks, insurers rise and US tariffs shrugged off 0.751509 -1.473133

This post outlined a supervised learning approach to sentiment analysis. Due to lack of publicly available relevant training data, this approach has not been explored too deeply in financial literature. As IMDB reviews dataset was created based on review ratings, perhaps using realized stock return may be a way to label these news articles and turn them into a training set.

Tags: python  NLP  keras 

Discussion and feedback