News Sentiment Analysis (Part 2)
In the previous post, I’ve replicated Fraiberger et al (2018) for the Korean market to show that Reuter Korea related news have predictive power on the next day’s KOSPI 200 index return when sentiment is measured through word frequency method. In this post, I will use a deep learning based supervised learning to classify same Reuter headlines into positive and negative news. The benefits of using supervised learning approach is that, assuming a relevant training set is available, more sophisticated relationships among words can be modelled to arrive at the final sentiment. Unfortunately, as there does not exist a publicly available training set for financial news yet, so I use a training set from a different domain, which is likely to lower out-of-sample accuracy.
For this analysis, I will use large movie reviews dataset, which can be found here. The dataset contains 50,000 movie reviews from IMDB, which has been labelled based on the star ratings that reviewers along with their reviews. The order of analysis will be 1) preprocess and divide movie reviews dataset into train / test set, 2) train an algorithm on the train set, 3) confirm that the accuracy holds in the test set, 4) deploy the algorithm on the Reuter headline, 5) inspect a sample of classified Reuter headlines.
For the first step, I will preprocess data and then divide it into train / test set. Let’s first load the data and inspect few rows.
import pandas as pd movie_reviews = pd.read_csv("IMDB Dataset.csv") movie_reviews.head(10)
|0||One of the other reviewers has mentioned that ...||positive|
|1||A wonderful little production. <br /><br />The...||positive|
|2||I thought this was a wonderful way to spend ti...||positive|
|3||Basically there's a family where a little boy ...||negative|
|4||Petter Mattei's "Love in the Time of Money" is...||positive|
|5||Probably my all-time favorite movie, a story o...||positive|
|6||I sure would like to see a resurrection of a u...||positive|
|7||This show was an amazing, fresh & innovative i...||negative|
|8||Encouraged by the positive comments about this...||negative|
|9||If you like original gut wrenching laughter yo...||positive|
As can seen above, the reviews contain HTML tags and punctuations that are irrelevant in sentiment analysis. For pre-processing, I will remove the HTML tags, remove numbers and punctuations, and remove multiple spaces / single letter words. Upper/lower case will be taken care of at tokenizing step.
import re def preprocess_text(sen): # Remove html tags sentence = re.sub(r'<[^>]+>','', sen) # Remove punctuations and numbers sentence = re.sub('[^a-zA-Z]', ' ', sentence) # remove one letter words sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # remove multiple spaces sentence = re.sub(r'\s+', ' ', sentence) return sentence X =  sentences = list(movie_reviews['review']) for sen in sentences: X.append(preprocess_text(sen)) #Change label into 1 - Positive, 0 - Negative y = movie_reviews['sentiment'] y = list(map(lambda x: 1 if x=="positive" else 0, y))
Now that the reviews have been pre-processed, I will divide them into training (80%) / testing set (20%)
from sklearn.model_selection import train_test_split X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
The machine learning model used in this analysis is a deep learning models, which are multi-layered neural networks. Neural networks take series of numbers between 0-1 as an input, so the a series of steps needs to be taken to convert the movie reviews into a series of numbers. The first step is to tokenize the reviews. I will use 5,000 most frequent words in the training set as the dictionary.
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences #Initialize tokenizer and set word treshold to 5,000 tokenizer = Tokenizer(num_words=5000) tokenizer.fit_on_texts(X_train_raw) #Tokenize reviews in train and test set X_train = tokenizer.texts_to_sequences(X_train_raw) #Vocab size will be used for defining embedding matrix size vocab_size = len(tokenizer.word_index) + 1 #Last 200 words will be used in sentimenet analysis maxlen = 200 X_train = pad_sequences(X_train, padding='post', maxlen=maxlen) #Preview of X_train X_train
array([[ 217, 9, 1, ..., 198, 345, 3812], [ 15, 46, 59, ..., 82, 99, 6], [ 128, 1307, 108, ..., 0, 0, 0], ..., [ 778, 8, 24, ..., 0, 0, 0], [ 8, 347, 10, ..., 0, 0, 0], [ 8, 5, 525, ..., 0, 0, 0]])
After the tokenizing step, each reviews are converted into an integer vector of length 200. For reviews with more than 200 words, the last 200 words are kept and for reviews with fewer than 200 words, 0s are padded at the end. While the tokenized words are closer to the input specification in that they are numbers, each input number is not between 0 and 1 yet. For the next step, I will use embedding matrix.
Embeddings in NLP domain referes to low dimensional vector representation of words. I will use 100 dimensional vectors, meaning each word will be transformed into a vector of numbers between 0 and 1 of length 100. Hence, the resulting transformation would convert a review to a matrix of numbers with dimension $200\times 100$.
For mapping each word to a embedding vector, I will use Global Vector for Word Representation (GloVe). The version I use can be found here. GloVe provides embedding vector trained on Wikipedia and Gigaword for a large number of english words.
import numpy as np #Read glove file into embeddings dictionary embeddings_dictionary = dict() glove_file = open('glove.6B.100d.txt', encoding="utf8") for line in glove_file: records = line.split() word = records vector_dimensions = np.asarray(records[1:], dtype='float32') embeddings_dictionary [word] = vector_dimensions glove_file.close() #Transform embeddings dictionary into a matrix based on 5,000 most frequent words embedding_matrix = np.zeros((vocab_size, 100)) for word, index in tokenizer.word_index.items(): embedding_vector = embeddings_dictionary.get(word) if embedding_vector is not None: embedding_matrix[index] = embedding_vector
Now I am ready to design neural network structure. The network structure used in this post will be sequential model of Long Short Term Memory (LSTM) and Convolution layers. LSTM layers are used in deep learning when history matters in determining significance of an input where as Convolution layers are used when relative position matters. The exact specification I use can be seen in the figure below.
For training, I run 10 epochs and keep 20% of the training set as a validation set.
from keras.models import Sequential from keras.layers import LSTM, Conv1D, GlobalMaxPooling1D from keras.layers.core import Dense from keras.layers.embeddings import Embedding #Build Sequential Model model = Sequential() #Embedding weights are taken as given embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False) model.add(embedding_layer) #Add LSTM Layer model.add(LSTM(128,return_sequences=True)) #Add Conv1D Layer (Convolution Layer) model.add(Conv1D(128, 5, activation='relu')) model.add(GlobalMaxPooling1D()) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=0, validation_split=0.2)
The training can take sometime and given the randomness of stochastic gradient descent and train/test split, each trained weights and tokenizer will be slightly different. I include my trained weights and tokenizer here. For measuring accuracy on trained data, I will use the saved weights and tokenizer.
import pickle with open("trained_model.pick", "rb") as f: params = pickle.load(f) #some parameters are commented out because they are stationary tokenizer = params['tokenizer'] #maxlen = params['maxlen'] #embedding_matrix = params['embedding_matrix'] weights = params['weights'] #vocab_size = params['vocab_size'] #Set pretrained weights model.set_weights(weights) #Based on stored tokenizer, tokenize test reviews and pad it X_test = tokenizer.texts_to_sequences(X_test_raw) X_test = pad_sequences(X_test, padding='post', maxlen=maxlen) score = model.evaluate(X_test, y_test, verbose=0) print("Accuracy on Testing Set", score)
Accuracy on Testing Set 0.8816
The trained model scores roughly 88% on the testing set, which is quite good on this data set. Now it is time to deploy the trained model to the Reuter news. Here are article body and title extracted from Reuter news in the previous post.
articles = pd.read_csv("reuter_articles.csv") articles.head(10)
|0||* Shanghai shares add 0.4% , blue-chips up 0.7...||2020-02-14||0||China stocks end higher, post first weekly gai...|
|1||* Shanghai shares, blue-chips both add 2.3% * ...||2020-02-17||0||China stocks recoup virus losses as Beijing st...|
|2||* Shanghai shares add 1.8%; blue-chips up 2.3%...||2020-02-20||0||China stocks find footing on rate cut, policy ...|
|3||* Shanghai shares add 0.3%, blue-chips up 0.1%...||2020-02-21||0||Shanghai stocks seal best week in 10 months on...|
|4||SHANGHAI, Feb 24 (Reuters) - China’s main stoc...||2020-02-24||0||China main indexes fall as coronavirus spreads...|
|5||SHANGHAI, Feb 26 (Reuters) - China stocks ende...||2020-02-26||0||China stocks slide as global coronavirus fears...|
|6||* CSI300 +0.3%, Shanghai Composite +0.1% * Chi...||2020-02-27||0||China stocks rise on fewer virus deaths, stimu...|
|7||* Shanghai shares up 3.2%, largest gain since ...||2020-03-02||0||China stocks rebound as dismal data fuels stim...|
|8||SHANGHAI, March 4 (Reuters) - China stocks set...||2020-03-04||0||China stocks end higher as Fed rate cut bolste...|
|9||* CSI300 up 2.2%, hits highest since Jan. 14 *...||2020-03-05||0||China blue-chips rise to 7-week high on policy...|
Let’s look at the sentiment based on last 200 words of news article body.
#Preprocess article body X =  sentences = list(articles['body']) for sen in sentences: X.append(preprocess_text(sen)) #Based on stored tokenizer, tokenize test reviews and pad it X_article = tokenizer.texts_to_sequences(X) X_article = pad_sequences(X_article, padding='post', maxlen=maxlen) sentiment_pred = model.predict(X_article, verbose = 0)
Rather than testing the predictive power of the resulting sentiment on KOSPI 200 returns, I will compare the performance of sentiment classification from supervised learning to classification from word frequency based method.
#Sentiment from Supervised Learning articles['sentiment_supervised'] = sentiment_pred #Sentiment from Frequency Based articles['sentiment_word_freq'] = pd.read_csv("freq_based_sentiment.csv")['sentiment'] #Correlation print("Correlation is",np.corrcoef(articles['sentiment_supervised'],articles['sentiment_word_freq'])[0,1])
Correlation is 0.20378549809070076
As expected the correlation between the two forms of sentiment measure is not great. There appears to be about 20% correlation. Now let’s inspect the relative performance.
Below are 20 articles with the most different sentiment scores. In the word frequency measure, 0 is assumed the divider of positive/negative news whereas in the supervised learning method, 0.5 is assumed to be the divider. As expected, word frequency method appears to be better performing since the training set wasn’t finance related.
#Normalize both sentiment measures articles['N_sentiment_word_freq'] = articles['sentiment_word_freq']/np.std(articles['sentiment_word_freq']) articles['N_sentiment_supervised'] = (articles['sentiment_supervised'] - 0.5)/\ np.std(articles['sentiment_supervised']) #Display 20 headlines with the highest difference pd.options.display.max_colwidth =200 articles['sentiment_diff'] = articles['N_sentiment_word_freq'] - articles['N_sentiment_supervised'] articles.sort_values("sentiment_diff", ascending=False)[['title','N_sentiment_word_freq','N_sentiment_supervised']].head(20)
|8345||Singapore MAS to require investors to report short positions on SGX stocks||2.018885||-1.344420|
|1550||UPDATE 1-Singapore MAS to require investors to report short positions on SGX stocks||1.334613||-1.545239|
|139||Hong Kong shares end higher on trade deal hopes, up for 6th week||1.102909||-1.571239|
|2010||SE Asia Stocks-Rise on Wall Street rebound; Philippines, Singapore lead gains||1.280798||-1.382014|
|400||Indian shares end higher; Bharti Infratel top gainer||1.418026||-1.237163|
|478||China's securities regulator says has not changed rules for IPO review||1.353571||-1.241889|
|6945||S.Korea stocks hit near 9-month high ahead of Phase 1 trade deal||1.280798||-1.291429|
|1883||San Miguel says confident investors will buy stake in food unit||0.934229||-1.613568|
|2797||SE Asia Stocks-Rise on deadline extension hopes; Philippines leads gains||1.240773||-1.276552|
|5551||JGBs dip as equity gains dim demand for safe-haven debt||1.063520||-1.398065|
|235||Indian shares end higher; lenders gain on Yes Bank profit beat||2.010366||-0.442756|
|438||China studying reform to outbound QDII investment scheme||0.696574||-1.654075|
|975||U.S. says China reneging on trade commitments, talks continue||1.044862||-1.296420|
|1219||Nikkei rises, financials lead gains on higher U.S. yields||2.134663||-0.188313|
|410||Indian shares end higher; Zee Entertainment top gainer on NSE index||3.379127||1.064884|
|8170||MIDEAST STOCKS-Saudi shares flat on banks, petrochemicals; other markets up||0.853865||-1.443667|
|3554||RPT-WRAPUP 2-China shares, yuan rise on hopes for last-minute trade deal||0.987134||-1.296672|
|418||India shares end mostly flat; metal, PSU banks gain||0.721904||-1.537553|
|8594||HK shares end higher on China policy boost, trade talk hopes||1.155047||-1.094102|
|1931||Nikkei has best gain in 5 weeks, insurers rise and US tariffs shrugged off||0.751509||-1.473133|
This post outlined a supervised learning approach to sentiment analysis. Due to lack of publicly available relevant training data, this approach has not been explored too deeply in financial literature. As IMDB reviews dataset was created based on review ratings, perhaps using realized stock return may be a way to label these news articles and turn them into a training set.Tags: