Topic Relevance Among Korean Stocks

The interaction between media and stock market is a hot topic, which has received a lot of attention in the finance literature. In this post, I will utilize Latent Dirichlet Allocation (LDA) method to isolate out different topic in Korean news papers and how it interacts with the stock market activity. For this post, I will be looking at news articles related to Kia Motors (A000270). The choice of stock is rather arbitrary. I chose this stock because it has the smallest stockcode among the 35 stocks I’ve looked in reversion strategy post.

The dataset consists of 8,118 posts extracted from Paxnet between January 2018 until March 19th 2020. I won’t go into detail regarding the code used to scrape the news articles, but it is as follows.

import requests
import time
import os

stockcode = 'A000270'
collected_article = []
page_no = 1
cont = True
while cont:
    req = requests.get(f'http://www.paxnet.co.kr/news/{stockcode[1:]}/stock?currentPageNo={page_no}&stockCode={stockcode[1:]}')
    soup = BeautifulSoup(req.content.decode('utf8'),"html.parser")
    article_list = soup.find('div',attrs={'class':'board-thumbnail'}).findAll('li')
    for i,article in enumerate(article_list[:20]):
        link = article.find('dl',attrs={'class':'text'}).find('dt').find('a')['href'].split("¤")[0]
        articleId = link.split("articleId=")[1]
        if int(articleId[:4])<2018: #Ex
            cont = False
            break
        if articleId in collected_article: #As new posts come up, articles shift down. So ignore duplicate
            continue
        else:
            collected_article.append(articleId)
            while True: #Sometimes there is connection error, so try again if error shows up
                try:
                    req = requests.get('http://www.paxnet.co.kr'+link)
                except Exception as e:
                    print(e)
                    continue
                break
            soup = BeautifulSoup(req.content,"html.parser")
            body = soup.find('div', attrs = {'class':'span_article_content'})
            with open(f"articles/{stockcode}/{articleId}.txt", "w", encoding = "utf8", errors="ignore") as f:
                f.write(body.text)
        time.sleep(1)
    page_no+=1
    print(stockcode, page_no)


Applying LDA requires that articles be tokenized and lemmatized. While there are few options for Korean articles, I’ve decided to use soynlp as it is strongest in extracting nouns even when there is spacing errors. Korean as a language can be tricky in tokenizing because spacing rules are complicated. soynlp uses L-R graph to compute the probability of each was as being a noun. Thus, to first build a list of potential nouns widely used in news articles related to Kia Motors, I use the NewsNounExtractor function, which is one of three noun extractor function provided by soynlp.

from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import NewsNounExtractor

folder = 'A000270'
#Combine all news articles into one file
with open("combined_article.txt", "w", encoding = "utf8", errors = "ignore") as f:
    articles = os.listdir(f"articles/{folder}")
    for article in articles:
        with open(f"articles/{folder}/{article}", "r", encoding = "utf8") as f1:
            f.write(f1.read())
    
sentences = DoublespaceLineCorpus("combined_article.txt", iter_sent=True)
noun_extractor = NewsNounExtractor(verbose=False)
nouns = noun_extractor.train_extract(sentences)
for i,[k,v] in enumerate(nouns.items()):
    print(k,v)
    if i>=2:
        break
겨냥했다. NewsNounScore(score=0, frequency=4, feature_proportion=0, eojeol_proportion=1.0, n_positive_feature=0, unique_positive_feature_proportion=0)
충족했다. NewsNounScore(score=0, frequency=5, feature_proportion=0, eojeol_proportion=1.0, n_positive_feature=0, unique_positive_feature_proportion=0)
채용한다. NewsNounScore(score=0, frequency=4, feature_proportion=0, eojeol_proportion=1.0, n_positive_feature=0, unique_positive_feature_proportion=0)

As can be seen above, the NewsNounExtractor function prints out different potential nouns, their frequency and how likely it is to be a noun. Using this score, I will apply to LTokenizer functions to tokenize each news articles into words.

from soynlp.tokenizer import LTokenizer
import re

noun_scores = {noun:score.score for noun, score in nouns.items() if score.frequency>=10}

tokenizer = LTokenizer(scores=noun_scores)
parsed_articles = []
dates = []
folder = 'A000270'
articles = os.listdir(f"articles/{folder}")
for article in articles:
        with open(f"articles/{folder}/{article}", "r", encoding = "utf8") as f1:
            article_content = f1.read()
        dates.append(article[:8]) #Keep track of dates of the article
        article_content = re.sub('\(.*\)', '', article_content)#remove words in parenthesis
        article_content = re.sub('\[.*\]', '', article_content)#remove words in brackets
        article_content = re.sub(r"[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?" \
                                 "^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?", ' ', article_content) #remove emails
        article_content = re.sub('''[!?=,▶`/\+◆※\-▲△◇\(\)\[\]\.■…<>·"'‘’“”]''', ' ', article_content)#remove special characters
        article_words = tokenizer.tokenize(article_content, flatten = False)
        article_words = [a[0] for a in article_words if not re.search('[0-9]+', a[0])] #remove words with numbers
        article_words = [a for a in article_words if len(a)>=2] #remove 1 letters
        article_words = [a for a in article_words if "@" not in a] #remove emails
        parsed_articles.append(article_words)

Let’s look at some of the most widely used words in the articles.

#Words that show up most number of articles
import collections
counter = collections.Counter([word for parsed_article in parsed_articles for word in set(parsed_article)])
print(counter.most_common(100))
[('기아', 4887), ('현대', 3682), ('있다', 2872), ('무단', 2728), ('지난', 2726), ('자동', 2120), ('판매', 1908), ('기자', 1875), ('있는', 1839), ('올해', 1719), ('통해', 1718), ('국내', 1711), ('것으로', 1692), ('이라고', 1636), ('밝혔다', 1628), ('제목', 1599), ('관계자는', 1585), ('말했다', 1488), ('대한', 1487), ('금지', 1414), ('재배포', 1384), ('대비', 1353), ('배포', 1325), ('위해', 1282), ('글로벌', 1266), ('차량', 1257), ('저작권자', 1233), ('확대', 1224), ('대해', 1217), ('이번', 1197), ('주요', 1187), ('함께', 1170), ('등을', 1169), ('계획', 1158), ('이를', 1157), ('따르면', 1130), ('전재', 1116), ('가능', 1108), ('시장', 1081), ('미국', 1073), ('모델', 1073), ('위한', 1058), ('이상', 1050), ('매수', 1044), ('이후', 1010), ('실적', 1009), ('전년', 991), ('고객', 983), ('각각', 976), ('현대자동', 958), ('파이', 958), ('특히', 957), ('적용', 954), ('관련', 949), ('신차', 939), ('거래', 926), ('한국', 919), ('생산', 897), ('상위', 897), ('경쟁', 890), ('다양한', 886), ('이날', 873), ('최근', 864), ('이어', 843), ('전기', 819), ('따라', 818), ('기존', 807), ('경우', 791), ('중국', 784), ('기업', 783), ('분석', 781), ('세계', 778), ('예정이다', 764), ('완성차', 753), ('같은', 751), ('기술', 749), ('서울', 744), ('대상', 744), ('책임', 739), ('시장에서', 738), ('브랜드', 738), ('업체', 735), ('했다', 734), ('영업이', 734), ('개발', 727), ('현재', 721), ('SUV', 720), ('가장', 713), ('매도', 710), ('이라며', 708), ('한편', 695), ('프로', 694), ('규모', 692), ('업계', 688), ('오는', 677), ('최대', 676), ('때문', 664), ('가운데', 654), ('서비스', 646), ('영향', 645)]

One common practice is to remove any words that occur too frequently, as it may not be useful in detection topics. In my case, I will set a threshold to be 2,000.

stop_words = []
for item in counter.items():
    if item[1]>=2000:
        stop_words.append(item[0])

Furthermore, although soynlp does a decent job of detecting nouns, it is not perfect, so I impose a few additional manual changes to the tokenized documents.

new_parsed_articles = []
for parsed_article in parsed_articles:
    new_parsed_article = [] 
    for word in parsed_article:
        if word in stop_words:
            continue
        word = re.sub('으로.*','',word)
        word = re.sub('하면.*','하다',word)
        word = re.sub('하는.*','하다',word)
        if len(word)<2:
            continue
        if word[-1] in ['가','는','을','를','인','이']:
            word = word[:-1]
        if len(word)>=2:
            new_parsed_article.append(word)
    new_parsed_articles.append(new_parsed_article)

Now that the articles have been tokenized and standardized, it is time to run the LDA analysis. LDA analysis assumes that a document is mixture of topics and a topic is a set of words and their probability of occuring in a document if that document contains the topic. I will utilize gensim the most widely used python package for LDA analysis.

One parameter choice in LDA analysis is number of topics to be extracted, and one way to decide on it is to choose number of topics that maximizes coherence score, which is degree of sementic similary between high scoring words in a topic. U_Mass and C_v appears to be most widely used metric in this analysis, and I use C_v coherence score and choose 12 topics, which yields the highest coherence score.

import gensim

coherence = []
dictionary = gensim.corpora.Dictionary(new_parsed_articles)
bow_corpus = [dictionary.doc2bow(doc) for doc in new_parsed_articles]
for num_topics in range(2,15):
    lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                    num_topics =num_topics,
                                    passes=8,
                                    id2word = dictionary,                                    
                                    workers = 3)
    
    coherence_model_lda = gensim.models.CoherenceModel(model = lda_model, texts = new_parsed_articles, dictionary = dictionary, coherence = 'c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    coherence.append([num_topics, coherence_lda])
    print(num_topics, coherence_lda)
2 0.27002251985248243
3 0.4390758044871117
4 0.4517799530237788
5 0.4222227640122373
6 0.4282767418436646
7 0.48615301373038206
8 0.4185972709529318
9 0.4637912875710659
10 0.49536560709460586
11 0.48576005419268503
12 0.5443914190341478
13 0.46588194353704615
14 0.4742900305739503

Below is top 5 words belonging to the 12 topics yielded by LDA analysis.

import gensim
num_topics = 12
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                num_topics =num_topics,
                                passes=8,
                                id2word = dictionary,                                    
                                workers = 3)

for i,topic in lda_model.show_topics(formatted=True, num_topics=12, num_words=5):
    print(str(i)+": "+ topic)
0: 0.023*"판매" + 0.014*"전기" + 0.009*"친환경차" + 0.008*"올해" + 0.008*"글로벌"
1: 0.014*"차량" + 0.013*"전기" + 0.010*"개발" + 0.008*"고객" + 0.008*"기술"
2: 0.016*"고객" + 0.012*"서비스" + 0.012*"차량" + 0.011*"프로" + 0.009*"혜택"
3: 0.011*"업종" + 0.007*"실적" + 0.005*"시장" + 0.005*"코스" + 0.004*"미국"
4: 0.014*"제목" + 0.014*"업종" + 0.011*"상위" + 0.009*"순매수" + 0.009*"매수"
5: 0.019*"모델" + 0.017*"디자" + 0.016*"SUV" + 0.009*"적용" + 0.009*"신형"
6: 0.010*"협력사" + 0.009*"기술" + 0.008*"자율" + 0.007*"올해" + 0.005*"지배"
7: 0.012*"코스" + 0.011*"종목" + 0.009*"삼성" + 0.008*"거래" + 0.007*"지수"
8: 0.012*"인도" + 0.006*"수석" + 0.006*"인사" + 0.005*"미래" + 0.005*"승진"
9: 0.035*"판매" + 0.014*"전년" + 0.013*"대비" + 0.011*"실적" + 0.010*"영업"
10: 0.026*"공시" + 0.010*"중국" + 0.009*"규모" + 0.006*"영업" + 0.006*"사전계약"
11: 0.012*"생산" + 0.010*"공장" + 0.006*"부품" + 0.006*"노조" + 0.005*"한국"

To see, which of these topics are most relevant the stock market activity, I will follow an approach used by Hisano et al. (2016) - “High Quality Topic Extraction from Business News Explains Abnormal Financial Market Volatility”. Namely, I will define a daily topic volume as number of times a word tagged with a topic appears in daily news articles.

import datetime
import pandas as pd

#Extract 5 top topic words from each topic
topics=lda_model.show_topics(num_topics=12, num_words=5,formatted=False)
topic_words = [[wd[0] for wd in tp[1]] for tp in topics]

#Compute daily topic volume
topic_volume = []
for t_words in topic_words:
    daily_volume = {d:0 for d in set(dates)}
    for date, parsed_article in zip(dates, parsed_articles):
        for word in set(parsed_article):
            if word in t_words:
                daily_volume[date]+=1
    topic_volume.append(daily_volume)

#Arrange daily topic volume in data frame
for i,tv in enumerate(topic_volume):
    dt = []
    for date, count in tv.items():
        dt.append({'date':datetime.datetime.strptime(date,'%Y%m%d').date(), f'topic{i}':count})
    df = pd.DataFrame(dt) if i == 0 else pd.merge(df, pd.DataFrame(dt), on = "date")
df = df.sort_values('date')
print(df.head(5))
           date  topic0  topic1  topic2  topic3  topic4  topic5  topic6  \
44   2018-01-02      89      22      15      28       1      15      41   
373  2018-01-03      13       2       1       7       2       4       5   
560  2018-01-04      13       8       2       6       0       6       6   
354  2018-01-05       9       5       3       5       0       2       2   
319  2018-01-06       0       0       0       1       0       0       0   

     topic7  topic8  topic9  topic10  topic11  
44        5      14      66       14       18  
373       1       0       9        5       11  
560       0       1       7        2        4  
354       0       9       6        0        1  
319       0       1       0        0        0  

Now let’s see if any of the topics have any co-dynamics with market trading volume (number of daily contracts traded).

import MySQLdb as sql
con = sql.connect(host="ai.bond.co.kr", user = "kis", database="kisdb", charset = "utf8")               
cursor = con.cursor()

cursor.execute("select tradedate, trd_qty from stockprices where gicode = '000270'")        
res = cursor.fetchall()
res = [[r[0], float(r[1].replace(",",""))] for r in res]
df2 = pd.merge(df, pd.DataFrame(res, columns = ["date","volume"]), on = "date")

I will make use of cross-validated lasso to trim out topics that have highest out of sample predictive power. Since this is time series data, I will be looking at TimeSeriesSplit rather than simple CrossValidation split.

from sklearn.linear_model import LassoCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error


X = df2[[f'topic{i}' for i in range(12)]]
y = df2['volume']

dates_test, X_train, X_test, y_train, y_test = df2['date'][400:],X[:400], X[400:], y[:400],y[400:]

lassocv = LassoCV(alphas=None, cv=TimeSeriesSplit(5), max_iter=100000, normalize=True)
lassocv.fit(X_train, y_train)
print("mse = ",mean_squared_error(y_test, lassocv.predict(X_test)))
print("best model coefficients:")
print(pd.Series(lassocv.coef_, index=X.columns))
mse =  257660.72621387482
best model coefficients:
topic0      0.00000
topic1      0.00000
topic2      0.00000
topic3      0.00000
topic4     20.37094
topic5      0.00000
topic6      0.00000
topic7      0.00000
topic8      0.00000
topic9      0.00000
topic10     0.00000
topic11     0.00000
dtype: float64

Unfortunately, only topic 4 seems to be relevant to the market activity. Topic 4 talks about Kia Motor’s stock market trading activity, so it seems Korean stock related news articles are not very helpful in understanding Korean stock market acivity. Nevertheless, here is plot of predicted stock market trading volume compared to actual trading volume.

import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=dates_test,y=y_test,name="Actual"))
fig.add_trace(go.Scatter(x=dates_test,y=lassocv.predict(X_test),name="Predicted"))
fig.show()

Tags: python  NLP  LDA  lasso  webscraping 

Discussion and feedback