Building Stockprice Database 1 (webscraping)

In next few series of posts, I will build a database of stock prices, which is a core ingredient in any financial analysis. Since stock market prices are public data, there are a number of sources to obtain this data from. Some examples are

I will use webscraping to collect stock price data and store in a MySQL database.I will use Korea Exchange(KRX) Marketdatasite as the data source. KRX website, aside from being an original source of stockprice data, has the benefit over the other sources in that it is free and comprehensive. Free portals and brokerage APIs require that I already have a list of stockids prior to data collection, and historical prices of delited companies are not avaialble. For FNGuide, aside from being a paid service, merges ETF data with stock data, so disentangling stock price data from ETF data takes some work.

As stated previously, the source website for webscraping is KRX website. I will focus on screen number 81004 in this exercise. Webscraping is largely a guess and check work in an attempt to reverse-engineer a website. It involves simulating browser request to the server. The most helpful tools in webscraping exercises are Chrome’s developer tools, which can be accessed by pressing F12 key.

Here is the screen shot of 81004 screen.

Pressing F12 after the page is loaded, and then pressing the search button yields this screen.

The six entries shown in the developer tools window shows four requests made to the server to enable to search function and their details. Looking at the four entries, I can infer that first entry GenerateOTP.jspx generates one-time password to be used in request of actual data MKD99000001.jspx. One helpful tool in reverse engineering is curl-to-requests-converter. I can copy cURL, which contain all information regarding a request to the server by right clicking on a request of interest. curl-to-requests-converter converts the cURL in python executable script. Below is python script emulating the GenerateOTP.jspx request.

import requests

cookies = {
    '__utma': '139639017.1319634283.1581578713.1581578713.1581578713.1',
    '__utmc': '139639017',
    '__utmz': '139639017.1581578713.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
    'SCOUTER': 'x5s51u8uuea6ke',
    'schStocksHistory': 'removed since not necessary',
    'JSESSIONID': '1CB74516EF4A83F07207737ED1929A3E.58tomcat1',
    '__utmt': '1',
    '__utmb': '70557324.6.10.1581584516',
}

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
    'Referer': 'http://marketdata.krx.co.kr/mdi',
    'Accept-Language': 'ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7',
}

params = (
    ('bld', 'MKD/13/1302/13020101/mkd13020101'),
    ('name', 'form'),
    ('_', '1581738049755'),
)

response = requests.get('http://marketdata.krx.co.kr/contents/COM/GenerateOTP.jspx', headers=headers, params=params, cookies=cookies, verify=False)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('http://marketdata.krx.co.kr/contents/COM/GenerateOTP.jspx?bld=MKD%2F13%2F1302%2F13020101%2Fmkd13020101&name=form&_=1581738049755', headers=headers, cookies=cookies, verify=False)

Playing around with the parameters, I can deduce that cookies are optional, while headers and params are required. Below is request for MKD99000001.jspx. I’ve removed cookies since they’re not necessary. After making different combination of requests, I find that market_gubun is STK for KOSPI and KSQ for KOSDAQ. The response of MKD99000001.jspx is a json file. Using json package, the response of the data can be deciphered.

import requests
import json

headers = {
    'Connection': 'keep-alive',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Origin': 'http://marketdata.krx.co.kr',
    'Referer': 'http://marketdata.krx.co.kr/mdi',
    'Accept-Language': 'ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7',
}

data = {
  'market_gubun': 'KSQ',
  'sect_tp_cd': 'ALL',
  'schdate': '20200214',
  'pagePath': '/contents/MKD/13/1302/13020101/MKD13020101.jsp',
  'code': 't6Z77r6SW62aLBwg1RhRVD/lqxrUElM6p2vnxibMnyi1s5W6XhjU2Aar9u7nxLo2RkmXdfe8sPy1PN6QT9lLy8JnN36zTKKVW2ShPKTzD2A/D/Exdk350oQ7mq7A7Eu09YRXOdH4fnAwGqUNV5MO/w==',
  'curPage':'1'
}

response = requests.post('http://marketdata.krx.co.kr/contents/MKD/99/MKD99000001.jspx', headers=headers, data=data)
dt = json.loads(response.content)['시가총액 상하위']
print(dt[0])
{'rn': '1', 'isu_cd': '000250', 'kor_shrt_isu_nm': '삼천당제약', 'isu_cur_pr': '36,000', 'fluc_tp_cd': '1', 'prv_dd_cmpr': '500', 'updn_rate': '1.4', 'opnprc': '35,600', 'hgprc': '36,150', 'lwprc': '35,050', 'isu_tr_vl': '197,104', 'isu_tr_amt': '7,043,808,250', 'cur_pr_tot_amt': '795,809,268,000', 'tot_amt_per': '0.32', 'lst_stk_vl': '22,105,813', 'totCnt': '1408'}

Since the goal of this post is to create a database of stock prices, I will create a MySQL table called stockprices. The sql command for table creation is shown below.

CREATE TABLE `echo_blog`.`stockprices` (
  `isu_cd` CHAR(6) NOT NULL,
  `tradedate` DATE NOT NULL,
  `open_prc` int(11) NOT NULL,
  `high_prc` int(11) NOT NULL,
  `low_prc` int(11) NOT NULL,
  `cls_prc` int(11) NOT NULL,
  `list_stock_vol` bigint(20) NOT NULL,
  `exchange` varchar(10) NOT NULL,
  PRIMARY KEY (`isu_cd`, `tradedate`),
  INDEX `exchange_idx` (`exchange` ASC),
  INDEX `tradedate_idx` (`tradedate` ASC));

The primary keys for the stock id is isu_cd, unique stockid and tradedate. I’ve added index for exchange and tradedate as well for faster information retrieval. Aside from the open, high, low, close prices, I’ve included number of listed stocks, which will help with value weighted averaging.

Now that the database is ready, stockprice data can be inserted. The scraping code is below.

import requests
import json
import time
import datetime
import MySQLdb as sql

headers = {
    'Connection': 'keep-alive',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Origin': 'http://marketdata.krx.co.kr',
    'Referer': 'http://marketdata.krx.co.kr/mdi',
    'Accept-Language': 'ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7',
}

#OTP removed
data = {
  'market_gubun': 'KSQ',
  'sect_tp_cd': 'ALL',
  'schdate': '20200214',
  'pagePath': '/contents/MKD/13/1302/13020101/MKD13020101.jsp',
  'code': 'OTP removed',
  'curPage':5
}

start_date = datetime.datetime(2009,12,30) #Collection Start Date
end_date = datetime.datetime(2020,2,14) #Collection End Date

#Connect to database (id/pw removed)
con = sql.connect(host="eric-cho.com", port=3308, database="echo_blog", user="user_id", password="user_pd", charset="utf8")
cursor = con.cursor() 


trade_date = start_date
while trade_date<=end_date: #iterate through tradedates
    data['schdate'] = trade_date.strftime("%Y%m%d")
    for exchange in ['STK','KSQ']: #iterate through exchanges
        temp_dt = []
        data['market_gubun'] = exchange   
        for i in range(1, 30): #iterate through pages
            data['curPage'] = i
            response = requests.post('http://marketdata.krx.co.kr/contents/MKD/99/MKD99000001.jspx', headers=headers, data=data)
            dt = json.loads(response.content)['시가총액 상하위']
            temp_dt+=dt
            if len(dt)==0: #When tradedate/exchange is finished print summary
                n_items = len(set([td['isu_cd'][:5] for td in temp_dt]))
                print(trade_date, exchange, "page", i-1, "is last.", n_items, "items" )
                break
            for d in dt: #insert data into database
                cursor.execute("insert into stockprices values (%s,%s,%s,%s,%s,%s,%s,%s)",[d['isu_cd'],trade_date, d['opnprc'].replace(",",""),d['hgprc'].replace(",",""), d['lwprc'].replace(",",""), d['isu_cur_pr'].replace(",",""),d['lst_stk_vl'].replace(",",""), exchange])
            con.commit()
            time.sleep(2) #generally good idea to delay requests so that host server doesn't crash

    trade_date = trade_date+datetime.timedelta(1)
con.close()

In this post, I’ve discussed scraping historical stockprices for Korean market. In the next post, I will discuss setting up cronjob so that the stockprice database is updated daily.

Tags: python  mysql  webscraping 

Discussion and feedback