steem api 읽기 속도 문제점 인지 및 해결

tpdns90321 (50)in #kr • 7 years ago (edited)

어제 밤에 개발환경을 준비를 하고 @clayop님의 의사코드를 테스트로 돌렸는데 징그럽게 느렸다. time 라이브러리가 적재되어 있는 것을 봐서 sleep을 다 찾아서 주석 처리를 했는데도 저 목표 수치인 10000 블록 이후까지 느려서 갈 수가 없었다.

해당 코드

import requests
import json
import time

url = "https://api.steemit.com"
block_num = 20000000
voter_list = {}

while True:
    sub = json.dumps({"jsonrpc": "2.0", "id": 0, "method": "call", "params": [0, "get_ops_in_block", [block_num, True]]})
    res = requests.post(url, data=sub).json()['result']
    for i in res:
        if i['op'][0] == "author_reward":
            author = i['op'][1]['author']
            permlink = i['op'][1]['permlink']
            c_sub = json.dumps({"jsonrpc": "2.0", "id": 0, "method": "call", "params": [0, "get_content", [author, permlink]]})
            c_res = requests.post(url, data=c_sub).json()["result"]
            for v in c_res['active_votes']:
                voter = v['voter']
                weight = int(v['percent']) # 10000 = 100% vote
                rshares = int(v['rshares']) + 1 # To avoid 0 rshares caess due to rshares offset, a small amount is added
                # Voter_list structure
                # {voter1:{author1:[weight,rshares], author2[weight, rshares]}}
                if voter not in voter_list:
                    voter_list[voter] = {}
                if author not in voter_list[voter]:
                    voter_list[voter][author] = [0, 0]
                voter_list[voter][author][0] += weight
                voter_list[voter][author][1] += rshares
                time.sleep(0.5)
    block_num += 1
    print(block_num)
    time.sleep(0.5)
    if block_num == 20010000:
        break

# Voter SP
u_res = requests.get("https://steemit.com/@"+author+".json").json()['user']
votersp = u_res['vesting_shares']

# Total rshares
rsh_stats = {}
# Total voting weight -> We can divide it by number of days then obtain average full vote casts per day
wgh_stats = {}
# Self-voting
sv_stats = {}
# Inverse simpton
invs_stats = {}
for v in voter_list:
    rsh = 0
    wgh = 0
    sv = 0
    invs = 0
    for a in voter_list[v]:
        rsh += voter_list[v][a][1]
        wgh += voter_list[v][a][0]
        if a == voter_list[v]:
            sv += voter_list[v][a][1]
    for a in voter_list[v]:
        if rsh == 0:
            print(v)
            print(a)
        invs += (voter_list[v][a][1]/rsh)**2
    rsh_stats[v] = rsh
    wgh_stats[v] = wgh
    sv_stats[v] = sv/rsh*100 # 100 = Full self-vote
    invs_stats[v] = invs

알고리즘이 문제인지 I/O가 문제인지 확인해야하기도 하고 이제 소스를 가공하기 쉽게 하기 위해서 라이브러리를 자체 제작하기로 하였다.

api.py

import requests
import json

url = "https://api.steemit.com"

# this function is reading the block of number
def get_block(block_num):
    req = json.dumps({"jsonrpc": "2.0", "id": 0, "method": "call", "params": [0, "get_ops_in_block", [block_num, True]]})
    res = requests.post(url, data=req).json()['result']
    return res

test.py

import api

first_block = 20000000
READING = 100

blocks = {}

for block_num in range(first_block, first_block+READING):
        blocks.update({block_num: api.get_block(block_num)})

for i, b in blocks.items():
    print(i)
    for ops in b:
        if ops["op"][0] == "author_reward":
            print(ops["op"][1]["author"])

그냥 간단히 보상 받는 글쓴이의 아이디와 그 명령어가 존재하는 블록의 숫자를 출력하게만 코딩했다.
횟수는 100.
결과는

알고리즘도 간단한 코드에서 이렇게 나온다는 것은 100% I/O 처리 속도 문제 때문이다.

파이썬 강연같은 것을 많이 보거나 들으면 I/O가 느리면 뭐다 라는 질문에 동시성 혹은 비동기라고 대답이 나올 것이고 오래전에 봤던 슬라이드 자료를 찾았고 역시 기억 속에 정답이 맞았다.

https://www.slideshare.net/deview/2d4python
파이썬이 주언어라면 한번 쯤은 꼭 보시길 바란다. 정말 엄청난 내공이 쌓일 것이다.

저 자료에 나오는 gevent의 monkey patch를 활용하면 코드 10 줄 만으로 엄청난 I/O 성능 개선을 이룰 수 있다.

api.py에서는 맨 윗줄에 이 두 줄을 추가

import gevent.monkey
gevent.monkey.patch_all()

test.py에서는 블럭 수집 부분을 변경

threads = []

for block_num in range(first_block, first_block+READING):
    threads.append(
        gevent.spawn(lambda num : blocks.update({num: api.get_block(num)}), block_num))
gevent.joinall(threads)

그리고 gevent 모듈을 적재하면 된다.
결과는

순서는 뒤죽박죽이지만 99번째 블록도 있는 것을 확인할 수 있고 속도가 40 배 넘게 차이가 난다.