将Hive文章保存到Elasticsearch的尝试

aafeng (71)in #cn • 5 years ago

一直对Hive的搜索功能不是很满意。在工作中使用过不同的框架，这些框架提供的搜索功能中，facet，关键字标红等可以说是标配。但在Hive中一样都没有。其原因我也能理解，毕竟从链上读取数据，这些数据中不包含需要的信息，自然也不太容易实现这些功能。那么，如果把链上数据保存到Solr或者Elasticsearch中，不就能增强其搜索功能了吗？下面记录一下把Hive中的文章信息保存到Elasticsearch中的步骤。当然，下面只是一个简单的实验，想做到一个真正能工作的网站还有很多工作要做。

安装Elasticsearch

下载Elasticsearch:

https://www.elastic.co/cn/downloads/elasticsearch

解压缩：

tar xvf elasticsearch-7.7.0-linux-x86_64.tar.gz

运行：

cd elasticsearch-7.7.0/bin 
./elasticsearch

在本地用curl测试一下：

curl localhost:9200

其输出应类似于：

{
  "name" : "YOUR_SERVER_NAME",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "liHA066AQJeE91lv8lLqig",
  "version" : {
    "number" : "7.7.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
    "build_date" : "2020-05-12T02:01:37.602180Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

安装Kibana

其实，要想使用Elasticsearch，不一定要安装Kibana。但有了Kibana，可以实现对Elasticsearch中的数据可视化。

从官网下载：https://www.elastic.co/cn/downloads/kibana

解压缩后进入config目录并修改kibana.yml:

server.host: "0.0.0.0"

进入bin目录运行：

./kibana

可以看到如下界面：

导入如下电商数据进行测试：

打开devtool后进行一个简单的查询，可以看到如下数据：

Elasticsearch中的重要概念

在向ElasticSearch中插入数据之前，把其中的重要概念和传统的关系型数据库做一个对比：

ES RDB

Index => Database
Document => Row
Field => Column
Mapping => Schema

安装并测试Python elasticsearch模块

首先安装elasticsearch模块：

pip install elasticsearch

进入Python命令行，依次输入如下命令创建一个新的索引“hive-posts-index”：

>>> from datetime import datetime
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch()
>>> es.indices.create(index='hive-posts-index', ignore=400)
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'hive-posts-index'}

进入Kibana devtool界面，就会看到这个索引已经创建：

测试添加记录

es.index(index="hive-posts-index", id=1, body={"any": "data", "timestamp": datetime.now()})

读取记录：

>>> es.get(index="hive-posts-index", id=1)
{'_index': 'hive-posts-index', '_type': '_doc', '_id': '1', '_version': 1, '_seq_no': 0, '_primary_term': 1, 'found': True, '_source': {'any': 'data', 'timestamp': '2020-06-02T15:52:40.759337'}}

将Hive中的帖子插入到Elasticsearch中

写了一个简单的程序，将我最新5篇文章的标题，正文，分类保存到ES，并以permlink作为ES中的id：

from beem import Steem
from beem.account import Account
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()

hive    = Steem(nodes = 'https://api.hive.blog')
account = Account('aafeng', steem_instance = hive)
posts   = account.get_blog(start_entry_id=0, limit=5)

for post in posts:
    author   = post.author
    title    = post.title
    body     = post.body
    category = post.category
    permlink = post.permlink
    es.index(index="hive-posts-index", id=permlink, body={"author": author,\
                                                    "title":  title,\
                                                    "body":   body,\
                                                    "category": category,\
                                                    "permlink": permlink,\
                                                    "timestamp": datetime.now()})

在Kibana中进行查询：

GET hive-posts-index/_search?q=*:*

从输出可以看出，我的文章已经保存到ES中了:

也可以使用curl验证是否已经将数据保存至ES：

curl http://localhost:9200/hive-posts-index/_search\?size=5

然后可以使用ES来进行各种查询了，比如：

GET hive-posts-index/_search?q=title:必有
GET hive-posts-index/_search?q=category:hive-105017

#cn-reader #cn-curation #cn-programming #python #elasticsearch

5 years ago in #cn by aafeng (71)

Sort:

mrspointm (75) 5 years ago

@tipu curate

$0.00

1 vote

[-]

tipu (67) 5 years ago

Upvoted 👌 (Mana: 3/16 - need recharge?)

$0.00

[-]

aafeng (71) 5 years ago

Thanks

$0.00

[-]

annepink (71) 5 years ago

都补拍手完了

$0.00

1 vote

[-]

susanli3769 (74) 5 years ago

5👏

$0.00

1 vote