I.T. Spices The LINUX Way
Python In The Shell: The STEEMIT Ecosystem – Post #111
SCRAPING ALL BLOGS USING PYTHON – THE INITIAL PREPARATIONS
Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way
In this post we will be discussing the initial preparations so that our python script can effectively do its job of getting a lot of data about the blogs of a certain user. We can gather only a few, we can also gather all blogs.
Lines 1 to 19 speaks of the way we prepare our python script:
1 #!/usr/bin/python3.6
2
3 ###MODULES
4 import sys, os
5 import shutil
6 import requests
7 import re
8 from bs4 import BeautifulSoup
9 from steem import Steem
10 from steem.post import Post
11
12 ###MAKE TEMP DIR
13 tempdir = '/dev/shm/steemblogs'
14 shutil.rmtree(tempdir, ignore_errors=True)
15 os.mkdir(tempdir)
16
17 ###OPEN TEMP LOGS FILE FOR THE LOGS
18 flogs = open(tempdir + '/templogs', 'a+')
19
Line 1 is telling the linux OS that this is a python script, particularly version 3.6.
Lines 3 to 10 downloads all the python modules that we need to perform this task.
The sys and os modules will enable us to go back and forth the linux shell commands and python itself
The shutil module is for us to manipulate files and folders in the python way, of course we can also use the linux shell for this but I like to illustrate both ways here
The requests module is useful for python in trying to browse HTTP sites, and we all know that websites are in HTTP/S; this is like browsing without using our hands and mouse
The re module is a very useful python module to manipulate text strings, as in any program we have to read our auto results in making decisions; this module will prove very very useful
The BeautifulSoup module is python’s magnificent way of handling the text strings from websites, manipulating the HTML format as if ordinary strings of text; in essence, this module will simplify our examination of the blog as posted on a certain URL
The steem and steem.post modules is steem’s python module for whatever things STEEM blockchain we want to query from
Line 13 will create a temporary folder everytime this script is doing its job, not only is this a very clean way of dealing with results but it will also make our program streamlined as all files pertaining to it can be seen in only one folder
Line 14 deletes any temporary folder as previously run, whether it is present or not does not matter; new routine, clean slate, better handling of data
Line 15 creates the new folder for the present script run, so we expect such to be a clean folder everytime a new instance of this python script runs
Line 18 opens up a file in preparation for the log results, so this script will display the results on the monitor screen as well as write such results on this log file
The next post will discuss the ways we will use to acquire the full URL of a blog.
“A Good Recipe Is Like A Good Program; It Needs A Good Preparation…….”