Scrapy - parse steemit post commenters
Recently @Inber started a contest dedicated to 700 followers. Idea of competition is that all participants must comment a post with a keyword (I`m in). At the end among participants a winner will randomly be chosen. Let`s formulate the task. We need to obtain all root comments on the post, check to keyword, and generate list of participants. To obtain all root comments on the post we will create a Scrapy spider using XPath. Please review my previous articles to find out what is Scrapy, how to install it and how to create a XPath query.
Ok, let`s start container created in one of the past posts.
$ cd ~/Scrapy/
$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy -it scrapy /bin/bash
-it means that we start container interactively and can run commands inside container go to shared volume directory (/scrapy)
# cd /scrapy
Now we should create an empty project of Scrapy, and move to project directory
# scrapy startproject ContestParticipants
# cd ContestParticipants/
Our next task is to create base spider that we will use for crawling scrapy genspider commenters steemit.com All spiders of project placed in ContestParticipants/ContestParticipants/spiders Replace spider commenters.py which was created with next text (be aware that there is no editor in container and you should use your base OS and edit file in ~/Scrapy/scrapy-data/ContestParticipants/ContestParticipants/spiders)
import scrapy
class SteemitSpider(scrapy.Spider):
name = 'commenters'
def start_requests(self):
search_url = "https://steemit.com/art/@inber/700-followers-celebration-a-contest-for-my-followers-inside-get-a-free-drawing-from-me"
yield scrapy.Request(search_url, self.parse)
def parse(self, response):
for comment in response.xpath(".//*[@class='hentry Comment root']/div[2]"):
yield {
'author': comment.xpath('div[1]/span/span/span[1]/span[1]/strong/text()').extract_first(),
'text': comment.xpath('div[2]/div/div/descendant::*/text()').extract(),
}
In this spider we see three XPath queries, let me explain what they do.
For comment in response.xpath(".//*[@class='hentry Comment root']/div[2]") this expression extract all comments from page and divide it on per comment base.
Next two query parse each comment and extract author name and text of comment.
More information about spider anatomy you can find in documentation.
And the last what we do is we edit Docker file to build container which is run this code. We already put spider to container volume (~/Scrapy) and now we must add this line to the end of Dockerfile
CMD scrapy runspider /scrapy/ContestParticipants/ContestParticipants/spiders/commenters.py -o /scrapy/commenters.json
Now rebuild image and run container:
$ cd ~/Scrapy
$ sudo docker build -t content_participants_image .
$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy content_participants_image
Result will be placed in ~/Scrapy/scrapy-data/commenters.json and look like this
[
{"author": "gmuxx", "text": ["Fantastic! Well done ", "@inber", " for reaching such a great milestone!", "I'm in - ", "of course I'm in", "...you have an R2D2 picture up for grabs! I am the biggest Star Wars fan on Steemit doncha know? :-D"]},
{"author": "sneakgeekz", "text": ["I'm in- These are so freaking cool. I love your style, the strokes and the use of colors. Congratulations on the milestone. I just signed up about 2 weeks ago and post like these are very encouraging. Thank you for the random giveaway."]},
{"author": "jangaladesigns", "text": ["Congrats Inber!! :)"]},
{"author": "kiaraantonoviche", "text": ["awesome Tardis!"]},
{"author": "jae5086", "text": ["I'm in, so the hardest thing is, which do I choose when I win. I guess it has to be R2. Also, I can use a ", "sound tattoo", " of, well, whatever it is he's saying. I'll just use a clip from the movies.", "Another thing I'm interested in is the learning curve between drawing free hand, and your new device (I forget the specific name, clique or something). My sister has remarkable skills as an artist. She can paint and draw freehand so well, but 15 years of various kinds of abuse has left her...well, she's not good. I'd love to maybe get her something like what you have that may compel her to do more with herself instead of just float through life never utilizing her talents, but I'm afraid she would not commit the time to learn it."]},
{"author": "jamhuery", "text": ["Congratulation ", "@iber", ", I'm in"]},
{"author": "ilaypipe", "text": ["I'm in!"]},
{"author": "animal-shelter", "text": ["I'm in hurry to write this comment;)"]},
{"author": "speckofdust", "text": ["congratulations ", "@inber", "!!! you are the best"]},
{"author": "varunrayas", "text": ["I'm In"]},
{"author": "torem-di-torem", "text": ["I'm in another universe when I look at your works:) Congratulations!"]},
{"author": "nezaigor", "text": ["Good"]},
{"author": "freedomnation", "text": ["Congrats!"]},
{"author": "ertinfagor", "text": ["I'm in :)"]},
{"author": "acromott", "text": ["Iam in!", "\nCongratz on hitting 700!!", "\nI am looking forward to being there with ya! ;)"]},
{"author": "colormesensei", "text": ["I'm in a league of my own!"]},
{"author": "sashin", "text": ["It's been an awesome journey following you thus far. I wish you all the success in the future and may your following continue to grow."]},
{"author": "ryivhnn", "text": ["Congratulations on 7 centuries XD", "(I'll let someone else have a chance at winning, I just wanted to express my happiness for you ;)"]}
]
This is all that I want to show in this post. In next post we will create a GitHub repository and add to this project some functionality. Please if you have questions fill free ask them in the comments below.
Thank You! :)
WOW!!!! Thank you for taking time to work with this all!:)
Your are welcome! :) It`s pleasure to me :)
Great work... What a post!
Thank You! :)