Way to parse steemit (scraping dynamically generated frontend)

in #steemit7 years ago

Hello! 

Our today's story will be about scraping dynamically generated frontend. Current Web technologies runs a part of the code on a client (browser) side. This technologies made the websites more flexible, reducing the server-side load, allowing to download content dynamically. 

If your heard about ReactJS,  Angular, Ember, Backbone this is all about dynamically generated frontend. For example steemit is written on ReactJS. But advantages for users are disadvantages for scrappers. For example in steemit case only few articles of user blog are initially loaded, to see more articles you must scroll down a page. Simple action for user is not so simple for spider. 

To deal with this problem scraping frameworks interacts with different browser automation tools. The most famous tool of this kind  is Selenium which primary task is automated tests of web pages. But this method needs an active browser. Another alternative is Splash. Splash is a browser but without GUI that wrapped in docker container. This browser controlled through HTTP API. And one more important thing is that Scrapy has a plugin for Splash. This is example from Scrapy-Splah documentation that allows to understand how to integrate Splash

Image Credit

import scrapy

from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):

    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):

        for url in self.start_urls:

            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):

        # response.body is a result of render.html call; it

        # contains HTML processed by a browser.

        # ...

Looks great. 

In the next article we will try to parse our steemit using Scrapy-Splash.

Sort:  

Congratulations @ertinfagor! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on any badge to view your own Board of Honnor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!

Congratulations @ertinfagor! You have received a personal award!

Happy Birthday - 1 Year on Steemit Happy Birthday - 1 Year on Steemit
Click on the badge to view your own Board of Honor on SteemitBoard.

For more information about this award, click here

By upvoting this notification, you can help all Steemit users. Learn how here!

Congratulations @ertinfagor! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

Upvote this notification to help all Steemit users. Learn why here!

Congratulations @ertinfagor! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You made more than 5000 upvotes. Your next target is to reach 6000 upvotes.

Click here to view your Board of Honor
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:

Meet the Steemians Contest - The results, the winners and the prizes
Meet the Steemians Contest - Special attendees revealed
Meet the Steemians Contest - Intermediate results

Support SteemitBoard's project! Vote for its witness and get one more award!

Congratulations @ertinfagor! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 3 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!