Shock! Python has ranked the first programming language?(Scratch from O'Reilly)

in #deeplearning7 years ago

Self-intro: I am a graduate student at an unnamed institution in China :) The main focus is Computer vision using deep learning. I will update some of notes about deep learning at steemit. Hope that like-minded friends follow me, and we discuss and support with each other. Thanks for reading! @whytin

Python has ranked the first programming language?(who know, I don't know)
So, let's explore it!

Finding the most popular category of books in O'Reilly

Overview

Refer to "Data Science From Scratch", I want to explore which category of books is the largest quantity, and I will recommend you to start studying which kind of books when you are still confused with your future.

programming_book.png

book.png

You can fork the project by Github:
Github: http://github.com/whytin/book_scratch

Preparation

Tool

  • BeautifulSoup4(a python library designed for dissecting a doucument into a parse tree, we can extract what we need esaily);
    Refer to : http://www.crummy.com/software/BeautifulSoup/
  • htmll5lib(a popular Python parser to handle the HTML format);
  • requests(make a HTTP request)

Environment

  • Linux Mint 18.1 (Unlimited)
  • Python 2.7
  • Sqlite3

Foundation

  • Python
  • HTML
  • Matplotlib
  • SQL

Start

Scratch admited

Before you start the project, make sure your target is open to scratch.
Like O'Reilly: http://oreilly.com/terms/
Glance over the page I have not found some issues with banning the scratch.
Then we look over the robots.txt file. http://shop.oreilly.com/robots.txt
We found that :

Crawl-delay: 30
Request-rate: 1/30

It means that we should delay 30s between two requests.

Parsing the page

If you know well with HTML, it is easy for you to find out the tags.
First, you can select category of data through Browse Subjects

Second, use the developer tools.
Paste_Image.png

It is wise to use the button of Select an element in the page to inspect it ,and then find out the tag
We can extract the title, authors, date, isbn, price of the book.
Do yourself , you will fall in curiousity.

Coding

from bs4 import BeautifulSoup
import requests
# Making a request of url and send to BeautifulSoup parsing with html5lib.
url = "http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page=1"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
tds = soup('td', 'thumbtext')

We found book's title involved the a tag of

, and extract it.

titles = [td.find("div", "thumbheader").a.text for td in tds]

And we can build the function of book_info()

# In order to extract the book information like title, authors, isbn, date, price. Return a dict.
def book_info(td):
    title = td.find("div", "thumbheader").a.text
    authors = td.find('div', 'AuthorName').text
    isbn_link = td.find("div", "thumbheader").a.get("href")
    isbn = re.match("/product/(.*)\.do", isbn_link).group(1)
    date = td.find("span", "directorydate").text.strip()
    price = td('span', 'pricelabel')[0].find('span', 'price')
    
    return {
            "tilte": title,
            "authors": authors,
             "isbn": isbn,
             "date":date,
             "price":price  }

Scratching:

from bs4 import BeautifulSoup
import requests
import re
from time import sleep
base_url = "http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page="
books=[]
NUM_PAGES = 44

for page_num in range(1, NUM_PAGES + 1):
    url = baseurl + str(page_num)
    soup = BeautifulSoup(requests.get(url).text, 'html5lib')
    for td in soup('td', 'thumbtext'):
        books.append(book_info(td))
    sleep(30)

Visualization:

import matplotlib as plt
def get_year(book):
    return int(book["date"].split()[1])

# Counter(): dict subclass for counting hashable objects
years_counts = Counter(get_year(book) for book in books if get_year(book) <= 2016)
years = sorted(years_counts)
book_counts = [year_counts[year] for year in years]
plt.plot(years, book_counts)
plt.show()

Summary

It is the brief induction of usage of python scratching, using BeautifulSoup and Matplotlib. You can also scratching Amazon website or whatever you want to obtain. Remember you are risk in data scratching, square up your behavior in Internet.

Thanks for reading!

Sort:  

Congratulations @whytin! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

You published your First Post
You made your First Vote
You made your First Comment
You got a First Vote
Award for the number of upvotes received

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!