Google Trends Analysis: Determining Correlation of Two Search Terms

in #data8 years ago (edited)

Hi all,

Today I finally got around to creating a front end for a program that I wrote a few months ago.  I'll first post some instructions on how to use it and then I'll briefly discuss how it works for those of you that are interested.

First you're going to want to head on over to Google Trends.  Type in the first search term you want to compare and hit enter.  Now you will see the search trend for the term you entered.  Notice that it allows to filter searches based on location, time, category, and search type, so play around with them if you want.  Next you need to click on Compare and enter the search term you wish to compare the first to.  Hit enter again and you'll see the graph now has two lines, one for each search term.

![Gtrends.png]()

It bothered me that the trends tool allows you to filter by so much but doesn't provide the correlation coefficient.  This is where my new web app comes in handy.  Copy the trends URL from the page where you just compared two search terms, then head on over to my app at http://deelawn.ninja/gtrend/.  Paste the URL into the input box, click the Go! button, and wait for a bit as it retrieves the correlation coefficient for the search terms you chose.

Quick note: if you've just tried this you probably noticed that it's pretty slow.  This is due to the time it takes to execute the javascript and render the page, so I'm not sure if there is a way to make it any faster until Google decides to release an API for it.

So what does this correlation coefficient mean?  The coefficient ranges from -1 to 1, with 1 being a perfectly positive correlation and -1 being a perfectly negative correlation.  A perfectly positive correlation means that the two search terms are identical in the sense that their popularity rises and falls at the same time.  A perfectly negative correlation means that as the popularity of one term falls, the other rises in equal amount.  A correlation of zero indicates that the two terms are not related in any way.  In short, the closer the correlation coefficient is to either -1 or 1, the more correlated the search terms are.

How does it work?  The program I wrote collects the data points for each of the search terms.  It then normalizes the data and feeds it into the pandas python library to compute the correlation coefficient.

Check out my github if you want to allow your python programs to be able to interact with Google trends: https://github.com/deelawn/PyGTrend

Now I'm going to ramble a bit...

I originally wrote the core of the backend about four months ago because I was curious to know the correlation of a few different search terms.  I started by looking at the source code for the trends graph.  I wrote a script to parse out all the data points and graphed them, but the graph I produced didn't look the same as the graph in my browser.  Weird.  Then I realized that all the X values were correct but the Y values were all inverted, so I inverted them and ran my script again to produce a matching graph.

The next part was parsing the html.  I had recently finished a personal project for which I used the Beautiful Soup library to parse Wikipedia articles, so I decided to put it to use again to parse the trends data.  I started examining the structure and coding it up.  Then it was time for a quick test.  Run.  No data returned.  Wut?  After some thought I decided that Google probably isn't serving me the content because it knows I'm not a browser.  Okay, easy enough, I'll just send all the necessary headers with my get request to trick Google into thinking I'm its darling, yet gluttonous son, Chrome.  Google isn't tricked so easily, or so I thought...

It turned out that the real reason I wasn't able to get the data I needed was that the javascript that is run when the page loads is what is responsible for creating all of the content I was after.  Hmmmm... I had never ran into this problem before so I did some research.  I happened upon Selenium and PhantomJS.  I was able to use PhantomJS execute the javascript as my browser normally would and collect the data.  Even though it's slow, it still gets the job done.

The remaining tasks to complete this weren't too difficult; it was really just figuring out how the URLs need to be built and doing a little research into the pandas library to compute the correlation coefficient.  Today was pretty painful though, as I have practicality no front end dev experience.  But I'm glad I did it.  Not everyone wants to use something that needs to be run from the command line.

Let me know if you have any questions, find any bugs, or have any suggestions for improvement.

Thanks for reading,
deelawn
Sort:  

Congratulations @deelawn! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of comments

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!

Congratulations @deelawn! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

You got your First payout
Award for the total payout received

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!