Scrape Instagram Data: Python Practices and Tools
Imagine that you need data from Instagram to understand trends, analyze engagement, or gather insights for marketing strategies—but scraping Instagram isn't exactly straightforward. With their anti-bot measures and complex login requirements, getting that data can feel like navigating a maze. But don't worry; there’s a solution that can save you time and effort. Let’s dive into how you can efficiently scrape Instagram data using Python.
Set Up Your Tools
Before diving into the code, make sure you have the necessary Python libraries installed:
pip install requests python-box
- Requests: The workhorse for making HTTP requests.
- Python-box: Makes dealing with complex JSON data easier by converting it into Python objects that you can access using dot notation.
Now, let's break this down into digestible chunks: sending API requests, parsing the data, using proxies, and simplifying the JSON handling with Box. This is where the magic happens.
Step 1: Build the API Request
Instagram hides much of its data behind complex front-end security, but the backend? That's a different story. Instagram’s backend API allows us to access detailed profile information without needing to authenticate. Here's how to get that information.
Explanation:
- Headers: Instagram can detect bot-like activity by analyzing the request headers. By mimicking a real browser, we can make the request look like it’s coming from an Instagram app.
- API Endpoint: The URL
https://i.instagram.com/api/v1/users/web_profile_info/?username={username}
is your goldmine. It returns everything you need about a public profile, from follower counts to bio details.
Here’s how you can set it up in Python:
import requests
# Headers to mimic a real browser request
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
# Replace this with the username you want to scrape
username = 'testtest'
# Send the request to Instagram's backend
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json() # Parse the response into a JSON object
Step 2: Bypass Rate-Limiting with Proxies
Instagram isn’t a fan of repeated requests from the same IP address. So, if you’re scraping on a large scale, proxies are your best friend. They rotate your requests through different IPs, reducing the chances of detection.
Setting Up Proxies:
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
Step 3: Parsing JSON with Ease Using Box
Instagram’s API returns complex, nested JSON data. Navigating this with traditional dictionary access can be a pain. Enter Box: This library turns JSON into an object that you can access with simple dot notation, making data extraction a breeze.
Using Box for Simplicity:
from box import Box
response_json = Box(response.json()) # Convert the response into a Box object
# Extract profile data
user_data = {
'full_name': response_json.data.user.full_name,
'followers': response_json.data.user.edge_followed_by.count,
'biography': response_json.data.user.biography,
'profile_pic_url': response_json.data.user.profile_pic_url_hd,
}
Step 4: Scrape Videos and Timeline Data
Once you have profile data, it's time to scrape Instagram posts and videos. The data includes view counts, likes, comments, and even video durations.
Here’s how to extract the timeline data:
# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'video_url': element.node.video_url,
'view_count': element.node.video_view_count,
}
profile_video_data.append(video_data)
# Extract timeline media (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'id': element.node.id,
'media_url': element.node.display_url,
'like_count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
Step 5: Writing Data to JSON Files
Once you've gathered the data, it’s time to store it. Python’s json module lets you easily write the data to files, ready for further analysis.
import json
# Save user data to JSON
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
# Save video data to JSON
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
# Save timeline media data to JSON
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
Full Code Example
Now that you have all the building blocks, here’s the complete script that scrapes Instagram user profile data, video data, and timeline media, using proxies and handling the complexities of the data format:
import requests
from box import Box
import json
# Define headers and proxies as before...
# Send the request and parse the response
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
response_json = Box(response.json())
# Extract user data, videos, and timeline media as shown earlier...
# Save the extracted data to JSON files
Final Thoughts
Scraping Instagram data with Python is a powerful skill that can help you gather valuable insights—whether you’re tracking user engagement, understanding influencer activity, or analyzing trends. Remember, though, to comply with Instagram’s terms of service. Always scrape responsibly and ensure that your efforts align with their policies.