Instagram 爬蟲方案整理¶

整理日期: 2026-02-26
類型: 技術研究

需求背景¶

目前知識庫系統可以抓取： - ✅ Threads 文章（使用 Playwright） - ✅ 一般網頁（使用 web_fetch）

但無法抓取： - ❌ Instagram Reel/貼文內容

Instagram 爬蟲的挑戰¶

官方 API 的限制¶

如果使用 Instagram 官方 API： - ❌ 需要 Facebook Business Account - ❌ 需要 App 審核（需要數天） - ❌ OAuth 2.0 實作（複雜） - ❌ 嚴格速率限制（基礎版 200 calls/hour） - ❌ 只能存取已連接帳號 - ❌ API 常變動

結論：對於簡單的內容抓取，官方 API 太複雜。

網頁抓取的挑戰¶

Instagram 需要登入才能看完整內容
動態載入（JavaScript 渲染）
反爬蟲機制

五大解決方案¶

方案 1：Instaloader ⭐⭐⭐⭐⭐（推薦）¶

簡介： - 成熟的 Python 套件 - 專門下載 Instagram 照片、影片、元資料 - 開源、活躍維護

功能： - ✅ 下載 Reels、貼文、Stories - ✅ 提取 captions、hashtags、likes、comments - ✅ 支援公開和私人帳號 - ✅ 命令列工具 + Python API

安裝：

pip install instaloader

基本使用：

import instaloader

# 初始化
L = instaloader.Instaloader()

# 下載單一 Reel（需要 shortcode）
# 例如：https://www.instagram.com/reel/DUp-e9zkwRv/
# shortcode 就是 DUp-e9zkwRv
post = instaloader.Post.from_shortcode(L.context, "DUp-e9zkwRv")

# 取得資訊
print(post.caption)  # 文案
print(post.likes)    # 按讚數
print(post.video_url)  # 影片網址

# 下載影片
L.download_post(post, target="reels")

優勢： - ✅ 簡單易用 - ✅ 文檔完整 - ✅ 社群支援良好 - ✅ 可以不登入抓取公開內容 - ✅ 登入後可抓取更多資訊

劣勢： - ⚠️ Instagram 可能封鎖過於頻繁的請求 - ⚠️ 需要處理速率限制

適合場景： - 定期抓取少量內容 - 需要完整元資料 - 願意登入取得更多資訊

方案 2：GraphQL 逆向工程 ⭐⭐⭐⭐¶

簡介： - 模擬瀏覽器行為 - 使用 Instagram 內部 GraphQL API - 無需登入即可抓取公開內容

原理： 1. Instagram 網頁版使用 GraphQL API 2. 透過瀏覽器 DevTools 分析 Network 請求 3. 找到 GraphQL endpoint 和參數 4. 用 Python requests 模擬請求

關鍵發現：

POST https://www.instagram.com/graphql/query

Python 範例：

import requests

def get_reel_data(shortcode):
    url = "https://www.instagram.com/graphql/query"

    # 模擬瀏覽器 headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "X-IG-App-ID": "936619743392459",  # Instagram web app ID
    }

    # GraphQL query
    params = {
        "query_hash": "...",  # 需要從 DevTools 取得
        "variables": f'{{"shortcode":"{shortcode}"}}'
    }

    response = requests.get(url, headers=headers, params=params)
    return response.json()

優勢： - ✅ 無需登入 - ✅ 無需 API key - ✅ 可以客製化需要的資料 - ✅ 輕量化（只抓需要的資料）

劣勢： - ⚠️ 需要逆向工程（找 query_hash） - ⚠️ Instagram 更新後可能失效 - ⚠️ 需要維護更新

適合場景： - 技術能力強 - 需要客製化抓取 - 只需要公開內容

方案 3：Apify Instagram Scraper ⭐⭐⭐¶

簡介： - 商業化雲端爬蟲服務 - 提供 Instagram Reel Scraper API - 按使用量收費

功能： - ✅ 抓取 Reels、貼文、Stories - ✅ 提取文字、likes、shares、views、transcript - ✅ hashtags、mentions、comments - ✅ 提供 REST API

使用方式：

from apify_client import ApifyClient

client = ApifyClient("<YOUR_API_TOKEN>")

# 執行 Instagram Reel Scraper
run_input = {
    "urls": ["https://www.instagram.com/reel/DUp-e9zkwRv/"],
}

run = client.actor("apify/instagram-reel-scraper").call(run_input=run_input)

# 取得結果
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

優勢： - ✅ 免維護（Apify 負責更新） - ✅ 穩定可靠 - ✅ 處理速率限制和反爬蟲 - ✅ 雲端執行（不佔用本地資源）

劣勢： - ⚠️ 需要付費（免費額度有限） - ⚠️ 依賴第三方服務

定價： - 免費方案：有限額度 - 付費方案：依使用量計費

適合場景： - 預算充足 - 需要穩定服務 - 不想自己維護

方案 4：DrissionPage 瀏覽器自動化 ⭐⭐⭐¶

簡介： - 使用瀏覽器自動化（類似 Playwright） - 模擬真實使用者行為 - 支援動態載入內容

Python 範例：

from DrissionPage import ChromiumPage

def scrape_instagram_reel(url):
    page = ChromiumPage()
    page.get(url)

    # 等待內容載入
    page.wait.load_start()

    # 提取資料
    caption = page.ele('.caption').text
    likes = page.ele('.likes').text

    return {
        'caption': caption,
        'likes': likes
    }

優勢： - ✅ 可以處理 JavaScript 動態內容 - ✅ 模擬真實使用者，較難被偵測 - ✅ 可以截圖、互動

劣勢： - ⚠️ 需要安裝瀏覽器 - ⚠️ 執行較慢（需要啟動瀏覽器） - ⚠️ 資源消耗較大 - ⚠️ 需要處理登入（可能）

適合場景： - 需要處理複雜互動 - 需要截圖 - 本地執行

方案 5：Session ID + requests ⭐⭐¶

簡介： - 使用 Instagram 帳號的 session cookie - 直接用 requests 抓取 HTML - 解析 JSON 資料

步驟： 1. 在瀏覽器登入 Instagram 2. 從 Cookie 中取得 sessionid 3. 用 requests 加上 sessionid 發送請求

Python 範例：

import requests

def get_reel_with_session(shortcode, sessionid):
    url = f"https://www.instagram.com/p/{shortcode}/?__a=1&__d=dis"

    cookies = {
        'sessionid': sessionid
    }

    headers = {
        "User-Agent": "Mozilla/5.0 ...",
    }

    response = requests.get(url, cookies=cookies, headers=headers)
    data = response.json()

    return data

優勢： - ✅ 簡單直接 - ✅ 可以抓取私人帳號（如果有追蹤） - ✅ 輕量化

劣勢： - ⚠️ 需要登入帳號（有風險） - ⚠️ sessionid 會過期 - ⚠️ 爬太頻繁可能被鎖帳號 - ⚠️ Instagram GraphQL API 格式可能改變

適合場景： - 小量抓取 - 已經有 Instagram 帳號 - 短期使用

方案比較總表¶

方案	難度	成本	穩定性	速度	推薦度
Instaloader	低	免費	高	中	⭐⭐⭐⭐⭐
GraphQL 逆向	高	免費	中	快	⭐⭐⭐⭐
Apify	低	付費	很高	快	⭐⭐⭐
DrissionPage	中	免費	中	慢	⭐⭐⭐
Session ID	中	免費	低	快	⭐⭐

知識庫系統整合建議¶

推薦方案：Instaloader¶

為什麼選 Instaloader： 1. ✅ 簡單易用（類似現有的 Playwright 腳本） 2. ✅ 免費開源 3. ✅ 成熟穩定 4. ✅ 可以擴充到 Python 腳本 5. ✅ 文檔完整

實作步驟¶

1. 安裝 Instaloader¶

cd /home/allen/.openclaw/workspace/knowledge-base
npm install  # 已有 package.json
pip3 install instaloader

2. 建立 fetch-instagram.py¶

#!/usr/bin/env python3
import sys
import json
import instaloader

def fetch_instagram_reel(url):
    """
    從 Instagram URL 提取 shortcode 並抓取資料
    URL 格式：https://www.instagram.com/reel/DUp-e9zkwRv/
    """
    # 提取 shortcode
    shortcode = url.split("/")[-2] if url.endswith("/") else url.split("/")[-1]

    # 初始化
    L = instaloader.Instaloader()

    try:
        # 取得貼文
        post = instaloader.Post.from_shortcode(L.context, shortcode)

        # 提取資料
        result = {
            "author": post.owner_username,
            "timestamp": post.date_utc.isoformat(),
            "caption": post.caption if post.caption else "",
            "likes": post.likes,
            "comments": post.comments,
            "is_video": post.is_video,
            "video_url": post.video_url if post.is_video else None,
            "url": url,
        }

        print(json.dumps(result, ensure_ascii=False, indent=2))

    except Exception as e:
        print(f"錯誤: {str(e)}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("用法: python3 fetch-instagram.py <instagram-url>")
        sys.exit(1)

    fetch_instagram_reel(sys.argv[1])

3. 測試執行¶

cd knowledge-base
python3 fetch-instagram.py "https://www.instagram.com/reel/DUp-e9zkwRv/"

4. 整合到現有工作流程¶

跟 fetch-threads.js 類似的使用方式
輸出 JSON 格式，方便整理成 Markdown

進階功能（可選）¶

登入以抓取更多資訊¶

L = instaloader.Instaloader()
L.load_session_from_file("username")  # 使用已儲存的 session
# 或
L.login("username", "password")  # 直接登入（不建議明文密碼）

批次處理¶

# 可以建立批次腳本，一次處理多個 URL

注意事項與風險¶

法律與倫理¶

✅ 只抓取公開內容
✅ 遵守 Instagram 使用條款
✅ 不用於商業目的
✅ 尊重原作者版權

技術風險¶

⚠️ 速率限制：不要爬太快（建議間隔 5-10 秒）
⚠️ IP 封鎖：過度使用可能被 Instagram 封鎖 IP
⚠️ 帳號風險：如果登入，帳號可能被限制
⚠️ API 變動：Instagram 更新可能導致工具失效

最佳實踐¶

速率控制：每次請求間隔至少 5 秒
錯誤處理：優雅處理失敗（重試機制）
使用者代理：模擬真實瀏覽器
小量使用：避免大規模爬取
定期更新：保持 instaloader 為最新版本

替代方案（如果 Instaloader 失效）¶

Plan B：Apify（付費但穩定）¶

如果免費方案都失效
考慮付費使用 Apify
穩定性最高，免維護

Plan C：手動整理¶

截圖 + 手動複製文字
最可靠但最耗時
適合少量內容

總結¶

立即可用方案¶

✅ Instaloader（推薦） - 安裝簡單：pip3 install instaloader - 使用簡單：類似現有的 fetch-threads.js - 免費、開源、穩定

下一步行動¶

安裝 Instaloader
建立 fetch-instagram.py 腳本
測試 Allen 提供的 Reel URL
整理成 Markdown 文章
加入知識庫

參考資源¶

Instaloader 官方文檔: https://instaloader.github.io/
Instaloader GitHub: https://github.com/instaloader/instaloader
Apify Instagram Scraper: https://apify.com/apify/instagram-reel-scraper
GraphQL 逆向工程文章: https://medium.com/@seotanvirbd/...

標籤¶

Instagram #爬蟲 #Instaloader #Python #GraphQL #WebScraping #Reel #知識庫¶