web-dev-qa-db-ja.com

Instagramのウェブブラウザからフォロワーをウェブスクレイピングする方法は?

特定のユーザーのInstagramフォロワーを表示するために基になるURLにアクセスする方法を誰かに教えてもらえますか? Instagram APIでこれを行うことはできますが、承認プロセスへの保留中の変更を考慮して、スクレイピングに切り替えることにしました。

Instagram Webブラウザーを使用すると、特定のパブリックユーザーのフォロワーリストを表示できます。たとえば、Instagramのフォロワーを表示するには、「 https://www.instagram.com/instagram "」にアクセスして、[フォロワーのURLで、ビューアを介してページを表示するウィンドウを開きます(注:これを表示するには、アカウントにログインする必要があります)。

このウィンドウがポップアップすると、URLが「 https://www.instagram.com/instagram/followers "」に変わることに注意しますが、このURLの基になるページソースを表示できないようです。 。

ブラウザのウィンドウに表示されるので、こすり取ることができると思います。しかし、Seleniumのようなパッケージを使用する必要がありますか?基になるURLが何であるかを誰かが知っているので、Seleniumを使用する必要はありませんか?

例として、「instagram.com/instagram/media/」にアクセスすることで、基になるフィードデータに直接アクセスできます。このデータから、すべての反復をスクレイプしてページ付けすることができます。フォロワーのリストと同様のことを行い、(Seleniumを使用するのではなく)このデータに直接アクセスしたいと思います。

6
user812783765

[〜#〜] edit [〜#〜]:2018年12月の更新:

これが投稿されてから、インスタランドの状況は変わりました。これは、もう少しPythonicで、XPATH/CSSパスをより適切に利用する更新されたスクリプトです。

この更新されたスクリプトを使用するには、explicitパッケージ(pip install explicit)をインストールするか、waiterの各行を純粋なSeleniumの明示的な待機に変換する必要があることに注意してください。

import itertools

from explicit import waiter, XPATH
from Selenium import webdriver


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    waiter.find_write(driver, "//div/input[@name='username']", username, by=XPATH)
    waiter.find_write(driver, "//div/input[@name='password']", password, by=XPATH)
    waiter.find_element(driver, "//div/button[@type='submit']", by=XPATH).click()

    # Wait for the user dashboard page to load
    waiter.find_element(driver, "//a/span[@aria-label='Find People']", by=XPATH)


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click()
    waiter.find_element(driver, "//a[@href='/instagram/followers/']", by=XPATH).click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)

    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child({}) a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # that
        last_follower = waiter.find_element(driver, follower_css.format(follower_index))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = 'instagram'
    driver = webdriver.Chrome()
    try:
        login(driver)
        # Print the first 75 followers for the "instagram" account
        print('Followers of the "{}" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t{:>3}: {}".format(count, follower))
            if count >= 75:
                break
    finally:
        driver.quit()

この方法でスクレイプしようとするフォロワーが増えると、パフォーマンスが指数関数的に低下することを示すために、簡単なベンチマークを実行しました。

$ python example.py
Followers of the "instagram" account
Found    100 followers in 11 seconds
Found    200 followers in 19 seconds
Found    300 followers in 29 seconds
Found    400 followers in 47 seconds
Found    500 followers in 71 seconds
Found    600 followers in 106 seconds
Found    700 followers in 157 seconds
Found    800 followers in 213 seconds
Found    900 followers in 284 seconds
Found   1000 followers in 375 seconds

元の投稿:あなたの質問は少し紛らわしいです。たとえば、「すべての反復でスクレイプしてページ付けできる」が実際に何を意味するのかはよくわかりません。現在、スクレイプとページ付けに何を使用していますか?

とにかく、instagram.com/instagram/media/instagram.com/instagram/followersと同じタイプのエンドポイントではありません。 mediaエンドポイントはREST APIであり、簡単に解析可能なJSONオブジェクトを返すように構成されているようです。

followersエンドポイントは、私が知る限り、実際にはRESTfulエンドポイントではありません。むしろ、[フォロワー]ボタンをクリックした後、(React?を使用して)ページソースへの情報にInstagramAJAXを追加します。フォロワーをユーザーに表示するJavaScriptをロード/レンダリングできるSeleniumのようなものを使用せずに、その情報を取得することはできないと思います。

このサンプルコードは機能します:

from Selenium import webdriver
from Selenium.webdriver.common.by import By
from Selenium.webdriver.support.ui import WebDriverWait
from Selenium.webdriver.support import expected_conditions as EC


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "//div[@style='position: relative; z-index: 1;']/div/div[2]/div/div[1]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    # You'll need to figure out some scrolling magic here. Something that can
    # scroll to the bottom of the followers modal, and know when its reached
    # the bottom. This is pretty impractical for people with a lot of followers

    # Finally, scrape the followers
    xpath = "//div[@style='position: relative; z-index: 1;']//ul/li/div/div/div/div/a"
    followers_elems = driver.find_elements_by_xpath(xpath)

    return [e.text for e in followers_elems]


if __name__ == "__main__":
    driver = webdriver.Chrome()
    try:
        login(driver)
        followers = scrape_followers(driver, "instagram")
        print(followers)
    finally:
        driver.quit()

このアプローチは多くの理由で問題がありますが、その主な理由はAPIと比較してどれだけ遅いかです。

16
Levi Noecker

以前の回答が機能しなくなったことに気付いたので、以前の回答に基づいて、スクロール機能を含む更新バージョンを作成しました(最初にロードされたユーザーだけでなく、リスト内のすべてのユーザーを取得するため)。さらに、これはフォロワーとフォロワーの両方をこすります。 (必要があります chromedriverをダウンロード も)

import time
from Selenium import webdriver as wd
from Selenium.webdriver.common.by import By
from Selenium.webdriver.support.ui import WebDriverWait
from Selenium.webdriver.support import expected_conditions as EC

# The account you want to check
account = ""

# Chrome executable
chrome_binary = r"chrome.exe"   # Add your path here


def login(driver):
    username = ""   # Your username
    password = ""   # Your password

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    SCROLL_PAUSE = 0.5  # Pause to allow loading of content
    driver.execute_script("followersbox = document.getElementsByClassName('_gs38e')[0];")
    last_height = driver.execute_script("return followersbox.scrollHeight;")

    # We need to scroll the followers modal to ensure that all followers are loaded
    while True:
        driver.execute_script("followersbox.scrollTo(0, followersbox.scrollHeight);")

        # Wait for page to load
        time.sleep(SCROLL_PAUSE)

        # Calculate new scrollHeight and compare with the previous
        new_height = driver.execute_script("return followersbox.scrollHeight;")
        if new_height == last_height:
            break
        last_height = new_height

    # Finally, scrape the followers
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
    followers_elems = driver.find_elements_by_xpath(xpath)

    followers_temp = [e.text for e in followers_elems]  # List of followers (username, full name, follow text)
    followers = []  # List of followers (usernames only)

    # Go through each entry in the list, append the username to the followers list
    for i in followers_temp:
        username, sep, name = i.partition('\n')
        followers.append(username)

    print("______________________________________")
    print("FOLLOWERS")

    return followers

def scrape_following(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Following' link
    driver.find_element_by_partial_link_text("following").click()

    # Wait for the following modal to load
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    SCROLL_PAUSE = 0.5  # Pause to allow loading of content
    driver.execute_script("followingbox = document.getElementsByClassName('_gs38e')[0];")
    last_height = driver.execute_script("return followingbox.scrollHeight;")

    # We need to scroll the following modal to ensure that all following are loaded
    while True:
        driver.execute_script("followingbox.scrollTo(0, followingbox.scrollHeight);")

        # Wait for page to load
        time.sleep(SCROLL_PAUSE)

        # Calculate new scrollHeight and compare with the previous
        new_height = driver.execute_script("return followingbox.scrollHeight;")
        if new_height == last_height:
            break
        last_height = new_height

    # Finally, scrape the following
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
    following_elems = driver.find_elements_by_xpath(xpath)

    following_temp = [e.text for e in following_elems]  # List of following (username, full name, follow text)
    following = []  # List of following (usernames only)

    # Go through each entry in the list, append the username to the following list
    for i in following_temp:
        username, sep, name = i.partition('\n')
        following.append(username)

    print("\n______________________________________")
    print("FOLLOWING")
    return following


if __name__ == "__main__":
    options = wd.ChromeOptions()
    options.binary_location = chrome_binary # chrome.exe
    driver_binary = r"chromedriver.exe"
    driver = wd.Chrome(driver_binary, chrome_options=options)
    try:
        login(driver)
        followers = scrape_followers(driver, account)
        print(followers)
        following = scrape_following(driver, account)
        print(following)
    finally:
        driver.quit()
3
Morten Amundsen

更新:2020年3月

これはLevi answerであり、一部の部分で小さな更新があります。これは、現在のように、ドライバーを正常に終了しなかったためです。これは、他の誰もが言っているように、デフォルトですべてのフォロワーを取得します多くのフォロワーを対象としていません。

import itertools

from explicit import waiter, XPATH
from Selenium import webdriver
from Selenium.webdriver.support.ui import WebDriverWait
from Selenium.webdriver.support import expected_conditions as EC
from Selenium.webdriver.common.by import By
from time import sleep

def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")
    sleep(3)
    # Login
    driver.find_element_by_name("username").send_keys(username)
    driver.find_element_by_name("password").send_keys(password)
    submit = driver.find_element_by_tag_name('form')
    submit.submit()

    # Wait for the user dashboard page to load
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click
    sleep(2)
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)
    allfoll = int(driver.find_element_by_xpath("//li[2]/a/span").text)
    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child({}) a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            if follower_index > allfoll:
                raise StopIteration
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # tha
        last_follower = waiter.find_element(driver, follower_css.format(group+11))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = ""  # <account to check>
    driver = webdriver.Firefox(executable_path="./geckodriver")
    try:
        login(driver)
        print('Followers of the "{}" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t{:>3}: {}".format(count, follower))
    finally:
        driver.quit()
0
Germán Ruelas