Pythonウェブサイトが存在するかどうかを確認

Question

特定のウェブサイトが存在するかどうかを確認したいのですが、これは私がやっていることです：

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent':user_agent } link = "http://www.abc.com" req = urllib2.Request(link, headers = headers) page = urllib2.urlopen(req).read() - ERROR 402 generated here!

ページが存在しない場合（エラー402またはその他のエラー）、page = ...行で何をして、読んでいるページが終了したことを確認できますか？

Adem &#214;ztaş · Accepted Answer

GETの代わりにHEADリクエストを使用できます。コンテンツのみではなく、ヘッダーのみをダウンロードします。その後、ヘッダーから応答ステータスを確認できます。

import httplib c = httplib.HTTPConnection('www.example.com') c.request("HEAD", '') if c.getresponse().status == 200: print('web site exists')

または、urllib2を使用できます

import urllib2 try: urllib2.urlopen('http://www.example.com/some_page') except urllib2.HTTPError, e: print(e.code) except urllib2.URLError, e: print(e.args)

または、requestsを使用できます

import requests request = requests.get('http://www.example.com') if request.status_code == 200: print('Web site exists') else: print('Web site does not exist')

alecxe · Answer

ステータスコードが400未満であることを確認した方がよいです。これは、完了したように here です。ステータスコードの意味は次のとおりです（ wikipedia から取得）：

1xx-情報
2xx - 成功
3xx-リダイレクト
4xx-クライアントエラー
5xx - サーバーエラー

ページが存在するかどうかを確認し、ページ全体をダウンロードしたくない場合は、 Head Request を使用する必要があります。

import httplib2 h = httplib2.Http() resp = h.request("http://www.google.com", 'HEAD') assert int(resp[0]['status']) < 400

この回答から取られました。

ページ全体をダウンロードする場合は、通常のリクエストを行い、ステータスコードを確認してください。 requests を使用した例：

import requests response = requests.get('http://google.com') assert response.status_code < 400

同様のトピックも参照してください。

ページ全体をダウンロードせずにWebページが存在するかどうかを確認するPythonスクリプト？
Pythonウェブページをダウンロードせずにリンクが無効かどうかを確認する
HEAD HTTPリクエストをPython 2？
HTTPの作成HEAD Python 2 からのurllib2を使用したリクエスト

お役に立てば幸いです。

keas · Answer

from urllib2 import Request, urlopen, HTTPError, URLError user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent':user_agent } link = "http://www.abc.com/" req = Request(link, headers = headers) try: page_open = urlopen(req) except HTTPError, e: print e.code except URLError, e: print e.reason else: print 'ok'

Unutbuのコメントに答えるには：

デフォルトのハンドラーはリダイレクト（300の範囲のコード）を処理し、100〜299の範囲のコードは成功を示すため、通常は400〜599の範囲のエラーコードのみが表示されます。ソース

Raj · Answer

コード：

a="http://www.example.com" try: print urllib.urlopen(a) except: print a+" site does not exist"

Maxfield · Answer

httplibおよびurllib2で使用するための@AdemÖztaşによる優れた回答があります。 requestsの場合、質問が厳密にリソースの存在に関するものであれば、リソースの存在が大きい場合に答えを改善できます。

requestsに対する以前の回答は、次のようなものを示唆していました。

def uri_exists_get(uri: str) -> bool: try: response = requests.get(uri) try: response.raise_for_status() return True except requests.exceptions.HTTPError: return False except requests.exceptions.ConnectionError: return False

requests.getはリソース全体を一度にプルしようとするため、大きなメディアファイルの場合、上記のスニペットはメディア全体をメモリにプルしようとします。これを解決するために、応答をストリーミングできます。

def uri_exists_stream(uri: str) -> bool: try: with requests.get(uri, stream=True) as response: try: response.raise_for_status() return True except requests.exceptions.HTTPError: return False except requests.exceptions.ConnectionError: return False

上記のスニペットを実行し、2つのWebリソースに対してタイマーをアタッチしました。

1） http://bbb3d.renderfarming.net/download.html 、非常に軽いhtmlページ

2） http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4 、まともなサイズのビデオファイル

以下のタイミング結果：

uri_exists_get("http://bbb3d.renderfarming.net/download.html") # Completed in: 0:00:00.611239 uri_exists_stream("http://bbb3d.renderfarming.net/download.html") # Completed in: 0:00:00.000007 uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4") # Completed in: 0:01:12.813224 uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4") # Completed in: 0:00:00.000007

最後の注意として：この機能は、リソースホストが存在しない場合にも機能します。たとえば、"http://abcdefghblahblah.com/test.mp4"はFalseを返します。

DiegoPacheco · Answer

def isok(mypath): try: thepage = urllib.request.urlopen(mypath) except HTTPError as e: return 0 except URLError as e: return 0 else: return 1

Vishal · Answer

これを試してください::

import urllib2 website='https://www.allyourmusic.com' try: response = urllib2.urlopen(website) if response.code==200: print("site exists!") else: print("site doesn't exists!") except urllib2.HTTPError, e: print(e.code) except urllib2.URLError, e: print(e.args)