リモートファイルがbashに存在するかどうかを確認します

Question

私はこのスクリプトでファイルをダウンロードしています：

parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'

ファイルをダウンロードせず、リモート側で確認し、存在する場合はダウンロードせずにダミーファイルを作成することは可能でしょうか？

何かのようなもの：

if wget --spider $url 2>/dev/null; then #touch img.file fi

動作するはずですが、このコードをGNU Parallelと組み合わせる方法がわかりません。

編集：

Oleの回答に基づいて、私は次のコードを作成しました。

#!/bin/bash do_url() { url="$1" wget -q -nc --method HEAD "$url" && touch ./images/${url##*/} #get filename from $url url2=${url##*/} wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg } export -f do_url parallel --progress -a urls.txt do_url {}

動作しますが、一部のファイルでは失敗します。一部のファイルで機能する理由、他のファイルで失敗する理由の一貫性が見つかりません。たぶんそれは最後のファイル名で何かを持っています。 2番目のwgetは現在のURLにアクセスしようとしますが、その後のtouchコマンドは単に目的のファイルを作成しません。最初のwgetは、常に（正しく）_001.jpg、_002.jpgなしでメインイメージをダウンロードします。

Urls.txtの例：

http://Host.com/092401.jpg （正しく動作し、_001.jpg .._ 005.jpgがダウンロードされます） http://Host.com/HT11019.jpg =（機能しません。メイン画像のみがダウンロードされます）

Ole Tange · Accepted Answer

あなた本当にが達成したいことを理解するのはかなり難しいです。あなたの質問を言い換えてみましょう。

私はurls.txtを含んでいます：
http://example.com/dira/foo.jpg http://example.com/dira/bar.jpg http://example.com/dirb/foo.jpg http://example.com/dirb/baz.jpg http://example.org/dira/foo.jpg 
example.comには、次のURLが存在します。
http://example.com/dira/foo.jpg http://example.com/dira/foo_001.jpg http://example.com/dira/foo_003.jpg http://example.com/dira/foo_005.jpg http://example.com/dira/bar_000.jpg http://example.com/dira/bar_002.jpg http://example.com/dira/bar_004.jpg http://example.com/dira/fubar.jpg http://example.com/dirb/foo.jpg http://example.com/dirb/baz.jpg http://example.com/dirb/baz_001.jpg http://example.com/dirb/baz_005.jpg 
example.orgには、次のURLが存在します。
http://example.org/dira/foo_001.jpg 
urls.txtが与えられた場合、元のURLに加えて、_001.jpg .._ 005.jpgとの組み合わせを生成したいと思います。例えば。：
http://example.com/dira/foo.jpg 
になります：
http://example.com/dira/foo.jpg http://example.com/dira/foo_001.jpg http://example.com/dira/foo_002.jpg http://example.com/dira/foo_003.jpg http://example.com/dira/foo_004.jpg http://example.com/dira/foo_005.jpg 
次に、ファイルをダウンロードせずにこれらのURLが存在するかどうかをテストしたいと思います。 URLがたくさんあるので、これを並行して実行したいと思います。

URLが存在する場合は、空のファイルを作成します。

（バージョン1）：dir imagesの同様のディレクトリ構造に空のファイルを作成したい。一部の画像の名前は同じですが、ディレクトリが異なるため、これが必要になります。

したがって、作成されるファイルは次のようになります。
images/http:/example.com/dira/foo.jpg images/http:/example.com/dira/foo_001.jpg images/http:/example.com/dira/foo_003.jpg images/http:/example.com/dira/foo_005.jpg images/http:/example.com/dira/bar_000.jpg images/http:/example.com/dira/bar_002.jpg images/http:/example.com/dira/bar_004.jpg images/http:/example.com/dirb/foo.jpg images/http:/example.com/dirb/baz.jpg images/http:/example.com/dirb/baz_001.jpg images/http:/example.com/dirb/baz_005.jpg images/http:/example.org/dira/foo_001.jpg 
（バージョン2）：dir imagesに空のファイルを作成したい。これは、すべての画像に一意の名前があるために実行できます。

したがって、作成されるファイルは次のようになります。
images/foo.jpg images/foo_001.jpg images/foo_003.jpg images/foo_005.jpg images/bar_000.jpg images/bar_002.jpg images/bar_004.jpg images/baz.jpg images/baz_001.jpg images/baz_005.jpg 
（バージョン3）：ディレクトリimagesに作成された、urls.txtからの名前と呼ばれる空のファイルが必要です。これは、_001.jpg .._ 005.jpgが1つしか存在しないために実行できます。
images/foo.jpg images/bar.jpg images/baz.jpg 

#!/bin/bash do_url() { url="$1" # Version 1: # If you want to keep the folder structure from the server (similar to wget -m): wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url" # Version 2: # If all the images have unique names and you want all images in a single dir wget -q --method HEAD "$url" && touch images/"$3" # Version 3: # If all the images have unique names when _###.jpg is removed and you want all images in a single dir wget -q --method HEAD "$url" && touch images/"$4" } export -f do_url parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

GNU Parallelは、ジョブごとに数ミリ秒かかります。ジョブがこれほど短い場合、オーバーヘッドがタイミングに影響します。 CPUコアが100％で実行されていない場合は、さらに多くのジョブを並行して実行できます。

parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

ループを「展開」することもできます。これにより、URLごとに5つのオーバーヘッドが節約されます。

do_url() { url="$1" # Version 2: # If all the images have unique names and you want all images in a single dir wget -q --method HEAD "$url".jpg && touch images/"$url".jpg wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg } export -f do_url parallel -j0 do_url {.} :::: urls.txt

最後に、250を超えるジョブを実行できます。 https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround

AnythingIsFine · Answer

代わりにcurlを使用して、ファイル自体をダウンロードせずに、解析しているURLがそこにあるかどうかを確認できます。

if curl --head --fail --silent "$url" >/dev/null; then touch .images/"${url##*/}" fi

説明：

--failは、失敗した要求で終了ステータスをゼロ以外にします。
--headファイルの内容のダウンロードを回避します
--silentは、ステータスまたはエラーがチェック自体によって発行されるのを回避します。

「ループ」の問題を解決するには、次のようにします。

urls=( "${url%.jpg}"_{001..005}.jpg ) for url in "${urls[@]}"; do if curl --head --silent --fail "$url" > /dev/null; then touch .images/${url##*/} fi done

darnir · Answer

私が見る限り、あなたの質問は、実際にはwgetを使用してファイルの存在をテストする方法ではなく、シェルスクリプトで正しいループを実行する方法に関するものです。

そのための簡単な解決策は次のとおりです。

urls=( "${url%.jpg}"_{001..005}.jpg ) for url in "${urls[@]}"; do if wget -q --method=HEAD "$url"; then touch .images/${url##*/} fi done

これは、--method=HEADオプションを指定してWgetを呼び出すことです。 HEADリクエストを使用すると、サーバーはデータを返さずに、ファイルが存在するかどうかを単に報告します。

もちろん、大きなデータセットでは、これはかなり非効率的です。試行しているファイルごとに、サーバーへの新しい接続を作成しています。代わりに、他の回答で提案されているように、GNU Wget2を使用できます。wget2を使用すると、これらすべてを並行してテストし、新しい--stats-serverオプションを使用してリストを見つけることができます。サーバーが提供したすべてのファイルと特定の戻りコードの例：

$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3} Site Statistics: http://example.com: Status No. of docs 404 3 http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response) http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response) http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response) 200 1 http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)

このデータをCSVまたはJSONとして印刷して、解析を容易にすることもできます。

Burghard Hoffmann · Answer

名前をループするだけですか？

for uname in ${url%.jpg}_{001..005}.jpg do if wget --spider $uname 2>/dev/null; then touch ./images/${uname##*/} fi done