重複ファイルを見つけて削除する

Question

Pythonプログラムを作成して、フォルダから重複ファイルを見つけて削除します。

Mp3ファイルのコピーが複数あり、その他のファイルもいくつかあります。私はsh1アルゴリズムを使用しています。

これらの重複ファイルを見つけて削除するにはどうすればよいですか？

Todor Minakov · Answer

最速のアルゴリズム-受け入れられた回答と比較して100倍のパフォーマンス向上（本当に:)）

他のソリューションのアプローチは非常に優れていますが、重複ファイルの重要な特性を忘れています-ファイルサイズは同じです。同じサイズのファイルでのみ高額なハッシュを計算すると、CPUを大幅に節約できます。最後にパフォーマンスの比較について説明します。

@noskloによって与えられた確かな答えを反復し、@ Raffiのアイデアを借りて、各ファイルの先頭のみの高速ハッシュを取得し、高速ハッシュの衝突についてのみ完全なものを計算します。これが手順です。

ファイルのハッシュテーブルを作成します。ファイルサイズはキーです。
同じサイズのファイルの場合、最初の1024バイトのハッシュでハッシュテーブルを作成します。衝突しない要素は一意です
最初の1kバイトで同じハッシュを持つファイルの場合、完全なコンテンツでハッシュを計算します-一致するものを持つファイルは一意ではありません。

コード：

#!/usr/bin/env python import sys import os import hashlib def chunk_reader(fobj, chunk_size=1024): """Generator that reads a file in chunks of bytes""" while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1): hashobj = hash() file_object = open(filename, 'rb') if first_chunk_only: hashobj.update(file_object.read(1024)) else: for chunk in chunk_reader(file_object): hashobj.update(chunk) hashed = hashobj.digest() file_object.close() return hashed def check_for_duplicates(paths, hash=hashlib.sha1): hashes_by_size = {} hashes_on_1k = {} hashes_full = {} for path in paths: for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: full_path = os.path.join(dirpath, filename) try: # if the target is a symlink (soft one), this will # dereference it - change the value to the actual target file full_path = os.path.realpath(full_path) file_size = os.path.getsize(full_path) except (OSError,): # not accessible (permissions, etc) - pass on continue duplicate = hashes_by_size.get(file_size) if duplicate: hashes_by_size[file_size].append(full_path) else: hashes_by_size[file_size] = [] # create the list for this file size hashes_by_size[file_size].append(full_path) # For all files with the same file size, get their hash on the 1st 1024 bytes for __, files in hashes_by_size.items(): if len(files) < 2: continue # this file size is unique, no need to spend cpy cycles on it for filename in files: try: small_hash = get_hash(filename, first_chunk_only=True) except (OSError,): # the file access might've changed till the exec point got here continue duplicate = hashes_on_1k.get(small_hash) if duplicate: hashes_on_1k[small_hash].append(filename) else: hashes_on_1k[small_hash] = [] # create the list for this 1k hash hashes_on_1k[small_hash].append(filename) # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates for __, files in hashes_on_1k.items(): if len(files) < 2: continue # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it for filename in files: try: full_hash = get_hash(filename, first_chunk_only=False) except (OSError,): # the file access might've changed till the exec point got here continue duplicate = hashes_full.get(full_hash) if duplicate: print "Duplicate found: %s and %s" % (filename, duplicate) else: hashes_full[full_hash] = filename if sys.argv[1:]: check_for_duplicates(sys.argv[1:]) else: print "Please pass the paths to check as parameters to the script"

そして、ここが楽しい部分です-パフォーマンスの比較。

ベースライン-

1047ファイルのディレクトリ、32 mp4、1015-jpg、合計サイズ-5445.998 MiB-つまり、私の電話のカメラの自動アップロードディレクトリ:)
小型（ただし完全に機能する）プロセッサ-1600 BogoMIPS、1.2 GHz 32L1 + 256L2 Kbsキャッシュ、/ proc/cpuinfo：

プロセッサー：Feroceon 88FR131 rev 1（v5l）BogoMIPS：1599.07

（つまり、私のローエンドNAS :)、実行中Python 2.7.11。

したがって、@ noskloの非常に便利なソリューションの出力：

root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg real 5m44.198s user 4m44.550s sys 0m33.530s

そして、これはサイズチェックにフィルターが付いたバージョンで、次に小さなハッシュ、そして衝突が見つかった場合は最後に完全なハッシュです。

root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone" Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg real 0m1.398s user 0m1.200s sys 0m0.080s

両方のバージョンは、必要な時間の平均を得るために、それぞれ3回実行されました。

したがって、v1は（user + sys）284sであり、その他は-2s;です。かなりの違いですよね:)この増加により、SHA512、またはさらに洗練されたものに進むことができます-パフォーマンスのペナルティは、必要な計算が少なくなることによって軽減されます。

ネガ：

他のバージョンよりも多くのディスクアクセス-すべてのファイルがサイズ統計のために1回アクセスされ（これは安価ですが、それでもディスクIOです）、すべての複製が2回開かれます（最初の1kバイトの小さなハッシュと完全なコンテンツのハッシュ）
ハッシュテーブルランタイムを格納するため、より多くのメモリを消費します

nosklo · Answer

再帰フォルダーバージョン：

このバージョンでは、ファイルサイズとコンテンツのハッシュを使用して重複を検出します。複数のパスを渡すことができ、すべてのパスを再帰的にスキャンし、見つかったすべての重複を報告します。

import sys import os import hashlib def chunk_reader(fobj, chunk_size=1024): """Generator that reads a file in chunks of bytes""" while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def check_for_duplicates(paths, hash=hashlib.sha1): hashes = {} for path in paths: for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: full_path = os.path.join(dirpath, filename) hashobj = hash() for chunk in chunk_reader(open(full_path, 'rb')): hashobj.update(chunk) file_id = (hashobj.digest(), os.path.getsize(full_path)) duplicate = hashes.get(file_id, None) if duplicate: print "Duplicate found: %s and %s" % (full_path, duplicate) else: hashes[file_id] = full_path if sys.argv[1:]: check_for_duplicates(sys.argv[1:]) else: print "Please pass the paths to check as parameters to the script"

zalew · Answer

def remove_duplicates(dir): unique = [] for filename in os.listdir(dir): if os.path.isfile(filename): filehash = md5.md5(file(filename).read()).hexdigest() if filehash not in unique: unique.append(filehash) else: os.remove(filename)

//編集：

mp3の場合は、このトピックにも興味があるかもしれません異なるビットレートや異なるID3タグで重複したMP3ファイルを検出しますか？

John Millikin · Answer

私はPython=で少し前に書いた-あなたはそれを使って大歓迎です。

import sys import os import hashlib check_path = (lambda filepath, hashes, p = sys.stdout.write: (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest (): ((hash in hashes) and (p ('DUPLICATE FILE
' ' %s
' 'of %s
' % (filepath, hashes[hash]))) or hashes.setdefault (hash, filepath)))()) scan = (lambda dirpath, hashes = {}: map (lambda (root, dirs, files): map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath))) ((len (sys.argv) > 1) and scan (sys.argv[1]))

Raffi · Answer

より高速なアルゴリズム

「大きなサイズ」の多くのファイル（画像、mp3、pdfドキュメント）を分析する必要がある場合、次の比較アルゴリズムを使用すると興味深い/高速になります。

最初の高速ハッシュは、ファイルの最初のNバイト（たとえば1KB）で実行されます。このハッシュは、ファイルが間違いなく異なるかどうかを示しますが、2つのファイルがまったく同じかどうかは示しません（ハッシュの精度、ディスクから読み取られるデータが制限されます）。
最初の段階で衝突が発生した場合は、より正確で、ファイルのコンテンツ全体に対して実行される2番目の低速なハッシュ

このアルゴリズムの実装は次のとおりです。

import hashlib def Checksum(current_file_name, check_type = 'sha512', first_block = False): """Computes the hash for the given file. If first_block is True, only the first block of size size_block is hashed.""" size_block = 1024 * 1024 # The first N bytes (1KB) d = {'sha1' : hashlib.sha1, 'md5': hashlib.md5, 'sha512': hashlib.sha512} if(not d.has_key(check_type)): raise Exception("Unknown checksum method") file_size = os.stat(current_file_name)[stat.ST_SIZE] with file(current_file_name, 'rb') as f: key = d[check_type].__call__() while True: s = f.read(size_block) key.update(s) file_size -= size_block if(len(s) < size_block or first_block): break return key.hexdigest().upper() def find_duplicates(files): """Find duplicates among a set of files. The implementation uses two types of hashes: - A small and fast one one the first block of the file (first 1KB), - and in case of collision a complete hash on the file. The complete hash is not computed twice. It flushes the files that seems to have the same content (according to the hash method) at the end. """ print 'Analyzing', len(files), 'files' # this dictionary will receive small hashes d = {} # this dictionary will receive full hashes. It is filled # only in case of collision on the small hash (contains at least two # elements) duplicates = {} for f in files: # small hash to be fast check = Checksum(f, first_block = True, check_type = 'sha1') if(not d.has_key(check)): # d[check] is a list of files that have the same small hash d[check] = [(f, None)] else: l = d[check] l.append((f, None)) for index, (ff, checkfull) in enumerate(l): if(checkfull is None): # computes the full hash in case of collision checkfull = Checksum(ff, first_block = False) l[index] = (ff, checkfull) # for each new full hash computed, check if their is # a collision in the duplicate dictionary. if(not duplicates.has_key(checkfull)): duplicates[checkfull] = [ff] else: duplicates[checkfull].append(ff) # prints the detected duplicates if(len(duplicates) != 0): print print "The following files have the same sha512 hash" for h, lf in duplicates.items(): if(len(lf)==1): continue print 'Hash value', h for f in lf: print '	', f.encode('unicode_escape') if \ type(f) is types.UnicodeType else f return duplicates

find_duplicates関数は、ファイルのリストを受け取ります。このようにして、2つのディレクトリを比較することもできます（たとえば、コンテンツをよりよく同期するため）。指定された拡張子を持つファイルのリストを作成し、一部のディレクトリへの入力を回避する関数の例を以下に示します。

def getFiles(_path, extensions = ['.png'], subdirs = False, avoid_directories = None): """Returns the list of files in the path :'_path', of extension in 'extensions'. 'subdir' indicates if the search should also be performed in the subdirectories. If extensions = [] or None, all files are returned. avoid_directories: if set, do not parse subdirectories that match any element of avoid_directories.""" l = [] extensions = [p.lower() for p in extensions] if not extensions is None \ else None for root, dirs, files in os.walk(_path, topdown=True): for name in files: if(extensions is None or len(extensions) == 0 or \ os.path.splitext(name)[1].lower() in extensions): l.append(os.path.join(root, name)) if(not subdirs): while(len(dirs) > 0): dirs.pop() Elif(not avoid_directories is None): for d in avoid_directories: if(d in dirs): dirs.remove(d) return l

このメソッドは、たとえば.svnパスを解析しない場合に便利です。これにより、find_duplicatesでファイルの衝突が確実にトリガーされます。

フィードバックは大歓迎です。

qun · Answer

@ IanLee1521には良い解決策がありますここ。最初にファイルサイズに基づいて重複をチェックするため、非常に効率的です。

#! /usr/bin/env python # Originally taken from: # http://www.pythoncentral.io/finding-duplicate-files-with-python/ # Original Auther: Andres Torres # Adapted to only compute the md5sum of files with the same size import argparse import os import sys import hashlib def find_duplicates(folders): """ Takes in an iterable of folders and prints & returns the duplicate files """ dup_size = {} for i in folders: # Iterate the folders given if os.path.exists(i): # Find the duplicated files and append them to dup_size join_dicts(dup_size, find_duplicate_size(i)) else: print('%s is not a valid path, please verify' % i) return {} print('Comparing files with the same size...') dups = {} for dup_list in dup_size.values(): if len(dup_list) > 1: join_dicts(dups, find_duplicate_hash(dup_list)) print_results(dups) return dups def find_duplicate_size(parent_dir): # Dups in format {hash:[names]} dups = {} for dirName, subdirs, fileList in os.walk(parent_dir): print('Scanning %s...' % dirName) for filename in fileList: # Get the path to the file path = os.path.join(dirName, filename) # Check to make sure the path is valid. if not os.path.exists(path): continue # Calculate sizes file_size = os.path.getsize(path) # Add or append the file path if file_size in dups: dups[file_size].append(path) else: dups[file_size] = [path] return dups def find_duplicate_hash(file_list): print('Comparing: ') for filename in file_list: print(' {}'.format(filename)) dups = {} for path in file_list: file_hash = hashfile(path) if file_hash in dups: dups[file_hash].append(path) else: dups[file_hash] = [path] return dups # Joins two dictionaries def join_dicts(dict1, dict2): for key in dict2.keys(): if key in dict1: dict1[key] = dict1[key] + dict2[key] else: dict1[key] = dict2[key] def hashfile(path, blocksize=65536): afile = open(path, 'rb') hasher = hashlib.md5() buf = afile.read(blocksize) while len(buf) > 0: hasher.update(buf) buf = afile.read(blocksize) afile.close() return hasher.hexdigest() def print_results(dict1): results = list(filter(lambda x: len(x) > 1, dict1.values())) if len(results) > 0: print('Duplicates Found:') print( 'The following files are identical. The name could differ, but the' ' content is identical' ) print('___________________') for result in results: for subresult in result: print('		%s' % subresult) print('___________________') else: print('No duplicate files found.') def main(): parser = argparse.ArgumentParser(description='Find duplicate files') parser.add_argument( 'folders', metavar='dir', type=str, nargs='+', help='A directory to parse for duplicates', ) args = parser.parse_args() find_duplicates(args.folders) if __name__ == '__main__': sys.exit(main())

ady · Answer

 import hashlib import os import sys from sets import Set def read_chunk(fobj, chunk_size = 2048): """ Files can be huge so read them in chunks of bytes. """ while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def remove_duplicates(dir, hashfun = hashlib.sha512): unique = Set() for filename in os.listdir(dir): filepath = os.path.join(dir, filename) if os.path.isfile(filepath): hashobj = hashfun() for chunk in read_chunk(open(filepath,'rb')): hashobj.update(chunk) # the size of the hashobj is constant # print "hashfun: ", hashfun.__sizeof__() hashfile = hashobj.hexdigest() if hashfile not in unique: unique.add(hashfile) else: os.remove(filepath) try: hashfun = hashlib.sha256 remove_duplicates(sys.argv[1], hashfun) except IndexError: print """Please pass a path to a directory with duplicate files as a parameter to the script."""

Basj · Answer

安全のために（何かがうまくいかない場合、自動的に削除するのは危険です！）、@ zalewの回答に基づいて、これを使用します。

Pleasは、md5合計コードが@zalewのコードとわずかに異なることにも注意してくださいこのコードにより生成された間違った重複ファイルが多すぎる（これが、ファイルを自動的に削除するのは危険だと言った理由です！）。

import hashlib, os unique = dict() for filename in os.listdir('.'): if os.path.isfile(filename): filehash = hashlib.md5(open(filename, 'rb').read()).hexdigest() if filehash not in unique: unique[filehash] = filename else: print filename + ' is a duplicate of ' + unique[filehash]