PostgreSQLで重複レコードを見つける方法

Question

私は現在 "user_links"と呼ばれるPostgreSQLデータベーステーブルを持っています。

year, user_id, sid, cid

Unique制約は現在 "id"と呼ばれる最初のフィールドですが、year、user_id、sid、cidがすべて一意であることを確認するために制約を追加しようとしています。制約.

すべての重複を見つける方法はありますか？

Marcin Zablocki · Accepted Answer

基本的な考え方は、カウント集計でネストしたクエリを使用することです。

select * from yourTable ou where (select count(*) from yourTable inr where inr.sid = ou.sid) > 1

内部クエリのwhere句を調整して検索範囲を狭めることができます。

コメントで言及されているそれに対する別の良い解決策があります（しかし、誰もがそれらを読むわけではありません）：

select Column1, Column2, count(*) from yourTable group by Column1, Column2 HAVING count(*) > 1

またはもっと短い：

SELECT (yourTable.*)::text, count(*) FROM yourTable GROUP BY yourTable.* HAVING count(*) > 1

alexkovelsky · Answer

「 PostgreSQL を使用して重複行を検索する」からスマートな解決策が得られます。

select * from ( SELECT id, ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row FROM tbl ) dups where dups.Row > 1

pwnyexpress · Answer

重複するフィールドで同じテーブルに結合してから、idフィールドで反結合することができます。最初のテーブルエイリアス（tn1）からidフィールドを選択してから、2番目のテーブルエイリアスのidフィールドでarray_agg関数を使用します。最後に、array_agg関数を正しく機能させるために、結果をtn1.idフィールドでグループ化します。これにより、レコードのIDと、結合条件に合うすべてのIDの配列を含む結果セットが生成されます。

select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id;

明らかに、1つのidのduplicate_entries配列にあるidも、結果セットに独自のエントリを持ちます。この結果セットを使用して、どのIDを「真実」のソースにするかを決定する必要があります。削除されるべきではない1つのレコード。たぶん、あなたはこのようなことをすることができました：

with dupe_set as ( select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id order by tn1.id asc) select ds.id from dupe_set ds where not exists (select de from unnest(ds.duplicate_entries) as de where de < ds.id)

IDがint PKを増加させていると仮定して、重複がある最小のIDを選択します。これらはあなたが身につけているであろうIDです。