Pythonを使用して複雑なテキストファイルを解析する方法は？

Question

複雑なテキストファイルを解析してpandas DataFrame。

もっと簡潔/高速/ Pythony /読みやすくする方法はありますか？

また、この質問を Code Review に付けました。

サンプルファイルを次に示します。

Sample text A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores. School = Riverdale High Grade = 1 Student number, Name 0, Phoebe 1, Rachel Student number, Score 0, 3 1, 7 Grade = 2 Student number, Name 0, Angela 1, Tristan 2, Aurora Student number, Score 0, 6 1, 3 2, 9 School = Hogwarts Grade = 1 Student number, Name 0, Ginny 1, Luna Student number, Score 0, 8 1, 7 Grade = 2 Student number, Name 0, Harry 1, Hermione Student number, Score 0, 5 1, 10 Grade = 3 Student number, Name 0, Fred 1, George Student number, Score 0, 0 1, 0

解析後に結果をどのように表示するかを以下に示します。

 Name Score School Grade Student number Hogwarts 1 0 Ginny 8 1 Luna 7 2 0 Harry 5 1 Hermione 10 3 0 Fred 0 1 George 0 Riverdale High 1 0 Phoebe 3 1 Rachel 7 2 0 Angela 6 1 Tristan 3 2 Aurora 9

現在、私はそれをどのように解析していますか：

import re import pandas as pd def parse(filepath): """ Parse text at given filepath Parameters ---------- filepath : str Filepath for file to be parsed Returns ------- data : pd.DataFrame Parsed data """ data = [] with open(filepath, 'r') as file: line = file.readline() while line: reg_match = _RegExLib(line) if reg_match.school: school = reg_match.school.group(1) if reg_match.grade: grade = reg_match.grade.group(1) grade = int(grade) if reg_match.name_score: value_type = reg_match.name_score.group(1) line = file.readline() while line.strip(): number, value = line.strip().split(',') value = value.strip() dict_of_data = { 'School': school, 'Grade': grade, 'Student number': number, value_type: value } data.append(dict_of_data) line = file.readline() line = file.readline() data = pd.DataFrame(data) data.set_index(['School', 'Grade', 'Student number'], inplace=True) # consolidate df to remove nans data = data.groupby(level=data.index.names).first() # upgrade Score from float to integer data = data.apply(pd.to_numeric, errors='ignore') return data class _RegExLib: """Set up regular expressions""" # use https://regexper.com to visualise these if required _reg_school = re.compile('School = (.*)
') _reg_grade = re.compile('Grade = (.*)
') _reg_name_score = re.compile('(Name|Score)') def __init__(self, line): # check whether line has a positive match with all of the regular expressions self.school = self._reg_school.match(line) self.grade = self._reg_grade.match(line) self.name_score = self._reg_name_score.search(line) if __name__ == '__main__': filepath = 'sample.txt' data = parse(filepath) print(data)

Ezer K · Answer

ここでは、splitとpd.concat（「txt」は質問の元のテキストのコピーを意味します）を使用した私の提案です。名前とグレードがcsvのような形式であるという事実の。ここに行きます：

import pandas as pd from io import StringIO schools = txt.lower().split('school = ') schools_dfs = [] for school in schools[1:]: grades = school.split('grade = ') grades_dfs = [] for grade in grades[1:]: features = grade.split('student number,') feature_dfs = [] for feature in features[1:]: feature_dfs.append(pd.read_csv(StringIO(feature))) feature_df = pd.concat(feature_dfs, axis=1) feature_df['grade'] = features[0].replace('
','') grades_dfs.append(feature_df) grades_df = pd.concat(grades_dfs) grades_df['school'] = grades[0].replace('
','') schools_dfs.append(grades_df) schools_df = pd.concat(schools_dfs) schools_df.set_index(['school', 'grade'])

Mike Robins · Answer

元のコードと同様の方法で、解析正規表現を定義します

import re import pandas as pd parse_re = { 'school': re.compile(r'School = (?P<school>.*)$'), 'grade': re.compile(r'Grade = (?P<grade>\d+)'), 'student': re.compile(r'Student number, (?P<info>\w+)'), 'data': re.compile(r'(?P<number>\d+), (?P<value>.*)$'), } def parse(line): '''parse the line by regex search against possible line formats returning the id and match result of first matching regex, or None if no match is found''' return reduce(lambda (i,m),(id,rx): (i,m) if m else (id, rx.search(line)), parse_re.items(), (None,None))

その後、各生徒に関する情報を収集する行をループします。レコードが完成したら（Scoreがあればレコードは完成します）、レコードをリストに追加します。

行ごとの正規表現一致によって駆動される小さな状態マシンは、各レコードを照合します。特に、スコアと名前は入力ファイルで別々に提供されるため、生徒を番号で成績で保存する必要があります。

results = [] with open('sample.txt') as f: record = {} for line in f: id, match = parse(line) if match is None: continue if id == 'school': record['School'] = match.group('school') Elif id == 'grade': record['Grade'] = int(match.group('grade')) names = {} # names is a number indexed dictionary of student names Elif id == 'student': info = match.group('info') Elif id == 'data': number = int(match.group('number')) value = match.group('value') if info == 'Name': names[number] = value Elif info == 'Score': record['Student number'] = number record['Name'] = names[number] record['Score'] = int(value) results.append(record.copy())

最後に、レコードのリストがDataFrameに変換されます。

df = pd.DataFrame(results, columns=['School', 'Grade', 'Student number', 'Name', 'Score']) print df

出力：

 School Grade Student number Name Score 0 Riverdale High 1 0 Phoebe 3 1 Riverdale High 1 1 Rachel 7 2 Riverdale High 2 0 Angela 6 3 Riverdale High 2 1 Tristan 3 4 Riverdale High 2 2 Aurora 9 5 Hogwarts 1 0 Ginny 8 6 Hogwarts 1 1 Luna 7 7 Hogwarts 2 0 Harry 5 8 Hogwarts 2 1 Hermione 10 9 Hogwarts 3 0 Fred 0 10 Hogwarts 3 1 George 0

いくつかの最適化は、最も一般的な正規表現を最初に比較し、空白行を明示的にスキップすることです。データフレームを作成することで、データの余分なコピーを避けることができますが、データフレームへの追加はコストのかかる操作であると思います。