syslogログファイルから時間範囲の高速抽出？

Question

標準のsyslog形式のログファイルがあります。 1秒あたり数百行を除いて、次のようになります。

Jan 11 07:48:46 blahblahblah... Jan 11 07:49:00 blahblahblah... Jan 11 07:50:13 blahblahblah... Jan 11 07:51:22 blahblahblah... Jan 11 07:58:04 blahblahblah...

正確に真夜中にはロールバックしませんが、2日を超えることはありません。

このファイルからタイムスライスを抽出しなければならないことがよくあります。このための汎用スクリプトを書きたいのですが、次のように呼び出すことができます。

$ timegrep 22:30-02:00 /logs/something.log

...そして22:30から真夜中の境界を越えて翌日の午前2時まで線を引きます。

注意点がいくつかあります。

コマンドラインで日付を入力する手間を省きたいのですが、時間だけです。プログラムはそれらを理解するのに十分スマートでなければなりません。
ログの日付形式には年が含まれていないので、現在の年に基づいて推測する必要がありますが、それでもなお正月には正しいことを行います。
高速にしたいのですが、ファイル内を探してバイナリ検索を使用するために行があるという事実を使用する必要があります。

これを書くのにたくさんの時間を費やす前に、それはすでに存在していますか？

Paused until further notice. · Answer

pdate：元のコードを、多数の改良を加えた更新バージョンに置き換えました。これを（実際の）アルファ品質と呼びましょう。

このバージョンには以下が含まれます：

コマンドラインオプションの処理
コマンドラインの日付形式の検証
いくつかのtryブロック
行読みを関数に移動

元のテキスト：

さて、あなたは何を知っていますか？「シーク」とあなたがたは見つけなければならない！これはPythonプログラムで、ファイル内を検索し、多かれ少なかれバイナリ検索を使用します。AWKスクリプトよりもかなり高速です。 他の男が書いた。

それは（プレ？）アルファ品質です。 tryブロックと入力の検証、および多くのテストを行う必要があり、間違いなくよりPythonicになる可能性があります。しかし、ここはあなたの娯楽用です。ああ、それはPython 2.6。

新しいコード：

#!/usr/bin/env python # -*- coding: utf-8 -*- # timegrep.py by Dennis Williamson 20100113 # in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile # thanks to serverfault user http://serverfault.com/users/1545/mike # for the inspiration # Perform a binary search through a log file to find a range of times # and print the corresponding lines # tested with Python 2.6 # TODO: Make sure that it works if the seek falls in the middle of # the first or last line # TODO: Make sure it's not blind to a line where the sync read falls # exactly at the beginning of the line being searched for and # then gets skipped by the second read # TODO: accept arbitrary date # done: add -l long and -s short options # done: test time format version = "0.01a" import os, sys from stat import * from datetime import date, datetime import re from optparse import OptionParser # Function to read lines from file and extract the date and time def getdata(): """Read a line from a file Return a Tuple containing: the date/time in a format such as 'Jan 15 20:14:01' the line itself The last colon and seconds are optional and not handled specially """ try: line = handle.readline(bufsize) except: print("File I/O Error") exit(1) if line == '': print("EOF reached") exit(1) if line[-1] == '
': line = line.rstrip('
') else: if len(line) >= bufsize: print("Line length exceeds buffer size") else: print("Missing newline") exit(1) words = line.split(' ') if len(words) >= 3: linedate = words[0] + " " + words[1] + " " + words[2] else: linedate = '' return (linedate, line) # End function getdata() # Set up option handling parser = OptionParser(version = "%prog " + version) parser.usage = "
	%prog [options] start-time end-time filename

\ 	where times are in the form hh:mm[:ss]" parser.description = "Search a log file for a range of times occurring yesterday \ and/or today using the current time to intelligently select the start and end. \ A date may be specified instead. Seconds are optional in time arguments." parser.add_option("-d", "--date", action = "store", dest = "date", default = "", help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.") parser.add_option("-l", "--long", action = "store_true", dest = "longout", default = False, help = "Span the longest possible time range.") parser.add_option("-s", "--short", action = "store_true", dest = "shortout", default = False, help = "Span the shortest possible time range.") parser.add_option("-D", "--debug", action = "store", dest = "debug", default = 0, type = "int", help = "Output debugging information.					None (default) = %default, Some = 1, More = 2") (options, args) = parser.parse_args() if not 0 <= options.debug <= 2: parser.error("debug level out of range") else: debug = options.debug # 1 = print some debug output, 2 = print a little more, 0 = none if options.longout and options.shortout: parser.error("options -l and -s are mutually exclusive") if options.date: parser.error("date option not yet implemented") if len(args) != 3: parser.error("invalid number of arguments") start = args[0] end = args[1] file = args[2] # test for times to be properly formatted, allow hh:mm or hh:mm:ss p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$') if not p.match(start) or not p.match(end): print("Invalid time specification") exit(1) # Determine Time Range yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d") today = datetime.now().strftime("%b %d") now = datetime.now().strftime("%R") if start > now or start > end or options.longout or options.shortout: searchstart = yesterday else: searchstart = today if (end > start > now and not options.longout) or options.shortout: searchend = yesterday else: searchend = today searchstart = searchstart + " " + start searchend = searchend + " " + end try: handle = open(file,'r') except: print("File Open Error") exit(1) # Set some initial values bufsize = 4096 # handle long lines, but put a limit them rewind = 100 # arbitrary, the optimal value is highly dependent on the structure of the file limit = 75 # arbitrary, allow for a VERY large file, but stop it if it runs away count = 0 size = os.stat(file)[ST_SIZE] beginrange = 0 midrange = size / 2 oldmidrange = midrange endrange = size linedate = '' pos1 = pos2 = 0 if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend)) # Seek using binary search while pos1 != endrange and oldmidrange != 0 and linedate != searchstart: handle.seek(midrange) linedate, line = getdata() # sync to line ending pos1 = handle.tell() if midrange > 0: # if not BOF, discard first read if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line)) linedate, line = getdata() pos2 = handle.tell() count += 1 if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate)) if searchstart > linedate: beginrange = midrange else: endrange = midrange oldmidrange = midrange midrange = (beginrange + endrange) / 2 if count > limit: print("ERROR: ITERATION LIMIT EXCEEDED") exit(1) if debug > 0: print("...stopping: '{0}'".format(line)) # Rewind a bit to make sure we didn't miss any seek = oldmidrange while linedate >= searchstart and seek > 0: if seek < rewind: seek = 0 else: seek = seek - rewind if debug > 0: print("...rewinding") handle.seek(seek) linedate, line = getdata() # sync to line ending if debug > 1: print("...junk: '{0}'".format(line)) linedate, line = getdata() if debug > 0: print("...comparing: '{0}'".format(linedate)) # Scan forward while linedate < searchstart: if debug > 0: print("...skipping: '{0}'".format(linedate)) linedate, line = getdata() if debug > 0: print("...found: '{0}'".format(line)) if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate)) # Now that the preliminaries are out of the way, we just loop, # reading lines and printing them until they are # beyond the end of the range we want while linedate <= searchend: print line linedate, line = getdata() if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend)) handle.close()

Fred · Answer

バイナリ検索を適用するC++プログラム-テキスト日付を処理するには、簡単な変更（つまり、strptimeを呼び出す）が必要です。

http://gitorious.org/bs_grep/

テキストの日付をサポートする以前のバージョンがありましたが、それでもログファイルの規模に対して遅すぎました。プロファイリングによると、90％以上の時間がstrptimeに費やされたため、ログ形式を変更して数値のUNIXタイムスタンプも含めるようにしました。

Jeffrey Devloo · Answer

この答えは手遅れですが、それは一部にとって有益かもしれません。

@Dennis WilliamsonのコードをPython他のpythonものに使用できるクラスに変換しました。

複数の日付サポートのサポートを追加しました。

import os from stat import * from datetime import date, datetime import re # @TODO Support for rotated log files - currently using the current year for 'Jan 01' dates. class LogFileTimeParser(object): """ Extracts parts of a log file based on a start and enddate Uses binary search logic to speed up searching Common usage: validate log files during testing Faster than awk parsing for big log files """ version = "0.01a" # Set some initial values BUF_SIZE = 4096 # self.handle long lines, but put a limit to them REWIND = 100 # arbitrary, the optimal value is highly dependent on the structure of the file LIMIT = 75 # arbitrary, allow for a VERY large file, but stop it if it runs away line_date = '' line = None opened_file = None @staticmethod def parse_date(text, validate=True): # Supports Aug 16 14:59:01 , 2016-08-16 09:23:09 Jun 1 2005 1:33:06PM (with or without seconds, miliseconds) for fmt in ('%Y-%m-%d %H:%M:%S %f', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S', '%b %d %Y %H:%M:%S %f', '%b %d %Y %H:%M', '%b %d %Y %H:%M:%S', '%b %d %Y %I:%M:%S%p', '%b %d %Y %I:%M%p', '%b %d %Y %I:%M:%S%p %f'): try: if fmt in ['%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S']: return datetime.strptime(text, fmt).replace(datetime.now().year) return datetime.strptime(text, fmt) except ValueError: pass if validate: raise ValueError("No valid date format found for '{0}'".format(text)) else: # Cannot use NoneType to compare datetimes. Using minimum instead return datetime.min # Function to read lines from file and extract the date and time def read_lines(self): """ Read a line from a file Return a Tuple containing: the date/time in a format supported in parse_date om the line itself """ try: self.line = self.opened_file.readline(self.BUF_SIZE) except: raise IOError("File I/O Error") if self.line == '': raise EOFError("EOF reached") # Remove 
 from read lines. if self.line[-1] == '
': self.line = self.line.rstrip('
') else: if len(self.line) >= self.BUF_SIZE: raise ValueError("Line length exceeds buffer size") else: raise ValueError("Missing newline") words = self.line.split(' ') # This results into Jan 1 01:01:01 000000 or 1970-01-01 01:01:01 000000 if len(words) >= 3: self.line_date = self.parse_date(words[0] + " " + words[1] + " " + words[2],False) else: self.line_date = self.parse_date('', False) return self.line_date, self.line def get_lines_between_timestamps(self, start, end, path_to_file, debug=False): # Set some initial values count = 0 size = os.stat(path_to_file)[ST_SIZE] begin_range = 0 mid_range = size / 2 old_mid_range = mid_range end_range = size pos1 = pos2 = 0 # If only hours are supplied # test for times to be properly formatted, allow hh:mm or hh:mm:ss p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$') if p.match(start) or p.match(end): # Determine Time Range yesterday = date.fromordinal(date.today().toordinal() - 1).strftime("%Y-%m-%d") today = datetime.now().strftime("%Y-%m-%d") now = datetime.now().strftime("%R") if start > now or start > end: search_start = yesterday else: search_start = today if end > start > now: search_end = yesterday else: search_end = today search_start = self.parse_date(search_start + " " + start) search_end = self.parse_date(search_end + " " + end) else: # Set dates search_start = self.parse_date(start) search_end = self.parse_date(end) try: self.opened_file = open(path_to_file, 'r') except: raise IOError("File Open Error") if debug: print("File: '{0}' Size: {1} Start: '{2}' End: '{3}'" .format(path_to_file, size, search_start, search_end)) # Seek using binary search -- ONLY WORKS ON FILES WHO ARE SORTED BY DATES (should be true for log files) try: while pos1 != end_range and old_mid_range != 0 and self.line_date != search_start: self.opened_file.seek(mid_range) # sync to self.line ending self.line_date, self.line = self.read_lines() pos1 = self.opened_file.tell() # if not beginning of file, discard first read if mid_range > 0: if debug: print("...partial: (len: {0}) '{1}'".format((len(self.line)), self.line)) self.line_date, self.line = self.read_lines() pos2 = self.opened_file.tell() count += 1 if debug: print("#{0} Beginning: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'". format(count, begin_range, mid_range, end_range, pos1, pos2, self.line_date)) if search_start > self.line_date: begin_range = mid_range else: end_range = mid_range old_mid_range = mid_range mid_range = (begin_range + end_range) / 2 if count > self.LIMIT: raise IndexError("ERROR: ITERATION LIMIT EXCEEDED") if debug: print("...stopping: '{0}'".format(self.line)) # Rewind a bit to make sure we didn't miss any seek = old_mid_range while self.line_date >= search_start and seek > 0: if seek < self.REWIND: seek = 0 else: seek -= self.REWIND if debug: print("...rewinding") self.opened_file.seek(seek) # sync to self.line ending self.line_date, self.line = self.read_lines() if debug: print("...junk: '{0}'".format(self.line)) self.line_date, self.line = self.read_lines() if debug: print("...comparing: '{0}'".format(self.line_date)) # Scan forward while self.line_date < search_start: if debug: print("...skipping: '{0}'".format(self.line_date)) self.line_date, self.line = self.read_lines() if debug: print("...found: '{0}'".format(self.line)) if debug: print("Beginning: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'". format(begin_range, mid_range, end_range, pos1, pos2, self.line_date)) # Now that the preliminaries are out of the way, we just loop, # reading lines and printing them until they are beyond the end of the range we want while self.line_date <= search_end: # Exclude our 'Nonetype' values if not self.line_date == datetime.min: print self.line self.line_date, self.line = self.read_lines() if debug: print("Start: '{0}' End: '{1}'".format(search_start, search_end)) self.opened_file.close() # Do not display EOFErrors: except EOFError as e: pass

Michael Graff · Answer

ネット上のクイック検索から、キーワード（FIREなど）に基づいて抽出するものはありますが、ファイルから日付範囲を抽出するものはありません。

あなたが提案することをするのは難しくないと思われます：

開始時間を検索します。
その行を印刷します。
終了時刻<開始時刻で、行の日付が>終了かつ<開始の場合、停止します。
終了時間が>開始時間で、ラインの日付が>終了である場合、停止します。

簡単そうに思えるので、よろしければ書いてくださいRuby :)

Paused until further notice. · Answer

これにより、現在の時刻（現在）との関係に基づいて、開始時刻と終了時刻の間のエントリの範囲が出力されます。

使用法：

timegrep [-l] start end filename

例：

$ timegrep 18:47 03:22 /some/log/file

-l（long）オプションを使用すると、出力が可能な限り長くなります。開始時刻の時間と分の値が終了時刻と現在の両方より小さい場合、開始時刻は昨日と解釈されます。開始時刻と終了時刻の両方のHH：MM値が「現在」より大きい場合、終了時刻は今日として解釈されます。

「今」が「Jan 11 19:00」であると仮定すると、これはさまざまな例の開始時刻と終了時刻が解釈される方法です（注記がある場合を除き、-lなし）：

 開始終了範囲開始範囲終了 19:01 23:59 1月10日1月10日 19:01 00:00 1月10日1月11日 00:00 18 ：59 Jan 11 Jan 11 18:59 18:58 Jan 10 Jan 10 19:01 23:59 Jan 10 Jan 11＃-l 00:00 18:59 Jan 10 Jan 11＃-l 18:59 19:01 Jan 10 Jan 11＃-l

ほとんどすべてのスクリプトがセットアップされています。 最後の2行はすべての作業を行います。

警告：引数の検証やエラーチェックは行われません。エッジケースは完全にテストされていません。これはgawkを使用して作成されたもので、他のバージョンのAWKが誤動作する可能性があります。

#!/usr/bin/awk -f BEGIN { arg=1 if ( ARGV[arg] == "-l" ) { long = 1 ARGV[arg++] = "" } start = ARGV[arg] ARGV[arg++] = "" end = ARGV[arg] ARGV[arg++] = "" yesterday = strftime("%b %d", mktime(strftime("%Y %m %d -24 00 00"))) today = strftime("%b %d") now = strftime("%R") if ( start > now || start > end || long ) startdate = yesterday else startdate = today if ( end > now && end > start && start > now && ! long ) enddate = yesterday else enddate = today fi startdate = startdate " " start enddate = enddate " " end } $1 " " $2 " " $3 > enddate {exit} $1 " " $2 " " $3 >= startdate {print}

AWKはファイルの検索に非常に効率的だと思います。 nindexedテキストファイルを検索する場合、他の何かが必ずしも速くなるとは思いません。