Mining Invariants From Logs For System Problem Detection

Transcription

Mining Invariants from Logs forSystem Problem DetectionJian-Guang LOU, Qiang FUSoftware AnalyticsMicrosoft Research Asia

BackgroundMany systems produce log messages fortrouble-shooting.– Logs usually record important system actions orevents for trouble shooting, and can reflect theexecution paths of a program– Logs are used to detect problems: Work flow errors -- there are errors occur in theexecution paths; Low performance problems -- the execution time orthe loop number is much larger than normal cases.

Problems Manually inspecting logs is not feasible– Large scale of system– High complexity of system Traditional rule/keyword based log analysis tools:– Heavily depend on the knowledge of operators– Difficult to keep rules updated when components arefrequently revised or upgraded

Automatic Log Analysis Statistical model based methods:-- treat a log sequence as a feature vector– [Xu et al. 2009]: Mine console logs for large-scale systemproblem detection based on PCA analysis.– [Mirgorodskiy et al. 2006]: Use string edit distance tocategorize logs and detect anomalies. Behavior model based methods:-- view a log sequence as a program work flow– [Tan et al. 2008]: Learn and visualize control flow modelsfrom Hadoop logs based on some pre-defined log tokens.– [Cotroneo et al. 2007]: Derive work flow models for a JavaVM.4

Our Basic Idea18:00:01.123 Incoming request no. 118:00:01.125 Start handle req. 118:00:01.256 Request 1 enqueue 18:00.01.375 Request 1. commit18:00.01.576 Request 1. complete Linear InvariantsSample LogsCount(open file) Count(close file)Incoming request no. %dStart handle req. %dCount(task add) Count(sel localdata) Count(sel remotedata) Request %d. commitRequest %d. complete Constant Strings inEXE/DLL files18:00:01.123 Incoming request no. 118:00:01.125 Start handle req. 118:00:01.123 Incoming request no. 118:00:01.256 Request 1 enqueue18:00:01.125 Start handle req. 118:00:01.123 Incoming request no. 1 18:00:01.256 Request 1 enqueue18:00:01.125Start handle req. 118:00.01.375 Request1. commit 18:00:01.256Request 1 enqueue18:00.01.576 Request1. complete18:00.01.375 Request 1. commit 18:00.01.576 Request 1. complete18:00.01.375 Request 1. commit 18:00.01.576 Request 1. complete New LogsLog ParserID0123 xxx LOG KEYIncoming request#Start task WriteID file size LOG KEYWritefile complete0Incoming request#1Start task LOG KEY 2Write IDfile size 0 file complete Incoming request#3Write1 Start task Return response 2 Write file size 3Write file complete xxxReturn response xxxReturn response Structured Logs18:00:102 Incoming request#1.18:00:124 Start task request#118:00:430 Write file size 150K.18:00:500 Write file complete18:00:503 Read 18:00:908 Return response .18:00:102 Incoming request#1.18:00:124 Start task request#118:00:430 Write file size 150K.18:00:500 Write file complete18:00:503 Read 18:00:908 Return response .Anomalies DetectionNormal behaviors can be learned from logs, and then beused to detect anomalies.5

Linear Program Invariant A predicate always holds the same valueunder different normal executions.– For example:ABX 0CCond.X! 0𝑐𝑜𝑢𝑛𝑡 𝐴 𝑐𝑜𝑢𝑛𝑡 𝐵 𝑐𝑜𝑢𝑛𝑡(𝐸)𝑐𝑜𝑢𝑛𝑡 𝐵 𝑐𝑜𝑢𝑛𝑡 𝐶 𝑐𝑜𝑢𝑛𝑡(𝐷)ED

Invariant and Execution Path𝑐𝑜𝑢𝑛𝑡 𝐴 𝑐𝑜𝑢𝑛𝑡 𝐵 𝑐𝑜𝑢𝑛𝑡(𝐸)𝑐𝑜𝑢𝑛𝑡 𝐵 𝑐𝑜𝑢𝑛𝑡 𝐶 𝑐𝑜𝑢𝑛𝑡(𝐷)X 0ABSequential ExecutionCCond.X! 0EDExecution BranchLinear invariants reflect the properties ofexecution path.7

Invariant Violation and Anomaly(1) A violation of invariant often indicates asystem problem.Enter CriticalSection𝑐𝑜𝑢𝑛𝑡 𝐸𝑛𝑡𝑒𝑟 perationsLeave CriticalSectionProblemon Critical Section Operations

Invariant Violation and Anomaly(2) Violated invariants often give diagnosis cues.𝑐𝑜𝑢𝑛𝑡 𝐴 𝑐𝑜𝑢𝑛𝑡(𝐵)𝑐𝑜𝑢𝑛𝑡 𝐵 𝑐𝑜𝑢𝑛𝑡 𝐶 𝑐𝑜𝑢𝑛𝑡(𝐷)X 0ABCond.X! 0Sequential ExecutionCExecution BranchED

Formulation of Invariant A linear invariant can be presented as a linearequation:𝑎0 𝑎1 𝑥1 𝑎2 𝑥2 𝑎𝑚 𝑥𝑚 0where xi is the message count of message i. Given a set of logs, we have11𝑿𝜃 1where𝑥11𝑥21 𝑥𝑛1𝑥12𝑥22 𝑥𝑛2𝜃 a0 , a1 , a2 , , am T𝑥1𝑚𝑥2𝑚𝜃 0 𝑥𝑛𝑚

What Is A Meaningful Invariant?-- Sparse Non-zero Coefficients𝑐 𝐵 𝑐 𝐶 𝑐(𝐷)X 0𝑐 𝐴 𝑐 𝐵Aare more meaningful thanBCCond.X! 0𝑐 𝐴 3𝑐 𝐵 2𝑐 𝐸 2𝑐 𝐶 2𝑐 𝐷 0Any vector in the Null Space of X is an invariant;Only sparse invariants are interested.ED

What Is A Meaningful Invariant?-- Integer CoefficientsElementary work flow structures can be interpreted byinteger invariants.AAAB BSequentialBCBranchCJoinInteger invariants are easy to be understood by humanoperators.

Problem Statement Due to noise pollution, mining invariants is to findinteger sparse solutions of regression.11𝑿𝜃 1𝑥11𝑥21 𝑥𝑛1𝑥12𝑥22 𝑥𝑛2 𝑥1𝑚𝑥2𝑚𝜃 0 𝑥𝑛𝑚𝒂𝒓𝒈 𝒎𝒊𝒏 𝑿𝜃𝟎– Challenges: A typical integer sparse regulation problem (NP-Hard) Traditional method is to relax 0-norm to 1-norm. However, itcannot guarantee to find all invariants.

Learning Invariant OverviewParsingMessageCount Vector00001: [ 1,1,1,2,5,3]00002: [1,3,3,3,6,3]00003: [1,2,2,2,4,2]00004: [ 1,1,1,2,5,3]00006: [1,3,3,3,6,3]00007: [1,2,2,2,4,2]00007: [1,2,2,2,4,2]00007: [1,2,2,2,4,2]---------------------------- Invariant 1 [ 1,1,0,0,0,0,0], 2 [0,0,1, 1,0,0,0], 3 [0,0,0,0, 1,1, 1].Anomalies00091: [ 1,1,0,0,0,0]violates the invariantθ200732: [1,3,3,3,4,2]violates the invariantθ3 ------------------------- Four Steps:Log parsing, Message Grouping and Counting,Search Invariants, and Anomaly Detection

Step 1: Log Parsing Goal: Free text logs structured logsParametersNew job added to schedule, jobId 8821, priority 64Log Msg Signature Basic idea:– Log messages of the same message type usually havea high similarity.– Words of log message signatures are often embeddedas constant strings in DLL/EXE files (e.g. symbols).

Parsing: Our SolutionParametersExe/DLLSample LogsCS in LogMessageClassification & SplittingConst StringsLog SignatureStatusParametersLimitation:Coverage depends on the sample logs.

Step 2: Message Grouping-- Cogenetic Parameters Cogenetic parameters: parameters record thevalue of the same system variable.Job ID18:51:05.767 Image file of job 00001 loaded in 0 seconds, size 57717.18:51:06.048 .18:51:06.329 Start a new thread 0x0CE4 to lunch job 00001.18:51:06.501 .18:51:06.658 Job 00001 finished.18:51:06.673 Image file of job 00002 loaded in 0 seconds, size 70795.

Detecting Cogenetic Parameter Groups Two parameters are cogenetic, if– (1) they have the same value set(e.g. A B),– (2) one parameter’s value set is a sub-set of the other’s (e.g.B C),– (3) there is a mid-parameter satisfying (2) with these twoparameters. (e.g. B C && B D)X 0ABCCond.X! 0ED Each cogentic parameter group corresponds a systemvariable.

Results of Parameter Grouping Testing on Hadoop logs, we detect thefollowing meaningful program variables:– Map/Reduce Task ID, Map/Reduce Task AttemptID, Block ID, and JVM ID, Storage ID, IP addressand port, and write data size of task shuffling. Testing on CloudDB, we found programvariables:– request ID, CASNode ID, replica operation ID,

Message Grouping Variables that identify objects are often recorded inlogs, which can be used to group messages.– e.g. Request ID, task ID, .18:51:05.767 Image file of job 00001 loaded .18:51:06.048. 18:51:06.329 Start a new thread 0x0CE4 tolunch job 00001.18:51:06.433Image file of job 00002 loaded in 0 seconds, size 70795.18:51:06.501. 18:51:06.629Start a new thread 0x0DE5 to lunchjob 00002. 18:51:06.701. 18:51:06.758 Job 00001 finished. 18:51:06.927Job 00002 finished.

Message Count Vector Count number for each message type in a group.18:51:05.767 Image file of job 00001 loaded . 18:51:06.329 Start a new thread 0x0CE4 tolunch job 00001. 18:51:06.758 Job 00001 finished. 18:51:06.048 .18:51:06.433 Image file of job 00002 loaded in0 seconds, size 70795.18:51:06.501 .18:51:06.629 Start a new thread 0x0DE5 tolunch job 00002.18:51:06.701 . 18:51:06.927 Job 00002 finished. Image loadedJob lunchJob finished Log group 1111 Log group 2111

Step 3: Search Invariant A hypotheses and testing framework:– Try any combination of non-zero coefficients toconstruct invariant candidates,– Then validate whether they fit the log data. Computational cost How to reduce computational cost based onproperties of log analysis?

Reduce Computational Cost(1) Divide and conquer: (message grouping)– Log messages often form some groups. There are strong correlationsin the same group, and no obvious inter-group correlation.𝑚1𝑚2𝑖𝑐𝑚 𝑜1𝑜𝑖 1𝑖𝑐𝑚 2𝑖 1 Limit number of non-zero coefficients :– In most systems, the dimension of row space of X is often very small(e.g. r 5), the number of non-zero coefficients is very small [Xu2009].

Reduce Computational Cost(2) Early termination:– According to linear algebra, there is at most k independentinvariants, k is the dimension of X’s Null Space. We canearly terminate the search process once we obtain kindependent invariants. Skipping strategy:– A linear combination of invariants is also a valid invariant.We can skip the searching on the combinations ofobtained invariants.

Results of Computational CostReductionOur reduction strategy largely reduces the computational cost,especially when the dimension is large.Message group of related object identifierFull search spacesizeOur search spacesizeHadoop logs with MapTask ID12837Hadoop logs with ReduceTask ID86Hadoop logs with MapTask Attempt ID2684354563310Hadoop logs with ReduceTask Attempt ID33554432730Hadoop logs with JVM ID12816

Results of Discovered Invariants onHadoopLogBased on manually checking, we find all discovered invariants are reasonable, nofalse positive.Message groups of related object identifiersInvariants( 3 coef.)Invariants ( 4coef.)Hadoop logs with MapTask ID30Hadoop logs with ReduceTask ID10Hadoop logs with MapTask Attempt ID213Hadoop logs with ReduceTask Attempt ID170Hadoop logs with Data Block ID90Hadoop logs with JVM ID50Hadoop Logs with Storage ID30Logs with IP/portLogs with task write packet size4100Note: we also find invariants for CloudDB logs, but no ground truth for evaluation.

Results of Anomaly Detection onHadoop LogWe compare our results with PCA based algorithm [Xu09]. Our method not only detects anomalies,but also gives the cue why they are abnormal.Anomaly DescriptionTasks fail due to heart beat lost.A killed task continued to be in RUNNING state in both the JobTracker and thatTaskTracker for everAsk more than one node to replicate the same block to a single node simultaneouslyWrite a block already existedTask JVM hangSwap a JVM, but mark it as unknown.Swap a JVM, and delete it immediatelyTry to delete a data block when it is opened by a clientJVM inconsistent stateThe pollForTaskWithClosedJob call from a Jobtracker to a task tracker times outwhen a job completes.False Positive DescriptionKilled speculative tasksJob cleanup and job setup tasksThe data block replica of Java execution fileUnknown ReasonPCA based Method397730Our Method7791133262625204872113733252048721164163PCA MethodOur Method58532356499177777800

A Detail Case-- Anomaly of “a task JVM hang”𝑐 𝐽𝑉𝑀 𝑠𝑝𝑎𝑤𝑛𝑒𝑑 𝑐 𝐽𝑉𝑀 𝑒𝑥𝑖𝑡𝑒𝑑a JVM spawnedbut did not exitJVM was hung.𝑐 𝐽𝑉𝑀 𝑠𝑝𝑎𝑤𝑛𝑒𝑑 𝑐 𝑔𝑖𝑣𝑒𝑛 𝑡𝑎𝑠𝑘A task was givenafter a JVM spawnedJVM got hungafter it was assigned a task

Results on Production Programs CloudDB logs:– 266 invariants are learned from CASNode levellogs;– 9 anomalies are detected in one test set andconfirmed by a tester. SharePoint logs:– Detected anomalies are checked by developers;– About 78% anomalies are real errors.

Summary We proposed an automatic anomaly detectiontechnique by mining linear invariants fromlogs including:– A method to automatically identify a set ofparameters that correspond to the same programvariable;– A method to discover sparse integer linearinvariants from logs;– A method to detect anomalies based on invariantsand gives intuitive cues for diagnosis.

THANKS!

Hadoop logs with MapTask ID 128 37 Hadoop logs with ReduceTask ID 8 6 Hadoop logs with MapTask Attempt ID 268435456 3310 Hadoop logs with ReduceTask Attempt ID 33554432 730 Hadoop logs with JVM ID 128 16 Our reduction strategy largely reduces the computational cost, especially when the dimension is large.