|
Workload Management Scenario
|
Performance Intelligence > Solutions > Batch Workload Management > Workload Management Scenario
Unexpected Problems
September 28, 2002, early morning: Marcus J. Hoffman, the system operations manager for Great Northwestern Utilities phones in to inform his superior that he injured his back in a friendly basketball game last night. Overnight his condition worsened and he will have to have a physician’s examination and rest before he can get resume his work responsibilities. Peter, a recent hire still learning the ropes as a system programmer, will fill in for Marcus (the other experienced system programmer is away at a conference)..
Marcus instructs Peter to monitor the critical jobs executed over night and ensure that everything is set for the coming night's scheduled activities. Peter is not familiar with the tasks that Marcus performs on a daily basis, but he can use PIReporter as an aid to follow the tasks documented in the "monitoring critical job performance" activity.
Peter logs in and inspects the tasks associated with the "System Operation Manager" role. He finds the activity and starts to perform the tasks.
The first step calls for inspecting the output of the Critical Jobs Execution – Yesterday report, a report that Marcus has scheduled to run every morning. Peter can use the output of this report and compare it to the output of the previous day's execution report to ensure that the critical reports executed in their expected time frames and that key performance indicators are within reasonable thresholds.
While checking the report for critical jobs, Peter notices that GMLADYC, one of the mission critical jobs, did not finish on time. GMLADYC took 2.5 CPU minutes and started on time, but it took almost 7.5 hours of elapsed time to complete. Because GMLADYC is mission critical, Peter needed to find out why it didn't finish on time and fix the problem to ensure it would not recur.
Is it a problem, really?
Peter believes he found a problem based on the Elapsed Time key performance indicator (KPI). He consults the activity task analysis notes that read as follows:
In cases where a (critical) job displays an unexpected key performance indicator value, verify that is an exceptional value by comparing to results obtained for the same job in previous interval instances (weekly, monthly, etc.) as some jobs will have different performance at specific intervals (special day of the week, month or quarter, for example).
Peter finds the report results for September 21, 2002 (one week earlier) which indicate that both the CPU time and I/O activity values are similar, but the job finished in less than three hours of elapsed time, indicating that it is not likely to be a special weekly performance surge.
Peter still wants to compare the results to the previous month to ensure that this is not some kind of "month end" processing performance that can be expected. It seems that Marcus deleted the results of the Aug. 28 execution, but it is easy to create the desired output by changing two report parameters (From Date, To Date) in the PIReporter report definition and executing the new report. With the new output, Peter compares the job's KPIs and once again rules out the possibility of a special time interval performance surge. It seems that Peter identified a real problem that occurred during last night's execution. He now needs to find the cause of the problem and ensure that the critical jobs end on time—before the start of the next business day.
Who used my CPU?
The Information Activity provides the following information in the next task:
To find potential causes for the key performance indicator problem try to look at other jobs that executed during the same time window - pay special attention to jobs that used the suspected resource for the key perforamce indicator e.g - I/O clash, Database connections, Network resources etc...).
Peter creates a custom report, using the same model. His goal is to find out what could prevent GMLADYC from finishing on time. He selects the same date and time window (From Date = To Data = September 28, 2002; From Time = 12:00AM, To Time = 06:00AM), He is interested in all other jobs executing during that time (Exclude Jobs = GMLADYC). To help him focus on the most probable clashing jobs, he sorts the data by CPU Time (Sort By = CPU Time, Descending).
Peter is now on the phone with Marcus, who is not familiar with the jobs UORION, MTMERED and KIZKAGA. Peter continues his investigation after the systems manager does not recognize three of the jobs that took the greatest amount of system resources.
Researching Past Behavior
The activity Peter was following does not have specific instructions for the situation he unearthed—but it has lead him to identify a problem and a probable cause. He is now familiar with the Distribution Summary model and consults the model's help to define a new report that is specific to the three unknown jobs.
He sets the time date for the recent two weeks (From Date = September 14 2002, To Date = Today)
He restricts the output to the three suspect jobs (Job Name = KIZKAGA, UORION, MTMERED)
He sets the summary keys to be Job Name and sorts by Job Start Date.
The report shows that two of the jobs have been running around 6 am for the past two weeks, but that changed on the morning of September 28. The third job, on the other hand, always starts around 2 am , so it is not the cause of the problem.
Armed with information, Peter can find the user who submitted the jobs and call a meeting to discuss execution submission times.
Conclusions
PIReporter for zOS provides a combination of data, analysis tools and embedded knowledge that enabled a user to solve a potentially serious problem. We see how Peter, an up and coming system programmer was able to identify a problem, research it and find a solution, all while learning on the job:
Without the data collection and storage of PIReporter, Peter would have had to research piles of daily reports or write a custom report on archived tape data.
Without the analysis application that allowed him to change parameters to create multiple reports as he researched the problem and learned on the go, the time to retrieve the information needed for resolution would have taken days or weeks, and would have required multiple custom solutions.
The Information Activity allowed Peter to focus on the task and find probable causes for the problem by following a series of steps that, essentially, create a "best practice" for this situation. When he got close enough to understanding the cause, he was able to create a custom solution quickly using the existing analysis application that modeled heavier and lighter usage periods on the system. PIReporter's analysis applications made it possible for Peter to not only quickly find and understand the problem, but also to explain it to his superior and co-workers, so decisions can be made and actions taken.
System programmers and managers are smart entrusted with maintaining a vital resource in the enterprise. Effectiveness and efficiency are enterprise-critical. Installing new hardware and software components, monitoring system and subsystems’ performance, solving problems and planning for the future are all routine tasks for IT managers and administrators. Failure to recognize problems, find their causes, solve them or plan for optimal performance is a waste of precious computing resources and can bring hundreds and thousands of employees, business partners and customers to a stop.
|