Data analysis in practice: A case collection

Data analysis in practice: A case collection - by Jostein Lillestøl https://hdl.handle.net/11250/2500748 This is a case collection in data analysis containing cases mostly relevant for business studies. Sat, 27 Jul 2024 11:54:52 GMT 2024-07-27T11:54:52Z 20 - Customer Satisfaction https://hdl.handle.net/11250/2500920 20 - Customer Satisfaction Lillestøl, Jostein Topics: Possible methods of analysis are: Cross tabulation - chisquare tests Analysis of variance Regression analysis Factor analysis Discriminant analysis Cluster analysis Context: For producers of commodities and services it is important to acquire knowledge about customer satisfaction, in particular the importance of the different aspects of the product for the total impression and for repeated buying. In this case we will study the issue in relation to the hotel business. Our data are responses on a customer survey taken at Norwegian tourist hotels. We want to answer the following questions: 1. How important are the different aspects of the stay for the total impression and enjoyment and the likelihood of coming back to the hotel and the region? 2. Are there different segments of guests with respect to the importance of the different aspects and opportunities for the stay? 3. Are there different segments of guests with respect to their satisfaction with their stay? The questionnaire is provided at the end, and contains the following: 1. Background information on the respondent (Question 1 to 10 and Question 21). This information may be used for dividing into segments (e.g. according to nationality, gender, type of tour, hotel experience etc.) 2. Importance scores (Question 11and 19 on a scale from 1 to 7) Question 11 deals with 27 aspects of the product, and the guest is asked to judge the importance of for a ”successful hotel experience”. In question 19 the guest is encouraged to state the importance of different factors when planning the vacation. Questions 11 and 19 can both be used for several purposes, among them identifying different (”benefit-segments”), i.e. groups of respondents which emphasize similar features of a successful stay, and to identify the underlying dimensions for judgement. 3. Satisfaction scores (Question 12 on an 11 point scale from -5 to 5). This encompasses all 27 aspects of Question 11 above, but here the issue is how satisfied the customer actually is with each aspect at the current stay. Here the respondent also had the opportunity to mark “not relevant” which appear as missing (blank) in the data file. 4. Total satisfaction scores: (Question 14 to 17 on 11-point scale from -5 to 5 or 1 to 11). This deals with how the guests actually judge the current stay and hotel from a total perspective. The scores on these questions can be combined to give a single score for total satisfaction, e.g. by summing the four scores. Task A-version: Perform a customer satisfaction study that throws light on the questions asked above, i.e.: 1. How important are the different aspects of the stay for the total impression and the likelihood of coming back to the hotel and the region? 2. Are there different segments of guests with respect to the importance of the different aspects and opportunities for the stay? 3. Are there different segments of guests with respect to their satisfaction with their stay? In doing this it is of importance to be able to judge i. The statistical issues (assumptions underlying the chosen methods, the handling missing observations etc.) ii. The practical implications of the findings for improvement opportunities. B-version 1. Analyse the importance of different aspects for the total satisfaction of the respondent. At least three different approaches are available and may be compared: a. Compute the averages of the importance scores of the respondents (question 11) b. Perform a multiple regression analysis, using the total satisfaction as the dependent variable (question 14-17) and the satisfaction scores for each aspect (question 12) as the explanatory variables. The regression coefficients will then give the relative importance each aspect has for the total satisfaction. c. Perform a multiple regression analysis, using the total satisfaction as the dependent variable (question 14-17) and different indices scores as the explanatory variables (see pt 3 below). The regression coefficients will then give the relative importance each of the indices (dimensions) have for the total satisfaction of the respondent. 2. Compare the importance of the different aspects for the total satisfaction within different respondent segments. This may be done by a. First identify different respondent segments, which can be done in different ways i. By separating the respondents in groups according to response on specific questions (e.g. gender, frequent traveller or not, lone traveller or not ) ii. By separating the respondents in groups according to factor scores obtained by a factor analysis of the importance variables (Question 11) iii. By doing a cluster analysis on the importance scores from Question 11, and identify different segments, say 3 to 5. b. Then perform separate analyses for each segment as under pt. 1 above, and compare the explanatory power (R2) and the regression coefficients. 3. Identification of underlying dimensions of judgement and construction of indices. It is convenient to uncover some underlying “satisfaction dimensions” that can be brought into the analysis above. To reduce the amount of data we may a. either use some “product element model” from the literature, group the variables wrt. this directly and compute an index (average or total sum) for each respondent for each element, b. or we can go via a factor analysis of Question 12 (or 11), and for each respondent compute average or sum satisfaction scores based on Question 12 for each factor identified by the factor analysis 4. Compare the satisfaction scores for different segments. For the identified segments under pt. 2, we can by an Analysis of variance (ANOVA), determine whether some vacation products ”hit” some segments better than others. In other words is it more difficult to satisfy some segments than others? The analyses may be done alternatively for the a. Total satisfaction scores (Questions 14 to 17). b. Satisfaction scores for each aspect (Question 12). c. Indices based on the answers to Question 12 (see pt. 3 above) Sun, 21 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500920 2007-10-21T00:00:00Z 19 - Hospital Expenses https://hdl.handle.net/11250/2500919 19 - Hospital Expenses Lillestøl, Jostein Topic: Analysis of variance (ANOVA) Context: A county wants to look into a specific type of costs at three of its regional hospitals. The last three years the average cost per patient day of stay, adjusted for general price increases were: Year 1 Year 2 Year 3 Average Hospital A 398 408 380 395 Hospital B 452 498 494 481 Hospital C 501 516 479 499 All three hospitals have three units: surgery (S), medicine (M) and gynaecology (G) which includes birth delivery, and should comparable with respect to costs. However, the distribution of patients on the units may differ among the hospitals, and experience has shown that the percentage of patients at the different units at each hospital is as follows: Unit S Unit M Unit G Hospital A 45% 35% 20% Hospital B 35% 35% 30% Hospital C 40% 35% 25% Each hospital and unit have in principle the same routines, which have not been altered in the years in question, but it is possible that the routines are practised differently at the hospitals and units. Data: The data file Hospital_Expenses.XLS contains four columns with the variables (codes/unit in parenthesis): Year (1, 2, 3), Hospital (A, B, C), Unit (S, M, G) and Expense (NOK per patient day). Task: The supervisory body for the hospitals in a county wants to see if there are systematic differences between the hospitals and its units and over the years, or whether the differences may just as well be due to chance. In that case there is no reason to make fuzz about the observed differences. Version A: You are assigned to the task, and decide to try out the different estimating and testing procedures you learned at school, among them analysis of variance. Version B: Consider first the data in the first table which disregard the hospital unit. (a) Assume a constant yearly expected expense level within each hospital, and consider the data for the three years in the table as independent observations of the expenses. Estimate the expected expenses and test whether there are systematic differences between them. Is sufficient evidence that any hospital has lower expected expense than the others? (b) Repeat the testing in (a) under the assumption of a general yearly level difference in expenses. Test also whether there are systematic yearly differences. Is the independence assumption in (a) warranted? (c) Give reasons why the analysis in (b) may be misleading. Now consider the expenses taking the hospital unit into account as well. (d) Make a two-way table for mean expenses for hospital vs. unit similar to the first table given Exhibit by suitable graphs. Do you find support for the conclusion in (c)? What kind of analysis can overcome this? (e) Perform such analyses under different assumptions where we have the opportunity the test whether there are systematic differences between (i) hospital (ii) units (iii) years. (f) To what extent are we able to check whether the standard assumptions for this kind of analysis are fulfilled? Sat, 20 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500919 2007-10-20T00:00:00Z 18 - Union Card https://hdl.handle.net/11250/2500918 18 - Union Card Lillestøl, Jostein Topic: T-tests, analysis of variance, regression Context: The labour union UNI has doubled its membership during the last year, due to desertion from another union. Some new offers to the membership are under consideration, among others a new type of benefit card. In this connection they have performed a survey among the members, both the old ones (group A and B) and the new ones (group C and D). We will here examine two of the questions, coming after a presentation of the features of the card: Do you recommend that UNI to provides this kind of benefit card? Answer on a 11-level scale from “Absolutely not” (-5) via “indifferent” (0) to “Absolutely” (+5) Will you use such a card yourself? No / Do not know / Yes The number of member who answered the questionnaire was 400 out of xxx. The results for the following variables are stored in the file Union_Card.XLS: GROUP (A=1, B=2, C=3, D=4) AGE (Years) GENDER (Female = 1, Male =2) STATUS (Single= 0, Married/Partnership = 1) ATTITUDE (-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5) USE (0=No, 1=Do not know, 2=Yes) Task: The questions to be answered are: 1. To what extent and how is the attitude to the benefit card dependent on gender, status, profession and age (or combinations thereof)? 2. To what extent and how is the use of such a card dependent on gender, status, profession and age (or combinations thereof)? Fri, 19 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500918 2007-10-19T00:00:00Z 17 - Tax Audit https://hdl.handle.net/11250/2500917 17 - Tax Audit Lillestøl, Jostein Topic: Comparisons, regression and outliers Context: The establishment ”Nels” has live music every Friday and Saturday. The guests have to pay a cover charge of 50 NOK kr, and if the cloakroom is used, a wardrobe fee of 10 kr. Value added tax shall be paid on the wardrobe fees, but cover charge is payment for entertainment and is exempt VAT. Numbered tickets are used, and accounting data with vouchers exists for every evening, covering the number of cover charge tickets (C), the number of wardrobe tickets (W) and the sales turnover in the bar (S). After a book review at “Nels”, the tax revenue service claimed their own appraisal of the income from entrance for the year 1997. From the conclusion of the report of the tax auditor we read: “The registration sheets show that for many evenings (22 out of 104) more wardrobe tickets than entrance tickets are sold. Furthermore, the average bar sales compared with the number of entrance tickets sold, are often very high (more than 300 kr, which amounts to about 7 glasses of beer), and very varying (from 161 kr to 386 kr). Evenings with high registered beer consumption are frequently evenings when it is sold more wardrobe tickets than entrance tickets (among others on all 5 evenings where the bar sales per entrance ticket were above 300 kr). This indicates that it must have been far more guests than reported in the books.” N. Nelson claims that the tax auditor interprets the data completely wrong, and does not account for a number of obvious circumstances in running a restaurant: The number of guests and how much money they leave behind in the bar and the wardrobe depends on the time of the year, weekday, weather and clientele; if the attendance is low early in the evening, the doorman will desist from cover charge. Moreover, every evening will have a number of guests free of charge (X), i.e. guests with so-called VIP card or left over audience from special arrangements (concerts, shows, cabarets etc.) prior to the opening for regular guests (31 evenings out of 104 in 1997). N. Nelson claims that on regular evenings it is not uncommon with more than 100 free guests, and on special arrangements more than 300. On one occasion in February 1998 which can be documented (by video surveillance of a newly installed ticket system) the numbers were: paying guests C = 792, free guests X = 535, wardrobe tickets W = 881. This was an exceptionally high number of free guests. N. Nelson says that he reckons that 60-70% of the guests use the wardrobe midwinter, but only 20-30% midsummer, depending on the weather that evening. N. Nelson also refers to independent market research, which tells that some clientele on average consume about 6-7 glasses of beer per evening. Typically, more is consumed on Fridays than Saturdays. Task: Suppose that you, as the external auditor, have accepted the income statement, and are about to assist the owner Nels Nelson to disprove the claim of the tax revenue service. A-version: You have no guidance on how to proceed to analyze the data. B-version: You have obtained the following suggestions on how to proceed: By reading the report of the tax auditor you will see that the argument is mainly connected to a comparison of C with W and judging S/C by what is regarded an uncommon consumption, and when this occur. You realise that the role of the number of free guests (X) is largely neglected/underestimated, and that a more proper measure for the average consumption ought to be (a) Let S = 200 000 and compute R for the four cases Compare this with what the tax auditor would get if he used a fixed X = 0 or X = 300. Discuss the following claim of the tax auditor: “We regard here the number of free guests as constant, as this will not affect the judgement of the differences between high and low bar sales per guest”. (b) Computer descriptive statistics for the variables S, C, W, S/C, W/C. Use these results, your calculation in (a) and the information given, and try to argue that the registered number of guests C is reasonable compared to S and W. (c) A simple regression analysis may be performed, where C is explained by S. One could consider the outliers and try to find argument for that they are not more frequent or larger than expected. Ask yourself also how extraordinary bar sales may affect the regression. (d) To examine more closely the use of the wardrobe during the year, a regression is may be performed where W/C is explained by month (0 – 1 variable for each month, with January as basis). One could provide a table with estimated percentages in each month and try to explain why they are misleading. These numbers may be corrected using the facts in the case description, and then decide whether the result supports the claim of N. Nelson concerning the use of the wardrobe. Thu, 18 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500917 2007-10-18T00:00:00Z 16 - Lost Sales https://hdl.handle.net/11250/2500916 16 - Lost Sales Lillestøl, Jostein Topic: Two-sample comparisons Context: Our firm started a certain business service in the early 1990’s. The concept was a success, and several other firms were established in the mid 1990’s offering similar services. Some of them were apparently not very serious, and many customers felt that they did not get what they paid for. Complaints from customers came to the attention of the consumer organization and the media. One business magazine frequently ran stories that may have left the impression that the whole business was unsound, and this was basis for its mentioning on prime time TV. Our firm felt that the magazine was running a campaign that harmed a perfectly legal and useful business, and since they had a name that was close to the business concept itself, they were probably harmed more than the unserious firms. The firm planned to sue the magazine in order to get a disclaimer and also get compensation for lost sales. The service product was mainly marketed by telephone, giving the customers who signed up for the service, the possibility to withdraw within a given time limit. The service is a subscription for one year at a time, and can be cancelled at the maturity next year. During the period of bad publicity the firm experiences an increase in the number of withdrawals and non payments for the new subscriptions. The period of bad publicity started November 11 1995. However, the average time from a sale to payment is close to two months. Consequently, the sales at least from early September on may be affected by the bad publicity, maybe also from late August. The firm felt that “the campaign” went on during the winter and spring, but faded off in late April. The firm had monthly data from January 1994 to October 1996.The data may be split in three periods: 20 months prior to the period affected by bad publicity (Jan. 1994 to Aug 1995), 8 months believed to be affected by bad publicity (Sept. 1995 to April 1996), and 6 months after this period (May 1996 to October 1996). Data stored in file Lost_Sales.XLS are as follows (currency NOK): 1. Sales in current month. 2. Paid amount in current month or later for sales in current month 3. Cancelled amount in current month or later for sales in current month 4. Average number of days until payment and average number of days until cancellation for sales in each month. 5. Number of salesmen employed in each month. Note that Paid and Cancelled do not necessarily sum up to Sale, since some sales were never paid or cancelled. They also had a record on paid and cancelled amounts in each month, irrespective when the sales were made. Task: Bring forward the evidence of lost sales due to the bad publicity and estimate the amount lost. Imagine that your analysis will be used as evidence in the court case against the magazine, where the defence may challenge your analysis. Advice: If you make your analysis too sophisticated, it may easily be dismissed by the court. Preferably it should be understood by the laymen (and the judge). Wed, 17 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500916 2007-10-17T00:00:00Z 15 - City Parking https://hdl.handle.net/11250/2500915 15 - City Parking Lillestøl, Jostein Topic: Estimation with confidence, t-tests Context: The city parking company of Bergen, Norway had in the period 1983-1988 two shifts for emptying the parking meters and ticket machines. The two shifts were alternating on a weekly basis, such that shift 1 had this job in odd numbered weeks and shift 2 in even numbered weeks. Each shift had three teams. The teams as well as the routines were fixed in the period down to week 42 in 1988, and collected money at the end of the week was counted and the settlement entered in the books with no split on teams. The parking rates were increased 1.1.1984, 1.1.1987 and 1.1.1988. A reduction in the number of street parking meters took place 30.6.1984, when a large city garage opened. Small changes in street parking took place in the year afterwards, among others due to replacement of some parking meters by ticket machines. The available data consists of 149 settlements for each of the two shifts. Some weeks are lacking because of a strike and some other circumstances that will not affect the analysis. In 1987 the management was warned that the equipment for emptying the meters had a weakness and that there were rumours that some employees had taken the opportunity to siphon off money for themselves. Some said these rumours were around already in 1984. The available data are stored in file City_Parking.XLS Task: Imagine you are a member of the management team and responsible for analyzing the available data to see whether it is worthwhile to start further investigation. It is regarded very unfortunate for the working climate within the company if management confronts the teams with their suspicion and it turns out that there is nothing to it. The decision to stage further investigation should therefore be made if statistics shows beyond reasonable doubt that money is lacking. Suppose that the suspicion is strengthened by the analysis, but you are not sure whether one, two or all three work teams in shift 2 are involved. How could routines be modified to be able to reveal this? In case that fraud is admitted, but far less than the data indicate, it is of interest to both estimate the most likely embezzled amount and a lower limit beyond reasonable doubt. A-version: You have no guidance on how to proceed, except what you have learned at school. B-version: You follow the telephonic guidance of a friend on how to proceed (see next page). Task (B-version): Your friend on the telephone recommends the following: (a) You should first look at the data file to see if there may be observations or lack of observations that may affect the subsequent analysis. In what follows you should see if corrections or modifications are required, or whether the data problems really matters for the conclusion and the intended use of the results. (b) Provide a time series plot of the amount per collection for each shift over the period 1983-1988. Interpret the plot and remember special features that may affect the subsequent analysis. Reflect on the possibilities of differences not linked to fraud. Also reflect on the best way to plot the data to expose differences between the shifts. (c) Compute the average amount collected by the two shifts for the entire period 1983-1988. You may perform a two-sample t-analysis to test the hypothesis of no difference and at the same time obtain a confidence interval for the mean, which can be projected to the total. You should examine whether the assumptions for the analysis are justified. If not, you should look for alternatives. (d) Go on to examine each year separately. Compute average amounts for both shifts for each year. You may perform separate two-sample t-analysis for each year. You should test the hypotheses of no difference as well as try to compute confidence interval for the total amount. (e) Some further possibilities are: - a two-factor analysis of variance (ANOVA), the two factors being shift and year, - analyze suitable paired observation from the two shifts, - analyze departures from a moving average smoothing of the time series (f) You have to judge whichever approach is better, and whether some improvements could be made. Tue, 16 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500915 2007-10-16T00:00:00Z 14 - Operating Expenses https://hdl.handle.net/11250/2500914 14 - Operating Expenses Lillestøl, Jostein Topic: Multidimensional categorical and numerical variables - association and explanation A civil service operates throughout the state and has a number of cars. Purchasing and maintenance are decentralized to 6 regions: Capital, East, South, West, Central and North. The running expenses, including maintenance and repair are recorded in a journal for each car. Each account is settled four times a year, at a fixed date every quarter, and then reported to the headquarter in the capital. The receiving officer wants to examine the operating costs for different car categories in different regions, and in particular the dependence on driving length and age of the car. The intention is that the findings should be used as input to replacement calculations. For an organization with more than 5000 vehicles state wide, such calculations may be of considerable economic significance. For a preliminary analysis the officer decides to take a sample of 325 cars from his own region and survey the operating and repair costs last year, for each car in the sample. Operating costs include expenses for gasoline/diesel, oil, washing, routine maintenance and insurance. Repair costs include expenses for repair of damage and breakdown. All cars are bought new, and there are three categories: sedan, station wagon and pick-up van. The data is available in the file Operating Expenses.XLS as follows: District (1=Capital, 2=East, 3=South, 4=West, 5=Central, 6=North) Car type (1=sedan, 2=station wagon, 3=pick-up van) Age of car (in years) Driving length (in km) Operating costs (in local currency) Repair costs (in local currency) Question (A-version): How will driving length and age and type car affect the two costs categories? Try different modes of analysis: tabular, graphical and modelling of relations. Questions (B-version): Do each of the following tasks for both types of costs and successively reflect on what you learn, and may want to do next: 1. Categorize Age and Driving length in two categories. Suggested coding: Age group (1=3 years or less, 2= over 3 years) Driving length group (1=up to 20000 km, 2=over 20000km) Make for each Car type 2x2 tables of counts for the coded variables for checking of reasonable coding. 2. Make 2x2 table(s) showing the average and standard deviation of Cost in terms of the categorized variables Age group and Driving length group. Do this for each Car type as well. 3. Draw graphs to illustrate each of the two costs in terms of Driving length group. Repeat for Age. (Hint: Dotplot) 4. Draw graphs to illustrate each of the two costs in terms of the original variable Driving length. Repeat for Age. (Hint: Scatterplot). 5. Compute the pairwise correlations between the Cost, Driving length and Age. Do the same separately for sedans. Do you see something that may affect the interpretation of findings above and further analysis? 6. Establish regression relationships where Cost is explained by Age, Driving length and Car Type Mon, 15 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500914 2007-10-15T00:00:00Z 13 - Response Times https://hdl.handle.net/11250/2500913 13 - Response Times Lillestøl, Jostein Topics: One-sample and two-sample estimation and testing, analysis of variance and regression Context: The users of a data base system, e.g. in a bank, will emphasize short response times when searching the data base. The response time may vary, depending on the type of search, the number of simultaneous searching and technical circumstances related to the transmission of data. Response times over a threshold lead to irritation and delay. This threshold is typically between 5 and 10 seconds, but varies between individuals. The response times are occasionally much longer, caused by traffic getting stuck. It is possible to generate fictitious requests which can be followed through the system in order to uncover “bottlenecks”. It is also possible to make minor modifications on the system, both with respect to programming and technical solutions, so that efficiency comparisons can be made. A fictitious request every 5th minute, i.e. 72 within the common working hours from 9.30 a.m. to 3.30 p.m. is few in comparison with the total number of requests of about 5000, and will therefore have a negligible effect on the response times of the system. We have data from Wednesdays in two consecutive weeks (Day 1 and Day 2), with a system they change in between, intented to reduce response times. The effect can be judged by various criteria: the change in expected response time, the change in median response time or by a change of the chance that the response time exceeds a certain “critical” level ,say 5 seconds. Data are available in the file Respose_Times.XLS as follows: Response time (in seconds) Day (1-2) Hour (1-6) Lunch break (0-1) Traffic (no. of requests in encompassing minute) How the data will be analyzed depends on how much statistical theory you have. In particular some distribution theory beyond the normal distribution may be helpful. Task (A-version): What can be said about response times before and after the system change? Can we conclude that the system change led to an improvement? Task (B-version) 1. Present the distribution for the sampled response times for each Wednesday separately, and comment on their shape and possible differences. 2. Estimate, before and after the system change a. the mean response time. b. the median response time. c. the probability that the response time exceeds 5 seconds If you can, provide sampling error limits to each estimate.. 3. Estimate the difference of (before minus after) a. the mean response times. b. the median response times. c. the probabilities that the response time exceeds 5 seconds If you can, provide sampling error limits to each estimate. 4. Perform the standard formal tests of the hypothesis of no change in 3 a, b and c. Which one of the tests is most relevant for our problem? Are the assumptions for these tests satisfied? 5. *Make a reasonable assumption for the distribution of the random response time T. Do the following after the system change: a. Estimate the parameters of the model and compute an estimate of . b. Estimate the maximal response time that can be guaranteed with 95% certainty. Hint: A possibility is to set T = a + X, where a is the smallest possible response time, and let the overshoot X be distributed Gamma or lognormal. What can be said in favour and disfavour of assuming a specific distribution? 6. It may be that the response times vary over the day, and that this may affect the analysis above. A possibility is to pair the observations for the same time the two Wednesdays, and then analyse the 72 differences in response time (before minus after). Estimate a. the expected difference in response times, b. the median difference in response times, c. the probability of a shortened response time after the system change 7. Perform the standard formal tests of the hypothesis of no change in 6. Hint: One-sample T-test, one-sample rank test, sign-test. Discuss the choice of test for this data. Comment on the result of the testing, and whether the pairing of observations may reduce the possibility of testing improvement with respect to unacceptable response times. Do you see an alternative if the number of requests in the lunch break from 11.30 to 12.30 are typically shorter. 8. Figure out whether the response times vary over the day by a. making suitable plots b. one-factor analysis of variance (ANOVA) for the Day 1 observations using Hour (1-6) as factor. c. two-factor analysis of variance (ANOVA) all observations using Hour (1-6) as the first factor and Day (1-2) as the second factor. 9. As a measure of the traffic it is recorded the number of requests in the minute encompassing each fictitious request. Analyse by regression analysis how the traffic affects the response times, and whether there is improvement from Day 1 to Day 2. Discuss whether the standard assumptions for regression analysis are fulfilled. An underlying assumption for some of the analysis above is independent observations. Discuss the possibility of positive correlation, and the risk that this may twist our conclusions. 10. *Study the relationship between traffic and response times over 5 seconds by logistic regression 11. *It is claimed that the number of requests in a given period is Poisson distributed. Is this reasonable or unreasonable? Can this assumption be tested? For planning purposes the expected number of requests pr. minute is set to 16. Is this reasonable? Compute approximate the probability that the number of requests pr. minute is more than 25. Sun, 14 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500913 2007-10-14T00:00:00Z 12 - Cashier Fraud https://hdl.handle.net/11250/2500911 12 - Cashier Fraud Lillestøl, Jostein Topic: Variation and extremes Context : This case deals with cashier fraud at a supermarket, involv¬ing so-called “return money” when the cashiers have to correct errors. The supermarket in question has 16 cashiers, about 8-9 at work at a time. Each cashier has her/ his own code that is entered in the computerised cash register before operation. The cash regis¬ters has two rolls, one producing itemised receipts to the customer, the other recording the same amounts without item specification to be kept by the store (called gossip roll). Typically such rolls are replaced more than once during a workday. If an erroneous amount is entered into the register or merchandise is returned, it shall be detracted from the register, and then recorded in writing on a return/correction sheet. The wrong entry and its correction will then appear on both the customer receipt and the gossip roll. At the end of the day the cash in hand and the return/correction sheet, was brought to the back office to be balanced. There was no day-to-day inspection of gossip rolls. One of the cashiers (A) seized the opportunity to deliberately enter erroneous amounts. These were not corrected in full on the rolls, but recorded as if on the return/correction sheet. The difference was then pocketed. According to the routines the cash at hand balanced the records at the end of the day. To reduce the risk of gossip rolls telling the truth, the parts in question was occasionally scissored off. In May 2003 the supervisor surveyed the sales turnover and return money for each of the cashiers during the first 4 months of the year. Her suspicion was raised when she discovered time gaps between successive gossip rolls. By examining further she discovered that cashier A had fairly frequent returns and higher return amounts than other cashiers. The return/correction sheets were then compared with the gossip rolls, and mismatches were discovered. The complete records from the previous year added to the suspicion. By after hour inspection of the garbage cans the management found a piece of a gossip roll from cashier A. Nevertheless her cash balance was apparently correct, which was taken as an evidence of withdrawal of money that day. A complaint was then filed to the police. These findings, together with statistics from the computerized cash register, were basis for the examination by the police. The cashier quickly admitted having embezzled varying amounts, ranging from NOK 1000 up to, may be, NOK 10 000 at a time, on average once a week for a little more than a year and a half (1£ = 10 NOK). Concerning the total amount she was confronted with a calculation covering the period 03.10.01 to 07.05.03 in which her return money amounted to about NOK 400 000 compared to an average of NOK 50 000 for all cashiers, thus leaving NOK 350 000 for her to account for. Although the accused could not imagine having embezzled this large an amount, the police obtained her concession based on «undisputable facts». The prosecutor to be quickly realised a possible flaw in using this calculation as basis for the accusation: The accused may have lower abilities than average at the outset, and should not be punished for that. In a criminal case «every penny» embezzled must be proved, and the average argument would be a gift to the defence. After preliminary consultation of statistical expertise, taking the variation of return money between cashiers into account as well, the accusation was reduced from about NOK 350 000 to NOK 300 000. A statistician was wanted as an expert witness, since statistical arguments of this kind had not previously been brought to the court. The following commission was given to the statistician: «To analyse the information made available and establish how much the return money of the accused deviates from natural variation among the cashiers in the store. State the assumptions of the analysis, and how sensitive it is to realistic changes of the assumptions». Data The available data included individual return amounts, number of returns, sales turnover among others, for every cashier and every work day in the period 03.10.01 to 07.05.03. Some cashiers had worked more than others and in the file Cashier_Fraud.XLS we provide the total return amount and the number of returns for each cashier A to P. In a separate sheet we give the return amount and the number of returns for each working day of cashier A, totalling 270 working days. Of these about 40 were in year 2001, 160 in 2002 and 70 in 2003. Task: 1. Pretend you are the statistician faced by the commission above. 2. Investigate also when the supposed fraud started, and how it developed. Teacher Note. The context and the commission may be given very brief. Afterwards the complete story can be revealed for discussion and critique. The case could alternatively be discussed from the point of view of the store supervisor, the police investigator, the prosecutor or the defence lawyer. You could even imagine role-playing. For the role of store supervisor you set the crime scene and reveal the suspicious circumstances and data, and then ask the student how to proceed. Sat, 13 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500911 2007-10-13T00:00:00Z 11 - Takeover https://hdl.handle.net/11250/2500910 11 - Takeover Lillestøl, Jostein Topic: Estimation with error limits Context: An industry group has for some years owned a company in the building trade with branches in several cities. Now they have decided to sell the company to a foreign company. As part of the economic settlement, the inventories in each branch have to be evaluated. The basis is the figures recorded in the data system, named system values. The situation may be described as follows: Each branch has registered a number of building items, and has recorded quantities, and unit prices for each of these. The system prices are adjusted for each new shipment added to the stock, where the new system price is a weighed sum of the old system price and the last invoice. The weights are determined by the remaining stock and the added quantity. The system price will therefore represent an average of the unit prices attached to the current stock at the time of buying. The inventory had N=5501 items, and the computed worth based on system prices turned out to be NOK 72.549.991 At takeover the buyer of the company took a random sample of inventory items, and found that the total system value in the sample was larger than computed worth if the unit prices from the last invoice are used. Since the prices generally have increased during several years, you would expect the opposite, that the system prices were lower. Based on random sampling theory the buyer estimated the total overstatement to be NOK 500 000, and consequently they claimed a price reduction of this size. The seller refused this, and having no statistical expertise at hand, their only argument was that the sample of 75 items was too small. They agreed to double the sample to about 150 items. Based on the total sample, the estimated overstatement turned out to be just NOK 300 000, and the buyer reduced the claim to this amount. The seller is not happy with this, and at this stage you are brought in as a statistical advisor. The file Takeover.XLS contains Quantity, System Price and Invoice Price for a sample of n=153 items. Task (A-version): 1. Do your own calculations on the data, and judge if the claim of the buyer is justified. If not, provide the convincing arguments to withstand the claim. 2. Try to clarify the basic statistical issues related to the described situation, and be prepared to teach your client about your method and findings. If you feel that the context is not fully clarified, face the task under different interpretations. Task (B-version) as above, but with the added guidance: We may study the problem within two different contexts, leading to slightly different approaches: Context 1: Buyer and seller had at the outset come to terms with that the updating procedures were satisfactory, and that the system prices should be the basis for the valuation. Sampling with subsequent data analysis was nevertheless done on the buyers own initiative, and claims were raised with hindsight. Context 2: Buyer and seller were at the outset uncertain whether the updating procedure was followed at all branches. They had agreed that possible overstatement of the worth of the inventory using system prices should be fully adjusted. An estimate of this overstatement should be determined by sufficiently large sample. Here Context 1 corresponds to the actual situation, and may be taken as the primary one. Statistical issues: Different sampling schemes and different ways of estimating the overstatement and its error margins. The question on how the error margins depend on the sample size. Some further guidance: The Difference Method: Estimates where the sample mean of the differences D between system and invoice amounts for each stock item is multiplied by the population size N. The standard error for estimated overstatement is approximately equal to N S / sqrt(n) under the assumption that the sample is small in comparison with the population (otherwise a so called finite correction may be needed). (a) Use the difference method to estimate the total overstatement of worth with accompanying error margins corresponding to 95% approximate confidence level. (b) What sample size will be needed if we want error margins equal +/- 500 000? Say +/- 300 000? (c) Search a textbook (or the net) for the Ratio Method as an alternative to the Difference Method. (d) Search a textbook (or the net) for the following alternative sampling schemes: (i) stratified sampling (ii) sampling proportional to size Fri, 12 Oct 2007 00:00:00 GMT https://hdl.handle.net/11250/2500910 2007-10-12T00:00:00Z