# I101: Introduction to Informatics

# Lab 9: data analysis (linear regression)

# Detailed Instructions for Tasks

*Introduction*

The aim of this lab is to understand and work with linear regression using Excel. Microsoft Excel has many powerful features for handling data analysis. Data analysis is complex, takes time to perform (as you will learn in this lab), but most algorithms are based on basic statistical analysis algorithms that you have already done in class.

We will be initially working with the basic functions of Excel to perform linear regression, and then using excel tools to do the same. Try to understand how algorithms work, since Excel uses the same algorithm which we will be using.

*understanding linear regression*

When two sets of data ( *x *and *y *)
are related in a linear manner, the data on being plotted ( *y *versus
*x *) gives a straight line. This is known as having a **linear
correlation**. This follows the equation of a straight line . Following is an example of a sample data set and the plot of a
"best-fit" straight line through the data.

Instead of plotting the data to determine the
constants ** m (slope) **and

**of the equation , we can simply apply a statistical treatment known as**

*b*(y-intercept)**linear regression**to the data to determine these constants. The Linear Regression method will be explained below as an exercise you will perform in the lab.

*Import Data From Text File*

You learned in your last lab that it is just not practical to enter data directly into Excel. For this reason Excel allows you to import data from other sources, in this case, a text file. Please download the data file here and then open it in Notepad.

- Open Microsoft Excel, you should have a new blank spreadsheet open. Save
this spreadsheat as lab9.xls.
**Remember to save your spreadsheet periodically. This is the file you will be turning in for this week's lab.** - Next we are going to import the data
file you just looked at . Go to
*Data > Import External Data*then find the data.txt file and import it like you did in lab 8. However, this time make sure you select 'tab delimited' instead of 'comma delimited'. You will see the data in your spreadsheet as shown below.

- Next, make the top row bold so we know its our heading and not data. Take
your cursor and click on the
*1*on the very left. This should highlight the entire row. Now go up to your toolbar and click the bold symbol. Alternatively, after highlighting the row you can simply press ctrl-b. If the top row is now bold, you are ready to start working with the data.

*calculate Statistics *

We will now calculate the regression statistics which help us plot the best fit line.

For this, we will need to calculate
* m* (the slope)

Also, we need to calculate * b* (the intercept)

We also need to calculate * r*. The formula for that
is

However, to calculate * m*,

*, and*

**b***, we will need to calculate a few more things first.*

**r**To do this, create column headings **xy**, **x2**,
**y2** in your sheet. Now to calculate these values, do the
following:

- To calculate
**xy**, place the cursor in cell C2, and in the formula bar above, type**=A2*B2****=(A3)*B3**. The power of Excel lies in its ability to automatically update formulas based on which row or column they are in. This process should look like the screen below

You will automatically see that the column gets populated with correct values.

- To calculate
**x2**, place the cursor in cell D2, and in the formula bar above, type**=(A2)^2**

Do the same process as you did in the previous step to now populate column D with correct value. - To calculate
**y2**, place the cursor in cell D2, and in the formula bar above, type**=(B2)^2**

Do the same process as you did in the previous step to now populate column E with correct value. - Finally, your spreadsheet should look like the one below

- Now, also create below this data, the fields as shown
- Now, let us calculate the values for the fields we just created.

- To calculate
**sum-x**, go to cell A15 (the one below**sum-x**) and click the formula button in the toolbar (as described in lab 8), and select the SUM option. Now highlight the area of cells A2:A11 (which represents the values for**x**) and press enter. The value 55 automatically shows up in the cell below**sum-x**. - Repeat the above procedure to calculate the values of
**sum-y**,**sum-xy**,**sum-x2**, and**sum-y2**. - We know that
**n**is simply the total number of values. You could just count the values but this would be tedious with a large dataset. Instead, you can put a function to count for you. In this cell (F15), enter**=COUNT(A2:A11)**. Alternatively, after entering**COUNT(**, you can click the top value and drag down to the last value you want to count. - You spreadsheet should now look like:

- To calculate
- Now, we have all the required components for calculating
,**m**, and**b**.**r**

- Using the formulae, calculate the values for
,**m**, and**b**. You can do this either using Excel, or simply using a calculator.**r**

- First, calculate the values for
,**m**, and**b**first using a calculator. Show your calculations and results in the same Excel sheet.**r** - Then use the functions SLOPE, INTERCEPT and CORREL to calculate
,**m**, and**b**using Excel. The functions are fairly simple - use them directly and see how Excel helps calculate the values easier. However, Excel has a lesser accuracy (in terms of number of decimal places).**r**

- First, calculate the values for

*Create a chart for the data obtained *

Basic Chart

- Let us create a basic chart of the x and y values. To do this, first
select the data values under columns for
**x**and**y**.

- Now, go to
, or simply press the chart button in the toolbar.**Insert > Chart** - Select the
chart option. Select the following chart sub-type and click next.**XY (Scatter)**

- You will see a preview of how the chart is going to look. Click Next.

- Add your chart title, and Value (X) axis and Value (Y) axis
titles.

- Just click Finish here. You should now see the graph appear on your
existing sheet.
- Now, click on the
**Chart**option in the taskbar and select**Add Trendline**. Select the linear option as shown, and click OK. - You will see your trend-line appear on the graph.

*using data analysis tools in excel *

- We will now see how to do the entire data analysis that you did in the lab up until this point - but using just a few clicks in Excel.
- Within the same Excel document, go to
You will see several options, and check the first two as shown, and then click OK.**Tools > Add-Ins...** - This has now activated the Analysis tools that are necessary functions to calculate various data analysis tasks, including regression, correlation, etc.
- Now go to
You will see several functions here. Select Regression, and click OK.**Tools > Data Analysis...**

- Select the
**Input Y Range**and**Input X Range**as shown, and below, check the**Line Fit Plots**checkbox. Click OK.

- You will observe that Excel automatically inserts a new sheet in the file. Go to that sheet and observe that it contains all the data analysis you performed (and a lot more things).
- Notice below, the values of
,**m**, and**b**that you calculated are circled in red in this new data analysis sheet. It also shows a graph similar to the one you plotted before. Compare the graph and trendlines. You've pretty much learned how to do the same thing a program like Excel does for you!**r**

**Turn in this file (lab9.xls) in your dropboxes in
Oncourse. **

**Now, use this file - shoesize.txt
- and perform the same analysis using both steps (normal, plus using Excel's
data analysis tools). You should use a seperate file - shoesize.xls for this,
and turn that in the dropbox in Oncourse too. **

**Also, write a description on your blog, of what
you
learned in this lab - both about data analysis as well as inductive
modeling at large. Get creative! Your score depends on the quality of
your
post (in a linearly increasing manner obviously!)**