Skip to content

Multiple Linear Regression for 50 Ad Agency

1. Data Processing

In order to know what kind of publicity expenses matters more in determining the profit of a company and what the relation would be, we collect the data from 50 different companies in 3 different cities about their expenses on print media, social media and outdoor advertisement. Based on this data, we could build a model to analyze the relationship.

1.1 Raw Data

Print Media ExpensesSocial Media ExpensesOutdoor Ad ExpensesCityProfit
165349.2136897.8471784.1Mumbai192261.8
162597.7151377.6443898.5Chandigarh191792.1
153441.5101145.6407934.5Delhi191050.4
144372.4118671.9383199.6Mumbai182902
142107.391391.77366168.4Delhi166187.9
131876.999814.71362861.4Mumbai156991.1
134615.5147198.9127716.8Chandigarh156122.5
130298.1145530.1323876.7Delhi155752.6
120542.5148719311613.3Mumbai152211.8
123334.9108679.2304981.6Chandigarh149760
101913.1110594.1229161Delhi146122
10067291790.61249744.6Chandigarh144259.4
93863.75127320.4249839.4Delhi141585.5
91992.39135495.1252664.9Chandigarh134307.4
119943.2156547.4256512.9Delhi132602.7
114523.6122616.8261776.2Mumbai129917
78013.11121597.6264346.1Chandigarh126992.9
94657.16145077.6282574.3Mumbai125370.4
91749.16114175.8294919.6Delhi124266.9
86419.7153514.10Mumbai122776.9
76253.86113867.3298664.5Chandigarh118474
78389.47153773.4299737.3Mumbai111313
73994.56122782.8303319.3Delhi110352.3
67532.53105751304768.7Delhi108734
77044.0199281.34140574.8Mumbai108552
64664.71139553.2137962.6Chandigarh107404.3
75328.87144136134050.1Delhi105733.5
72107.6127864.6353183.8Mumbai105008.3
66051.52182645.6118148.2Delhi103282.4
65605.48153032.1107138.4Mumbai101004.6
61994.48115641.391131.24Delhi99937.59
61136.38152701.988218.23Mumbai97483.56
63408.86129219.646085.25Chandigarh97427.84
55493.95103057.5214634.8Delhi96778.92
46426.07157693.9210797.7Chandigarh96712.8
46014.0285047.44205517.6Mumbai96479.51
28663.76127056.2201126.8Delhi90708.19
44069.9551283.14197029.4Chandigarh89949.14
20229.5965947.93185265.1Mumbai81229.06
38558.5182982.09174999.3Chandigarh81005.76
28754.33118546.1172795.7Chandigarh78239.91
27892.9284710.77164470.7Delhi77798.83
23640.9396189.63148001.1Chandigarh71498.49
15505.73127382.335534.17Mumbai69758.98
22177.74154806.128334.72Chandigarh65200.33
1000.231241531903.93Mumbai64926.08
1315.46115816.2297114.5Delhi49490.75
0135426.90Chandigarh42559.73
542.0551743.150Mumbai35673.41
0116983.845173.06Chandigarh14681.4

The raw data is shown above.

1.2 Drop Zero Data

Drop all data with null values in the table to prevent extreme outliers.

1.3 Categorical Variable

Column “City” contains labels, Mumbai, Chandigah and Delhi, which cannot be used for linear regression. Thus, transform them into dummy variables.

Because there are three different cities, two dummy variables z1 and z2 are needed.

As the table below shows, (0, 1) represents Munbai, (1, 0) represents Chandigah, and (0, 0) represents Delhi.

z1z2
Mumbai01
Chandigah10
Delhi00

1.4 Data Description

Based on these simply processed data, the following figures are drawn. In the data, the spendings of all 46 companies on three different advertising methods are counted.

For print media expenses and social media expenses, the overall spending level is low, basically below 200,000, mostly concentrating between 50,000 and 150,000. In outdoor advertising expenses, most companies invest more than 100,000, among which the highest reaches 450,000.

For print and social media expenses, up to 6 sets of data locates in the same range, and for outdoor advertising spending, up to 7 sets. Regarding cities of the companies, Mumbai and Chandigah each accounts for 15, and Delhi 16.

For profit of the companies, it is distributed in the range of 50,000 to 200,000. Most of the companies have a profit around 100,000. There is only one company with a profit of 50000, and four companies with a profit of 200000, accounting for a very small proportion.

1699261215166

2. Data Test

In order to use multiple linear regression to build our model, several conditions must be satisfied. Therefore, we use several tests to check out whether the independent variables are good candidates for multiple linear regression.

2.1 Linear Relationship between Independent Variables and Dependent Variable

2.1.1 Scatter Plot

Three figures below show the value of profit versus three different expenses.

Fig 2.1 and 2.3 shows a strong linear relationship between profit and print media expenses and outdoor ad expenses. Fig 2.2 shows the weak linear relationships between profit and social media expenses.

1699261082472

2.1.2 Correlation Coefficient

To better analyze the linear relationship between profit and expenses, we compute the correlation coefficients, which is shown above.

The table shows strong linear relationships between y and x1,x3. But, the relationship between y and x2 is weak, so as z1 and z2. However, z1 and z2 are categorical variables, which just has values of 0 or 1. Therefore, generally profit has linear relationship with expenses.

x1x2x3z1z2
y0.970.130.73-0.060.03

2.2 Independence between Variables

2.2.1 Residual Plot

In the residual plot we could see that the data points show no significant trend along x **or y axis.

Regarding the source, the data comes from different companies in 3 different cities, which leads to low interaction between these companies. Together with the residual plot shown above, we roughly consider our variables to be independent.

1699261122481

2.3 Normality for Residuals

2.3.1 Q-Q Plot

To roughly see whether the residuals follow normal distribution, we scatch the quantile-quantile plot, which is shown below.

In the figure, we could see that the scatter points lie close to the marked straight line, which means that the distribution of these residuals are approximately normal.

1699261153558

2.3.2 Kolmogorov–Smirnov Test

Normality of data is of very high importance in our regression, so we do further K-S test to confirm it.

With our data and following equations, we compute Dn, the K-S statistic of our data to be 0.1153. Referring the table, we know that the limit for our data to be normal is 0.1698, which is significantly higher than 0.1153. Therefore, we could consider that our data follows normal distribution.

ri=yiy^iFexp(ri)=in,i=1,2,,nFobs(ri)=P(z<riμσ)Dn=max(|Fexp(ri)Fobs(ri)|),r1r2rnDn=0.1153<Dc=0.1698

2.4 Equal Variance

2.4.1 Residual - Predicted Value Plot

To build our multiple linear regression model, equal variance is also a very important need. So, we use the residual plot to see if there is an obvious trend of residuals regarding the change of profit. As Fig 5, the residuals seem random and homogeneous around zero, without significant increasing, decreasing or bending. Therefore, we consider our model to be with equal variance.

1699261184946

3. Data Regression

3.1 Multicollinearity

3.1.1 Correlation Matrix

x1x2x3z1z2
x110.20.71-0.080.04
x20.21-0.09-0.050.06
x30.71-0.091-0.150.03
z1-0.08-0.05-0.151 ng-0.48
z20.040.060.03-0.481

To build our regression model, we must ensure that multicollinearity is not severe between all these variables, which means that no two variables share highly similar linear relationship between profit and itself.

So, we construct the correlation matrix, which is shown in the above table. As all these computations show, most of their correlation coefficients are much smaller than 0.7. Therefore, our model shows low multicollinearity.

3.1.1 Analysis of Eigenvalues

For matrix XX, all the eigenvalues are shown above. With all the computations of k, the maximum one gives the value of 7.77, which is smaller than 100. Therefore, we could consider our variables to have very low level of multicollinearity and thus could be used in analysis.

X=[x11x21x51x12x22x52x1nx2nx5n]A=XXλ1=82.85,λ2=10.66,λ3=23.68,λ4=65.01,λ5=47.81k=λ1λ2=7.77<100

3.2 Regression

3.2.1 Forward Selection

Step 1:

R2=1SSESST
y^R2f
x1y^=54110+0.8046x10.95585952.60249
x2y^=94970+0.1812x20.01825
x3y^=63050+0.2365x30.53617

From the R2 we get, we could find that R12 > R22 > R32, and the largest R2 is from x1. Then we will take F-test of x1. And according to the table of F-test, we could find that f = 952.60249 > f0.05(k1,nk1), so x1 is significant, and we choose x1. Then the procedure will continue and we will insert a new variable.

Step 2:

R(βi|βj)=R(βi,βj)R(βi)f=SSR/(k1)SSE/(nk1)
y^R(βi,β1)f
x1,x2y^=63000+0.8141x10.07897x21938395563.51402
x1,x3y^=51860+0.7581x1+0.02580x3193319026
f0.05(k1,nk1)>3.51402

Therefore, adding more variables doesn’t give a significant change to the model. So the procedure is terminated.

Then we find only x1 gives a significant change to this model, as a result, we could get:

y^=54110+0.8046x1

That means:

Profit=54110+0.8046×PrintMediaExpenses

3.2.2 Backward Elimination

Step 1: Fit all the variables

βR2f
x10.8060.951295.978
x2-0.027
x30.027

Step 2: Remove x2

βR2f
x10.7970.950450.771
x30.030

Step 3: Remove x3

y^=54110+0.8046x1
βR2f
x10.8540.947849.789

Backward Elimination analysis was performed with Print Media Expenses, Social Media Expenses, and Outdoor Ad Expenses as independent variables, and Profit as the dependent variable. After calculation, only Print Media Expenses have the significant change to this model.

Profit=54110+0.8046x1×PrintMediaExpenses

4. Hypotheses Testing

4.1 Effects of Different Kinds of Expenses

We are firstly interested in whether one certain expense impacts profit significantly under the condition that other expenses are put. So we use the model containing all the variables to do the hypotheses testing.

H0:βj=βj0H1:βjβj0t=β^jβj0scjjβ^jtα/2scjj<βj<β^j+tα/2scjj

In the test, the t values for the critical region are -2.021 and 2.021.

For β1, we use the data to compute the corresponding t-statistic to be 19.8, which exceeds the t-statistic for the critical region.

Similarly, for β2 and β3, the corresponding t-statistics are about -1.28 and 1.38, which are smaller than the corresponding t-statistic for the critical region.

Therefore, we can reject the null hypothesis for β1 but not for β2 and β3, which means we can not reject β2 or β3 to be 0 in the presense of the other two.

That is to say, except print media expenses, one certain expense have insignificant impact on the profit under the existence of the other two expenses. Thus, print media expense may be the most important one in the three expenses.

4.2 Effects of Different Cities

4.2.1 One-Way ANOVA

We are then interested in whether the cities of these companies have a significant impact on their profit. So, we continue to construct our hypotheses-testing process.

H0:μ1=μ2=μ3H1:any of them is differentf=0.0790<f0.05(2,42)can not reject H0

We first set our null hypothesis to be that the means of profit for companies in these three cities are equal. Then, the alternative hypothesis is that any of these two means are not equal.

For only one single factor is considered there, we use One-Way ANOVA to test our hypothesis and the results are shown above. The treatment sum of squares SSA, with degrees of freedom to be 2, has its mean square about 109051702.16. And the error sum of squares SSE, with degrees of freedom to be 42, has mean square about 1380294796.39.

So, the f-statistic computed is 0.0790, which is much smaller than the f-statistic for the critical region. Therefore, we can not reject our null hypothesis. Thus, we could get the conclusion that we can not reject the profit of all the companies in three cities share the same mean value.

That is to say, the impact of cities on the profit a company make is insignificant.

SourceDFSSMSf
SSA2218103404.31109051702.160.0790
SSE4257972381448.241380294796.39
SST4458190484852.56

5. Conclusion

From all the processes above, we could get the conclusion that the print media expense matters most in determining the profit, followed by outdoor ad expense and then social media expense. What’s more, the effect of city is insignificant.

For all the models we build, Radj2 is computed to be larger than 0.95, which means that most of the observed values of profit could be explained by the model. Besides, the MAPE (Mean Absolute Percentage Error) is computed to be lower than 1%.

Additionally, the linearity between response and responsors, the normality and equal variance are all significant.

Therefore, we eveluate our regression models to be good ones.

However, shortcomings still exist.

First, the number of companies in each city is not very large. Thus, the data may not support the results very strongly.

Second, we can not confirm our variables to be of high level of independence.

Third, the multicollinearity between print media expenses and outdoor ad expenses is not very low, which may disturb the regression results.

Fourth, we can not confirm these 46 companies to have similar conditions and other expenses, without which the regression results would be inaccurate.

Therefore, if further improvements are to be conducted, ways might be found through the above four points.

6. Contributions

Team MemberContributions
Ke XuPlanning the structure; programming and coding; data searching, processing and testing; regression; hypotheses testing; writing report; making slides.
Yifan ChenPlanning the structure; coding; data searching, description and testing; regression; hypotheses testing; writing report; making slides.
Ruiwen ZhangForward selection; writing report.
Zeyi YangBackward elimination; writing report.
Zhuo WangData description; writing report.
One day we will climb the highest mountain, and suvey the smallest point.