Sunday, October 25, 2009

mY favourite regression project.

INTRODUCTION & OBJECTIVE

This data set was obtained from the Statistical Online Computational Resource database.
From this data set, I wish to study the average mercury concentration in muscle tissue of one type of fish, which is called, Largemouth Bass.
Largemouth bass were studied in 53 different Florida lakes to examine the factors that influence the level of mercury contamination. Water samples were collected from the surface of the middle of each lake in August 1990 and then again in March 1991. The pH level, the amount of chlorophyll, calcium, and alkalinity were measured in each sample. Next, a sample of fish was taken from each lake with sample sizes ranging from 4 to 44 fish. The age of each fish and mercury concentration in the muscle tissue was measured.
For this research, there are, 11 variables available, but to regress it into my regression model, I choose 5 variables from it, which is,

Independent Variable (Y) Predictor Variable (X)
Average Mercury concentration in the muscle tissue of the fish X1 : pH (pH concentration)
X2 : Alkalinity (Mg/L of
Calcium Carbonate)
X3 : Calcium (Mg/L)
X4(Qualitative):
Age Data
1: Data has been recorded
0:No data was recorded on the age

I wish to examine the level of mercury by regressing the pH concentration, alkalinity in the water, calcium concentration in the water and as well as the fish ages. Mercury is an alkali, that’s why; I’m choosing, alkalinity and calcium concentration for this research. For the age data, since, the mercury concentration is increasing by the age of the fish, and this research was conduct twice, first in 1990 and the second one is in 1991, not all of the fish in the lake was contaminate with mercury. The label ‘1’ means, the fish had been labeled before as contaminate and of course, its age is jotted to. While, the label ‘0’ means, the fish’s data was never jotted, whether it is a new fish in the lake or fish that comes from out of the lake.

The objective of this study is, to examine the level of mercury in the lake that contaminated the fish. Largemouth Bass fish is not so affected with mercury, and also, the longer they live in the lake that contained mercury, the larger the mercury concentration in their muscle tissue. With higher concentration of alkalinity and mercury, the higher the concentration of the mercury will be.

DIAGNOSTIC CHECKING
Before doing any calculation, I checked all of my variables by using SPSS. I wish to know whether my variables that I’ve chosen are suitable enough for my regression model.



Checking the model is significant or not:
By using F-Test:

Ho : β1 = 0 (The model is not significant)
H1 : β1 ≠ 0 (The model is significant)
α = 0.05

Test Statistic:
According to the ANOVA Table, the p-value is 0.000

Decision Rule:
If, p-value < α , Reject H0

Decision:
Since p-value < α ,Reject H0
Conclusion: This model is appropriate, since F-test shows there's linear relationship between X and Y.





The equation: Average Mercury = 1.137 – 0.082(pH) – 0.005(Alkalinity) + 0.004(Calcium) + 0.056(Age Data)

Thus, for every unit pH concentration increase, the average mercury will drop by 0.082 parts per million, provided that Alkalinity and Calcium remain unchanged. Next, for every 1 Mg/L increase in Alkalinity, the Average Mercury will decrease by 0.005 parts per million, while pH and Calcium remain unchanged. Similarly, for every 1 Mg/L increase in Calcium concentration, the Average Mercury will increase by 0.004 parts per million. While, with the pH concentration, Alkalinity and Calcium held constant, the history of data that was never recorded before is found to be 0.056 higher than the recorded data.

The p-value for pH, Calcium and Age data all are greater than 0.05, thus, those three predictor variable are not significant for Average Mercury, while, the p-value for Alkalinity is 0.017, which is less than 0.05, then, Alkalinity is significant for Average Mercury.

The 95% confident interval for pH is [-0.167, 0.003], while Calcium is [-0.002, 0.009] and Age data is [-0.139, 0.251]. This indicates those 3 predictor variable are not significant because, the value of 0 falls within the interval. Where else, the 95% Confident Interval for Alkalinity is [-0.01,-0.001], show this predictor variable is significant where the value of 0 doesn’t fall within the interval.

The VIF values are below 5, indicating that there is no problem of multicolinearity.

KOLMOGOROV-SMIRNOV TEST OF NORMALITY
Ho: Distribution is normal
H1: Distribution is no normal
α = 0.05

Test Statistic:
p-value = 0.004

Decision Rule:
If p-value < α, Reject Ho

Decision:
Since p-value= 0.04 < α=0.05, Reject Ho

Conclusion:
The distribution is not normal. To remedy this problem, transformations need to be done.




From this Unstandardized Residual (e) Vs Unstandardized Predicted Value(Y) scatter plot, we can deduce that the error term have non-constant variance. The plots shows heteroscedasticity pattern, which concludes that the variance of the dependent variable is, varies across the data. It might complicate the analysis because in our assumptions, we assume that the error variance is constant. To solve this problem, transformation needs to be done.


TRANSFORMATION



Based on this scatter-plot matrix, the transformation that is appropriate for this model is logy.

AFTER TRANSFORMATION

2nd ANALYSIS REGRESSION MODEL (AFTER TRANSFORMATION)



The equation: Log Average Mercury = 0.101 -0.072(pH) – 0.006(Alkalinity) + 0.002(Calcium) + 0.194(Age Data)

Thus, for every unit pH concentration increase, the average mercury will drop by 0.072 parts per million, provided that Alkalinity and Calcium remain unchanged. Next, for every 1 Mg/L increase in Alkalinity, the Log Average Mercury will decrease by 0.006 parts per million, while pH and Calcium remain unchanged. Similarly, for every 1 Mg/L increase in Calcium concentration, the Log Average Mercury will increase by 0.002 parts per million. While, with the pH concentration, Alkalinity and Calcium held constant, the history of data that was never recorded before is found to be 0.017 higher than the recorded data.

The p-value for pH and Calcium all are greater than 0.05, thus, those three predictor variable are not significant for Log Average Mercury, while, the p-value for Alkalinity is 0.003, which is less than 0.05, then, Alkalinity is significant for Log Average Mercury. Lastly, the p-value for Age Data is 0.032 which is less than 0.05, so, Age Data is significant in this model.

The 95% confident interval for pH is [-0.149, 0.005], while Calcium is [-0.003, 0.007]. This indicates those 2 predictor variable are not significant because, the value of 0 falls within the interval. Where else, the 95% Confident Interval for Alkalinity is [-0.01,-0.001], and Age Data is [0.017, 0.372] show these predictor variables is significant where the value of 0 doesn’t fall within the interval.

The VIF values are below 5, indicating that there is no problem of multicolinearity. But, the VIF values for Alkalinity Variables, is approaching 5, which conclude, there might be almost multicolinearity exist.



The R-Square value is 0.586, which means 58.6% of the variation in Log Average Mercury can be explain by Age Data, Calcium, pH and Alkalinity.


THE KOLMOGOROV-SMIRNOV TEST OF NORMALITY

Ho: Distribution is normal
H1: Distribution is not normal
α = 0.05

Test Statistic:
p-value = 0.200

Decision Rule:
If p-value < α, Reject Ho

Decision:
Since p-value= 0.200 > α=0.05, do not Reject Ho

Conclusion:
The distribution is normal. Transformation works.


Even though according to the VIF values of the predictor variables shows that multicollinearity problem exist and not serious, but there still relationship among the predictor variables. Between pH and Alkalinity, when the pH concentrations increase, Alkalinity will increase to. This is in violation of the assumption of independency of the predictor variable.

Multicollinearity is coined to express the situation where the independent variable are highly correlated or associated with each other. The presence of multicollinearity often makes the regression coefficient less reliable, over inflates the R-Square and makes it difficult to differentiate the more important predictor variable from the less important one.

There are many ways to remedy this problem, such as, dropped the variable. Adding the interaction term might help reduced the multicollinearity problem. So, I decide to do both, since, practically, Calcium and Alkalinity, any calcium solutions is an alkali, so Calcium and Alkalinity are correlated physically. By chance, according to the scatter-plot matrix too, it is highly correlated with each other.

Dropping the variable might help reducing the multicolinearity problem, and at the same time, I might want to interact both of them, since they are ‘interacted’ physically.

An interaction term is basically the product of two predictor variables of interest.
For this multiple regression model, I proposed to do interaction term on Calcium and Alkalinity and on the same time, dropped both of the predictor variables, deciding not to center their values.


Ngahaha... sampai situ je. Tak ley lbey2 sebab tkt2 de org tiru nnt. Ngehehe. Ini kerja projek yg ala-ala presentation untuk praktikal nanti. Untuk mereka yang nak tahu, macam ni lah lebih kurangnya kerja STATISTICIAN. Kerja cipta equation untuk estimate sbrg benda, kalau nak diikutkan, setiap bidang didunia ni perlukan statistician. Seorang statistician diajar untuk precise dalam kiraan, terperinci, kreatif dan otak yang memang seimbang, tak terlalu analitikal dan perlu rasional. Jangan pandang rendah pada bidang ni, sebab tak ramai orang yang boleh buat. Belajar dia pun separuh gila jugak! Ngahaha. Projek yang saya buat ni masih di peringkat awal. Peringkat asas. Maksudnya,data ni pun kitorang 'curik' kat internet je. Tak sempat rasanya nak kumpul data buat questionnare sendiri. Tak de masa! Ngahaha.

Tak faham jugak? Urm.. kira macam ni lah. Andai kata ada seorang doktor nak buat kajian tentang ubat yang dia baru cipta. Nak tahu, ubat tu berkesan ke tak, dia perlukan bantuan seorang statistician supaya kajian dia lebih terperinci. Tugas statistician tu akan jalan research macam, dia panggil 200 orang pesakit, bahagi dua kumpulan. kumpulan A dan B. Kumpulan akan terima rawatan dengan gunakan ubat tadi, manakala kumpulan B akan guna ubat yang dipanggil, placebo. Placebo tu cuma sejenis glukosa yang konon-kononnya ubat juga. Kajian mungkin akan dijalankan selama 2 bulan, lepas tu, statistician tu akan edarkan questionnare tentang tahap kesihatan mereka. Kemudian data-data tu akan dianalisis oleh statistician tadi, tengok data tu perlu dianalisis secara parametrik atau non-parametrik. Non-parametrik tu berlaku bila data yang diterima tak bertaburan normal, manakala parametrik sebaliknya. Kemudian, daripada analisis tadi tulah, baru si statistician tu bagitahu doktor, uba u berkesan ke tak, kemudian barulah doktor tu buat kajian guna cara saintifik dia sendiri.

Fuh! Tak sangka tugas seorang statistician ini begitu penting(ceh!). Di Malaysia, statistician masih kurang dan tak ramai yang berpengalaman.Harap-harap calon-calon statistician yang ada sekarang (termasuk saya) ni boleh berjaya dalam bidang ni. Perlukan ketekunan yang amat tinggi plus gaji pon besar! Ngahaha. Hai... silap-silap haribulan, tak kahwin lah saya.... ngahahaha... =p

Popular Posts