Skip to document

Math1041 assignment pdf

Course

Statistics for Life and Social Science (MATH1041)

183 Documents
Students shared 183 documents in this course
Academic year: 2022/2023
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
University of New South Wales

Comments

Please sign in or register to post comments.

Preview text

Gabriela Prasetiyo z

Q

1. The following RStudio codes are used to determine the ATAR's Standard Deviation. ATAR1<- as(unlist(data$ATAR)) ATAR2<- ATAR1[!is(ATAR1)] sd(ATAR2) 𝗰 = 14. 𝗰 = 14 (4 𝗆.𝗅.)

1. The following RStudio formula and code are used to determine Daniel9s Z-score.

𝗆 =

𝗆 2 𝗰

𝗰

z_score<- (92-mean(ATAR2))/sd(ATAR2) z_score 𝗆 2 𝗆𝗅𝗅𝗅𝗅 = 0. 𝗆 2 𝗆𝗅𝗅𝗅𝗅 = 0 (4 𝗆.𝗅.)

The number of standard deviations (𝗰) x is from the mean is represented by its z-score if it is an observation from a normal distribution with a mean (𝗰) and standard deviation (𝗰). As a result, Daniel's ATAR's Z-score is positive, indicating that x is 0 standard deviations to the right of the mean (𝗰).

1. Before computing any numerical summary, Daniel should have first completed the data filled in the NA values through a method called imputation. The imputation that can be used in RStudio is central imputation. In central imputation the centre value, which is the mode, median, or mean of the specified dataset to replace the missing data.

1. Isabella is right because the different distributions of the data can cause error in the bar chart and the graph shows that the data is normally distributed without any outliers by utilising measurements of central tendency per se which is incorrect and there are also incomplete data with N/A.

1.

1. The differences are that most female students chose <Labor= and male students chose mostly chose <Liberal= and more male students have a political preference between <Liberal= and <Labor= rather than female students. Also, more number of females voted rather than males.

Q

2. The explanatory variable is the type of high schools attended and it is categorical.

2. The response variable is the WAM, and it is quantitative.

2. It is an observational study because the responses are observed, the variables are measured, and no treatments are imposed as it is in an experiment.

2.

2. The spread order is Australian public school, Australian private school, Australian selective school, and non-Australian high school. The shape of Australian public school is skewed to the left, Australian private school is symmetrical, and both Australian selective school and non-Australian high school is skewed to the right. Both Australian public school and Australian selective school have outliers at one end, while both Australian private school and non-Australian high school have no outlier. The outlier means that there was one unusual score compared to the other scores. The medians of Australian public school and Australian selective school are almost identical, while non- Australian high school have a lower median. Non-Australian high school have more spread than the rest of the data.

2. Confounding variables are variables that influence the independent and dependent variable in the data,

leading to an erroneous relationship between the two variables, which in this case is the curriculum of the school because it affects the WAM by the type of high school of the students, as we can see that students from Australian high schools that teaches the curriculum from New South Wales, or Victoria, or Queensland, etc. seem to be doing better in terms of WAM than students who did their high school in a non-Australian high school with a foreign curriculum.

3. The residual is a measure of the difference between the observed and the predicted value which in this context is the difference between Daniel9s current WAM and predicted WAM which is negative meaning that the predicted value is greater.

Gabriela Prasetiyo z

Q

4. The formula for confidence interval is

𝗃𝗃 = 𝗆± 𝗆 7

𝗆

:𝗅

We know that the mean(data$WAM) 𝗆 = 72.

And the standard deviation is sd(data$WAM) 𝗆 = 7.

𝗆 7 is the value from t(n-1) which is the degrees of freedom (df)

n = 108

therefore, we use 𝗆 7 = 𝗆(𝗅 2 1) = t(108-1) 𝗆 7 = 𝗆(107)

Since it is a 95% confidence interval the formula would be

Quantile = 0 + (

)

= 0.

And in Rstudio it would be qt(0,df) qt(0, df= 107) = 1.

Then, we put it in the confidence interval formula for the lower bound and the upper bound

𝗃𝗃 = [72 2 1 ×

7.

: 108

,72 + 1 ×

7.

: 108

]

Therefore, the confidence interval for the dataset is = (71, 73)

4. From the confidence interval in 4. we can find that the margin of error is

𝗆 7

𝗆

:𝗅

1 ×

7.

: 108

= 1.

= 1 5 (4 𝗆.𝗅.)

Due to the removal of incomplete surveys, the non-response in this case won't have an impact on the margin of error.

4. We may determine the true mean of a sample by utilising the sample mean and standard deviation rather than the confidence interval, which only provides the lower and upper bounds. The statement is untrue as a result.

The confidence interval is not a measure of probability, as it is stated in the question. Because its mean is either within the limits or it is beyond them, unlike sample means, whose position within the limitations cannot be expressed as a probability, the actual mean is a fixed parameter that does not vary from sample to sample.

Gabriela Prasetiyo z

P-value: 𝗄𐀀𐀀𐀀𐀀(𝗄 g 𝗆)

P(𝗄 g 1)= 1 2 𝗄(𝗄 <1) The null distribution is 𝗆(𝗅 2 1)

𝗆(108 2 1) = t(107) On Rstudio the P-value would be 1-pt(1, df=107) = 0. = 0 (4 𝗆.𝗅.)

Therefore, we can conclude that there is little to no evidence that there was an increase in the students9 stress levels at Randwick university during Covid compared to the pre-pandemic situation.

5.

  1. The observations are independent for the distribution.
  2. The distribution of each random variable is normal with the same mean 𝗰 and standard deviation 𝗰.

The two aforementioned conditions were satisfied since the dataset was generated at random, and because each variable had an equal chance of being chosen, the difference between the stress levels at Randwick university during Covid compared to the pre-pandemic are independent, as we can also in the quantile plot.

Was this document helpful?

Math1041 assignment pdf

Course: Statistics for Life and Social Science (MATH1041)

183 Documents
Students shared 183 documents in this course
Was this document helpful?
Gabriela Prasetiyo z5391941
Q1
1.a. The following RStudio codes are used to determine the ATAR's Standard Deviation.
ATAR1<- as.numeric(unlist(data$ATAR))
ATAR2<- ATAR1[!is.na(ATAR1)]
sd(ATAR2)
�㗰 = 14.73493
�㗰 = 14.73)(4)�㗆. �㗅. )
1.b. The following RStudio formula and code are used to determine Daniel9s Z-score.
�㗆 = �㗆 2 �㗰
�㗰
z_score<- (92.7-mean(ATAR2))/sd(ATAR2)
z_score
�㗆 2 �㗆�㗅�㗅�㗅�㗅 = 0.5847759
�㗆 2 �㗆�㗅�㗅�㗅�㗅 = 0.5848)(4)�㗆. �㗅. )
The number of standard deviations (
�㗰
) x is from the mean is represented by its z-score if it is an observation
from a normal distribution with a mean (
�㗰
) and standard deviation (
�㗰
). As a result, Daniel's ATAR's Z-score is
positive, indicating that x is 0.5847759 standard deviations to the right of the mean (
�㗰
).
1.c. Before computing any numerical summary, Daniel should have first completed the data filled in the NA
values through a method called imputation. The imputation that can be used in RStudio is central imputation. In
central imputation the centre value, which is the mode, median, or mean of the specified dataset to replace the
missing data.
1.d. Isabella is right because the different distributions of the data can cause error in the bar chart and the graph
shows that the data is normally distributed without any outliers by utilising measurements of central tendency
per se which is incorrect and there are also incomplete data with N/A.
1.e.
1.f. The differences are that most female students chose <Labor= and male students chose mostly chose
<Liberal= and more male students have a political preference between <Liberal= and <Labor= rather than female
students.
Also, more number of females voted rather than males.