::p_load(tidyverse, haven, knitr, DT,
pacman labelled, janitor, gtsummary)
1 PISA Data
The 2022 PISA Data is available on the OECD website.
2 Getting Started
Library | Description |
---|---|
tidyverse, janitor | For data preparation, wrangling, and exploration. |
haven | To enable R to read and write various data formats such as SAS and SPSS. |
knitr, DT, kableExtra | For dynamic report generation. |
labelled | For reading and manipulating variable labels. |
gtsummary | For summary and analytical tables. |
The following code chunk uses p_load()
of pacman package to check if tidyverse packages are installed in the computer. If they are, the libraries will be called into R.
3 Reading Data into R
From PISA 2022, SAS data sets (.sas
) are available with all countries in the file for each respondent type.
The code chunk below imports the 2022 Student Questionnaire dataset downloaded from OECD’s PISA Database using the read_sas()
from the haven package.
<- read_sas("data/cy08msp_stu_qqq.sas7bdat") stu
The dataset is in a tibble dataframe, containing 613,744 observations (rows) across 1,279 variables (columns). Each observation corresponds to an entry from a student who participated in the 2022 PISA survey for students, and the variables correspond to information from students on various aspects of their home, family, and school background.
CNT refers to the country of response, we can use this to filter for Singapore (where CNT = SGP) responses for our analysis. filter()
of the dplyr package allows us to perform this extraction of participating country.
<- stu %>%
stu_SG filter(CNT == "SGP")
The resulting data contains 6,606 rows/observations across 1,279 columns/variables.
The .rds file format is usually smaller than its SAS file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file. For fast and space efficient data storage, files can be exported as RDS and re-imported into R using write_rds()
and read_rds()
respectively.
write_rds(stu_SG, "data/stu_SG.rds")
<- read_rds("data/stu_SG.rds") stu_SG
4 Data Wrangling
Below chart provides an overview of the different categories the team hopes to focus on to understand their impact on student scores.
flowchart TD A[2022 PISA Survey Student Questionnaire]-.-> A11[Gender] A -.-> A12[Socio-economic] A -.-> A13[Wellbeing] A -.-> A14[Attitude] A -.-> A15[Environment] A -.-> A16[Schools]
4.1 Filtering for required dataset
After perusing through the Codebook and Technical Report, the team narrowed down the questions from the survey that would yield insightful results. The names of the columns are stored in a vector named colname. To filter the raw dataset with the columns, we use select()
function of readr package to identify all the variables listed out in the colname vector.
<- c("CNTSCHID", "ST034Q06TA", "ST265Q03JA", "ST270Q03JA", "ST004D01T", "ST296Q01JA", "ST296Q02JA", "ST296Q03JA", "STRATUM", "HISCED", "IMMIG", "ST022Q01TA", "ST230Q01JA", "ST250D06JA", "ST250D07JA", "ST251Q01JA", "ST255Q01JA", "EXERPRAC", "ST250Q01JA", "WORKHOME", "ST268Q01JA", "ST268Q02JA", "ST268Q03JA") colname
The following code chunk serves the following purpose:
select()
function to retain the following columns:Variables identified in colname and
Columns that starts with “PV” and contains either “MATH”, “SCIE”, or “READ” to extract the plausible values of scores related to the subjects of Mathematics, Science, and Reading. This is performed using a combination of
starts_with()
andcontains()
.starts_with()
: Matches the beginning of the column name with “PV”, andcontains()
: Searches for columns containing three alternative subjects to be matched.
mutate()
to create 3 new variables to store the mean plausible values for each subject for each row usingrowMeans()
andacross()
.
<-
stu_SG_filtered %>%
stu_SG
# Retains desired variables
select(all_of(colname), starts_with("PV") & contains(c("MATH", "READ", "SCIE"))) %>%
# Calculates the mean of plausible values for each subject per student
mutate(Math = rowMeans(across(starts_with("PV") & contains("MATH")), na.rm = TRUE),
Reading = rowMeans(across(starts_with("PV") & contains("READ")), na.rm = TRUE),
Science = rowMeans(across(starts_with("PV") & contains("SCIE")), na.rm = TRUE),
%>%
)
# Drops Plausible Values columns
select(-starts_with("PV"))
stu_SG_filtered
contains 5183 observations across 26 variables.
%>%
stu_SG_filtered generate_dictionary() %>%
kable() %>%
::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"),
kableExtrafixed_thead = T)
pos | variable | label | col_type | missing | levels | value_labels |
---|---|---|---|---|---|---|
1 | CNTSCHID | Intl. School ID | dbl | 0 | NULL | NULL |
2 | ST034Q06TA | Agree/disagree: I feel lonely at school. | dbl | 1147 | NULL | NULL |
3 | ST265Q03JA | Agree/disagree: I feel safe in my classrooms at school. | dbl | 44 | NULL | NULL |
4 | ST270Q03JA | How often: The teacher helps students with their learning. | dbl | 68 | NULL | NULL |
5 | ST004D01T | Student (Standardized) Gender | dbl | 0 | NULL | NULL |
6 | ST296Q01JA | How much time spent on homework in: Mathematics homework | dbl | 70 | NULL | NULL |
7 | ST296Q02JA | How much time spent on homework in: [Test language] homework | dbl | 77 | NULL | NULL |
8 | ST296Q03JA | How much time spent on homework in: [Science] homework | dbl | 87 | NULL | NULL |
9 | STRATUM | Stratum ID 5-character (cnt + original stratum ID) | chr | 0 | NULL | NULL |
10 | HISCED | Highest level of education of parents (ISCED) | dbl | 57 | NULL | NULL |
11 | IMMIG | Index on immigrant background (OECD definition) | dbl | 236 | NULL | NULL |
12 | ST022Q01TA | What language do you speak at home most of the time? | dbl | 42 | NULL | NULL |
13 | ST230Q01JA | How many siblings (including brothers, sisters, step-brothers, and step-sisters) do you have? | dbl | 43 | NULL | NULL |
14 | ST250D06JA | Which of the following are in your home? <Country-specific item 1> | chr | 0 | NULL | NULL |
15 | ST250D07JA | Which of the following are in your home? <Country-specific item 2> | chr | 0 | NULL | NULL |
16 | ST251Q01JA | How many of these items are there at your [home]: Cars, vans, or trucks | dbl | 47 | NULL | NULL |
17 | ST255Q01JA | How many books are there in your [home]? | dbl | 44 | NULL | NULL |
18 | EXERPRAC | Exercise or practice a sport before or after school | dbl | 47 | NULL | NULL |
19 | ST250Q01JA | Which of the following are in your [home]: A room of your own | dbl | 66 | NULL | NULL |
20 | WORKHOME | Working in household/take care of family members before or after school | dbl | 51 | NULL | NULL |
21 | ST268Q01JA | Agree/disagree: Mathematics is one of my favourite subjects. | dbl | 69 | NULL | NULL |
22 | ST268Q02JA | Agree/disagree: [Test language] is one of my favourite subjects. | dbl | 74 | NULL | NULL |
23 | ST268Q03JA | Agree/disagree: [Science] is one of my favourite subjects. | dbl | 79 | NULL | NULL |
24 | Math | NA | dbl | 0 | NULL | NULL |
25 | Reading | NA | dbl | 0 | NULL | NULL |
26 | Science | NA | dbl | 0 | NULL | NULL |
4.2 Renaming Columns
<-
stu_SG_filtered %>%
stu_SG_filtered ::rename(
dplyr"SchoolID" = "CNTSCHID",
"Loneliness" = "ST034Q06TA",
"ClassroomSafety" = "ST265Q03JA",
"TeacherSupport" = "ST270Q03JA",
"Gender" = "ST004D01T",
"Homework_Math" = "ST296Q01JA",
"Homework_Reading" = "ST296Q02JA",
"Homework_Science" = "ST296Q03JA",
"SchoolType" = "STRATUM",
"ParentsEducation" = "HISCED",
"Immigration" = "IMMIG",
"HomeLanguage" = "ST022Q01TA",
"Sibling" = "ST230Q01JA",
"Aircon" = "ST250D06JA",
"Helper" = "ST250D07JA",
"Vehicle" = "ST251Q01JA",
"Books" = "ST255Q01JA",
"Exercise" = "EXERPRAC",
"OwnRoom" = "ST250Q01JA",
"FamilyCommitment" = "WORKHOME",
"Preference_Math" = "ST268Q01JA",
"Preference_Reading" = "ST268Q02JA",
"Preference_Science" = "ST268Q03JA"
)
Rows: 6,606
Columns: 26
$ SchoolID <dbl> 70200052, 70200134, 70200112, 70200004, 70200152, 7…
$ Loneliness <dbl> 3, 3, 3, NA, 4, 4, 3, 4, 3, 3, 3, 4, NA, NA, 3, 4, …
$ ClassroomSafety <dbl> 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, …
$ TeacherSupport <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 3, 2, 1, 1, 2, …
$ Gender <dbl> 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, …
$ Homework_Math <dbl> 1, 3, 2, 3, 4, 1, 1, 2, 1, 3, 3, 4, 3, 1, 2, 1, 3, …
$ Homework_Reading <dbl> 1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 3, 2, 1, 4, 1, 3, …
$ Homework_Science <dbl> 2, 3, 3, 2, 4, 1, 1, 2, 1, 2, 3, 4, 3, 1, 3, 1, 3, …
$ SchoolType <chr> "SGP01", "SGP01", "SGP01", "SGP01", "SGP01", "SGP01…
$ ParentsEducation <dbl> 8, 7, 4, 6, 7, 9, 6, 9, 8, 8, 4, 9, 10, 9, 6, 9, 9,…
$ Immigration <dbl> 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1, 1, 3, 1, 2, 3, …
$ HomeLanguage <dbl> 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, …
$ Sibling <dbl> 4, 4, 2, 4, 4, 3, 2, 2, 3, 4, 1, 3, 4, 1, 4, 3, 2, …
$ Aircon <chr> "7020002", "7020001", "7020001", "7020002", "702000…
$ Helper <chr> "7020002", "7020001", "7020002", "7020002", "702000…
$ Vehicle <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 3, 3, 1, 2, 2, 1, 2, 2, 1, …
$ Books <dbl> 7, 4, 4, 3, 2, 2, 4, 5, 7, 4, 3, 7, 4, 4, 2, 4, 5, …
$ Exercise <dbl> 1, 4, 2, 5, 9, 1, 2, 0, 3, 5, 1, 2, 5, 2, 4, 0, 2, …
$ OwnRoom <dbl> 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, …
$ FamilyCommitment <dbl> 10, 2, 0, 10, 5, 5, 7, 0, 0, 4, 2, 2, 10, 0, 10, 0,…
$ Preference_Math <dbl> 2, 4, 3, 2, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 1, 3, 2, …
$ Preference_Reading <dbl> 3, 3, 2, 3, 4, 3, 3, 2, 2, 2, 2, 3, 2, 2, 4, 3, 2, …
$ Preference_Science <dbl> 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 4, 2, 4, 2, 2, 2, …
$ Math <dbl> 605.2533, 689.9528, 676.7768, 401.0528, 436.1151, 5…
$ Reading <dbl> 667.4296, 627.6078, 582.9252, 361.3969, 475.6763, 4…
$ Science <dbl> 639.7873, 672.0703, 660.0384, 343.6425, 479.2390, 4…
4.3 Dropping Invalid Responses
There are some responses which are marked as invalid, or missing in our data. In the Helper and Aircon variables, these are coded as “9999999”. In this next step, we remove these responses and drop them.
<- stu_SG_filtered %>%
stu_SG_filtered mutate(Aircon = na_if(Aircon, "9999999"),
Helper = na_if(Helper, "9999999")) %>%
na.omit()
stu_SG_filtered
now contains 5158 observations across 26 variables.
4.4 Recoding and Ranking Questionnaire Responses
There are several types of responses for the Student’s Questionnaire. We store all the response levels for each question in separate vectors and subsequently combine to create a global dictionary named dicts.
<- c('1' = "0",
Books '2' = "1 - 10",
'3' = "11 - 25",
'4' = "26 - 100",
'5' = "101 - 200",
'6' = "201-500",
'7' = ">500")
<- c('1' = "English",
HomeLanguage '2' = "Others")
# Likert Scales: Strong Disagree to Strongly Agree
<- c('1' = "Strongly Disagree",
Preference_Math '2' = "Disagree",
'3' = "Agree",
'4' = "Strongly Agree")
<- c('1' = "Strongly Disagree",
Preference_Reading '2' = "Disagree",
'3' = "Agree",
'4' = "Strongly Agree")
<- c('1' = "Strongly Disagree",
Preference_Science '2' = "Disagree",
'3' = "Agree",
'4' = "Strongly Agree")
# Likert Scales: Strong Agree to Strongly Disagree
<- c('1' = "Strongly Agree",
Loneliness '2' = "Agree",
'3' = "Disagree",
'4' = "Strongly Disagree")
<- c('1' = "Strongly Agree",
ClassroomSafety '2' = "Agree",
'3' = "Disagree",
'4' = "Strongly Disagree")
# Binary
<- c('SGP01' = "Public",
SchoolType 'SGP03' = "Private")
<- c('1' = "Yes",
OwnRoom '2' = "No")
<- c('7020001' = "Yes",
Aircon '7020002' = "No")
<- c('7020001' = "Yes",
Helper '7020002' = "No")
# Frequency responses
<- c('0' = "0",
Exercise '1' = "1",
'2' = "2",
'3' = "3",
'4' = "4",
'5' = "5",
'6' = "6",
'7' = "7",
'8' = "8",
'9' = "9",
'10' = "10")
<- c('0' = "0",
FamilyCommitment '1' = "1",
'2' = "2",
'3' = "3",
'4' = "4",
'5' = "5",
'6' = "6",
'7' = "7",
'8' = "8",
'9' = "9",
'10' = "10")
# Time Periods
<- c('1' = "≤ 0.5hr",
Homework_Math '2' = "0.5hr - 1hr",
'3' = "1hr - 2hr",
'4' = "2hr - 3hr",
'5' = "3 - 4 hr",
'6' = "> 4hr")
<- c('1' = "≤ 0.5hr",
Homework_Reading '2' = "0.5hr - 1hr",
'3' = "1hr - 2hr",
'4' = "2hr - 3hr",
'5' = "3 - 4 hr",
'6' = "> 4hr")
<- c('1' = "≤ 0.5hr",
Homework_Science '2' = "0.5hr - 1hr",
'3' = "1hr - 2hr",
'4' = "2hr - 3hr",
'5' = "3 - 4 hr",
'6' = "> 4hr")
# Gender
<- c('1' = "Female",
Gender '2' = "Male")
# Immigrant Background
<- c('1' = "Native",
Immigration '2' = "2nd Generation",
'3' = "3rd Generation")
# Education Level
<- c('1'="Pre-Primary",
ParentsEducation '2'="Primary",
'3'="Secondary",
'4'='Secondary',
'6'="Post-Secondary",
'7'="Post-Secondary",
'8'="Tertiary",
'9'="Tertiary",
'10'="Tertiary")
# Posessions
<- c('1' = "0",
Vehicle '2' = "1",
'3' = "2",
'4' = "≥3")
<- c('1' = "0",
Sibling '2' = "1",
'3' = "2",
'4' = "≥3")
# Support
<- c('1' = "Every lesson",
TeacherSupport '2' = "Most lesson",
'3' = "Some lessons",
'4' = "Never or almost never")
# Global Dictionary
<- list(
dicts "Loneliness" = Loneliness,
"ClassroomSafety" = ClassroomSafety,
"TeacherSupport" = TeacherSupport,
"Gender" = Gender,
"Homework_Math" = Homework_Math,
"Homework_Reading" = Homework_Reading,
"Homework_Science" = Homework_Science,
"SchoolType" = SchoolType,
"ParentsEducation" = ParentsEducation,
"Immigration" = Immigration,
"HomeLanguage" = HomeLanguage,
"Sibling" = Sibling,
"Aircon" = Aircon,
"Helper" = Helper,
"Vehicle" = Vehicle,
"Books" = Books,
"Exercise" = Exercise,
"OwnRoom" = OwnRoom,
"FamilyCommitment" = FamilyCommitment,
"Preference_Math" = Preference_Math,
"Preference_Reading" = Preference_Reading,
"Preference_Science" = Preference_Science
)
The helper function below attempts to recode all of the columns based on the global recode dictionary, dicts, using functions from the base R, tidyr, and rlang packages:
names(x)
retrieves the column names of the input dataframerecode()
helps to recode values in the columns using dicts!!sym(x_nm)
unquotes and evaluates the column name that matches the names of the dictionaries, while!!!dicts[[x_nm]]
unquotes and splices the global recoding dictionary corresponding to the column name.
<- function(x) {
rcd <- names(x)
x_nm mutate(x, !! x_nm := recode(!! sym(x_nm), !!! dicts[[x_nm]]))
}
lmap_at()
of the purrr package applies the helper function to the column in the dataframe where the column name matches the keys of the dictionaries.
<-lmap_at(stu_SG_filtered,
stu_SG_rcd names(dicts),
rcd)
The mutate()
function in the dplyr package and the fct_relevel()
function in the forcats package are then used to set the order for ordinal variables.
<- stu_SG_rcd %>%
stu_SG_rcd mutate_if(is.character, as.factor) %>%
mutate(SchoolID = factor(SchoolID)) %>%
mutate(Books = fct_relevel(Books,
"0",
"1 - 10",
"11 - 25",
"26 - 100",
"101 - 200",
"201-500",
">500"),
Preference_Math = fct_relevel(Preference_Math,
"Strongly Disagree",
"Disagree",
"Agree",
"Strongly Agree"),
Preference_Reading = fct_relevel(Preference_Reading,
"Strongly Disagree",
"Disagree",
"Agree",
"Strongly Agree"),
Preference_Science = fct_relevel(Preference_Science,
"Strongly Disagree",
"Disagree",
"Agree",
"Strongly Agree"),
Loneliness = fct_relevel(Loneliness,
"Strongly Disagree",
"Disagree",
"Agree",
"Strongly Agree"),
ClassroomSafety = fct_relevel(ClassroomSafety,
"Strongly Disagree",
"Disagree",
"Agree",
"Strongly Agree"),
Exercise = fct_relevel(Exercise,
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"10"),
FamilyCommitment = fct_relevel(FamilyCommitment,
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"10"),
Homework_Math = fct_relevel(Homework_Math,
"≤ 0.5hr",
"0.5hr - 1hr",
"1hr - 2hr",
"2hr - 3hr",
"3 - 4 hr",
"> 4hr"),
Homework_Reading = fct_relevel(Homework_Reading,
"≤ 0.5hr",
"0.5hr - 1hr",
"1hr - 2hr",
"2hr - 3hr",
"3 - 4 hr",
"> 4hr"),
Homework_Science = fct_relevel(Homework_Science,
"≤ 0.5hr",
"0.5hr - 1hr",
"1hr - 2hr",
"2hr - 3hr",
"3 - 4 hr",
"> 4hr"),
Immigration = fct_relevel(Immigration,
"Native",
"2nd Generation",
"3rd Generation"),
ParentsEducation = fct_relevel(ParentsEducation,
"Pre-Primary",
"Primary",
"Secondary",
"Post-Secondary",
"Tertiary"),
Vehicle = fct_relevel(Vehicle,
"0",
"1",
"2",
"≥3"),
Sibling = fct_relevel(Sibling,
"0",
"1",
"2",
"≥3"),
TeacherSupport = fct_relevel(TeacherSupport,
"Never or almost never",
"Some lessons",
"Most lesson",
"Every lesson"))
Characteristic | N | N = 5,1581 |
---|---|---|
SchoolID | 5,158 | |
70200001 | 50 (1.0%) | |
70200002 | 29 (0.6%) | |
70200003 | 31 (0.6%) | |
70200004 | 41 (0.8%) | |
70200005 | 30 (0.6%) | |
70200006 | 25 (0.5%) | |
70200007 | 27 (0.5%) | |
70200008 | 30 (0.6%) | |
70200009 | 30 (0.6%) | |
70200010 | 30 (0.6%) | |
70200011 | 51 (1.0%) | |
70200012 | 41 (0.8%) | |
70200013 | 49 (0.9%) | |
70200014 | 30 (0.6%) | |
70200015 | 22 (0.4%) | |
70200016 | 30 (0.6%) | |
70200017 | 29 (0.6%) | |
70200018 | 41 (0.8%) | |
70200019 | 29 (0.6%) | |
70200020 | 48 (0.9%) | |
70200021 | 35 (0.7%) | |
70200022 | 26 (0.5%) | |
70200023 | 29 (0.6%) | |
70200024 | 27 (0.5%) | |
70200025 | 29 (0.6%) | |
70200026 | 45 (0.9%) | |
70200027 | 44 (0.9%) | |
70200028 | 20 (0.4%) | |
70200029 | 33 (0.6%) | |
70200030 | 26 (0.5%) | |
70200031 | 42 (0.8%) | |
70200032 | 31 (0.6%) | |
70200033 | 24 (0.5%) | |
70200034 | 26 (0.5%) | |
70200035 | 42 (0.8%) | |
70200036 | 26 (0.5%) | |
70200037 | 32 (0.6%) | |
70200038 | 32 (0.6%) | |
70200039 | 31 (0.6%) | |
70200040 | 44 (0.9%) | |
70200041 | 4 (<0.1%) | |
70200042 | 30 (0.6%) | |
70200043 | 40 (0.8%) | |
70200044 | 44 (0.9%) | |
70200045 | 47 (0.9%) | |
70200046 | 26 (0.5%) | |
70200047 | 32 (0.6%) | |
70200048 | 27 (0.5%) | |
70200049 | 43 (0.8%) | |
70200050 | 22 (0.4%) | |
70200051 | 29 (0.6%) | |
70200052 | 44 (0.9%) | |
70200053 | 35 (0.7%) | |
70200054 | 27 (0.5%) | |
70200055 | 28 (0.5%) | |
70200056 | 22 (0.4%) | |
70200057 | 20 (0.4%) | |
70200058 | 28 (0.5%) | |
70200059 | 33 (0.6%) | |
70200060 | 24 (0.5%) | |
70200061 | 30 (0.6%) | |
70200062 | 42 (0.8%) | |
70200063 | 32 (0.6%) | |
70200064 | 27 (0.5%) | |
70200065 | 29 (0.6%) | |
70200066 | 47 (0.9%) | |
70200067 | 41 (0.8%) | |
70200068 | 30 (0.6%) | |
70200069 | 28 (0.5%) | |
70200070 | 29 (0.6%) | |
70200071 | 48 (0.9%) | |
70200072 | 27 (0.5%) | |
70200073 | 30 (0.6%) | |
70200074 | 30 (0.6%) | |
70200075 | 49 (0.9%) | |
70200076 | 29 (0.6%) | |
70200077 | 30 (0.6%) | |
70200078 | 16 (0.3%) | |
70200079 | 20 (0.4%) | |
70200080 | 28 (0.5%) | |
70200081 | 29 (0.6%) | |
70200082 | 46 (0.9%) | |
70200083 | 28 (0.5%) | |
70200084 | 29 (0.6%) | |
70200085 | 28 (0.5%) | |
70200086 | 32 (0.6%) | |
70200087 | 28 (0.5%) | |
70200088 | 32 (0.6%) | |
70200089 | 29 (0.6%) | |
70200090 | 31 (0.6%) | |
70200091 | 26 (0.5%) | |
70200092 | 30 (0.6%) | |
70200093 | 33 (0.6%) | |
70200094 | 41 (0.8%) | |
70200095 | 21 (0.4%) | |
70200096 | 28 (0.5%) | |
70200097 | 28 (0.5%) | |
70200098 | 25 (0.5%) | |
70200099 | 23 (0.4%) | |
70200100 | 27 (0.5%) | |
70200101 | 31 (0.6%) | |
70200102 | 32 (0.6%) | |
70200103 | 24 (0.5%) | |
70200104 | 29 (0.6%) | |
70200105 | 45 (0.9%) | |
70200106 | 31 (0.6%) | |
70200107 | 28 (0.5%) | |
70200108 | 27 (0.5%) | |
70200109 | 27 (0.5%) | |
70200110 | 45 (0.9%) | |
70200111 | 47 (0.9%) | |
70200112 | 32 (0.6%) | |
70200113 | 28 (0.5%) | |
70200114 | 37 (0.7%) | |
70200115 | 14 (0.3%) | |
70200116 | 29 (0.6%) | |
70200117 | 29 (0.6%) | |
70200118 | 44 (0.9%) | |
70200119 | 40 (0.8%) | |
70200120 | 28 (0.5%) | |
70200121 | 23 (0.4%) | |
70200122 | 26 (0.5%) | |
70200123 | 28 (0.5%) | |
70200124 | 26 (0.5%) | |
70200125 | 30 (0.6%) | |
70200126 | 34 (0.7%) | |
70200127 | 29 (0.6%) | |
70200128 | 30 (0.6%) | |
70200129 | 28 (0.5%) | |
70200130 | 46 (0.9%) | |
70200131 | 26 (0.5%) | |
70200132 | 46 (0.9%) | |
70200133 | 28 (0.5%) | |
70200134 | 32 (0.6%) | |
70200135 | 31 (0.6%) | |
70200136 | 26 (0.5%) | |
70200137 | 31 (0.6%) | |
70200138 | 14 (0.3%) | |
70200139 | 46 (0.9%) | |
70200140 | 28 (0.5%) | |
70200141 | 35 (0.7%) | |
70200142 | 44 (0.9%) | |
70200143 | 32 (0.6%) | |
70200144 | 32 (0.6%) | |
70200145 | 43 (0.8%) | |
70200146 | 31 (0.6%) | |
70200147 | 20 (0.4%) | |
70200148 | 23 (0.4%) | |
70200149 | 18 (0.3%) | |
70200151 | 33 (0.6%) | |
70200152 | 25 (0.5%) | |
70200153 | 28 (0.5%) | |
70200154 | 30 (0.6%) | |
70200155 | 44 (0.9%) | |
70200156 | 28 (0.5%) | |
70200157 | 28 (0.5%) | |
70200158 | 24 (0.5%) | |
70200159 | 48 (0.9%) | |
70200160 | 26 (0.5%) | |
70200161 | 27 (0.5%) | |
70200162 | 34 (0.7%) | |
70200163 | 27 (0.5%) | |
70200164 | 16 (0.3%) | |
70200165 | 29 (0.6%) | |
Loneliness | 5,158 | |
Strongly Disagree | 1,412 (27%) | |
Disagree | 2,747 (53%) | |
Agree | 783 (15%) | |
Strongly Agree | 216 (4.2%) | |
ClassroomSafety | 5,158 | |
Strongly Disagree | 72 (1.4%) | |
Disagree | 149 (2.9%) | |
Agree | 2,261 (44%) | |
Strongly Agree | 2,676 (52%) | |
TeacherSupport | 5,158 | |
Never or almost never | 80 (1.6%) | |
Some lessons | 568 (11%) | |
Most lesson | 1,799 (35%) | |
Every lesson | 2,711 (53%) | |
Gender | 5,158 | |
Female | 2,542 (49%) | |
Male | 2,616 (51%) | |
Homework_Math | 5,158 | |
≤ 0.5hr | 1,185 (23%) | |
0.5hr - 1hr | 1,671 (32%) | |
1hr - 2hr | 1,580 (31%) | |
2hr - 3hr | 525 (10%) | |
3 - 4 hr | 133 (2.6%) | |
> 4hr | 64 (1.2%) | |
Homework_Reading | 5,158 | |
≤ 0.5hr | 2,008 (39%) | |
0.5hr - 1hr | 1,777 (34%) | |
1hr - 2hr | 1,085 (21%) | |
2hr - 3hr | 219 (4.2%) | |
3 - 4 hr | 40 (0.8%) | |
> 4hr | 29 (0.6%) | |
Homework_Science | 5,158 | |
≤ 0.5hr | 1,150 (22%) | |
0.5hr - 1hr | 1,564 (30%) | |
1hr - 2hr | 1,684 (33%) | |
2hr - 3hr | 578 (11%) | |
3 - 4 hr | 128 (2.5%) | |
> 4hr | 54 (1.0%) | |
SchoolType | 5,158 | |
Private | 354 (6.9%) | |
Public | 4,804 (93%) | |
ParentsEducation | 5,158 | |
Pre-Primary | 7 (0.1%) | |
Primary | 49 (0.9%) | |
Secondary | 637 (12%) | |
Post-Secondary | 1,559 (30%) | |
Tertiary | 2,906 (56%) | |
Immigration | 5,158 | |
Native | 3,742 (73%) | |
2nd Generation | 573 (11%) | |
3rd Generation | 843 (16%) | |
HomeLanguage | 5,158 | |
English | 3,229 (63%) | |
Others | 1,929 (37%) | |
Sibling | 5,158 | |
0 | 643 (12%) | |
1 | 2,397 (46%) | |
2 | 1,287 (25%) | |
≥3 | 831 (16%) | |
Aircon | 5,158 | 4,543 (88%) |
Helper | 5,158 | 1,276 (25%) |
Vehicle | 5,158 | |
0 | 2,033 (39%) | |
1 | 2,605 (51%) | |
2 | 423 (8.2%) | |
≥3 | 97 (1.9%) | |
Books | 5,158 | |
0 | 160 (3.1%) | |
1 - 10 | 727 (14%) | |
11 - 25 | 909 (18%) | |
26 - 100 | 1,880 (36%) | |
101 - 200 | 826 (16%) | |
201-500 | 482 (9.3%) | |
>500 | 174 (3.4%) | |
Exercise | 5,158 | |
0 | 1,340 (26%) | |
1 | 466 (9.0%) | |
2 | 795 (15%) | |
3 | 663 (13%) | |
4 | 507 (9.8%) | |
5 | 414 (8.0%) | |
6 | 312 (6.0%) | |
7 | 96 (1.9%) | |
8 | 150 (2.9%) | |
9 | 43 (0.8%) | |
10 | 372 (7.2%) | |
OwnRoom | 5,158 | 3,214 (62%) |
FamilyCommitment | 5,158 | |
0 | 1,917 (37%) | |
1 | 371 (7.2%) | |
2 | 529 (10%) | |
3 | 400 (7.8%) | |
4 | 319 (6.2%) | |
5 | 584 (11%) | |
6 | 198 (3.8%) | |
7 | 104 (2.0%) | |
8 | 140 (2.7%) | |
9 | 57 (1.1%) | |
10 | 539 (10%) | |
Preference_Math | 5,158 | |
Strongly Disagree | 561 (11%) | |
Disagree | 1,186 (23%) | |
Agree | 2,010 (39%) | |
Strongly Agree | 1,401 (27%) | |
Preference_Reading | 5,158 | |
Strongly Disagree | 530 (10%) | |
Disagree | 1,902 (37%) | |
Agree | 2,121 (41%) | |
Strongly Agree | 605 (12%) | |
Preference_Science | 5,158 | |
Strongly Disagree | 384 (7.4%) | |
Disagree | 1,184 (23%) | |
Agree | 2,339 (45%) | |
Strongly Agree | 1,251 (24%) | |
1 n (%) |
4.5 Data Health
get_dupes()
of the janitor package is used to hunt for duplicate records. The results show that there are no duplicated rows.
get_dupes(stu_SG_rcd)
[1] SchoolID Loneliness ClassroomSafety TeacherSupport
[5] Gender Homework_Math Homework_Reading Homework_Science
[9] SchoolType ParentsEducation Immigration HomeLanguage
[13] Sibling Aircon Helper Vehicle
[17] Books Exercise OwnRoom FamilyCommitment
[21] Preference_Math Preference_Reading Preference_Science Math
[25] Reading Science dupe_count
<0 rows> (or 0-length row.names)
5 Our Final Dataset
write_csv(stu_SG_rcd, "data/stu_SG_rcd.csv")
write_rds(stu_SG_rcd, "data/stu_SG_rcd.rds")
colSums(is.na(stu_SG_rcd))
SchoolID Loneliness ClassroomSafety TeacherSupport
0 0 0 0
Gender Homework_Math Homework_Reading Homework_Science
0 0 0 0
SchoolType ParentsEducation Immigration HomeLanguage
0 0 0 0
Sibling Aircon Helper Vehicle
0 0 0 0
Books Exercise OwnRoom FamilyCommitment
0 0 0 0
Preference_Math Preference_Reading Preference_Science Math
0 0 0 0
Reading Science
0 0