Phase 1: Data Wrangling

1 PISA Data

The 2022 PISA Data is available on the OECD website.

2 Getting Started

Library Description
tidyverse, janitor For data preparation, wrangling, and exploration.
haven To enable R to read and write various data formats such as SAS and SPSS.
knitr, DT, kableExtra For dynamic report generation.
labelled For reading and manipulating variable labels.
gtsummary For summary and analytical tables.

The following code chunk uses p_load() of pacman package to check if tidyverse packages are installed in the computer. If they are, the libraries will be called into R.

pacman::p_load(tidyverse, haven, knitr, DT,
               labelled, janitor, gtsummary)

3 Reading Data into R

From PISA 2022, SAS data sets (.sas) are available with all countries in the file for each respondent type.

The code chunk below imports the 2022 Student Questionnaire dataset downloaded from OECD’s PISA Database using the read_sas() from the haven package.

stu <- read_sas("data/cy08msp_stu_qqq.sas7bdat")

The dataset is in a tibble dataframe, containing 613,744 observations (rows) across 1,279 variables (columns). Each observation corresponds to an entry from a student who participated in the 2022 PISA survey for students, and the variables correspond to information from students on various aspects of their home, family, and school background.

CNT refers to the country of response, we can use this to filter for Singapore (where CNT = SGP) responses for our analysis. filter() of the dplyr package allows us to perform this extraction of participating country.

stu_SG <- stu %>%
  filter(CNT == "SGP") 

The resulting data contains 6,606 rows/observations across 1,279 columns/variables.

The .rds file format is usually smaller than its SAS file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file. For fast and space efficient data storage, files can be exported as RDS and re-imported into R using write_rds() and read_rds() respectively.

write_rds(stu_SG, "data/stu_SG.rds")
stu_SG <- read_rds("data/stu_SG.rds")

4 Data Wrangling

Below chart provides an overview of the different categories the team hopes to focus on to understand their impact on student scores.

flowchart TD

    A[2022 PISA Survey Student Questionnaire]-.-> A11[Gender]
    A -.-> A12[Socio-economic]
    A -.-> A13[Wellbeing]
    A -.-> A14[Attitude]   
    A -.-> A15[Environment] 
    A -.-> A16[Schools] 

4.1 Filtering for required dataset

After perusing through the Codebook and Technical Report, the team narrowed down the questions from the survey that would yield insightful results. The names of the columns are stored in a vector named colname. To filter the raw dataset with the columns, we use select() function of readr package to identify all the variables listed out in the colname vector.

colname <- c("CNTSCHID", "ST034Q06TA", "ST265Q03JA", "ST270Q03JA", "ST004D01T", "ST296Q01JA", "ST296Q02JA", "ST296Q03JA", "STRATUM", "HISCED", "IMMIG", "ST022Q01TA", "ST230Q01JA", "ST250D06JA", "ST250D07JA", "ST251Q01JA", "ST255Q01JA", "EXERPRAC", "ST250Q01JA", "WORKHOME", "ST268Q01JA", "ST268Q02JA", "ST268Q03JA")

The following code chunk serves the following purpose:

  • select() function to retain the following columns:

    • Variables identified in colname and

    • Columns that starts with “PV” and contains either “MATH”, “SCIE”, or “READ” to extract the plausible values of scores related to the subjects of Mathematics, Science, and Reading. This is performed using a combination of starts_with() and contains().

      • starts_with(): Matches the beginning of the column name with “PV”, and

      • contains(): Searches for columns containing three alternative subjects to be matched.

  • mutate() to create 3 new variables to store the mean plausible values for each subject for each row using rowMeans() and across().

stu_SG_filtered <- 
  stu_SG %>% 

  # Retains desired variables
  select(all_of(colname), starts_with("PV") & contains(c("MATH", "READ", "SCIE"))) %>% 

  # Calculates the mean of plausible values for each subject per student
  mutate(Math = rowMeans(across(starts_with("PV") & contains("MATH")), na.rm = TRUE),
         Reading = rowMeans(across(starts_with("PV") & contains("READ")), na.rm = TRUE),
         Science = rowMeans(across(starts_with("PV") & contains("SCIE")), na.rm = TRUE),
         ) %>% 
  
  # Drops Plausible Values columns
  select(-starts_with("PV"))

stu_SG_filtered contains 5183 observations across 26 variables.

stu_SG_filtered %>%  
  generate_dictionary() %>% 
  kable() %>% 
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"),
                            fixed_thead = T)
pos variable label col_type missing levels value_labels
1 CNTSCHID Intl. School ID dbl 0 NULL NULL
2 ST034Q06TA Agree/disagree: I feel lonely at school. dbl 1147 NULL NULL
3 ST265Q03JA Agree/disagree: I feel safe in my classrooms at school. dbl 44 NULL NULL
4 ST270Q03JA How often: The teacher helps students with their learning. dbl 68 NULL NULL
5 ST004D01T Student (Standardized) Gender dbl 0 NULL NULL
6 ST296Q01JA How much time spent on homework in: Mathematics homework dbl 70 NULL NULL
7 ST296Q02JA How much time spent on homework in: [Test language] homework dbl 77 NULL NULL
8 ST296Q03JA How much time spent on homework in: [Science] homework dbl 87 NULL NULL
9 STRATUM Stratum ID 5-character (cnt + original stratum ID) chr 0 NULL NULL
10 HISCED Highest level of education of parents (ISCED) dbl 57 NULL NULL
11 IMMIG Index on immigrant background (OECD definition) dbl 236 NULL NULL
12 ST022Q01TA What language do you speak at home most of the time? dbl 42 NULL NULL
13 ST230Q01JA How many siblings (including brothers, sisters, step-brothers, and step-sisters) do you have? dbl 43 NULL NULL
14 ST250D06JA Which of the following are in your home? <Country-specific item 1> chr 0 NULL NULL
15 ST250D07JA Which of the following are in your home? <Country-specific item 2> chr 0 NULL NULL
16 ST251Q01JA How many of these items are there at your [home]: Cars, vans, or trucks dbl 47 NULL NULL
17 ST255Q01JA How many books are there in your [home]? dbl 44 NULL NULL
18 EXERPRAC Exercise or practice a sport before or after school dbl 47 NULL NULL
19 ST250Q01JA Which of the following are in your [home]: A room of your own dbl 66 NULL NULL
20 WORKHOME Working in household/take care of family members before or after school dbl 51 NULL NULL
21 ST268Q01JA Agree/disagree: Mathematics is one of my favourite subjects. dbl 69 NULL NULL
22 ST268Q02JA Agree/disagree: [Test language] is one of my favourite subjects. dbl 74 NULL NULL
23 ST268Q03JA Agree/disagree: [Science] is one of my favourite subjects. dbl 79 NULL NULL
24 Math NA dbl 0 NULL NULL
25 Reading NA dbl 0 NULL NULL
26 Science NA dbl 0 NULL NULL

4.2 Renaming Columns

stu_SG_filtered <-
  stu_SG_filtered %>% 
  dplyr::rename(
    "SchoolID" = "CNTSCHID",    
    "Loneliness" = "ST034Q06TA",
    "ClassroomSafety" = "ST265Q03JA",
    "TeacherSupport" = "ST270Q03JA",
    "Gender" = "ST004D01T",
    "Homework_Math" = "ST296Q01JA",
    "Homework_Reading" = "ST296Q02JA",
    "Homework_Science" = "ST296Q03JA",
    "SchoolType" = "STRATUM",
    "ParentsEducation" = "HISCED",
    "Immigration" = "IMMIG",
    "HomeLanguage" = "ST022Q01TA",
    "Sibling" = "ST230Q01JA",
    "Aircon" = "ST250D06JA",
    "Helper" = "ST250D07JA",
    "Vehicle" = "ST251Q01JA",
    "Books" = "ST255Q01JA",
    "Exercise" = "EXERPRAC",
    "OwnRoom" = "ST250Q01JA",
    "FamilyCommitment" = "WORKHOME",
    "Preference_Math" = "ST268Q01JA",
    "Preference_Reading" = "ST268Q02JA",
    "Preference_Science" = "ST268Q03JA"
  )
Rows: 6,606
Columns: 26
$ SchoolID           <dbl> 70200052, 70200134, 70200112, 70200004, 70200152, 7…
$ Loneliness         <dbl> 3, 3, 3, NA, 4, 4, 3, 4, 3, 3, 3, 4, NA, NA, 3, 4, …
$ ClassroomSafety    <dbl> 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, …
$ TeacherSupport     <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 3, 2, 1, 1, 2, …
$ Gender             <dbl> 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, …
$ Homework_Math      <dbl> 1, 3, 2, 3, 4, 1, 1, 2, 1, 3, 3, 4, 3, 1, 2, 1, 3, …
$ Homework_Reading   <dbl> 1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 3, 2, 1, 4, 1, 3, …
$ Homework_Science   <dbl> 2, 3, 3, 2, 4, 1, 1, 2, 1, 2, 3, 4, 3, 1, 3, 1, 3, …
$ SchoolType         <chr> "SGP01", "SGP01", "SGP01", "SGP01", "SGP01", "SGP01…
$ ParentsEducation   <dbl> 8, 7, 4, 6, 7, 9, 6, 9, 8, 8, 4, 9, 10, 9, 6, 9, 9,…
$ Immigration        <dbl> 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1, 1, 3, 1, 2, 3, …
$ HomeLanguage       <dbl> 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, …
$ Sibling            <dbl> 4, 4, 2, 4, 4, 3, 2, 2, 3, 4, 1, 3, 4, 1, 4, 3, 2, …
$ Aircon             <chr> "7020002", "7020001", "7020001", "7020002", "702000…
$ Helper             <chr> "7020002", "7020001", "7020002", "7020002", "702000…
$ Vehicle            <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 3, 3, 1, 2, 2, 1, 2, 2, 1, …
$ Books              <dbl> 7, 4, 4, 3, 2, 2, 4, 5, 7, 4, 3, 7, 4, 4, 2, 4, 5, …
$ Exercise           <dbl> 1, 4, 2, 5, 9, 1, 2, 0, 3, 5, 1, 2, 5, 2, 4, 0, 2, …
$ OwnRoom            <dbl> 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, …
$ FamilyCommitment   <dbl> 10, 2, 0, 10, 5, 5, 7, 0, 0, 4, 2, 2, 10, 0, 10, 0,…
$ Preference_Math    <dbl> 2, 4, 3, 2, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 1, 3, 2, …
$ Preference_Reading <dbl> 3, 3, 2, 3, 4, 3, 3, 2, 2, 2, 2, 3, 2, 2, 4, 3, 2, …
$ Preference_Science <dbl> 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 4, 2, 4, 2, 2, 2, …
$ Math               <dbl> 605.2533, 689.9528, 676.7768, 401.0528, 436.1151, 5…
$ Reading            <dbl> 667.4296, 627.6078, 582.9252, 361.3969, 475.6763, 4…
$ Science            <dbl> 639.7873, 672.0703, 660.0384, 343.6425, 479.2390, 4…

4.3 Dropping Invalid Responses

There are some responses which are marked as invalid, or missing in our data. In the Helper and Aircon variables, these are coded as “9999999”. In this next step, we remove these responses and drop them.

stu_SG_filtered <- stu_SG_filtered %>% 
  mutate(Aircon = na_if(Aircon, "9999999"),
         Helper = na_if(Helper, "9999999")) %>% 
  na.omit()

stu_SG_filtered now contains 5158 observations across 26 variables.

4.4 Recoding and Ranking Questionnaire Responses

There are several types of responses for the Student’s Questionnaire. We store all the response levels for each question in separate vectors and subsequently combine to create a global dictionary named dicts.

Books <- c('1' = "0",
               '2' = "1 - 10",
               '3' = "11 - 25",
               '4' = "26 - 100",
               '5' = "101 - 200",
               '6' = "201-500",
               '7' = ">500")

HomeLanguage <- c('1' = "English",
             '2' = "Others")

# Likert Scales: Strong Disagree to Strongly Agree
Preference_Math <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

Preference_Reading <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

Preference_Science <- c('1' = "Strongly Disagree",
           '2' = "Disagree",
           '3' = "Agree",
           '4' = "Strongly Agree")

# Likert Scales: Strong Agree to Strongly Disagree
Loneliness <- c('1' = "Strongly Agree",
           '2' = "Agree",
           '3' = "Disagree",
           '4' = "Strongly Disagree")

ClassroomSafety <- c('1' = "Strongly Agree",
           '2' = "Agree",
           '3' = "Disagree",
           '4' = "Strongly Disagree")

# Binary
SchoolType <- c('SGP01' = "Public",
           'SGP03' = "Private")

OwnRoom <- c('1' = "Yes", 
                '2' = "No")

Aircon <- c('7020001' = "Yes",
            '7020002' = "No")

Helper <- c('7020001' = "Yes",
            '7020002' = "No")

# Frequency responses
Exercise <- c('0' = "0",
          '1' = "1", 
          '2' = "2",
          '3' = "3",
          '4' = "4",
          '5' = "5",
          '6' = "6",
          '7' = "7",
          '8' = "8",
          '9' = "9",
          '10' = "10")

FamilyCommitment <- c('0' = "0",
          '1' = "1", 
          '2' = "2",
          '3' = "3",
          '4' = "4",
          '5' = "5",
          '6' = "6",
          '7' = "7",
          '8' = "8",
          '9' = "9",
          '10' = "10")

# Time Periods
Homework_Math <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

Homework_Reading <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

Homework_Science <- c('1' = "≤ 0.5hr",
                '2' = "0.5hr - 1hr",
                '3' = "1hr - 2hr",
                '4' = "2hr - 3hr",
                '5' = "3 - 4 hr",
                '6' = "> 4hr")

# Gender
Gender <- c('1' = "Female",
            '2' = "Male")


# Immigrant Background
Immigration <- c('1' = "Native",
           '2' = "2nd Generation",
           '3' = "3rd Generation")

# Education Level
ParentsEducation <- c('1'="Pre-Primary",   
         '2'="Primary", 
         '3'="Secondary",
         '4'='Secondary',
         '6'="Post-Secondary",
         '7'="Post-Secondary",
         '8'="Tertiary",
         '9'="Tertiary",
         '10'="Tertiary")

# Posessions
Vehicle <- c('1' = "0",
            '2' = "1",
            '3' = "2",
            '4' = "≥3")

Sibling <- c('1' = "0",
            '2' = "1",
            '3' = "2",
            '4' = "≥3")

# Support
TeacherSupport <- c('1' = "Every lesson",
            '2' = "Most lesson",
            '3' = "Some lessons",
            '4' = "Never or almost never")

# Global Dictionary
dicts <- list(
  "Loneliness" = Loneliness,
  "ClassroomSafety" = ClassroomSafety,
  "TeacherSupport" = TeacherSupport,
  "Gender" = Gender,
  "Homework_Math" = Homework_Math,
  "Homework_Reading" = Homework_Reading,
  "Homework_Science" = Homework_Science,
  "SchoolType" = SchoolType,
  "ParentsEducation" = ParentsEducation,
  "Immigration" = Immigration,
  "HomeLanguage" = HomeLanguage,
  "Sibling" = Sibling,
  "Aircon" = Aircon,
  "Helper" = Helper,
  "Vehicle" = Vehicle,
  "Books" = Books,
  "Exercise" = Exercise,
  "OwnRoom" = OwnRoom,
  "FamilyCommitment" = FamilyCommitment,
  "Preference_Math" = Preference_Math,
  "Preference_Reading" = Preference_Reading,
  "Preference_Science" = Preference_Science
)

The helper function below attempts to recode all of the columns based on the global recode dictionary, dicts, using functions from the base R, tidyr, and rlang packages:

  • names(x) retrieves the column names of the input dataframe

  • recode() helps to recode values in the columns using dicts

  • !!sym(x_nm) unquotes and evaluates the column name that matches the names of the dictionaries, while !!!dicts[[x_nm]] unquotes and splices the global recoding dictionary corresponding to the column name.

rcd <- function(x) {
  x_nm <- names(x)
  mutate(x, !! x_nm := recode(!! sym(x_nm), !!! dicts[[x_nm]]))
}

lmap_at() of the purrr package applies the helper function to the column in the dataframe where the column name matches the keys of the dictionaries.

stu_SG_rcd <-lmap_at(stu_SG_filtered, 
        names(dicts),
        rcd)

The mutate() function in the dplyr package and the fct_relevel() function in the forcats package are then used to set the order for ordinal variables.

stu_SG_rcd <- stu_SG_rcd %>%
  mutate_if(is.character, as.factor) %>% 
  mutate(SchoolID = factor(SchoolID)) %>% 
  mutate(Books = fct_relevel(Books, 
                             "0",
                             "1 - 10",
                             "11 - 25",
                             "26 - 100",
                             "101 - 200",
                             "201-500",
                             ">500"),
         Preference_Math = fct_relevel(Preference_Math,
                                       "Strongly Disagree",
                                       "Disagree",
                                       "Agree",
                                       "Strongly Agree"),
         Preference_Reading = fct_relevel(Preference_Reading,
                                          "Strongly Disagree",
                                          "Disagree",
                                          "Agree",
                                          "Strongly Agree"),
         Preference_Science = fct_relevel(Preference_Science,
                                          "Strongly Disagree",
                                          "Disagree",
                                          "Agree",
                                          "Strongly Agree"),
         Loneliness = fct_relevel(Loneliness,
                                  "Strongly Disagree",
                                  "Disagree",
                                  "Agree",
                                  "Strongly Agree"),
         ClassroomSafety = fct_relevel(ClassroomSafety,
                                       "Strongly Disagree",
                                       "Disagree",
                                       "Agree",
                                       "Strongly Agree"),
         Exercise = fct_relevel(Exercise,
                                "0",
                                "1", 
                                "2",
                                "3",
                                "4",
                                "5",
                                "6",
                                "7",
                                "8",
                                "9",
                                "10"),
         FamilyCommitment = fct_relevel(FamilyCommitment,
                                        "0",
                                        "1",
                                        "2",
                                        "3",
                                        "4",
                                        "5",
                                        "6",
                                        "7",
                                        "8",
                                        "9",
                                        "10"),
         Homework_Math = fct_relevel(Homework_Math,
                                     "≤ 0.5hr",
                                     "0.5hr - 1hr",
                                     "1hr - 2hr",
                                     "2hr - 3hr",
                                     "3 - 4 hr",
                                     "> 4hr"),
         Homework_Reading = fct_relevel(Homework_Reading,
                                        "≤ 0.5hr",
                                        "0.5hr - 1hr",
                                        "1hr - 2hr",
                                        "2hr - 3hr",
                                        "3 - 4 hr",
                                        "> 4hr"),
         Homework_Science = fct_relevel(Homework_Science,
                                        "≤ 0.5hr",
                                        "0.5hr - 1hr",
                                        "1hr - 2hr",
                                        "2hr - 3hr",
                                        "3 - 4 hr",
                                        "> 4hr"),
         Immigration = fct_relevel(Immigration,
                                   "Native",
                                   "2nd Generation",
                                   "3rd Generation"),
         ParentsEducation = fct_relevel(ParentsEducation,
                                        "Pre-Primary",
                                        "Primary", 
                                        "Secondary",
                                        "Post-Secondary",
                                        "Tertiary"),
         Vehicle = fct_relevel(Vehicle,
                               "0",
                               "1",
                               "2",
                               "≥3"),
         Sibling = fct_relevel(Sibling,
                               "0",
                               "1",
                               "2",
                               "≥3"),
         TeacherSupport = fct_relevel(TeacherSupport,
                                      "Never or almost never",
                                      "Some lessons",
                                      "Most lesson",
                                      "Every lesson"))
Table of Variable Summary
Characteristic N N = 5,1581
SchoolID 5,158
    70200001
50 (1.0%)
    70200002
29 (0.6%)
    70200003
31 (0.6%)
    70200004
41 (0.8%)
    70200005
30 (0.6%)
    70200006
25 (0.5%)
    70200007
27 (0.5%)
    70200008
30 (0.6%)
    70200009
30 (0.6%)
    70200010
30 (0.6%)
    70200011
51 (1.0%)
    70200012
41 (0.8%)
    70200013
49 (0.9%)
    70200014
30 (0.6%)
    70200015
22 (0.4%)
    70200016
30 (0.6%)
    70200017
29 (0.6%)
    70200018
41 (0.8%)
    70200019
29 (0.6%)
    70200020
48 (0.9%)
    70200021
35 (0.7%)
    70200022
26 (0.5%)
    70200023
29 (0.6%)
    70200024
27 (0.5%)
    70200025
29 (0.6%)
    70200026
45 (0.9%)
    70200027
44 (0.9%)
    70200028
20 (0.4%)
    70200029
33 (0.6%)
    70200030
26 (0.5%)
    70200031
42 (0.8%)
    70200032
31 (0.6%)
    70200033
24 (0.5%)
    70200034
26 (0.5%)
    70200035
42 (0.8%)
    70200036
26 (0.5%)
    70200037
32 (0.6%)
    70200038
32 (0.6%)
    70200039
31 (0.6%)
    70200040
44 (0.9%)
    70200041
4 (<0.1%)
    70200042
30 (0.6%)
    70200043
40 (0.8%)
    70200044
44 (0.9%)
    70200045
47 (0.9%)
    70200046
26 (0.5%)
    70200047
32 (0.6%)
    70200048
27 (0.5%)
    70200049
43 (0.8%)
    70200050
22 (0.4%)
    70200051
29 (0.6%)
    70200052
44 (0.9%)
    70200053
35 (0.7%)
    70200054
27 (0.5%)
    70200055
28 (0.5%)
    70200056
22 (0.4%)
    70200057
20 (0.4%)
    70200058
28 (0.5%)
    70200059
33 (0.6%)
    70200060
24 (0.5%)
    70200061
30 (0.6%)
    70200062
42 (0.8%)
    70200063
32 (0.6%)
    70200064
27 (0.5%)
    70200065
29 (0.6%)
    70200066
47 (0.9%)
    70200067
41 (0.8%)
    70200068
30 (0.6%)
    70200069
28 (0.5%)
    70200070
29 (0.6%)
    70200071
48 (0.9%)
    70200072
27 (0.5%)
    70200073
30 (0.6%)
    70200074
30 (0.6%)
    70200075
49 (0.9%)
    70200076
29 (0.6%)
    70200077
30 (0.6%)
    70200078
16 (0.3%)
    70200079
20 (0.4%)
    70200080
28 (0.5%)
    70200081
29 (0.6%)
    70200082
46 (0.9%)
    70200083
28 (0.5%)
    70200084
29 (0.6%)
    70200085
28 (0.5%)
    70200086
32 (0.6%)
    70200087
28 (0.5%)
    70200088
32 (0.6%)
    70200089
29 (0.6%)
    70200090
31 (0.6%)
    70200091
26 (0.5%)
    70200092
30 (0.6%)
    70200093
33 (0.6%)
    70200094
41 (0.8%)
    70200095
21 (0.4%)
    70200096
28 (0.5%)
    70200097
28 (0.5%)
    70200098
25 (0.5%)
    70200099
23 (0.4%)
    70200100
27 (0.5%)
    70200101
31 (0.6%)
    70200102
32 (0.6%)
    70200103
24 (0.5%)
    70200104
29 (0.6%)
    70200105
45 (0.9%)
    70200106
31 (0.6%)
    70200107
28 (0.5%)
    70200108
27 (0.5%)
    70200109
27 (0.5%)
    70200110
45 (0.9%)
    70200111
47 (0.9%)
    70200112
32 (0.6%)
    70200113
28 (0.5%)
    70200114
37 (0.7%)
    70200115
14 (0.3%)
    70200116
29 (0.6%)
    70200117
29 (0.6%)
    70200118
44 (0.9%)
    70200119
40 (0.8%)
    70200120
28 (0.5%)
    70200121
23 (0.4%)
    70200122
26 (0.5%)
    70200123
28 (0.5%)
    70200124
26 (0.5%)
    70200125
30 (0.6%)
    70200126
34 (0.7%)
    70200127
29 (0.6%)
    70200128
30 (0.6%)
    70200129
28 (0.5%)
    70200130
46 (0.9%)
    70200131
26 (0.5%)
    70200132
46 (0.9%)
    70200133
28 (0.5%)
    70200134
32 (0.6%)
    70200135
31 (0.6%)
    70200136
26 (0.5%)
    70200137
31 (0.6%)
    70200138
14 (0.3%)
    70200139
46 (0.9%)
    70200140
28 (0.5%)
    70200141
35 (0.7%)
    70200142
44 (0.9%)
    70200143
32 (0.6%)
    70200144
32 (0.6%)
    70200145
43 (0.8%)
    70200146
31 (0.6%)
    70200147
20 (0.4%)
    70200148
23 (0.4%)
    70200149
18 (0.3%)
    70200151
33 (0.6%)
    70200152
25 (0.5%)
    70200153
28 (0.5%)
    70200154
30 (0.6%)
    70200155
44 (0.9%)
    70200156
28 (0.5%)
    70200157
28 (0.5%)
    70200158
24 (0.5%)
    70200159
48 (0.9%)
    70200160
26 (0.5%)
    70200161
27 (0.5%)
    70200162
34 (0.7%)
    70200163
27 (0.5%)
    70200164
16 (0.3%)
    70200165
29 (0.6%)
Loneliness 5,158
    Strongly Disagree
1,412 (27%)
    Disagree
2,747 (53%)
    Agree
783 (15%)
    Strongly Agree
216 (4.2%)
ClassroomSafety 5,158
    Strongly Disagree
72 (1.4%)
    Disagree
149 (2.9%)
    Agree
2,261 (44%)
    Strongly Agree
2,676 (52%)
TeacherSupport 5,158
    Never or almost never
80 (1.6%)
    Some lessons
568 (11%)
    Most lesson
1,799 (35%)
    Every lesson
2,711 (53%)
Gender 5,158
    Female
2,542 (49%)
    Male
2,616 (51%)
Homework_Math 5,158
    ≤ 0.5hr
1,185 (23%)
    0.5hr - 1hr
1,671 (32%)
    1hr - 2hr
1,580 (31%)
    2hr - 3hr
525 (10%)
    3 - 4 hr
133 (2.6%)
    > 4hr
64 (1.2%)
Homework_Reading 5,158
    ≤ 0.5hr
2,008 (39%)
    0.5hr - 1hr
1,777 (34%)
    1hr - 2hr
1,085 (21%)
    2hr - 3hr
219 (4.2%)
    3 - 4 hr
40 (0.8%)
    > 4hr
29 (0.6%)
Homework_Science 5,158
    ≤ 0.5hr
1,150 (22%)
    0.5hr - 1hr
1,564 (30%)
    1hr - 2hr
1,684 (33%)
    2hr - 3hr
578 (11%)
    3 - 4 hr
128 (2.5%)
    > 4hr
54 (1.0%)
SchoolType 5,158
    Private
354 (6.9%)
    Public
4,804 (93%)
ParentsEducation 5,158
    Pre-Primary
7 (0.1%)
    Primary
49 (0.9%)
    Secondary
637 (12%)
    Post-Secondary
1,559 (30%)
    Tertiary
2,906 (56%)
Immigration 5,158
    Native
3,742 (73%)
    2nd Generation
573 (11%)
    3rd Generation
843 (16%)
HomeLanguage 5,158
    English
3,229 (63%)
    Others
1,929 (37%)
Sibling 5,158
    0
643 (12%)
    1
2,397 (46%)
    2
1,287 (25%)
    ≥3
831 (16%)
Aircon 5,158 4,543 (88%)
Helper 5,158 1,276 (25%)
Vehicle 5,158
    0
2,033 (39%)
    1
2,605 (51%)
    2
423 (8.2%)
    ≥3
97 (1.9%)
Books 5,158
    0
160 (3.1%)
    1 - 10
727 (14%)
    11 - 25
909 (18%)
    26 - 100
1,880 (36%)
    101 - 200
826 (16%)
    201-500
482 (9.3%)
    >500
174 (3.4%)
Exercise 5,158
    0
1,340 (26%)
    1
466 (9.0%)
    2
795 (15%)
    3
663 (13%)
    4
507 (9.8%)
    5
414 (8.0%)
    6
312 (6.0%)
    7
96 (1.9%)
    8
150 (2.9%)
    9
43 (0.8%)
    10
372 (7.2%)
OwnRoom 5,158 3,214 (62%)
FamilyCommitment 5,158
    0
1,917 (37%)
    1
371 (7.2%)
    2
529 (10%)
    3
400 (7.8%)
    4
319 (6.2%)
    5
584 (11%)
    6
198 (3.8%)
    7
104 (2.0%)
    8
140 (2.7%)
    9
57 (1.1%)
    10
539 (10%)
Preference_Math 5,158
    Strongly Disagree
561 (11%)
    Disagree
1,186 (23%)
    Agree
2,010 (39%)
    Strongly Agree
1,401 (27%)
Preference_Reading 5,158
    Strongly Disagree
530 (10%)
    Disagree
1,902 (37%)
    Agree
2,121 (41%)
    Strongly Agree
605 (12%)
Preference_Science 5,158
    Strongly Disagree
384 (7.4%)
    Disagree
1,184 (23%)
    Agree
2,339 (45%)
    Strongly Agree
1,251 (24%)
1 n (%)

4.5 Data Health

get_dupes() of the janitor package is used to hunt for duplicate records. The results show that there are no duplicated rows.

get_dupes(stu_SG_rcd)
 [1] SchoolID           Loneliness         ClassroomSafety    TeacherSupport    
 [5] Gender             Homework_Math      Homework_Reading   Homework_Science  
 [9] SchoolType         ParentsEducation   Immigration        HomeLanguage      
[13] Sibling            Aircon             Helper             Vehicle           
[17] Books              Exercise           OwnRoom            FamilyCommitment  
[21] Preference_Math    Preference_Reading Preference_Science Math              
[25] Reading            Science            dupe_count        
<0 rows> (or 0-length row.names)

5 Our Final Dataset

write_csv(stu_SG_rcd, "data/stu_SG_rcd.csv")
write_rds(stu_SG_rcd, "data/stu_SG_rcd.rds")
colSums(is.na(stu_SG_rcd))
          SchoolID         Loneliness    ClassroomSafety     TeacherSupport 
                 0                  0                  0                  0 
            Gender      Homework_Math   Homework_Reading   Homework_Science 
                 0                  0                  0                  0 
        SchoolType   ParentsEducation        Immigration       HomeLanguage 
                 0                  0                  0                  0 
           Sibling             Aircon             Helper            Vehicle 
                 0                  0                  0                  0 
             Books           Exercise            OwnRoom   FamilyCommitment 
                 0                  0                  0                  0 
   Preference_Math Preference_Reading Preference_Science               Math 
                 0                  0                  0                  0 
           Reading            Science 
                 0                  0