Intro to Conjoint Experiments | Lab - 1
1 Lab sessions: why and how?
- Apply theoretical knowledge
- Increase understanding by interacting with data
- Learn to use some packages in R
- How:
- Relatively unstructured
- Go at your own pace, try to do the exercises yourself (do yourself a favour and do not just copy paste and run the solutions)
- “There is never time to do it right, but there is always time to do it over”
1.1 Software used: R
- This is not an R course!
- We will learn some R as we go along
- I will use RStudio
- Many packages or libraries exist to do specific analyses
2 Setting up R
2.1 Why R
- Free and open source (think of science in developing countries)
- Good online-documentation (much better than some commercial softwere such as Mplus)
- Lively community of users (forums etc.)
- Visualization capabilities (ggplot …)
- Cooperates with other programs and programming language (e.g. Python)
- Popularity (See popularity statistics on books, blogs, forums)
- RStudio as powerful integrated development environment (IDE) for R
- Evolves into a scientific work suite optimizing workflow (replication, reproducability etc.)
- Institutions/people (Gary King, Andrew Gelman etc.)
2.2 Where to learn R
- If you haven’t used R so far it’s necessary that you learn some basics in R.
- Data Camp (free trial but it is commercial)
- Try R: A short interactive intro to the language can be found here: http://tryr.codeschool.com/
- Swirl: Learn R interactively within R itself: http://swirlstats.com/
2.3 Install R on your machine
Below some notes on the installation and setup of R and relevant packages on your own computer:
- Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/).
- If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don’t have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
- [If you are using OS X, you can use Homebrew to install R.(https://mac.install.guide/homebrew/index.html)]
- Install the latest version of R from CRAN (https://cran.r-project.org/).
- Install the latest version of RStudio (https://www.rstudio.com/products/RStudio/). RStudio is the editor, i.e. you’ll write code in RStudio which is subsequently sent to and run within R.
- Install the latest versions of various packages that we need. You can also update your packages by running “update.packages(ask = FALSE)” in the R commandline.
3 What we are going to cover
- Load data
- From Qualtrics
- Data exploration
- Attributes
- Levels
- Checks and data manipulation
- Validity checks
- Removing NAs
4 Dataset used
The data set used throughout is the CEU Experimental Political Science dataset on Jan 2019. We will restrict the analysis to some specific variables. Each line in the data set represents a respondent recruited through Qualtrics from a representative panel of the US population.
Codebook:
- Q578 Q579 Q580: Choice CJ Task
- F-*-*: Conjoint features
- Q78: Employment status
- Q77: Race
- Q76: Education
- Q75: Gender
- Q74: Age
- Q581: Religiosity
5 Environment preparation
# ### Data import ###
# install.packages("readr") # read datasets
# install.packages("qualtRics") # read qualtrics datasets
# install.packages("here") # absolute path management
# ### Data manipulation ###
# install.packages("dplyr") # pipes and data manipulation
# ### Visualization ###
# install.packages("ggplot2") # graphing capabilities
# ### Estimation ###
# install.packages("cjoint") # base amce package
# install.packages("cregg") # amce and mm
# install.packages("factorEx") # amce with non-uniform distribution
## Custom build functions
# library(devtools)
# install_github("albertostefanelli/cjoint") # fixes some problem with cjoint
### Data import ###
library("readr")
library("qualtRics")
library("here")
### Data manipulation ###
library("dplyr")
### Visualization ###
library("ggplot2")
### Estimation ###
library("cjoint")
library("cregg")
library("factorEx")
6 Load the data: Qualtrics
- Let’s take a look to how to deal with the Qualtrics format
- Qualtrics include HTML tags that are use for diagnostics
- We need to get rid of them
- We are going to use the qualtRics package for removing the tags
<- qualtRics::read_survey(here("data","experimental_political_science_2019.csv") , # name of our data file
df_base legacy=FALSE, # new or old version of Qualtrics
strip_html=TRUE # remove the html tag
)
head(df_base)
# A tibble: 6 × 45
`Duration (in seconds)` Finished ResponseId Q578 Q579 Q580 Q271_1 Q271_2 Q271_3 Q271_4 Q271_5 Q271_6 Q78 Q77 Q76 Q75 Q74 Q581 `F-1-1` `F-1-1-1` `F-1-2` `F-1-1-2` `F-1-3` `F-1-1-3` `F-1-2-1` `F-1-2-2` `F-1-2-3` `F-2-1` `F-2-1-1` `F-2-2` `F-2-1-2` `F-2-3` `F-2-1-3` `F-2-2-1` `F-2-2-2`
<dbl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 227 TRUE R_2rzostKY… Cand… Cand… Cand… 4 - N… 6 3 5 3 5 Empl… Hisp… Mast… Fema… 25 Regu… Penal … The cand… Past P… Approxim… Policy… Acceptan… The cand… Approxim… Collecti… Penal … No proce… Past P… None Policy… Collecti… No proce… Approxim…
2 195 TRUE R_5dluK6Rw… Cand… Cand… Cand… 5 3 2 4 - N… 2 5 Unem… Cauc… Some… Male 28 Never Policy… Collecti… Penal … The cand… Past P… Approxim… Welcome … The cand… None Policy… Acceptan… Penal … The cand… Past P… Approxim… Welcome … The cand…
3 295 TRUE R_3QMq6Ezo… Cand… Cand… Cand… 6 6 6 6 3 5 Stud… Cauc… Some… Male 22 Never Past P… None Policy… Welcome … Penal … The cand… Approxim… Acceptan… No proce… Past P… Approxim… Policy… Acceptan… Penal … The cand… Approxim… Welcome …
4 358 TRUE R_2ZWYdunD… Cand… Cand… Cand… 6 3 3 6 1 - S… 7 - S… Unem… Cauc… Some… Male 32 Never Past P… Approxim… Penal … No proce… Policy… Welcome … Approxim… The cand… Acceptan… Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… The cand…
5 310 TRUE R_3GEedrAF… Cand… Cand… Cand… 6 3 6 5 3 6 Stud… Asian Some… Male 21 Never Penal … The cand… Past P… None Policy… Acceptan… No proce… None Acceptan… Penal … The cand… Past P… Approxim… Policy… Collecti… The cand… Approxim…
6 365 TRUE R_emv0H4Gn… Cand… Cand… Cand… 6 5 5 4 - N… 5 5 Empl… Cauc… Some… Male 30 Regu… Past P… None Policy… Acceptan… Penal … The cand… None Collecti… No proce… Past P… Approxim… Policy… Acceptan… Penal … No proce… None Collecti…
# … with 10 more variables: `F-2-2-3` <chr>, `F-3-1` <chr>, `F-3-1-1` <chr>, `F-3-2` <chr>, `F-3-1-2` <chr>, `F-3-3` <chr>, `F-3-1-3` <chr>, `F-3-2-1` <chr>, `F-3-2-2` <chr>, `F-3-2-3` <chr>
If you first need to download the data from the GitHub repository
<- "https://github.com/albertostefanelli/conjoint_class/raw/master/data/experimental_political_science_2019.csv"
githubURL
download.file(githubURL,
destfile = here("data","experimental_political_science_2019.csv"))
7 Understanding the data
The F-*-* are our conjoint features meaning the experimental manipulations. We need to makes sense of them and be extra careful to handle them correctly.
|> dplyr::select(dplyr::starts_with("F-")) df_base
# A tibble: 222 × 27
`F-1-1` `F-1-1-1` `F-1-2` `F-1-1-2` `F-1-3` `F-1-1-3` `F-1-2-1` `F-1-2-2` `F-1-2-3` `F-2-1` `F-2-1-1` `F-2-2` `F-2-1-2` `F-2-3` `F-2-1-3` `F-2-2-1` `F-2-2-2` `F-2-2-3` `F-3-1` `F-3-1-1` `F-3-2` `F-3-1-2` `F-3-3` `F-3-1-3` `F-3-2-1` `F-3-2-2` `F-3-2-3`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Penal proceedings The candidate has been convicted of … Past P… Approxim… Policy… Acceptan… The cand… Approxim… Collecti… Penal … No proce… Past P… None Policy… Collecti… No proce… Approxim… Welcome … Penal … The cand… Past P… Approxim… Policy… Acceptan… No proce… Approxim… Welcome …
2 Policy Proposal Collective expulsion of immigrants a… Penal … The cand… Past P… Approxim… Welcome … The cand… None Policy… Acceptan… Penal … The cand… Past P… Approxim… Welcome … The cand… Approxim… Policy… Collecti… Penal … No proce… Past P… Approxim… Acceptan… The cand… None
3 Past Political Experience None Policy… Welcome … Penal … The cand… Approxim… Acceptan… No proce… Past P… Approxim… Policy… Acceptan… Penal … The cand… Approxim… Welcome … No proce… Past P… Approxim… Policy… Collecti… Penal … No proce… Approxim… Welcome … The cand…
4 Past Political Experience Approximately 10 years Penal … No proce… Policy… Welcome … Approxim… The cand… Acceptan… Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… The cand… Welcome … Past P… Approxim… Penal … No proce… Policy… Acceptan… Approxim… No proce… Welcome …
5 Penal proceedings The candidate is under investigation… Past P… None Policy… Acceptan… No proce… None Acceptan… Penal … The cand… Past P… Approxim… Policy… Collecti… The cand… Approxim… Collecti… Penal … The cand… Past P… None Policy… Welcome … No proce… None Collecti…
6 Past Political Experience None Policy… Acceptan… Penal … The cand… None Collecti… No proce… Past P… Approxim… Policy… Acceptan… Penal … No proce… None Collecti… The cand… Past P… Approxim… Policy… Welcome … Penal … The cand… Approxim… Collecti… The cand…
7 Policy Proposal Welcome immigrants and organise huma… Past P… Approxim… Penal … The cand… Welcome … Approxim… The cand… Policy… Welcome … Past P… None Penal … The cand… Acceptan… Approxim… The cand… Policy… Acceptan… Past P… None Penal … The cand… Welcome … None The cand…
8 Policy Proposal Collective expulsion of immigrants a… Penal … The cand… Past P… Approxim… Welcome … The cand… None Policy… Acceptan… Penal … No proce… Past P… None Welcome … The cand… None Policy… Collecti… Penal … The cand… Past P… Approxim… Welcome … The cand… Approxim…
9 Penal proceedings No proceedings Past P… None Policy… Acceptan… The cand… None Acceptan… Penal … The cand… Past P… None Policy… Collecti… No proce… Approxim… Collecti… Penal … No proce… Past P… None Policy… Acceptan… The cand… Approxim… Welcome …
10 Policy Proposal Acceptance of immigrants conditional… Past P… Approxim… Penal … The cand… Collecti… None The cand… Policy… Collecti… Past P… Approxim… Penal … The cand… Welcome … Approxim… The cand… Policy… Acceptan… Past P… Approxim… Penal … No proce… Collecti… Approxim… The cand…
# … with 212 more rows
Let’s drive deeper and try to figure out which ones are the attributes (i.e. experimental characteristics/manupulation) of the experiment
|> dplyr::select("F-1-1") |> table() df_base
Past Political Experience Penal proceedings Policy Proposal
75 69 78
|> dplyr::select("F-1-1-1") |> table() df_base
Acceptance of immigrants conditional on certain requisites Approximately 10 years Approximately 20 years Collective expulsion of immigrants and closure of the border
18 26 22 31
No proceedings None The candidate has been convicted of corruption The candidate is under investigation for corruption
26 27 25 18
Welcome immigrants and organise human corridor
29
8 Checks and data manipulation
8.1 Validity Checks
- We are going to try to detect abnormal patterns in the survey duration
- Exclude respondents who did not completed the survey or were too fast in completing it
- Exclude observations that took too long to complete the survey
# duration of the survey in seconds
$"Duration (in seconds)" <- as.numeric(df_base$"Duration (in seconds)")
df_base# let's transform it in minutes
$duration_mins <- df_base$"Duration (in seconds)"/60
df_base# plot the density
ggplot(df_base, aes(x=duration_mins)) +
geom_density()
- The survey has been tested and it was estimated to take between 8 and 15 mins to be completed
- Q: What can you see from the graph above?
- We are going to exclude outliers
- Less than 4 mins
- More than 3 standard deviation above the mean
# calculate the % of respondents that completed the survey in less than 4 minutes
sum(df_base$duration_mins < 4)/nrow(df_base)*100) (
[1] 11.71171
# select the observations that completed the survey in more than 4 minutes
<- df_base$duration_mins > 4
lower_bound # subset the observation keeping only the observation that completed the survey in more than 4 minutes
<- df_base[lower_bound,]
df_greater_base # to get the upper bound we resort on another approach
# scale a z-scores
$scaled_duration_mins <- scale(df_greater_base$duration_mins)
df_greater_base# calculate the % of respondents that are 3 sd or more from the mean
sum(df_greater_base$scaled_duration_mins > 3)/nrow(df_greater_base)*100) (
[1] 1.530612
# upper bound
<- df_greater_base$scaled_duration_mins < 3
upper_bound # subset the observation keeping only the observation that completed the survey in no more than 3 sd from the mean
<- df_greater_base[upper_bound,] df_greater_base
8.2 Removing NA on the DV
- Calculate how many respondents did not answer the CJ questions
sum(is.na(df_greater_base$Q578))/nrow(df_greater_base)*100) (
[1] 4.145078
sum(is.na(df_greater_base$Q579))/nrow(df_greater_base)*100) (
[1] 4.145078
sum(is.na(df_greater_base$Q580))/nrow(df_greater_base)*100) (
[1] 3.626943
- The similarity in the % suggests that a specific subset of the respondents did not complete the survey
- This is good news but let’s check if this is the case
- Let investigate which respondents did not answer the CJ questions
is.na(df_greater_base$Q578),"ResponseId"] df_greater_base[
# A tibble: 8 × 1
ResponseId
<chr>
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_2S3ngM3YxFabDkc
7 R_3nDDFSDtufMIwpL
8 R_3QPGHPBx2jnXRIQ
is.na(df_greater_base$Q579),"ResponseId"] df_greater_base[
# A tibble: 8 × 1
ResponseId
<chr>
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_2S3ngM3YxFabDkc
7 R_3nDDFSDtufMIwpL
8 R_3QPGHPBx2jnXRIQ
is.na(df_greater_base$Q580),"ResponseId"] df_greater_base[
# A tibble: 7 × 1
ResponseId
<chr>
1 R_3dEN4IvpNRlrYGZ
2 R_24GD2YvqjePlgR8
3 R_9QSDhesusl6rm8N
4 R_2QMWfW6np5x4tOU
5 R_1r8icC6TonV8OPo
6 R_3nDDFSDtufMIwpL
7 R_3QPGHPBx2jnXRIQ
- It seems that these unique respondents did not perform any of the CJ tasks
- We can safely remove them from our sample
- Let’s now exclude the missing
- NB: If you got more than 5% of missing might be that something have gone wrong and you need further investigations
<- df_greater_base[!is.na(df_greater_base$Q578),]
df_greater_base_wo_NA <- df_greater_base_wo_NA[!is.na(df_greater_base_wo_NA$Q579),]
df_greater_base_wo_NA <- df_greater_base_wo_NA[!is.na(df_greater_base_wo_NA$Q580),] df_greater_base_wo_NA