Talk to them on the phone. Well, not really, because correlation does not imply causation, but of the following activities with friends - going to the friends house, hanging out after school (not at the friends house), hanging out on the weekend, talking about a problem, or talking on the telephone - talking on the telephone was the best predictor of friendship reciprocity. If you talked on the phone, the likelihood that your friendship was reciprocated increased from 26% to 54%. Of course, this data is from 1994-1995. Maybe now it’s text messaging or facebook chat that would predict reciprocity.
In a moment of panic, I realized that I wanted to use likelihood ratio tests in addition to comparing AIC to determine which model fits best, but that I had not stored the likelihood from each model. Thankfully, it isn’t hard to calculate given the number of parameters (k) and the AIC. Whew.
In the Add Health survey, students were asked to list their top five female friends and then their top five male friends. As a result, students may have been more likely to list cross-gender friends than they would have if they hadn’t been prompted to list friends for each gender.
Of all of the friendship ties listed, 40% are between students of different genders. 95% of students who list at least one friend list at least one same gender friend, while 73% list a cross-gender friend. The likelihood that a student lists a cross-gender friend increases as students get older. 65% of 7th graders list a cross-gender friend, while 76% of 12th graders list a cross gender friend.
Here is the full breakdown of the percent of students who list certain numbers of same gender and cross gender friends:
Sane gender friends are slightly more likely to be similar in terms of parents education:
| same_gender
same_educ3 | 0 1 | Total
—————-+———————————+—————
0 | 73,940 106,621 | 180,561
| 55.97 54.57 | 55.14
—————-+———————————+—————
1 | 58,172 88,747 | 146,919
| 44.03 45.43 | 44.86
—————-+———————————+—————
Total | 132,112 195,368 | 327,480
| 100.00 100.00 | 100.00
This difference is small, but statistically significant (sample size is pretty big), even when controlling for the composition of the school.
Two schools that were normal-sized included information on student’s residential locations, in the form of x and y coordinates. I wanted to quickly plot the segregation by parent education. This was how it went down.
First plot I made had some crazy outliers. Not even gonna show that one. Next one zoomed in a bit.
It was still hard to see what was going on, so I zoomed in even more and also excluded the middle group of students whose parents attended some college.
I didn’t see much here. So, I looked at the residential locations of students whose parents had an advanced degree vs those whose parents only attended high school or less.
Maybe a little more segregation?
The parents of these students were also interviewed and asked about their income. The median income is 45k in school 58, and 36k in school 77. Across the entire sample the median income is 38k, so these schools are right around the median. Only 34 (2%) students out of the 1635 students have parents with a family income over 100k. 121 (7%) have parents with a family income over 80k.
The parent income distributions of the two schools is the following:
Income is correlated with parent education. 65% of the high income parents had a bachelors degree, while 11% of the low income parents had a bachelors degree.
educ3_imp | 0 1 | Total
—————-+———————————+—————
1 | 167 16 | 183
| 62.78 13.22 | 47.29
—————-+———————————+—————
2 | 69 27 | 96
| 25.94 22.31 | 24.81
—————-+———————————+—————
3 | 30 78 | 108
| 11.28 64.46 | 27.91
——————+———————————+—————
Total | 266 121 | 387
| 100.00 100.00 | 100.00
The neighborhood segregation is below:
Just read this op-ed in the NYT, another discussion of the idea that there is increasing segregation in our social worlds.
High school is one place where people don’t have quite as much say in who they are put into contact with (though it is their parents who decide where to live and therefore who their children interact with in school). How common is it for a student who has a parent who graduated from college befriend a student whose parent has no more than a high school degree? I’ve definitely looked at this before but didn’t remember the exact numbers…
35% of high-SES (parent with bachelor’s degree) have no friends whose parents only have high school or less. But, that means 65% of students DO have a friend whose parents have a high school degree or less. About 70% of low-SES students have a friend with a parent with a bachelor’s degree or higher.
Of the student’s whose parents have an advanced degree beyond a college degree, 55% have a friend whose parents have no post-secondary education.
130 schools. How many are all low-SES? How many all high-SES?
I look at the distribution of students in schools by SES where SES is measured by a student’s parents’ level of education. First, I look at three categories of parent education: bachelor’s degree or more, some college, or a high school degree or less.
Most schools have a mix of students from different backgrounds. There are a few schools in which almost all of the student’s parents’ attended college, but most schools have between 20% and 60% students with parents who attended college.
I think break down parent’s education into five categories instead of three, and look at what share of each school is in each category.
In contrast, there are many more schools that are homogenous by race.
Can also stack them. It is not the case that all of the high-SES schools are the all-white schools.
Parent education is more correlated with income than it is with race:
I realized that I hadn’t looked at some simple stats on inequality in my sample in a while, and I hadn’t written about those either.
So here goes!
The first measure I look at is the Add Health Picture Vocabulary Test (ah_pvt), standardized by student age. It’s a measure of vocabulary, indicating both IQ/academic achievement. The median score is 100. The bottom 25% have scores below 90 while the top 25% have scores above 111.
I take the maximum value of mothers and fathers education level and then recode into three categories. 1 indicates students whose max parent education is high school. 2 is for students with a parent with some college education. 3 indicates the student had at least one parent with a college degree.
The scores increase for students with more highly-educated parents.
. table educ3, c(n ah_pvt mean ah_pvt median ah_pvt)
——————————————————————————
educ3 | N(ah_pvt) mean(ah_pvt) med(ah_pvt)
—————+————————————————————-
1 | 7,209 94 95
2 | 5,780 101 101
3 | 6,484 105 107
——————————————————————————
A score of 95 indicates the 35th percentile score, while 106 is at the 65th percentile. So going from having no parents with a college degree to one parent with a college degree means moving from the 35th to the 65th percentile.
The disparities are extreme at the top of the achievement scale. Of the students at the 90th percentile of achievement and above, only 15% have parents who did not attend college. Of the students below the 10th percentile, 65% have parents who did not attend college, while 16% have parents who did attend college.
The story is similar for GPA. Median GPA is 2.75, the bottom 25% is below 2.25, the top 25% are above 3.33.
. table educ3, c(n gpa mean gpa median gpa)
———————————————————————
educ3 | N(gpa) mean(gpa) med(gpa)
—————+—————————————————-
1 | 7,350 2.566429 2.5
2 | 5,961 2.713485 2.75
3 | 6,742 2.982362 3
———————————————————————
2.5 is at the 42 percentile, while 3 is at the 66th percentile. 50% of the students in the bottom 10th percentile have parents who did not attend college, while 16% have parents who attended college. 22% of the students in the top 90th percentile have parents who did not attend college, while 51% have parents who attended college. The reason that there are more low-SES students with high GPAs is likely because of between-school segregation, where low-SES students in majority low-SES schools are now at the top end of the achievement distribution and receive higher grades.
Programming is so empowering. That moment when you have to do something tedious and repetitive, and you realize that you have the power to get a computer to do it for you: priceless! Running ERG models across multiple schools is one of those tedious and repetitive tasks and I now have the power to automate it!
Unfortunately I do not know R nearly as well as I know Stata, and all network models are in R, but I am confident I can do it, thanks to Stanford CS106AB107.
The first thing I want to do is to check if there are enough students in a given category to estimate a homophily coefficient. It’s pretty likely that the coefficient for a covariate where there are less than five students in a school will be “Inf”. So I can check whether there are more than five students in each category and only include it if there are more than five. When specifying the model, this means that if there are three white students, coded as race = 1, fifty black students, coded as race = 2, and one hundred hispanic students, coded as race = 3, then I should include keep = c(2,3) in the model.
I have a feeling there is an easier way to do this, but the first thing I tried is a bit brute force - first count the number of categories where there are more than five students (so I can figure out the last one and not add a comma), and then paste each into a string for the model specification.
Voila! It works!
race_table <- tabulate(nodes$race)
min_in_category <- 5
n_race_categories <- 0
races_to_keep_string <- “c(“
# figure out how many categories there are
for (i in 1:length(race_table)) {
if (mytable[i] > min_in_category) {
n_race_categories <- n_race_categories + 1
}
# create string with categories to keep
counter <- 0
for (i in 1:length(race_table)) {
if (mytable[i] > min_in_category) {
counter <- counter + 1
if (counter < n_race_categories) {
races_to_keep_string <- paste(races_to_keep_string, i, “,”)
} else {
races_to_keep_string <- paste(races_to_keep_string, i, “)”)
}
}
}
erg_code <- paste(“ergm(net ~edges + nodematch(‘race’, diff=T, keep=”, races_to_keep_string,”))”)
m1 <- (eval(parse(text=erg_code)))
summary(m1)
Is it the case that students are even more likely to be friends if they are both from the same socioeconomic background and the same race? According to literature on homophily, the more similar the better when it comes to making friends.
However, it looks like this isn’t the case in the schools in the Add Health sample. I include an interaction term between SES and race in models predicting friendship formation and do not find a clear indication that the association is positive. I obtain a coefficient on the interaction term and then look at the distribution of the term across schools as well as whether or not the term is statistically significant. Taking the average and weighting more precise estimates results in an average of -.003. Of the statistically significant coefficients, 8 are positive and 11 are negative.
The following plot shows each coefficient and the standard error bars sorted by size. I omit the top three coefficients and the bottom three coefficients as these are much larger and make it difficult to see the rest of the coefficients. Darker lines indicate estimates that are statistically significantly different from zero.
I noticed that often when there are very few students of a racial group, the coefficient on matching for that group is -Inf in an ERGM. At first, this seemed like something very bad. However, I wondered if it made a difference for the other coefficients in the model or whether it was the same as dropping a term in a regression. I’ve now run models where I omit the problematic small groups with -Inf coefficients, and the coefficients on the other terms in the model appear to be exactly the same in dyadic independent models. For example, in one school, when I run the model:
m1 = ergm(net ~ edges + nodematch(“educ3”)+ nodematch(“race”, diff=TRUE) + nodematch(“educ3_race”) + nodematch(“grd”, diff = TRUE) + nodematch(“sex”) + nodeofactor(“educ3”, base = 1) + nodeofactor(“race”, base = 1) + nodeofactor(“grd”, base = 1) + nodeofactor(“sex”)) summary(m1)
The coefficient on nodematch.race.3 and nodematch.race.4 are -Inf and then NA for the standard error and p-value. When I restrict the model to only consider race = 1:
m1 = ergm(net ~ edges + nodematch(“educ3”)+ nodematch(“race”, diff=TRUE, keep=c(1)) + nodematch(“educ3_race”) + nodematch(“grd”, diff = TRUE) + nodematch(“sex”) + nodeofactor(“educ3”, base = 1) + nodeofactor(“race”, base = 1) + nodeofactor(“grd”, base = 1) + nodeofactor(“sex”)) summary(m1)
I no longer have any -Inf coefficients, but the coefficients on the other covariates are exactly the same. It appears that -Inf indicates that the configuration does not exist in the network so it is like the term is dropped from the model.
However, I do see differences in the coefficients in models where I include network statistics that take into account the dependencies between ties. Now the coefficients are different. In addition, there are no AIC or BIC measures to help determine model fit when there is an -Inf coefficient.
An article I am a co-author of with Sean, Demetra, and Erica was covered in the Atlantic and then also was picked up by a few other news sources such as ABC and Stanford. Exciting! While the paper doesn’t actually say that Brown vs. Board of Ed was a failure, it does find that segregation increases when school districts are released from court ordered desegregation.
For my final project for CS448 - Data Visualization - I worked with three colleagues to develop a website that would enable users to upload network data and view the data on a geographical map as well as a nodelink diagram and to see some stats on the network. There is currently no tool that we knew of that would allow for this kind of quick exploratory analysis of network data with geography.
Our final project is definitely a work in progress, but I thought it turned out pretty well!
Link is here:
http://cs448b.digitallyinclined.net:8000/
The Airbnb data is actually not very interesting - it was mostly just to test using lat/lng and shows every region tied to every other region.
We also need to work on our documentation :) Right now we can accept two file formats - a JSON file and csv. Example of the JSON can be found on github:
https://github.com/jalperin/geonetviz/blob/master/server/static/collab2008.json
The csv requires fields with the following names: ego_name, ego_lat, ego_lng, alter_name, alter_lat, alter_lng, weight.
Here’s is what we show for the collaboration network.
I’ve been looking into why we find an association between having friends with parents who went to college and college attendance. One theory is that having high-SES friends means students have higher expectations/a norm for college attendance. We control for the student’s current expectation of college attendance as well as the average of their friends expectation for college attendance so that does not appear to be the only factor at work.
The average student has 46% of friends with a parent who has graduated from college. The average student has a mean friend college expectation of 6.5, which corresponds to answering between “pretty likely” and “it will happen” in response to a question asking students about the chances that they will graduate from college. For students with the fewest high-SES friends, the mean of their friends college expectations is “6” - pretty likely”, while for students with over 70% of high-SES friends, the mean is “7” (still between pretty likely and it will happen - which is an “8”). The two variables are correlated 0.3.
Looking at the students rather than their friends, 62% of students with a parent who went to college answer “it will happen” to the question about their chances of graduating from college. 43% of students whose parents have a high school degree or less answer “it will happen.” Students with high-SES friends are more likely to answer that they will graduate from college.
Though average friend’s college expectations does not end up being a significant predictor of college attendance, another question is whether or not students with high-SES friends will then raise their expectations. Students without a parent who graduated from college who have over 70% of friends with a parent who graduated from college have an average answer of 6.75. The average of their friends’ college expectations is 6.83. Do these students have higher expectations a year later?
The question that was asked in both wave 1 and wave 2 is slightly different - it asks the student about the likelihood that they will go to college, not the likelihood that they will graduate from college, and the answer is on a scale from 1 to 5.
College expectations in wave one are 3.16 on average. A year later, average college expectations are 3.11. If I look just at ninth and tenth graders, the expectations rise slightly, from 3.066 to 3.077. For students with over 70% high-SES friends, college expectations rise by 0.03. For other students, expectations decrease by -0.019. This difference is not statistically significant.
For a more rigorous test, I run a ordered logistic regression predicting the student’s answer to the question in wave 2 using their answer in wave 1 and only look at students who were in 9th and 10th grade in wave 1. I then control for the share of high-SES friends in wave 1. The results show that both are statistically significant predictors of expectations in wave 2, and both are positive. The higher the share of high-SES friends, the higher are students expectations of college attendance in wave 2, controlling for their expectations in wave 1.
A conditional logit is used when characteristics of the choices influence the likelihood of the choices, not just characteristics of the individuals making the choices.
I know that an ERGM is the same as a logistic regression when the predictors are attributes of the individuals or of the ties themselves (like whether it is same-race). The data consists of all possible ties and the outcome is whether the tie is realized. But not when ties are not independent.
I’ve been looking more into the relationship between friendship segregation and achievement gaps. I look at the gap between high and low SES test scores as measured by a vocabulary test and then correlate the gap with the level of friendship segregation between low and high SES students in the school. Because high-SES students tend to be the most likely to nominate same-SES friends, I look at the average share of low-SES friends nominated by high-SES students in the school and correlate that with the achievement gap. Because schools have different proportions of low-SES students, I normalize by 1 - the share of low-SES students in the school.
I find that schools in which high-SES students nominate more low-SES friends have a smaller achievement gap between high and low-SES students. The figure below plots the achievement gap - the difference in test scores between low and high SES students - vs the inbreeding homophily of high to low SES students - the share of low-SES friends nominated by high-SES students relative to the share in the school divided by 1 - the share of low-SES students in the school. High SES students tend to nominate a lower share of low-SES students then the proportion of low-SES students in the school, which is why that number is generally negative. When inbreeding homophily is 0 that means that high-SES students are nominating a share of low-SES friends that is equal to the share in the school. More negative numbers means that high-SES students are nominating a lower share of low-SES friends than the share of low-SES students in the school. The standard deviation of the word test is 15 points and the average gap is 7 points - about half a standard deviation.
Recently I’ve had to do analyses on datasets that are too large to work with in R, and I’ve been learning how to use the command line tools awk, sed, and grep instead. Today I wanted to do a little data cleaning and then take a small subset of the data to load into R, so I used awk. Awk reads a file one line at a time and performs operations on each line one a time, rather than loading the entire file first. It recognizes fields separated by spaces or tabs (you can specify the separator) and columns can be referred to using $1 for the first column, $2 for the second, etc.
Here is the line of code I used to create a new text file containing all the rows from the original data file where the first field is equal to “New York” ($0 means the entire line).
cat all_data.txt | awk ‘{if($1 == “New York”) print $0}’ > new_york.txt
Here are some of the references I’ve looked at:
http://www.ibm.com/developerworks/linux/library/l-awk1/index.html
Exponential random graph (ERG) models are great for modeling networks. I’ve been using these for a while now to look at how much being from the same socioeconomic background increases the likelihood that two students are friends. The most basic network model would involve creating a dataset with every possible tie between students, and then predicting whether or not a tie exists based on whether or not students are the same-SES (using a logit, for example). However, logits assume that ties are independent of each other, which is often not the case for networks. If I’m friends with A and A is friends with B then my friendship with B is more likely. ERGM allows me to model this type of dependency. This paper gives a good outline of the terms that can be included in ERG models.
I wanted to add a term for whether or not two students participated in an extra-curricular activity together to see if students who participated in extra-curricular activities together were more likely to be friends. Looking through the list of ERGM terms, I decided the edgecov statistic seemed best. The first argument for edgecov can be a network and the second the name of the edge attribute. I create two data sets. One contains a list of student id and friend id for the actual friendship ties that I use to make a network object called “acutal_net”. The other contains all possible combinations of students and then the number of shared activities between the two students. I can then estimate my models with the following code:
actual_edges = read.dta(“schl1_dyads.dta”)
all_possible_edges = read.dta(“schll_all_dyads.dta”)
actual_net = network(actual_edges[,c(“id”,”fid”)])
all_net = network(all_possible_edges[,c(“id”,”fid”)])
set.edge.attribute(all_net, “share_activity”, all_possible_edges[,”share_activity”])
summary(ergm(actual_net ~ edges + edgecov(all_net, “share_activity”)))
This ergm is pretty simple. In any case, I get that the coefficient on edgecov.shared_activity is positive and statistically significant, indicating that students who are in an extra-curricular activity together are more likely to be friends. The “edges” term in the model is like an intercept - it gives the overall propensity of students in the network to form ties.
I often use the capture command at the start of a line if I want Stata to keep going through code in a loop even if an error occurs in some instances. The other nice thing about the capture command is that it will store return code in the built-in scalar _rc. If _rc != 0, then there was an error and _rc will contain a value other than 0 indicating the error.
One of the puzzling parts of some of my findings is that it appears that there are many schools with extreme racial friendship segregation, but much less extreme friendship segregation by SES. Aren’t race and SES correlated?
My main measure of SES is parent level of education. I use both a three category measure of SES - high school or less, some college, college or more - and a five category measure - less than high school, high school, some college, college, college +. The table below shows the distributions for the three-level SES measure. White and other students are more likely to have parents who attended college, with hispanics the least like.
There is certainly variation in SES not explained by race. I regress parent education on dummy variables for each race and find that even when I use the most categories of SES the r-squared is 0.015 for non-imputed and 0.037 for imputed data (about 7,000 more cases of SES, imputed at the school level), indicating that at a max, race explains 3.7% of the variance in SES. When I regress parent income on the four categories of race instead of the SES measure based on parent education, race explains 1.8% of the variance for the non-imputed, and 1.7% of the variance for the imputed. If I look at the correlation in each school, the largest r-squared I find is 0.14.
Next, I look at schools that have high levels of friendship segregation by race and look at the SES composition of each racial group in the school and the patterns of friendship by SES in these schools. There are twelve schools where the average white and average black student have greater than 75% same-race friends, and where between 25-75% of the school is white and black.
Here are pictures of three of those schools. First, I show the nodes colored by race. Second, by the three category education variable. No legend module in Gephi yet as far as I can tell, so I don’t have a legend. But, the patterns are still pretty clear. High segregation by race in these schools, less clear seg by SES.
School 45:
School 50:
Of course, in the process of doing this analysis I noticed the differences between the imputed and non-imputed data which I had not expected. I had previously between imputing the max of mothers and fathers education but had switched to imputed mothers education and fathers education separately and then taking the max of those. Turns out fathers education is missing in about 35% of cases, while mothers education is there for 24% of those. So I switched back to imputing the max where there are just 10% missing values. I’ll run everything with both non-imputed and imputed data to check how the results differ.
This morning I’m going to figure out sweave - a way to link data analysis in R with LaTeX so that I can generate all the tables and figures and have them automatically added to a LaTeX document with one command.
To start, I read this incredibly helpful blog post to set up sweave on OSX. The example runs sweave from inside R but it is easy to run from command line by typing
R CMD Sweave myfile.Rnw
Two other tips that were helpful were to use
«echo=FALSE, results=tex»
which results in a tex file with no R code or sweave tags, and
print(xtable(res), include.rownames = FALSE)
to suppress row numbers being automatically added to a table.