Yield Rate Gender Gap, Part 1: Scraping the Web

This is Part 1 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

University admissions and the yield rate

In my previous work running the education startup Newton China, I frequently ended up using the website CollegeData.com as a reference – it lists a ton of data for a lot of schools.

As an example, searching my alma mater brings up Penn State’s page, which is actually one six seperate CollegeData pages of information: Overview, Admission, Money Matters, Academics, Campus Life, and Students.

Scrolling down, you’ll find a bunch of numbers organized in somewhat regularly into rows. Some table rows, like Average GPA, are easy enough, containing only one value. But others, like the row labeled Students Enrolled contain three pieces of data.

The first number in 'Students Enrolled' is the straightforward number of students enrolled, presumably 9,800 students attending school in the fall as incoming freshmen. The third number – 29,878 – is the number of students who were offered admission. The second number – 33% – is called the yield rate, and it is the number that raised questions that inspired me to collect and analyze this data.

But why care about the yield rate?

The yield rate is the percentage of offered students who actually enrolled and it is of considerable importance to many college admissions departments. If the admissions department enrolls too few students, the school will not receive enough tuition to cover its costs. If it enrolls too many students, the extra students will bring dorms and classrooms beyond capacity.

Not every student who is given an offer letter – a letter of acceptance – will accept it. In fact it seems only 33% of the Penn State’s offer letters were accepted. Presumably, the other 67% of students who were offered elected to attend another school. If this 33% yield rate is relatively consistent year to year, which it normally is, the school will have a good idea how many offer letters should be given out so they may end up with the right yield.

Curiously, the yield rate at Penn State for men (37%) seems a good amount higher than that for women (29%). Men seem more likely to accept Penn State’s offers than women are. After looking at a few other schools’ yield rates, I became curious if there actually is a significant disparity in yield rate among schools – and if so, how is this disparity distributed and what factors may predict it?

Scraping collegedata.com

The collegedata.com url for Penn State’s Admission information page looks like this:

The 59 at the end of the URL is the schoolId. The 02 part of the URL indicates you are on the 'Admission' information page. Not every schoolId corresponds to a school, but most do, especially in the lower range of numbers. Attempting to access a page for a number with no school will load a default page with the title "Retrieve a Saved Search". After poking around a bit, I found at larger schoolId, especially over 1000, real school pages became more sparse, and I’m fairly confident there are no schools listed with a schoolId over 2500.

BASE_URL = "http://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId="
END_ID = 2500

In order to determine if the yield disparity is real, I only need to scrape the yield rate from each page. But grabbing as much other data as possible from the six collegedata.com pages of information for each school at the same time could prove useful later when trying to find predictors for yield rate.

I decided a good balance would be to only get the data in the <td> tags under the heading 'Profile of Fall Admission', identified as <div id='section8'> in the HTML, in addition to the school’s name and its location. I can always come back to scrape more if I need it in the future.

Building a BeautifulSoup scraper

I used the wonderful BeautifulSoup library to help me get what I need from CollegeData’s pages. It maps the HTML into a tree that can be descended down (and up) and side to side to access the elements and strings you want.

The string 'schoolname' is conveniently located in the first <h1> tag. The 'location' is a bit more difficult to grab, as it isn’t nested inside a unique tag, but can be found easily by using the search feature to find the text 'City, State, Zip' and going from there.

Finding 'section8' is no problem, from there a relatively simple for loop through all of its <tr> descendants, each of which may have a <th> child to be used as a key and a <td> child to be used as a value. Though most of the <th> values are unique, some are not, including 'Women','Men', and the SAT Score information further down the section.

from bs4 import BeautifulSoup

def scrape(response, REJECT_PAGE_TITLE = "Retrieve a Saved Search"):
    # response is from requests.get()
    page = BeautifulSoup(response.text, "lxml")
    if page.h1 is None or page.h1.string == REJECT_PAGE_TITLE:
        return None

        scraped = {}        
        scraped['schoolname'] = page.h1.string
        scraped['location'] = page.find('th', text="City, State, Zip").next_sibling.string
        for tag in page.find(id="section8").descendants:
            if tag.name == 'tr' and tag.th is not None and tag.td is not None:
                key = " ".join(string for string in tag.th.stripped_strings)
                val = " ".join(string for string in tag.td.stripped_strings)
                while key in scraped:  # Temporarily deal with identical table headers cells
                    key += "*"
                scraped[key] = val
        return scraped

An empty dictionary 'scraped' is created to hold our scraped values. We add 'schoolname' and 'location' before looping through id=section8, saving every non-empty pair of <th> and <td>. To avoid duplicates, we add a '*' to the key as a temporary marker until we can later clean up the key names. Finally, a 'scraped' dictionary is returned to the caller.

Requesting 2500 pages!

I’ve read how to implement methods to evade anti-scraping measures, but I encountered no push back from CollegeData when making these basic requests. It took me about 40 minutes to run through this loop.

from requests import get
from IPython.core.display import clear_output
from warnings import warn

scraped_pages = {}
for schoolId in range(START_ID, END_ID+1):
    url = BASE_URL + str(schoolId) 
    print(url)   # a status update
    response = get(url)
    if response.status_code != 200:
        warn('Request {} caused status code {}.'.format(schoolId, response.status_code))
    scraped = scrape(response)
    if scraped:
        scraped_pages[schoolId] = scraped
    clear_output(wait = True) # clear the status update

print('Requested schoolId {} to {}, scraping {} pages.'.format(START_ID, END_ID, len(scraped_pages)))
Requested schoolId 0 to 2500, scraping 1813 pages.

We now have a dictionary, 'scraped_pages', filled with keys 'schoolId' associated with scraped dictionaries, which themselves are filled with scraped <th> keys and <td> values, in addition to 'location' and 'schoolname'. Pandas’ 'DataFrame.from_dict' method will quickly convert this to a DataFrame, which we will then backup to .csv before moving forward.

Inspecting the results

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1813 entries, 6 to 2099
Data columns (total 53 columns):
schoolname                               1813 non-null object
location                                 1813 non-null object
Overall Admission Rate                   1813 non-null object
Women                                    1813 non-null object
Men                                      1813 non-null object
Students Enrolled                        1813 non-null object
Women*                                   1813 non-null object
Men*                                     1813 non-null object
Early Decision Admission Rate            245 non-null object
Early Action Admission Rate              392 non-null object
Students Offered Wait List               335 non-null object
Students Accepting Wait List Position    295 non-null object
Students Admitted From Wait List         324 non-null object
Average GPA                              1813 non-null object
3.75 and Above                           1807 non-null object
3.50 - 3.74                              1807 non-null object
3.25 - 3.49                              1799 non-null object
3.00 - 3.24                              1804 non-null object
2.50 - 2.99                              1774 non-null object
2.00 - 2.49                              1557 non-null object
SAT Math                                 1813 non-null object
Score of 700 - 800                       1813 non-null object
Score of 600 - 700                       1813 non-null object
Score of 500 - 600                       1813 non-null object
Score of 400 - 500                       1813 non-null object
Score of 300 - 400                       1813 non-null object
Score of 200 - 300                       1813 non-null object
SAT Critical Reading                     1813 non-null object
Score of 700 - 800*                      1813 non-null object
Score of 600 - 700*                      1813 non-null object
Score of 500 - 600*                      1813 non-null object
Score of 400 - 500*                      1813 non-null object
Score of 300 - 400*                      1813 non-null object
Score of 200 - 300*                      1813 non-null object
SAT Writing                              1813 non-null object
Score of 700 - 800**                     1813 non-null object
Score of 600 - 700**                     1813 non-null object
Score of 500 - 600**                     1813 non-null object
Score of 400 - 500**                     1813 non-null object
Score of 300 - 400**                     1813 non-null object
Score of 200 - 300**                     1813 non-null object
ACT Composite                            1813 non-null object
Score of 30 - 36                         1813 non-null object
Score of 24 - 29                         1813 non-null object
Score of 18 - 23                         1813 non-null object
Score of 12 - 17                         1813 non-null object
Score of 6 - 11                          1813 non-null object
Score of 5 or Below                      1813 non-null object
High School Class Rank                   1107 non-null object
National Merit Scholar                   1813 non-null object
Valedictorian                            1813 non-null object
Class President                          1813 non-null object
Student Government Officer               1813 non-null object
dtypes: object(53)
memory usage: 764.9+ KB

I scraped 53 seperate columns for 1813 schools. According to the US Dept. of Education’s National Center for Education Statistics, there were 2870 4-year colleges as of 2011. It would be very helpful to know what criteria the people behind CollegeData used to add schools to their site. Are CollegeData’s schools only the largest? Only those who filled out a survey? Only those who paid?

I’m not sure how accurately this sample represents the overall population of 4-year colleges – and how accurately insights drawn from analysis of it can be generalized to the greater population of all schools – and more investigation into CollegeData’s data acquisition process is warranted. Be warned!

import pandas as pd
college = pd.DataFrame.from_dict(scraped_pages, orient='index') # convert scraped pages to a pandas DataFrame
college.to_csv('collegedata_scraped_backup.csv')   # backup

With that said, I will move on to cleaning the data. Perhaps after some exploration we will have a better idea the quality of this sample and the quality of inferences made about the yield rate disparity.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.