As information becomes more accessible on the internet, privacy has become a major concern. Many do not appreciate having their identifiable information stored on the internet, much less accessible by anyone. Michigan State University has a very concerning directory of all students and faculty called MSU PeopleSearch, and it contains all of the information needed to stalk, harass, or phish those unlucky enough to have their info listed.
The MSU PeopleSearch directory can be accessed by anyone, which is a privacy concern in and of itself, but perhaps the most important aspect of this directory is that everyone is automatically enrolled in it without being clearly notified. Sure, people can unenroll by editing the privacy settings in the Directory section of their StuInfo accounts, but the fact that students are required to go out of their way to remove their personal data is something that doesn’t sit well with me. It didn’t sit well with my team at SpartaHack 2017 either.
SpartaHack was co-sponsored by Amazon this year, and so we wanted to do an Echo-based project. We eventually settled on one extremely creepy but awareness-raising Alexa skill that would be able to recite the phone number of everyone listed in the PeopleSearch Database. We called it Howdy.
Scraping the Data
I wrote a web scraper in C# that would search through the database, leaving no stone unturned. It used a permutation generator to search for unique last names. If over fifty results were found (the site only displays the first fifty results) it would add a character to the first name field and continue permuting the searches. The program took around three hours to write, and another two and a half hours to scrape the entire database of students, which was a remarkable 47,000 entries.
With the everything scraped and stored in JSON, I passed the data onto my team member Scott, who set up a DynamoDB instance on Amazon Web Services, split up the data into 25-entry segments, and wrote them to the database using a bash script (AWS limits free users to 25 writes per second). The upload took another two or so hours with the rate limit, but nevertheless it was successfully stored on our private database.
With the data in a format and location accessible by Amazon Lambda, teammates Adam, Brian, and Koshiro wrote the front end and backend code that interfaced the Amazon Echo with our database. After a lot of tweaking and trickery, we had a working product: an Amazon Echo that could tell us the phone number of nearly every student currently enrolled at Michigan State University.