The murmur of the snarkmatrix…

August § The Common Test / 2016-02-16 21:04:46
Robin § Unforgotten / 2016-01-08 21:19:16
MsFitNZ § Towards A Theory of Secondary Literacy / 2015-11-03 21:23:21
Jon Schultz § Bless the toolmakers / 2015-05-04 18:39:56
Jon Schultz § Bless the toolmakers / 2015-05-04 16:32:50
Matt § A leaky rocketship / 2014-11-05 01:49:12
Greg Linch § A leaky rocketship / 2014-11-04 18:05:52
Robin § A leaky rocketship / 2014-11-04 05:11:02
P. Renaud § A leaky rocketship / 2014-11-04 04:13:09
Jay H § Matching cuts / 2014-10-02 02:41:13

A Snarkmarket mini-collaboration: Snarksyllabi
 / 

Via Tim’s retweet, I saw that Dan Cohen at the Center for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from 2002-2009.

I think this might make a fun Snarkmarket mini-collaboration. My tender programming chops are such that I can cook up a simple script to parse the data. I’m happy to share the code (and/or collaborate a bit) on Github, too, though I’m no pro with version control.

So the real question is: what sort of questions should we ask?

I’m open to anything, but my bias goes towards something slightly wacky, rather than, you know, something of scholarly significance. Let’s reverse-engineer an inquiry by starting with a Slate headline!

I mean, think about it—a syllabus is

  • a course of study,
  • a set of instructions,
  • a statement of values,
  • a collection of related documents,
  • an indirect payment to a bunch of authors,

and more, all in one.

What might we learn from a million of them all together?

Drop an idea, suggestion, meditation or musing in the comments!

Update: Bit of a hiccup with the database, per Dan’s second update here. As soon as the full version with the cached HTML pages is live, we’ll start playing with it. I’m leaning toward something simple and ngrammy to start, per Tim’s comment.

28 comments

Also, my instinct is that something about books would be interesting… something like: let’s look at some constrained list of books (winners of the Pulitzer or the National Book Award?) and see how often they occur in syllabi. Then try to spot patterns: is it books from a particular era that show up more? Books by men or women? Etc., etc. Essentially we’d be cross-referencing a second data set — a set of books — with the syllabi.

Dan Munz says…

I’d love to at least start with something like Google’s Ngrams project. It’d be interesting to see how various works increase or decreased in Perceived Scholarly Importance (PSI), as measured in syllabus inclusions per academic year, over time.

Also: I haven’t been able to look at the DB yet, but if it has field of study, it would be extremely interesting to look at (and possibly even visualize) works that start big in one field of study but slowly (or quickly?) migrate into others. I’m thinking particularly of Sociology texts that might migrate into Economics courses, or things about game theory that make their way into IR courses, but there are surely a ton of other examples as well.

Tim Carmody says…

My brain’s in information theory lately, which was partly concerned with things like frequency/infrequency of word & character strings. (Did you know that if a word ends with the letter “u,” it’s overwhelmingly likely to be the word “you”?)

So maybe we can spit out some lists of frequently and infrequently-seen phrases. Something like In this class, we will is probably super-common. But maybe there are phrases that are more common than you might think, that turn out to be super-interesting. Like, what if “I want us to have fun” actually appears in > 100 syllabi? That seems significant!

So it’d be a simple (I think) program, would generate a lot of data, and some of it then we could close-read for serendipitous silliness. That’s my first volley, anyways.

Suggestion from @faketv:

@robinsloan @tcarmody syllabi have to pay dues to a “canon” & then present tastefully hip alternatives. can you track that over time? … e.g. “adorno vs. the matrix” #infographic

To which I said:

@faketv @tcarmody Ha haha. That’s really good. We’d need a big set of canon/countercanon pairs though. Hmm. #crowdsourcing

Dan Munz says…

Another thought: I feel like you could get some kind prediction market going if you could manage to get data on a relatively frequent basis (like, every semester) from even a representative group of schools.

(Meta-note: it’s funny how much of the initial conversation about this actually played out on Twitter, instead of here in the comments. Things are chaaaanging!)

Tim Carmody says…

Yes – and it wasn’t like you tweeted “hey, I’m doing this, what do you guys think?” You literally just wrote “psst,” just to me.

Lance Hunter says…

How about the average rate of progression of textbook editions, perhaps divided by field (to learn which field tends to update to new textbooks more frequently)? [You would assume that due to constant advances in the field, the engineering and CS courses would be the ones to update books the most, but I wouldn’t be surprised if it turns out to be one of the humanities fields.]

Which field has the highest rate of textbooks written by the course instructor? [AKA the “Squeezing every penny I can out of teaching this class” award.]

Learn which discipline is more authoritarian by finding which has the highest average number of references to school disciplinary (& perhaps the phrase “no exceptions”) policy per syllabus. [I’m betting on English for this one.]

This would probably be hard to code for, but we could try to learn which discipline wins the “goddamn hippies” award by checking for slang terms and informal language in the syllabus, or finding the field that is most likely to have absolutely no attendance penalties.

I’m trying to think of a way to check for a patriarchal and/or sexist tone in text, because I think that would be something really interesting to be able to review from this dataset. Sadly, I have no idea where to even begin finding ways to analyze that in code.

How about the average rate of progression of textbook editions, perhaps divided by field (to learn which field tends to update to new textbooks more frequently)?

I think this, and variants thereof, could not just be interesting, but actually valuable. A professor who goes out of their way to find affordable books–or at least not demand that students pay for the latest edition–should get points and be findable.

I think just harvesting all the recommended textbooks and papers, scoring them by # of times cited and required vs. recommended, and what departments and levels (upper division vs. lower division) would create a valuable database. You could then crowd source the assignment of gender or race identities to each book (i.e. do a little websurfing to figure out if its a male Robin or a female Robin authoring ‘The New Liberal Arts’) and analyze the breakdown by field.

Hey all,
I just got a torrent tracker up and running and am seeding from home. Spread the word & help seed if you can!

Let me know if it breaks
@mcburton

Variation on the hippie test – trendiness test. Correlation between course content and TV series titles showing that year. “The Simpsons and Sociobiology”.

Oh that is good.

Yes, I particularly love this.

David K. says…

Just as there are “write only” journals, what are the journals that actually get assigned and presumably “read”? How do syllabi differ from / compare to Web of Science impacts, etc.?

Lauren says…

One thought: professors requiring students to buy textbooks they wrote. (royalties anyone?) broken down by school, subject, prof… Maybe compare with how popular said book is with other profs teaching same class?

How many times is the word innovation used? And does it increase proportionally with the growth of the Internet? Is it used in one field more than another? And, in turn, is that field doing better than others?

Dare you.

I’m actually most interested in assignment types and grading breakdowns.

Assignment types: does keeping a blog suddenly show up circa 2004? Are there movements from weekly papers to big ones? What’s the trend with group work? By department?

Grading breakdown: terribly curious about the correlation between ‘class participation’ and subject area.

I’ll tell you what I’m going to do…look at collocations by ISBN (and maybe titles if I get fancy later). What books get assigned together, and what can we learn about canons (esp. contemporary lit, which is what I study) by exploring these as networks? What are the clusters in the network, and what are the central nodes gluing different clusters together?

Let me know if anyone wants to work on this with me.

I do! I do like the idea of collocation. It would be particularly neat to relate this graph to any kind of relevant academic geneaology.

Some time ago I picked up Franco Moretti’s Graphs, Maps, Trees, and was very taken by the start of it, but then mislaid it. (A pox upon overly thin books!) But I’ve come across it again, which doubles my enthusiasm for this little project.

And silly me, you’re probably quite familiar with Moretti yourself. 😉

Great! I’m also excited to see what other people come up with. I think this is an interesting example of a big chunk of “dark matter”from academia suddenly coming to light.

I really want to see your results here.

It would be awesome to build a sylabi recommendation engine.

David Kahane says…

Thinking Slate headlines, see whether the gender, discipline, institution, or institution prestige (if you have a source for this) of the instructor correlate with their propensity to assign their own work. Perhaps factor in cost of the person’s own work being assigned.

Insofar as you can parse # of pages assigned from syllabus data, it’d be interesting to see how this correlates with discipline.

Should we be concerned with the fact that all the syllabi came from the internet and therefore might show some kind of bias toward those inclined to post their syllabi online? Maybe that’s not a significant or relevant aspect of the data set, but it occurred to me.

Oh, that’s probably an interesting signal in itself!

I saw that Harvard Law School just posted its entire collection of exams from 1871-1998. I immediately thought of your post when I saw this announcement because some ngram action would be great:

http://oasis.lib.harvard.edu/oasis/deliver/deepLink?_collection=oasis&uniqueId=law00237

Perhaps to identify the rise and fall of different legal movements and doctrines?

Whoah! That’s super interesting. Good tip, EC.

I find the phrase ‘digital surrogate’ to be delightfully creepy.

The snarkmatrix awaits you

Below, you can use basic HTML tags and/or Markdown syntax.