Play a Game to Improve a Philosophical Research Tool (guest post)


“What papers talk about P?”
“Who argues that P?”

Those are the kinds of questions that David Bourget (Western) is hoping that PhilPapers, the comprehensive online philosophical index and bibliography, will be able to reliably provide direct, concise, and correct answers to.

To make that happen, Bourget, one of PhilPapers’ co-founders, is working on incorporating AI models into PhilPapers, and has developed a game called Beat AI: a contest using philosophical concepts to help train them.

In the following guest post, Bourget explains how he plans to make use of AI at PhilPapers, how Beat AI is played, and how playing it helps the AI improve.

He also provides a link to the game so you can go play it. Maybe you’ll get a high score.*


Play a Game to Improve a Philosophical Research Tool
by David Bourget

Today the PhilPapers team launched an online game called Beat AI: A contest using philosophical concepts. This post offers some information on the game’s technical background and its place in the broader PhilPapers ecosystem. It’s also an invitation to ask questions in the discussion thread.

In short, the objective of the game is to trick AI models using your superior mastery of philosophical concepts. You do this by submitting a triplet of expressions: an anchor, a substitute, and a decoy. The anchor and substitute are supposed to be close in meaning. The decoy is supposed to be an expression that AI models may think is closer in meaning to the anchor than the substitute is. You win when the AI models make mistakes.

In case it’s not obvious, our primary aim in making this game was to generate a large set of examples that we can use to train better AI models. Ultimately, we want to enable more powerful search and other features on PhilPapers. Even if you’re not fully taken in by the game, please consider giving it a good run to help us collect the data we need!

There are three AI models competing in the game: OpenAI’s ada3-large, BGE-large, and my own home-grown model, which I call philai-embeddings. These models are not “talking” AIs like ChatGPT (they aren’t “generative AI”). They are what are called embedding models. An embedding model is a neural network that’s designed to convert a piece of natural language to a large vector of numbers. In the case of BGE and philai-embeddings, the output is a vector of 1024 real numbers. OpenAI’s ada3 outputs vectors of 3072 numbers.

Embedding models are trained so that certain semantic relationships between their inputs are reflected in numerical relationships between corresponding outputs. In the case of phiai-embeddings, the chief aim was for closeness of topic between the inputs to correspond to the angle between the output vectors: the more the topic(s) of two inputs are similar, the smaller the angle between the output vectors (the “embeddings”). By “topics”, I mean what the input texts are about: the subjects, views, properties, relations, and other things mentioned in them. My model only cares about sameness of topic, and not about logical relations such as entailment or consistency, because it’s designed to be used for a first-pass retrieval of passages relevant to a query on PhilPapers, and in a first pass it’s simpler if we ignore logical relations.

This takes me to the broader purpose of this project. Embedding models are meant to allow efficient comparison of large numbers of passages for relevant semantic relationships. In my case, what I want to do is fully encode all works indexed on PhilPapers using the model, so that we can then use the embedding corresponding to a user’s search query (or something derived from a user’s search query) to retrieve likely relevant passages by comparing the angles between this embedding and the stored embeddings.

The advantage of this technique compared to keyword search is that embeddings, when they work well, allow us to cope with two major stumbling blocks for keyword search: synonymy and ambiguity. The problem of synonymy, as the name indicates, is that some distinct keywords can be used to convey the same or roughly the same meaning. But if you search for just one of the synonymous keywords, you will find only the texts that use that keyword. The problem of ambiguity is that some words may not always have the meaning that is relevant to your query. Embeddings, when well trained, can cut through both problems by giving synonymous expressions nearby vectors while not giving nearby vectors for inputs that contain the same keywords but with different meanings (when the disambiguating context is part of the input).

So, my aim is to convert all the text available on PhilPapers to embeddings in order to enable more powerful search. I plan to embed not just the abstracts of papers, but over 1.5M full texts to which we have access. For the texts, we will embed paragraph-sized chunks individually, which will allow very fine-grained searches. At the end of the day, I hope that we can reliably give direct, concise answers to queries such as “What papers talk about P?” or “Who argues that P?”

The home-grown model that BeatAI players are competing against, philai-embeddings, is a very first step. It’s a version of Google’s BERT that I further trained on a large number of sentences from PhilArchive. That made it much better with philosophical text than the base BERT. Surprisingly, it’s doing about as well as OpenAI’s Ada3 despite being much smaller and computationally efficient. But there’s still a long way to go, as players of Beat AI will probably find. Please go play and let our future models learn from you!


[BeatAI Individual Leaderboard as of late last night]

Beyond the Ivory Tower. Workshop for academics on writing short pieces for wide audiences on big questions. Taking place October 18th to 19th. Application deadline July 30th. Funding provided.
Subscribe
Notify of
guest

30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dr. Lothar Leidernicht
Dr. Lothar Leidernicht
1 month ago

Is this a good thing?

David Bourget
Reply to  Dr. Lothar Leidernicht
1 month ago

Of course! PhilPapers’ AI isn’t going to take away philosophers’ jobs or turn everything to paper clips, just help people find stuff. We don’t have the budget to make world-destroying or job-stealing AIs!

Samantha
Samantha
Reply to  David Bourget
1 month ago

…sure, but will the company this is sold to have the budget?

Cecil Burrow
Cecil Burrow
Reply to  Samantha
1 month ago

I don’t think a program that classifies philosophy papers is exactly the sort of thing venture capitalists are going to be fighting each other over ….

Dr. Lothar Leidernicht
Dr. Lothar Leidernicht
Reply to  Cecil Burrow
1 month ago

But what if it can start classifies philosophers? Not all classifications are value-neutral…

Samantha
Samantha
Reply to  Cecil Burrow
1 month ago

“A subscription to this software enables you to automatically grade student papers in X department. Given the proof of concept, we will be able to role out AI grading of student papers in other humanities departments. Major cost saving!”

Nicolas Delon
Nicolas Delon
Reply to  Samantha
1 month ago

I know anything’s possible but: Philpapers is a repository for professional philosophy ‘papers’ (as in, preprints and publications) and the goal of the model, it seems, will be to classify papers, not grade them. What does this have to go with grading student papers?

Kenny Easwaran
Reply to  Dr. Lothar Leidernicht
1 month ago

Only if you believe it’s a good thing for people to be able to find philosophy papers that talk about particular topics, or make claims relevant to particular arguments. If you think that the invention of subfields and keywords were bad things, and that we’re generally better off trying to find papers by looking for prestigious journals, and authors we know, then you might well think this is bad.

Dr. Lothar Leidernicht
Dr. Lothar Leidernicht
Reply to  Kenny Easwaran
1 month ago

“Only if you believe it’s a good thing for people to be able to find philosophy papers that talk about particular topics, or make claims relevant to particular arguments.” I agree with that. But there is something about making things too easy that, in my experience, hinders understanding or knowledge. I read now a lot of papers which refer to other papers where often it seems people have not really read those other papers but did something like a search for keywords in those papers to find some claims that fit what they want. I wonder if a tool like this will not make this tendency for shortcuts and habits of taking them that ultimately undermine our own understanding even worse.

Chris
Chris
Reply to  Dr. Lothar Leidernicht
1 month ago

I dunno but in the old days people sometimes just cited articles because they were already cited by someone else, without reading etc
So shortcuts have been around

Dr. Lothar Leidernicht
Dr. Lothar Leidernicht
Reply to  Chris
1 month ago

Good point.

David Bourget
Reply to  Kenny Easwaran
1 month ago

True, I’m making some big assumptions here!

Joshua Miller
1 month ago

Are you running all models locally? If not, what is being shared with OpenAI’s servers, and under what license?

David Bourget
Reply to  Joshua Miller
1 month ago

We run the BGE and PhilAi models on our own servers (well, cloud-based virtual machines). OpenAI’s embedding model runs only on their servers, but they say they do not keep any data submitted through this API.

Joshua Miller
Reply to  David Bourget
1 month ago

Thanks David. I’m thinking about the NYT lawsuit–and Reddit’s $60M contract. It seems plausible that these models were built by scraping PhilPapers already, and that they might be interested in supporting PhilPapers in exchange for access to the ongoing output. Have you considered a partnership with them?

Michel
1 month ago

I’d like to complain about R2, who incorrectly rejected a submission. =)

Last edited 1 month ago by Michel
David Bourget
1 month ago

An update on how things are going since launching this morning: so far we’re really happy with the response. We’re approaching 4500 submissions and it looks like the game is just starting. The pace of submissions just keeps going up. I’m seeing many, many great submissions that will really enrich our AI training. Thanks to everyone who’s contributed already! I also want to say that we’re trying to fine-tune the refereeing process so that we can better keep up. Right now we can’t, but that’s a good problem to have.

Colm
Colm
1 month ago

Maybe its just me, but it seems like having an appeal process might be useful. For example, “Category” as the anchor and “Pure concept of the understanding” as the substitute was denied, whereas “Dasein” and “The being for whom being is a problem” was approved. The approvals and acceptances seem kind of arbitrary.

Colm
Colm
Reply to  Colm
1 month ago

Another example, “Authenticity” and “Ownedness” was accepted, whereas “Ready-to-hand” and “Equipment” were rejected.

David Bourget
Reply to  Colm
1 month ago

Thanks for the feedback! The “Category” anchor probably seemed too context-dependent to the referee. Think about this rule of thumb: if you picked a random philosophy fragment from a random book and you found the anchor, could you reasonably assume it means the same as the substitute? I think here the answer is “no” because the word “category” can be used in many ways, for example, “this book is in a different category”. If you had said “Kant’s categories” that would have worked.

Colm
Colm
Reply to  David Bourget
1 month ago

Fair enough, maybe ‘category’ was too broad of a term, but my Kant-focused brain didn’t see it that way. Either way, rather than just leveling criticism, I think the little game is great (and fun to play), and I’m always down to help out a great project from PhilPapers

Shen-yi Liao
Reply to  David Bourget
1 month ago

If context-dependence is the concern, then I am not sure that some of the examples aren’t in the same way. For example, surely the meaning of ‘materialism’ is context-dependent too? Even exclusively in philosophy, it’s not always used in the philosophy of mind sense.

Rob Hughes
1 month ago

Does this project have outside funding? If yes, what is the source?

David Bourget
Reply to  Rob Hughes
1 month ago

The project doesn’t have funding outside the PhilPapers Foundation.

Rob Hughes
Reply to  David Bourget
1 month ago

Good to know. Thanks.

Nicolas Delon
Nicolas Delon
1 month ago

This is fun. I’ve been playing around with it today. I’d be happy to referee too but it won’t let me although I’ve exceeded the minimum requirements.

David Bourget
Reply to  Nicolas Delon
1 month ago

Thanks for contributing and wanting to referee! We added a constraint that referees need to have at least 10 referee-vetted submissions, and because referees are behind this has been hard to meet. Unfortunately the extra constraint wasn’t explained everywhere.

Nicolas Delon
Nicolas Delon
Reply to  David Bourget
1 month ago

Ten human-refereed submissions? Maybe that’s why. Because a lot more than ten of my submissions have been validated.

Justin Smith-Ruiu
Justin Smith-Ruiu
1 month ago

Tom Sawyer would have been good at getting others to “play” his AI-training “game” too:

“Say – I’m going in a-swimming, I am. Don’t you wish you could? But of course you’d druther work – wouldn’t you? Course you would!”
Tom contemplated the boy a bit, and said:
“What do you call work?”
“Why, ain’t that work?”
Tom resumed his whitewashing, and answered carelessly:
“Well, maybe it is, and maybe it ain’t. All I know, is, it suits Tom Sawyer.”
“Oh come, now, you don’t mean to let on that you like it?”
The brush continued to move.
“Like it? Well, I don’t see why I oughtn’t to like it. Does a boy get a chance to whitewash a fence every day?”
That put the thing in a new light. Ben stopped nibbling his apple. Tom swept his brush daintily back and forth – stepped back to note the effect – added a touch here and there – criticised the effect again – Ben watching every move and getting more and more interested, more and more absorbed. Presently he said:
“Say, Tom, let me whitewash a little.”

Nick
Reply to  Justin Smith-Ruiu
1 month ago

And then two decided that for every board whitewashed, they would draw a little line in the sand next to their buckets. The one with more lines would be the ‘winner’ and would get to wear a little badge that Tom had been carrying around.