Lila: A Unified Benchmark for Mathematical Reasoning

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

It’s pretty remarkable how good the cutting edge language models have gotten at answering difficult technical questions. We’re very close to AIs being able to beat the average Human student on many undergrad or PhD STEM exams.

This project is impressive; they use LLMs to provide explanations of AI papers on request as you read them.

I wasn’t aware of explainpaper, actually works pretty well! I tried it on a technical paper (non-AI), and it worked decently.

I feel like technology like this could be super-useful to navigate law codes too.

Input: Poisson Editing Even with the above support regions Lempitsky and Iva- nov’s global adjustment does not eliminate all visible seams, see an example in Figure 11 (bottom row, center). Thus, subsequent to global adjustment we additionally perform local Poisson image editing [16]. Gal et al . [9] do this as well, but in a way that makes the computation prohibitively expensive: They Poisson edit complete texture patches, which results in huge linear systems (with > 107 variables for the largest patches in our datasets).

Output: The text is discussing a method of reducing seams in an image. The method is called Poisson Editing. Gal et al. [9] use a method that is very computationally expensive.. In a nutshell, kind of right. :slight_smile:

1 Like