Last week, we benchmarked the new 3.5 Sonnet (Upgraded) model on our five datasets. Here are the results: - On Legalbench, it's now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw - It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%). - There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%). - Although it's competitive with 4o, it's still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards. You can view the full results at https://www.vals.ai
About us
- Website
-
https://vals.ai
External link for Vals AI
- Industry
- Software Development
- Company size
- 2-10 employees
- Headquarters
- San Francisco
- Type
- Privately Held
Locations
-
Primary
San Francisco, US
Employees at Vals AI
Updates
-
Vals AI reposted this
We're officially announcing the Vals Legal AI Report, studying the performance of the most widely used generative AI legal applications on questions sourced by top US law firms. Thank you to the support from Tara L. Waters Nicola Shaver Jeroen Plink, as well as Legaltech News for the coverage. If you are a lawyer and would like to contribute to the dataset, please get in touch. There has been a proliferation of powerful new AI tools in legal tech but no neutral, third-party evaluation on real questions from law firms. We aim to change this, by collecting data from Am Law 100 firms, including Fisher Phillips and Reed Smith LLP, and benchmarking some of the top legal tech providers: Harvey, Thomson Reuters, LexisNexis, vLex, Vecflow. You can read more on our website: https://www.vals.ai/vlair. https://lnkd.in/dxza7GUa.
Law Firms, Legal Research Companies Collaborate With Vals AI on Gen AI Benchmarking Study
law.com
-
We're officially announcing the Vals Legal AI Report, studying the performance of the most widely used generative AI legal applications on questions sourced by top US law firms. Thank you to the support from Tara L. Waters Nicola Shaver Jeroen Plink, as well as Legaltech News for the coverage. If you are a lawyer and would like to contribute to the dataset, please get in touch. There has been a proliferation of powerful new AI tools in legal tech but no neutral, third-party evaluation on real questions from law firms. We aim to change this, by collecting data from Am Law 100 firms, including Fisher Phillips and Reed Smith LLP, and benchmarking some of the top legal tech providers: Harvey, Thomson Reuters, LexisNexis, vLex, Vecflow. You can read more on our website: https://www.vals.ai/vlair. https://lnkd.in/dxza7GUa.
Law Firms, Legal Research Companies Collaborate With Vals AI on Gen AI Benchmarking Study
law.com
-
Vals AI reposted this
Vals AI, supported by LegalTechnology Hub, announced it is collaborating with a number of top US law firms, including Reed Smith LLP and Fisher Phillips, AI vendors, including Thomson Reuters, LexisNexis, Harvey, vLex, and Vecflow, and an ALSP, Cognia Law, to produce a new legal AI benchmarking study that will evaluate the accuracy and efficacy of some of the legal industry’s most used generative AI platforms. This marks the first time that multiple law firms and vendors will have come together to objectively assess legal AI platform performance on real world examples of legal tasks. The new study will focus on specific legal tasks across transactional, disputes, and advisory disciplines, including document-related tasks, case analysis, and legal and market research. Rayan K., Co-Founder and CEO of Vals AI, said, “Vals AI, Legaltech Hub and the participating law firms and AI vendors share the common goals of increasing transparency within the legal industry and building trust in generative AI for legal practitioners. Following the release of our LegalBench report evaluating individual model performance on standard legal analyses, the natural next step was to apply our methodology and platform under more real-world conditions. We are thrilled to work with such a prestigious and forward-thinking intra-industry group. In undertaking this new study, we hope to push the market forward in its understanding and adoption of generative AI and offer a new standard for legal AI benchmarking.” https://lnkd.in/e_uVQU-S #LegalTechTalk #LegalTech #Legal #Law #AI #GenerativeAI #GenAI #Tech #LegalAI #Partnership
-
Vals AI reposted this
Multi-award winning Digital strategist | Legal innovator | Start-up adviser | Emerging tech evangelist
#Trust and #transparency are two of the most frequently mentioned concerns when discussing #GenAI for #legal. That is why I am so excited about this collaboration between Vals AI, LegalTechnology Hub, Thomson Reuters, LexisNexis, vLex, Harvey, Vecflow, Fisher Phillips, Reed Smith LLP, Cognia Law and others. This is the first (and hopefully not the last) study to benchmark the performance of both lawyers and multiple legal AI solutions across a range of real-world transactional, disputes and advisory tasks. My thanks to Rayan K. and Langston Nashold for inviting me to help support this important initiative!
Law Firms, Legal Research Companies Collaborate With Vals AI on Gen AI Benchmarking Study
law.com
-
We're excited to officially announce the recent release of our newest data set, CaseLaw, in collaboration with the Jurisage Group. CaseLaw is based on court decisions from the Canadian legal system. With a trained law librarian, Holly James, MLIS, we’ve curated a completely private set of questions a lawyer could ask about these cases in the course of real practice (as well as the set of correct answers). The model is then expected to answer either by selecting explicit excerpts or paraphrasing elements from the decision. The answer must be grounded with citations from the decision itself - it is insufficient to answer in general terms. Unsurprisingly, o1 currently holds a high score on the leaderboard. You can view the full results, and some sample questions, at https://lnkd.in/dbfrR8zU.
Vals.ai: LegalBench
vals.ai
-
We've finished benchmarking o1, and when it works, it works extremely well, setting new records on almost all of our benchmarks. See the full results at https://www.vals.ai. 🚀 It set a new state of the art on 4/5 of our benchmarks. On TaxEval, it beat the previous best by double digit percentages (73.2% acc, from 60.1) 🚀 It works especially well on numerical reasoning tasks, or tasks that require reasoning across multiple parts of a legal document. On simpler tasks, the difference is less pronounced. ❗It has no temperature, no system prompt, and it is much harder to limit the tokens produced in the final output. Therefore, it is much harder to get it to produce output in the correct format. 💰 It is slow (>10s latency) and comes at a high cost ($15 / MTok input, $60 / Mtok output). Output tokens also now include charges for intermediate "reasoning tokens".
Vals.ai: LegalBench
vals.ai
-
We had a great time as panelists at the LTH conference last week, speaking about evaluating legal AI and how to mitigate risk in production Gen AI systems. Big thank you to the Legal Tech Hub team for all their hard work putting an event like this together: Nicola Shaver Jeroen Plink Cheryl Wilson Griffin Deborah Tesser We also enjoyed the opportunity to connect (and re-connect) with others in the legal tech space: Rutvik Rau Thomas Bueler-Faudree Isha Marathe Evan Shenkman Elizabeth Wilkinson Lauren Hudson Daniel Hoadley Max Junestrand, and many others!
-
Vals AI reposted this
Since I can't let Campbell have all of the Day 0 fun regarding OpenAI's o1-preview announcement: 1. The naming convention is a travesty. We went from gpt-{number}-{suffix} to gpt-{number}{letter}-{suffix} to now {letter}{number}-{suffix}. 😫🤮 2. o1 works through some RL finetuning that bakes in agentic planning/reasoning into a "Reasoning" generation phase which is used to help in complex reasoning tasks. These Reasoning tokens are then discarded and only the post-Reasoning answer is provided back to the user. Anthropic is rumored to use a similar technique for their Claude.ai chatbot model which will use <thinking> tokens to plan that are not revealed in the answer, which can be exposed through complicated prompt engineering. 2a. Obscuring the Reasoning tokens feels like a play to "own" agentic reasoning behind a moat, while charging a premium for it. They can optimize and differentiate 4o models for cost-efficient zero-shot performance, with the 🤑premium 🤑 o1 reasoning models as the high-end offering. 2b. All of this Reasoning means increased cost and inference time. And to support all of this, the o1 models now have 32k output limits. Since input and output (and reasoning) tokens share the same pool of tokens, this could mean reserving a lot more output tokens to prevent truncated answers (via the new "max_completion_tokens" API parameter added to support o1 models). This isn't likely to matter most of the time, and it's hard to actually know with the Reasoning tokens being obscured. 3. I don't have a Tier 5 account, so I have to wait to set my credit card on fire, but ChatGPT+ has o1 models selectable today. Just keep in mind they are 30/50 requests per WEEK right now. This will likely get raised soon, but for now make them count!
-
Super exciting work released by Harvey yesterday! We believe that strong evals are the foundation of great products, and in-house evals are an essential step towards this. It's also important that the benchmarks reflect real-world workflows - in this case, the BigLaw benchmark represents work that real lawyers do daily. There’s still plenty of work to do around neutral, third-party review. As was alluded to in their appendix, we’re looking forward to collaborative efforts to develop industry-standard benchmarks for legal tasks with our vals.ai effort.
Excited to announce BigLaw Bench, a new standard to evaluate legal AI systems based on real-world billable work that lawyers actually do. We define performance on this benchmark as "What % of a lawyer-quality work product does the model complete for the user?" Harvey’s AI systems outperform leading foundation models on domain-specific tasks, producing 74% of a final, expert lawyer-quality work product. The outputs are more detailed, capture more nuance, and are much closer to final lawyer quality. More details and performance on different tasks coming soon. https://lnkd.in/gV9BSEsB
Introducing BigLaw Bench
harvey.ai