Improved test diagnosis page, SAML SSO, design refreshes, + more
🔎🩹 Quickly identify issues with the improved test diagnosis page
Diagnosing issues is a core part of the eval process, and that’s why we want to make sure our test diagnosis page is as helpful as possible. To make it easier to figure out why a test has been skipped or errored, we’ve added a list view where you can easily scan through test results and view any related error messages. We’ve added an overview so you can see the test result breakdown at a glance, as well as recent issues and failures. We’ve also made each section of the page collapsible for a smoother experience.🔐 Make your login even more secure by enabling SAML SSO
We now support SAML SSO using any of the major providers so that you can make sure your team’s workspace has that extra layer of security.
Custom metrics, rotating API keys, and new models for direct-to-API calls
We understand that you may have metrics that are highly specific to your use case, and you want to use these alongside standard metrics to eval your AI systems. That’s why we built custom metrics. You can now upload any custom metric to Openlayer simply by specifying them in your openlayer.json file. These metrics can then be used in a number of ways on platform; they’ll show up as metric tests that you can run, or as project-wide metrics that will be computed on all of your data. Now, your evals on Openlayer are more comprehensive than ever.We’ve shipped more exciting features and improvements this month, including the ability to create multiple API keys, and a bunch of new models available for direct-to-API calls, so be sure to read below for a full list of updates!
Improved quality control over your LLM’s responses with annotations and human feedback
Setting up alerts is an essential first step to monitoring your LLMs, but in order to understand why issues arise in production, it’s helpful to have human eyes to review requests.This process is now easier than ever in Openlayer - you can add annotations to any requests, with custom values. If you’ve set up tracing, you can annotate each individual step of the trace for more granularity.Every request can also be rated as a thumbs up or thumbs down, making it easy to scan through good and bad responses and figure out where your model is going wrong.We’ve released some other huge features and improvements this month, so make sure to read the full changelog below!
Now you can download requests data right from the workspace. This is especially helpful if you’ve applied filters and want to download the filtered cohort of data
Updated navigation
Our navigation has a new layout featuring breadcrumbs at the top, making it much easier to navigate between projects and understand the hierarchy
Annotation and human feedback
You can now annotate any request with custom values. You can also give every request a thumbs up or thumbs down to make identifying error patterns even easier.
Most of us get how crucial AI evals are now. The thing is, almost all the eval platforms we’ve seen are clunky – there’s too much manual setup and adaptation needed, which breaks developers’ workflows.Last week, we released a radically simpler workflow.You can now connect your GitHub repo to Openlayer, and every commit on GitHub will also commit to Openlayer, triggering your tests. You now have continuous evaluation without extra effort.You can customize the workflow using our CLI and REST API. We also offer template repositories around common use cases to get you started quickly.You can leverage the same setup to monitor your live AI systems after you deploy them. It’s just a matter of setting some variables, and your Openlayer tests will run on top of your live data and send alerts if they start failing.We’re very excited for you to try out this new workflow, and as always, we’re here to help and all feedback is welcome.
We’re thrilled to share with you the latest update to Openlayer: comprehensive tracing capabilities and enhanced request streaming with function calling support.Now, you can trace every step of a request to gain detailed insights in Openlayer. This granular view helps you to debug and optimize performance.Additionally, we’ve expanded our request streaming capabilities to include support for function calling. This means that requests you stream to Openlayer are no longer a black box, giving you improved control and flexibility.
We’ve added more ways to test latency. Beyond just mean, max, and total, you can now make test latency with minimum, median, 90th percentile, and 99th percentile metrics. Just head over to the Performance page and the new test types are there.You can also create more granular data tests by applying subpopulation filters to run the tests on specific clusters of your data. Just add filters in the Data Integrity or Data Consistency pages, and the subpopulation will be applied.
Go deep on test result history and add multiple criteria to GPT evaluation tests
You can now click on any test to dive deep into the test result history. Select specific date ranges to see the requests from that time period, scrub through the graph to spot patterns over time, and get a full picture of performance.We’ve also added the ability to add multiple criteria to GPT evaluation tests. Let’s say you’re using an LLM to parse customer support tickets and want to make sure every output contains the correct name, email address and account ID for the customer. You can now set a unique threshold for each of these criteria in one test.
Cost-per-request, new tests, subpopulation support for data tests, and more precise row filtering
We’re excited to introduce the newest set of tests to hit Openlayer! Make sure column averages fall within a certain range with the Column average test. Ensure that your outputs contain specific keywords per request with our Column contains string test, where the values in Column B must contain the string values in Column A. Monitor and manage your costs by setting Max cost, Mean cost, and Total cost tests.As additional support for managing costs, we now show you the cost of every request in the Requests page.You can now filter data when creating integrity or consistency tests so that the results are calculated on specific subpopulations of your data, just like performance goals.That’s not all, so make sure to read all the updates below. Join our Discord community to follow along on our development journey, and stay tuned for more updates from the changelog! 📩🤝
Log multi-turn interactions, sort and filter production requests, and token usage and latency graphs
Introducing support for multi-turn interactions. You can now log and refer back to the full chat history of each of your production requests in Openlayer. Sort by timestamp, token usage, or latency to dig deeper into your AI’s usage. And view graphs of these metrics over time.There’s more: we now support Google’s new Gemini model. Try out the new model and compare its performance against others.⬇️ Read the full changelog below for all the tweaks and improvements we’ve shipped over the last few weeks and, as always, stay closer to our development journey by joining our Discord!
Log multi-turn interactions in monitoring mode, and inspect individual production requests to view the full chat history alongside other meta like token usage and latency
Sort and filter through your production requests
View a graph of the token usage and latency across all your requests over time
Support for Gemini is now available in-platform: experiment with Google’s new model and see how it performs on your tests
View row-by-row explanations for tests using GPT evaluation
Expanded the Openlayer TypeScript/JavaScript library to support all methods of logging requests, including those using other providers or workflows than OpenAI
Improved commit selector shows the message and date published for each commit
New notifications for uploading reference datasets and data limits exceeded in monitoring mode
Only send email notifications when test statuses have changed from the previous evaluation in monitoring
Added sample projects for monitoring
Enhancements to the onboarding, including a way to quickstart a monitoring project by sending a sample request through the UI
No longer navigate away from the current page when toggling between development and monitoring, unless the mode does not apply to the page
Allow reading and setting project descriptions from the UI
Update style of selected state for project mode toggles in the navigation pane for clarity
Clarify that thresholds involving percentages currently require inputting floats
Allow computing PPS tests for columns other than the features
Test results automatically update without having to refresh the page in monitoring mode
Add dates of last/next evaluation to monitoring projects and a loading indication when they recompute
Surface error messages when tests fail to compute
Add callouts for setting up notifications and viewing current usage against plan limits in the navigation
Graphs with only a single data point have a clearer representation now
Improvements to the experience of creating tests with lots of parameters/configuration
Improvements to the experience of creating Great Expectations tests
Add alert when using Openlayer on mobile
Default request volume, token usage, and latency graphs to monthly view
GPT evaluation, Great Expectations, real-time streaming, TypeScript support, and new docs
Openlayer now offers built-in GPT evaluation for your model outputs. You can write descriptive evaluations like “Make sure the outputs do not contain profanity,” and we will use an LLM to grade your agent or model given this criteria.We also added support for creating and running tests from Great Expectations (GX). GX offers hundreds of unique tests on your data, which are now available in all your Openlayer projects. Besides these, there are many other new tests available across different project task types. View the full list below ⬇️You can now stream data real-time to Openlayer rather than uploading in batch. Alongside this, there is a new page for viewing all your model’s requests in monitoring mode. You can now see a table of your model’s usage in real-time, as well as metadata like token count and latency per-row.We’ve shipped the V1 of our new TypeScript client! You can use this to log your requests to Openlayer if you are using OpenAI as a provider directly. Later, we will expand this library to support other providers and use cases. If you are interested, reach out and we can prioritize.Finally, we’re releasing a brand new http://docs.openlayer.com/ that offers more guidance on how to get the most out of Openlayer and features an updated, sleek UI.As always, stay tuned for more updates and join our Discord community to be a part of our ongoing development journey 🤗
You can now create tests that rely on an LLM to evaluate your outputs given any sort of descriptive criteria. Try it out by going to Create tests > Performance in either monitoring or development mode!
Great Expectations
We added support for Great Expectations tests, which will allow you to create hundreds of new kinds of tests available here. To try it out, navigate to Create tests > Integrity in either monitoring or development mode.
New and improved data integrity & consistency tests
Class imbalance ratio (integrity) (tabular classification & text classification) — The ratio between the most common class and the least common class
Predictive power score (integrity) (tabular classification & tabular regression) — PPS for a feature (or index) must be in specific range
Special characters ratio (integrity) (LLM & text classification) — Check the ratio between the number of special characters to alphanumeric in the dataset
Feature missing values (integrity) (tabular classification & tabular regression) — Similar to null rows but for a specific feature, ensure features are not missing values
Quasi-constant features (integrity) (tabular classification & tabular regression) — Same as quasi-constant feature count but for a specific feature, expect specified features to be near-constant and with very low variance
Empty feature (integrity) (tabular classification & tabular regression) — Same as empty feature count but for a specific feature, expect specified features to not have only null value
Updates to existing tests
Set percentages as the threshold for duplicate rows, null rows, conflicting labels, ill-formed rows, and train-val leakage tests
We’ve added a new endpoint for streaming your data to Openlayer rather than uploading in batch
The new requests page allows you to see a real-time stream of your model’s requests, and per-row metadata such as token count and latency
The new Openlayer TypeScript library allows users who are directly leveraging OpenAI to monitor their requests
Our brand new docs are live, with more guided walkthroughs and in-depth information on the Openlayer platform and API
We have decided that the word “test” is a more accurate representation, and have updated all references in our product, docs, website, and sample notebooks
Polish and improvements to the new onboarding and navigation flows, including an updated “Getting started” page with more resources to help you get the most out of Openlayer
Creating a project in the UI now presents as a modal
Creating a project in the UI opens up subsequent onboarding modals for adding an initial commit (development) or setting up an inference pipeline (monitoring)
Added commit statuses and button for adding new commits and inference pipelines to the navigation pane
Once a commit is added in development mode, new tests are suggested that are personalized to your model and data and identify critical failures and under-performing subpopulations
Added more clarifying tooltip on how to enable subpopulation filtering for performance tests in monitoring mode
Improved wording of various suggested test titles
Default test groupings appropriately by mode
Floating point thresholds were difficult to input for users
Enhanced onboarding, redesigned navigation, and new goals
We’re thrilled to announce a new and improved onboarding flow, designed to make your start with us even smoother. We’ve also completely redesigned the app navigation, making it more intuitive than ever.You can now use several new consistency and integrity goals — fine-grained feature & label drift, dataset size-ratios, new category checks and more. These are described in more detail below.You’ll also notice a range of improvements — new Slack and email notifications for monitoring projects, enhanced dark mode colors and improved transactional email deliverability. We’ve reorganized several features for ease of use, including the subpopulation filter flow and the performance goal page layout.If you’re working in dev mode, check out the dedicated commit page where you can view all the commit’s metadata and download your models and data to use locally.Stay tuned for more updates and join our Discord community to be a part of our ongoing development journey. 🚀👥
Evals for LLMs, real-time monitoring, Slack notifications and so much more!
It’s been a couple of months since we posted our last update, but not without good reason! Our team has been cranking away at our two most requested features: support for LLMs and real-time monitoring / observability. We’re so excited to share that they are both finally here! 🚀We’ve also added a Slack integration, so you can receive all your Openlayer notifications right where you work. Additionally, you’ll find tons of improvements and bug fixes that should make your experience using the app much smoother.We’ve also upgraded all Sandbox accounts to a free Starter plan that allows you to create your own project in development and production mode. We hope you find this useful!Join our Discord for more updates like this and get closer to our development journey!
Revamped onboarding for more guidance on how to get started quick with Openlayer in development and production
Better names for suggested tests
Add search bar to filter integrity and consistency goals in create page
Reduce feature profile size for better app performance
Add test activity item for suggestion accepted
Improved commit history allows for better comparison of the changes in performance between versions of your model and data across chosen metrics and goals
Added indicators to the aggregate metrics in the project page that indicate how they have changed from the previous commit in development mode
Improved logic for skipping or failing tests that don’t apply
Updated design of the performance goal creation page for a more efficient and clear UX
Allow specifying MAPE as metric for the regression heatmap
Improvements to data tables throughout the app, including better performance and faster loading times
Improved UX for viewing performance insights across cohorts of your data in various distribution tables and graphs
Updated and added new tooltips throughout the app for better clarity of concepts
Regression projects, toasts, and artifact retrieval
This week we shipped a huge set of features and improvements, including our solution for regression projects!Finally, you can use Openlayer to evaluate your tabular regression models. We’ve updated our suite of goals for these projects, added new metrics like mean squared error (MSE) and mean absolute error (MAE), and delivered a new set of tailored insights and visualizations such as residuals plots.This update also includes an improved notification system: toasts that present in the bottom right corner when creating or updating goals, projects, and commits. Now, you create all your goals at once with fewer button clicks.Last but not least, you can now download the models and datasets under a commit within the platform. Simply navigate to your commit history and click on the options icon to download artifacts. Never worry about losing track of your models or datasets again.
Sign in with Google, sample projects, mentions and more!
We are thrilled to release the first edition of our company’s changelog, marking an exciting new chapter in our journey. We strive for transparency and constant improvement, and this changelog will serve as a comprehensive record of all the noteworthy updates, enhancements, and fixes that we are constantly shipping. With these releases, we aim to foster a tighter collaboration with all our amazing users, ensuring you are up to date on the progress we make and exciting features we introduce. So without further ado, let’s dive into the new stuff!
Added support for mentioning users, goals, and commits in goal comments and descriptions — type @ to mention another user in your workspace, or # to mention a goal or commit
Added the ability to upload “shell” models (just the predictions on a dataset) without the model binary (required for explainability, robustness, and text classification fairness goals)
Added ROC AUC to available project metrics
Added an overview page to browse and navigate to projects
Added an in-app onboarding flow to help new users get setup with their workspace
Added announcement bars for onboarding and workspace plan information