Scoring Framework
There are 3 main data structures that define PearWiseAI's scoring framework. Interaction, Session, and Scorables.
Interaction
An Interaction in PearWiseAI is defined as a pair of model inputs and outputs. It represents a single instance where the model processes input data and produces corresponding output. Interactions serve as fundamental units for evaluation.
Session
A Session consists of one or more Interactions, providing a broader context for model evaluation. It allows users to aggregate and analyze the performance of a model over a series of interactions. Sessions are useful for understanding the context behind how good an interaction is.
Scores
Multiple scores can be attached to each Interaction. The scoring system is flexible and accommodates various aspects of model performance.
We are looking to support comments and a discrete scoring system in the future.
Every score needs to have a name. This lets you identify what the score is measuring for, or to define different scoring rubrics. Here are some examples :
- User Thumbs Up: Measuring when users gave a message a thumbs up or thumbs down.
- Essay Scoring Rubrics A: Measuring the scores teachers associated with an essay generated by the model.
- User Responsiveness: Measuring how quickly a user made another action in response to the output.
Continuous vs Discrete Scores
There are 2 types of scores, Continuous and Discrete. Continuous scores are scores that are continous values eg. -100 to 100. Discrete scores are classes instead, eg. "ThumbsUp" and "ThumbsDown". Its important to differentiate the two as our EvaluatorAI handles them differently.
Collecting Scores
To kickstart the training process, we need some training data to train your EvaluatorAI. The more data you are able to provide, the better. More data allows EvalAI to be more accurate and confident of its score predictions.
There are 2 main ways to collect these scores, explicitly or implicitly.
Explicit Score Collection
Explicit score collection is scoring based on explicit actions to evaluate it. Here are some examples:
-
Domain Expert Scoring - you may want to provide an evaluation dashboard for your domain experts to score a models interaction based on a predefined rubrics.
-
User Feedback Scoring - you may want to add a simple thumbs up and thumbs down button in your chat application that allows your users to score model interactions as a means of providing feedback to you.
Implicit Score Collection
Implicit score collection is scoring based on implicit behaviour that you measure. Here are some examples:
-
User Action Completed - you may want to assign a session a positive score if the intended user action was completed successfully, and a negative score if the user abandoned the userflow midway instead.
-
User Responses - you may want to assign a session a score based on the number of user and model interactions there were. Depending on your use case, a lower number of use interactions can have a higher score. eg. Poor explainer models require students to repeatedly ask for further clarifications.
Conclusion
Thats it! That's everything you need to know to get started. For further enquiries, email us at hello@pearwise.dev