Sklearn Text Classifier model for Label Studio
The Sklearn Text Classifier model is a custom machine learning backend for Label Studio. It uses a Logistic Regression model from the Scikit-learn library to classify text data. This model is particularly useful for text classification tasks in Label Studio, providing an efficient way to generate pre-annotations based on the model’s predictions.
The model is trained on the labeled texts collected from Label Studio, and it uses the Label Studio API to fetch the labeled tasks for training. This integration with Label Studio allows for a seamless and efficient labeling workflow, as the model can be retrained and updated as new labeled data becomes available.
Before you begin
Before you begin, you must install the Label Studio ML backend.
This tutorial uses the sklearn_text_classifier
example.
Labeling configuration
The Sklearn Text Classifier model is designed to work with the default labeling configuration for text classification in Label Studio. This configuration includes a single <Choices>
output and a single <Text>
input. The model retrieves the first occurrence of these tags from the labeling configuration and uses them for its prediction:
<View>
<Text name="text" value="$text" />
<Choices name="label" toName="text" choice="single" showInLine="true">
<Choice value="positive" />
<Choice value="negative" />
</Choices>
</View>
Please note that you must specify the
LABEL_STUDIO_HOST
andLABEL_STUDIO_API_KEY
environment variables in order to download examples for training the model. These variables should point to your Label Studio instance and its API key, respectively. For more information about finding your Label Studio API key, see our documentation.
For training, you must label at least 2 examples with different labels.
Running with Docker (recommended)
- Start the Machine Learning backend on
http://localhost:9090
with the prebuilt image:
docker-compose up
- Validate that the backend is running:
$ curl http://localhost:9090/
{"status":"UP"}
- Create a project in Label Studio. Then from the Model page in the project settings, connect the model. The default URL is
http://localhost:9090
.
Building from source (advanced)
To build the ML backend from source, you have to clone the repository and build the Docker image:
docker-compose build
Running without Docker (advanced)
To run the ML backend without Docker, you have to clone the repository and install all dependencies using pip:
python -m venv ml-backend
source ml-backend/bin/activate
pip install -r requirements.txt
Then you can start the ML backend:
label-studio-ml start ./dir_with_your_model
Configuration
Parameters can be set in docker-compose.yml
before running the container.
The following common parameters are available:
LOGISTIC_REGRESSION_C
: This is the inverse regularization strength for Logistic Regression. It is a float value and can be set via environment variable “LOGISTIC_REGRESSION_C”. If not set, it defaults to10
.LABEL_STUDIO_HOST
: This is the host URL for Label Studio, used for training. It can be set via the environment variable “LABEL_STUDIO_HOST”. If not set, it defaults tohttp://localhost:8080
.LABEL_STUDIO_API_KEY
: This is the API key for Label Studio, used for training. It can be set via environment variable “LABEL_STUDIO_API_KEY”. There is no default value for this, so it must be set.START_TRAINING_EACH_N_UPDATES
: This is the number of updates after which training starts. It is an integer value and can be set via environment variable “START_TRAINING_EACH_N_UPDATES”. If not set, it defaults to10
.BASIC_AUTH_USER
- Specify the basic auth user for the model serverBASIC_AUTH_PASS
- Specify the basic auth password for the model serverLOG_LEVEL
- Set the log level for the model serverWORKERS
- Specify the number of workers for the model serverTHREADS
- Specify the number of threads for the model server
Customization
The ML backend can be customized by adding your own models and logic inside the ./dir_with_your_model
directory.