Performance test for aidoncent based on EnglishTogether

Mon, Apr 22, 24, Aidocent setup to performance test within 5 days

Project post
- revised requirments
Project resources
- googleDrive
- AWS server
  - Windows Sever
mlOps
- LLM candidates
performance test
- metrics

This is a draft, the content is not complete and of poor quality!

Project post

Click to open

I need an experienced data scientist or AI specialist to assist with evaluating a Korean dataset using OpenAI API calls. my goal is to set up a good testing template I can repeat

Key Aspects:

Inference Speed: Achieving a balance of optimal inference speed is vital for this project. The desired inference speed for the evaluation stands at a moderate level when considering 2000 users.
Total Processing Accuracy: I’m looking to ensure the total processing accuracy of this evaluation is high.
TPA (Total Processing Accuracy): Aspects of TPA are crucial for this project; we aim for a balanced importance between total processing accuracy and inference speed. (<1 /s is imperative)
Implement this on a shared googleColab or Jupyternotebook for on-demand use of the application

Ideal Skills and Experience:

Proficiency in Korean language and understanding of language nuances.
Previous experience with data evaluation using OpenAI API calls.
Proven track record in achieving optimal inference speeds for data processing.
Demonstrated ability to balance and optimize processing accuracy alongside inference speed.
Strong communication skills to provide regular updates on the project’s progress.

refer to this documentation; https://python.langchain.com/docs/langsmith/walkthrough/

request for access to the example document where all the testing data and instructions should be maintained and updated; https://docs.google.com/document/d/16GFgABbbnH7RXgL56SXNu41xRspstgtVcy0TQxmwd-A/edit?usp=drive_link

API example; https://api-engtoprod.meta-wedit.com/api-docs#/USER%20API/UsersController_perpectConversation

deliverables; The test procedure as described and per our discussion to the format of the sample google doc. Set of scripts for JMeter configs, python for accuracy and inference test. (hopefully Colab) A working PC implementation to run the test, on a machine and VM (dockerized) to replicate the same testing environment repeatedly. A full documentation of test procedure, steps, screenshots, reference links, of running tests, and setups.

revised requirments

Key responsibilities:
- Conducting performance tests on LLMs.
- Analysing the results and comparing them to identify the best performing system.
- Providing recommendations on how to optimize the performance of the chosen LLM.
- Set up a local testing environment (chatbot, playground dashboard, Jmeter 5.5)
- Set up a cloud server (API server)
- Google Colab, Jupyter Notebook
- All configuration, test scripts in python, bash and exe)
- Dockerized images for future implementation
- Our Korean dataset of 20,000 published to HuggingFace workspace based on an existing Korean dataset
- documentation 1. testing items/procedures 2. instructions, manuals of this project.

Models to consider:
llama3-8b-8192
text-davinci-003
Mixtral 8x22b
Whisper

Reference:
https://pf7.eggs.or.kr/aigenerative_overview.html

Sample testing items (send me a permission request with your name plz)
https://docs.google.com/document/d/16GFgABbbnH7RXgL56SXNu41xRspstgtVcy0TQxmwd-A/edit

Project resources

googleDrive

overview

datasets

hfDataset check under metrics for various testing items

HuggingFace Korean dataset 일상대화 : 다양한 질의답1 : 네이버 지식인 질의답 : 다양한 질의답2 :

다양한 질의답2 : https://huggingface.co/datasets/unoooo/alpaca-korean

our dataset

ucon-slmDatasets

other models

Models to consider: llama3-8b-8192 text-davinci-003 Mixtral 8x22b Whisper

AWS server

WindowsServer port: on MS RemoteDesktop itsInstance accessible only via aws console

AWS Windows Server to run Jmeter to LLM application on the cloud

ApplicationServer

Windows Sever

Setup, Jmeter

Creds and IP

mlOps

googleColab

[JupyterNotebook]

LLM candidates

Models to consider:

llama3-8b-8192

text-davinci-003

Mixtral 8x22b

Whisper

misTral for mistral llm api docs

gorqCloud and playgroundConsole

ncSoftVaroq Korean LLM by NCSoft based varco llm and kendra

gpt4all locally hosted chatbot opensource.

Day 1, day2, day 3, day 4, day 5, day 6

Click to open

Day1

Shared creds with freelancers
Task assigned among the group
AWS WindowServer instance
Completed the testing pc on the server Day2
Application Server
API server
Lib install and setup
Download llm models
Write scripts Day3 (today)
Write scripts using quantization of llms models
setup the application server.
pythong library issue
model candidates per quantization
flask app installed

performance test

hfDatasetsMetrics

metrics

perplexity, accuracy, wer

accuracy

perplexity

wer

Click to open

├───metrics
│   ├───accuracy
│   ├───bertscore
│   ├───bleu
│   ├───bleurt
│   ├───cer
│   ├───chrf
│   ├───code_eval
│   ├───comet
│   ├───competition_math
│   ├───coval
│   ├───cuad
│   ├───exact_match
│   ├───f1
│   ├───frugalscore
│   ├───glue
│   ├───google_bleu
│   ├───indic_glue
│   ├───mae
│   ├───mahalanobis
│   ├───matthews_correlation
│   ├───mauve
│   ├───mean_iou
│   ├───meteor
│   ├───mse
│   ├───pearsonr
│   ├───perplexity
│   ├───precision
│   ├───recall
│   ├───roc_auc
│   ├───rouge
│   ├───sacrebleu
│   ├───sari
│   ├───seqeval
│   ├───spearmanr
│   ├───squad
│   ├───squad_v2
│   ├───super_glue
│   ├───ter
│   ├───wer
│   ├───wiki_split
│   ├───xnli
│   └───xtreme_s
`

The following wiki, pages and posts are tagged with

Title	Type	Excerpt