Add selectable tokenizer supports on Ooba (#281)

# PR Checklist
- [ ] Did you check if it works normally in all models? *ignore this
when it dosen't uses models*
- [ ] Did you check if it works normally in all of web, local and node
hosted versions? if it dosen't, did you blocked it in those versions?
- [ ] Did you added a type def?

# Description
I write simple changes on code, which allow user to choose tokenizers.

As I write on https://github.com/kwaroran/RisuAI/issues/280, differences
in tokenizers makes error when use mistral based models.


![image](https://github.com/kwaroran/RisuAI/assets/62899533/3eb07735-874f-46d0-bc0c-c92a32ef927b)
As I'm not good at javascript, I simply implement this work by write
name of tokenizer model, and select one on tokenizer.ts file.

I test it on my node RisuAI and I send long context to my own server.

![image](https://github.com/kwaroran/RisuAI/assets/62899533/5b1f22a0-5b1b-4472-a994-bfe5472ba159)
As result, ooba returned 15858 as prompt tokens.


![image](https://github.com/kwaroran/RisuAI/assets/62899533/6d4c2185-07c9-4de1-8460-0983b6e45141)
And as I test on official tokenizer implementations, it shows 1k
differences between llama tokenizer and mistral tokenizer.

So I think adding this option will help users use oobabooga with less
error.
This commit is contained in:
kwaroran
2024-01-06 19:16:58 +09:00
committed by GitHub
3 changed files with 15 additions and 2 deletions

View File

@@ -61,6 +61,8 @@
<OptionalInput marginBottom={true} bind:value={$DataBase.reverseProxyOobaArgs.chat_instruct_command} />
{/if}
{/if}
<span class="text-textcolor">tokenizer</span>
<OptionalInput marginBottom={true} bind:value={$DataBase.reverseProxyOobaArgs.tokenizer} />
<span class="text-textcolor">min_p</span>
<OptionalInput marginBottom={true} bind:value={$DataBase.reverseProxyOobaArgs.min_p} numberMode />
<span class="text-textcolor">top_k</span>

View File

@@ -11,6 +11,7 @@ export interface OobaChatCompletionRequestParams {
greeting?: string
chat_instruct_command?: string
preset?: string; // The '?' denotes that the property is optional
tokenizer?: string;
min_p?: number;
top_k?: number;
repetition_penalty?: number;

View File

@@ -24,10 +24,20 @@ async function encode(data:string):Promise<(number[]|Uint32Array|Int32Array)>{
if(db.aiModel.startsWith('local_') ||
db.aiModel === 'mancer' ||
db.aiModel === 'textgen_webui' ||
(db.aiModel === 'reverse_proxy' && db.reverseProxyOobaMode ||
db.aiModel === 'ooba')){
(db.aiModel === 'reverse_proxy' && db.reverseProxyOobaMode)){
return await tokenizeWebTokenizers(data, 'llama')
}
if(db.aiModel === 'ooba'){
if(db.reverseProxyOobaArgs.tokenizer === 'mixtral' || db.reverseProxyOobaArgs.tokenizer === 'mistral'){
return await tokenizeWebTokenizers(data, 'mistral')
}
else if(db.reverseProxyOobaArgs.tokenizer === 'llama'){
return await tokenizeWebTokenizers(data, 'llama')
}
else{
return await tokenizeWebTokenizers(data, 'llama')
}
}
return await tikJS(data)
}