LoRA(Low-Rank Adaptation,低秩微调)。
全面微调(Full Fine-Tuning)一个几百亿参数的大模型像是在“重新装修整栋摩天大楼”(成本极高、极易塌房),那么 LoRA 就是在摩天大楼外面搭几根轻量级的“外挂管道”。
我们来一步一步实现这个过程:
准备
uv pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
uv pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
uv pip install --no-deps unsloth
一、加载底模:
from unsloth import FastLanguageModel
import torch
MODEL = "unsloth/Qwen3-14B"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name =MODEL,
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
full_finetuning = False
)
model = FastLanguageModel.get_peft_model(
model,
r = 32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 32,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
use_rslora = False,
loftq_config = None,
)
140 亿(14B 参数) 的庞然大物(通义千问最新版)塞进你的电脑里。
算笔账: 如果按标准出厂格式(Float16,每个参数占 2 字节),140亿参数光是安静地躺在显存里,就需要 28 GB 的显存!如果强行全参微调,显存直接飙升到 80GB 以上。普通人根本玩不起。
max_seq_length = 2048
是模型的“短期记忆草稿纸”大小(上下文窗口)。
为什么不设成 128K? 虽然 Qwen 支持极长的上下文,但在微调时,长度越长,计算 Attention 时的中间矩阵(特别是 KV Cache)就越庞大。设定为 2048,是为了在“能看懂足够长的问题”和“不把显卡内存撑爆”之间找一个黄金平衡。
full_finetuning = False
load_in_4bit = True
为你下一步接入上一章讲的 LoRA 外挂管道 做准备!
这就是在向 PyTorch 的内存管家下达死命令:“给我把这 140 亿个被压缩成 4-bit 的原生脑细胞彻底冻结(Freeze)!不管等会儿训练老师(Loss 交叉熵)发多大火,绝对不准修改它们中的任何一个数字!”
先用 load_in_4bit = True 把一栋 140 亿块砖头的摩天大楼,强行压缩成了极其轻量的“马赛克微缩模型”,塞进了你的游戏显卡里。
然后你用 full_finetuning = False 给这栋大楼贴上了封条,严禁拆卸。
大模型并不是靠某几个极其精确的数字来思考的,而是靠几百亿个数字的整体网络结构和相对大小。所以即使把精度砍到 4-bit,它的智商也几乎只有极其微小的下降。
二、LoRA微调
# 获取高效微调模型 (PEFT model)
model = FastLanguageModel.get_peft_model(
model,
# 目标模块 (需进行 LoRA 适配的层)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
# 使用梯度检查点 (优化显存技术),这里指定 Unsloth 的实现
use_gradient_checkpointing = "unsloth",
r = 16, # 秩 (Rank),低秩适应层的大小
lora_alpha = 4, # LoRA 的缩放因子
lora_dropout = 0, # LoRA 层的 Dropout 概率 (防止过拟合)
bias = "none" # 是否微调偏置 (bias) 项
)
代码使用了目前极其火爆的极速微调库 Unsloth(结合了 HuggingFace 的 PEFT)。每一行都在精妙地控制我们刚才说的那根“外挂管道”:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]- 含义: 我们要把外挂管道接在哪里?
- 直觉: 大模型里有无数个矩阵,这里指定只把 LoRA 挂在注意力机制(Attention)的 Q、K、V 和输出层 O 上。这是业界公认性价比最高的“挂载点”。
use_gradient_checkpointing = "unsloth"- 含义: 开启梯度检查点(一种省显存的黑科技)。
- 直觉: “时间换空间”。不把所有中间计算结果都存在显卡里,而是用的时候再重新算一遍。
unsloth提供了极其优化的版本,能让你在小显存游戏显卡(比如 RTX 3090/4090)上跑起庞然大物。
r = 16(最关键的参数)- 含义: 就是流程图里的那个“秩(Rank)”维度。
- 直觉: 也就是那个“沙漏”最窄的地方有多宽。$r$ 越大,这根管道能流过去的信息越多,模型能学到的新知识越复杂,但需要的显存也越大。通常设置在 8、16、32 左右。
lora_alpha = 4- 含义: LoRA 的缩放比例(Scaling factor)。
- 直觉: 相当于“外挂管道”的音量旋钮。由于最后两路信号要相加($\oplus$),
alpha决定了外挂信号的声音有多大。在数学上,外挂权重会乘以 $\frac{alpha}{r}$。
lora_dropout = 0和bias = "none"- 含义: 关闭 Dropout,不训练偏置项。
- 直觉: 极致的抠门学!因为 LoRA 模块本来就极小(参数很少),不容易发生过拟合,所以不需要用 Dropout 来“毒死”神经元。不训练 bias 也是为了进一步榨干显存,把所有算力全用在核心矩阵上。

graph TD
%% 定义样式类
classDef input fill:#e1bee7,stroke:#8e24aa,stroke-width:2px,color:#000
classDef frozen fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5,color:#000
classDef loraA fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
classDef loraB fill:#ffcdd2,stroke:#d32f2f,stroke-width:2px,color:#000
classDef operation fill:#eeeeee,stroke:#9e9e9e,stroke-width:2px,color:#000
classDef output fill:#bbdefb,stroke:#1976d2,stroke-width:2px,color:#000
classDef note fill:#f5f5f5,stroke:#666666,stroke-width:1px,stroke-dasharray: 3 3,color:#333
%% 节点定义
X["输入数据 X (维度 d)"]:::input
subgraph Pretrained [预训练模型 - 微调时冻结]
W["原生权重矩阵 W<br>(d × k)"]:::frozen
end
subgraph LoRAPath [LoRA 外挂管道 - 微调时只学习它们]
A["降维矩阵 A<br>(d × r)"]:::loraA
MUL(("⊗ 相乘")):::operation
B["升维矩阵 B<br>(r × k)"]:::loraB
end
ADD(("⊕ 相加")):::operation
Y["输出数据 Y (维度 k)"]:::output
%% 注释节点
Note["💡 提示:<br>秩 r 远小于 d 和 k (如 r=16)<br>所以外挂管道的参数量极小!"]:::note
%% 连接关系
X --> W
X --> A
W -->|保持不变的原生输出| ADD
A --> MUL
MUL --> B
B -->|微调学到的增量输出| ADD
ADD --> Y
%% 注释连线
B -.-> Note
这张图解释了“如何在不改变原有大模型智商的前提下,教它新知识”。
- 左侧的主路(预训练模型):
- 这是大模型出厂时自带的“原生知识矩阵”。
- 注意旁边的字:(微调时冻结)。这意味着在训练时,我们绝对不修改大模型原来的参数(彻底避免了“灾难性遗忘”,不用担心它学了写诗就忘了算数学)。
- 右侧的旁路(降维矩阵A 和 升维矩阵B):
- 这是我们加装的“外挂管道”(LoRA 模块)。旁边写着(微调时只学习它们)。
- 核心魔法在于这个“沙漏形”结构:
- 假设输入数据 x 是 4096 维的。
- 矩阵 A (降维): 把 4096 维的信息,狠狠压缩到一个极小的维度 r(比如图中的 $r=16)。
- 矩阵 B (升维): 拿着这 16 维的精华特征,再把它重新放大回 4096 维。
- 顶部的相加(+):
- 原始大模型的输出,和我们“外挂管道”算出来的输出,在最后一步直接相加。
- 物理直觉: “原生大脑”负责提供基础常识,“外挂大脑”负责提供特定领域的微调风格,两者混合,输出最终结果。
三、加载训练数据集
from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
上面可以看到,我们加载了2种数据集,一种是推理型数据样本,一种是非推理型数据样本,格式是不一样的:
# 处理推理数据集
from pprint import pprint
pprint(reasoning_dataset[0])
# 手写一个 generate_conversation 函数,把原始的 problem(问题)和 generated_solution(生成的解答)强行打包成了标准的字典列表 [{"role": "user", ...}, {"role": "assistant", ...}]。
# 样例特征与翻译要求的呼应:
# 思维链 (Chain-of-Thought): 一开始就是 <think> 标签,里面装满了诸如 "Okay, let's see...", "Let me try that..." 这样的碎碎念。这就是在教模型“三思而后行”,不要急于给出答案,把中间的推理步骤显式地打印出来。
# 清晰标记最终答案 (Mark the final answer clearly): 在回复的最后,使用了 \boxed{14}。
# 兼容评估基准 (Compatibility with evaluation benchmarks): 为什么要用 \boxed{}?因为主流的开源数学能力测试集(如 GSM8K, MATH),它们的自动判分脚本都是靠正则表达式去抓取 \boxed{} 里面的内容来判断对错的。如果你不教模型输出这个框,就算它算对了,在测试跑分时也会得零分。
def generate_conversation(examples):
problems = examples["problem"]
solutions = examples["generated_solution"]
conversations = []
for problem, solution in zip(problems, solutions):
conversations.append([
{"role" : "user", "content" : problem},
{"role" : "assistant", "content" : solution},
])
return { "conversations": conversations, }
reasoning_conversations = tokenizer.apply_chat_template(
reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
tokenize = False,
)
pprint(reasoning_dataset[0])
# 整理完的推理型的例子
<|im_start|>user
Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>
<|im_start|>assistant
<think>
Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.
First, let me write down the equation again to make sure I have it right:
√(x² + 165) - √(x² - 52) = 7.
Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:
√(x² + 165) = 7 + √(x² - 52).
Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:
(√(x² + 165))² = (7 + √(x² - 52))².
Simplifying the left side:
x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².
The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).
So putting it all together:
x² + 165 = 49 + 14√(x² - 52) + x² - 52.
Hmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:
165 = 49 + 14√(x² - 52) - 52.
Simplify the constants on the right:
49 - 52 is -3, so:
165 = -3 + 14√(x² - 52).
Now, add 3 to both sides to isolate the radical term:
165 + 3 = 14√(x² - 52).
So 168 = 14√(x² - 52).
Divide both sides by 14:
168 / 14 = √(x² - 52).
12 = √(x² - 52).
Now, square both sides again to eliminate the square root:
12² = x² - 52.
144 = x² - 52.
Add 52 to both sides:
144 + 52 = x².
196 = x².
So x = √196 = 14.
But wait, since the problem states that x is positive, we only take the positive root. So x = 14.
But hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.
Let's plug x = 14 back into the original equation:
√(14² + 165) - √(14² - 52) = ?
Calculate each term:
14² is 196.
So first radical: √(196 + 165) = √361 = 19.
Second radical: √(196 - 52) = √144 = 12.
So 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.
Therefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.
</think>
To solve the equation \(\sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7\) for positive \(x\), we proceed as follows:
1. Start with the given equation:
\[
\sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7
\]
2. Isolate one of the square roots by moving \(\sqrt{x^2 - 52}\) to the right side:
\[
\sqrt{x^2 + 165} = 7 + \sqrt{x^2 - 52}
\]
3. Square both sides to eliminate the square root on the left:
\[
(\sqrt{x^2 + 165})^2 = (7 + \sqrt{x^2 - 52})^2
\]
Simplifying both sides, we get:
\[
x^2 + 165 = 49 + 14\sqrt{x^2 - 52} + (x^2 - 52)
\]
4. Combine like terms on the right side:
\[
x^2 + 165 = x^2 - 52 + 49 + 14\sqrt{x^2 - 52}
\]
Simplifying further:
\[
x^2 + 165 = x^2 - 3 + 14\sqrt{x^2 - 52}
\]
5. Subtract \(x^2\) from both sides:
\[
165 = -3 + 14\sqrt{x^2 - 52}
\]
6. Add 3 to both sides to isolate the term with the square root:
\[
168 = 14\sqrt{x^2 - 52}
\]
7. Divide both sides by 14:
\[
12 = \sqrt{x^2 - 52}
\]
8. Square both sides again to eliminate the square root:
\[
12^2 = x^2 - 52
\]
Simplifying:
\[
144 = x^2 - 52
\]
9. Add 52 to both sides to solve for \(x^2\):
\[
196 = x^2
\]
10. Take the positive square root (since \(x\) is positive):
\[
x = \sqrt{196} = 14
\]
11. Verify the solution by substituting \(x = 14\) back into the original equation:
\[
\sqrt{14^2 + 165} - \sqrt{14^2 - 52} = \sqrt{196 + 165} - \sqrt{196 - 52} = \sqrt{361} - \sqrt{144} = 19 - 12 = 7
\]
The solution checks out.
Thus, the only positive solution is:
\[
\boxed{14}
\]<|im_end|>
同样整理非推理型的数据
# 非推理型
pprint(non_reasoning_dataset[0])
# 直接调用了开源微调神器 unsloth 自带的魔法库:standardize_sharegpt 和 apply_chat_template。
# ShareGPT 是目前开源界最流行的人机对话数据格式。
# 这两行代码会自动给数据穿上模型认识的“制服”(比如自动添加 <|im_start|> 和 <|im_end|> 这种特殊的控制字符)。
# 非推理型通常处理常识问答、代码编写、文本翻译等“知道就知道,不知道就不知道”的任务
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)
non_reasoning_conversations = tokenizer.apply_chat_template(
dataset["conversations"],
tokenize = False,
)
pprint(non_reasoning_dataset[0])
# 整理完的非推理型的例子
<|im_start|>user
Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code.
Furthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.
Finally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>
<|im_start|>assistant
<think>
</think>
Boolean operators are logical operators used in programming to manipulate boolean values. They operate on one or more boolean operands and return a boolean result. The three main boolean operators are "AND" (&&), "OR" (||), and "NOT" (!).
The "AND" operator returns true if both of its operands are true, and false otherwise. For example:
```python
x = 5
y = 10
result = (x > 0) and (y < 20) # This expression evaluates to True
```
The "OR" operator returns true if at least one of its operands is true, and false otherwise. For example:
```python
x = 5
y = 10
result = (x > 0) or (y < 20) # This expression evaluates to True
```
The "NOT" operator negates the boolean value of its operand. It returns true if the operand is false, and false if the operand is true. For example:
```python
x = 5
result = not (x > 10) # This expression evaluates to True
```
Operator precedence refers to the order in which operators are evaluated in an expression. It ensures that expressions are evaluated correctly. In most programming languages, logical AND has higher precedence than logical OR. For example:
```python
result = True or False and False # This expression is evaluated as (True or (False and False)), which is True
```
Short-circuit evaluation is a behavior where the second operand of a logical operator is not evaluated if the result can be determined based on the value of the first operand. In short-circuit evaluation, if the first operand of an "AND" operator is false, the second operand is not evaluated because the result will always be false. Similarly, if the first operand of an "OR" operator is true, the second operand is not evaluated because the result will always be true.
In programming languages that support short-circuit evaluation natively, you can use it to improve performance or avoid errors. For example:
```python
if x != 0 and (y / x) > 10:
# Perform some operation
```
In languages without native short-circuit evaluation, you can implement your own logic to achieve the same behavior. Here's an example in pseudocode:
```
if x != 0 {
if (y / x) > 10 {
// Perform some operation
}
}
```
Truthiness and falsiness refer to how non-boolean values are evaluated in boolean contexts. In many programming languages, non-zero numbers and non-empty strings are considered truthy, while zero, empty strings, and null/None values are considered falsy.
When evaluating boolean expressions, truthiness and falsiness come into play. For example:
```python
x = 5
result = x # The value of x is truthy, so result is also truthy
```
To handle cases where truthiness and falsiness are implemented differently across programming languages, you can explicitly check the desired condition. For example:
```python
x = 5
result = bool(x) # Explicitly converting x to a boolean value
```
This ensures that the result is always a boolean value, regardless of the language's truthiness and falsiness rules.<|im_end|>
数据集大小也不一样
print(len(reasoning_conversations))
19252
print(len(non_reasoning_conversations))
100000
四、对齐格式,合并训练数据集
data = pd.concat([
pd.Series(reasoning_conversations),
pd.Series(non_reasoning_subset)
])
data.name = "text"
from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)
在微调之前,我们必须将数据集准备成对话格式:
对于推理型数据,我们提取解答过程(solution)和最终答案(solution keys)。
对于非推理型数据,我们使用标准化方法,将其转换为所需的格式。
五、定义一个训练对象
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = combined_dataset,
eval_dataset = None,
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none",
)
)
1. 显存刺客的克星:积少成多的“吞咽法”
per_device_train_batch_size = 2(单次处理量)- 物理直觉: 每次只给模型看 2 道题。为什么这么小?因为即使模型被压到了 4-bit,训练时的中间变量(激活值)依然极其庞大。设为 2 是为了保证绝不爆显存(OOM)。
gradient_accumulation_steps = 4(梯度累加步数)- 物理直觉: 这是深度学习最伟大的工程发明之一!如果每次只看 2 道题就去更新参数,模型学到的知识会非常“偏科”和“震荡”。
- 魔法效果: 它告诉模型:“你看完 2 道题,先别急着改参数,把心得体会(梯度)记在脑子里;接着再看 2 道,再看 2 道……直到连续看了 4 批(总共 2 * 4 = 8 道题)之后,把心得体会加在一起,统一修改一次参数!”
- 结果: 你用极其微小的显存,完美模拟了
Batch Size = 8的平滑训练效果。
2. 学习节奏的艺术:先热身,再冲刺
learning_rate = 2e-4(学习率:0.0002)- 物理直觉: 决定了模型每次修改参数时的“步子跨多大”。对于 LoRA 微调来说,
2e-4是一个极其经典的“甜点值”。步子太大容易忘记常识(灾难性遗忘),步子太小半天学不会。
- 物理直觉: 决定了模型每次修改参数时的“步子跨多大”。对于 LoRA 微调来说,
warmup_steps = 5(热身步数)- 物理直觉: 刚开始训练时,外挂管道里的参数是随机初始化的。如果一开始就用全速(2e-4)去冲,会引发梯度爆炸,瞬间摧毁模型。
- 魔法效果: 在前 5 步,学习率会从 0 极其平滑地慢慢上升到 2e-4。就像开车起步先轻踩油门。
lr_scheduler_type = "linear"(学习率调度器)- 物理直觉: 过了 5 步热身期后,随着训练步数越来越接近尾声,学习率会变成直线下降,越来越小。这意味着模型在训练后期会进行更精细的“微调雕刻”,而不是大刀阔斧地乱改。
3. 终极算力压榨:8-bit 优化器
optim = "adamw_8bit"- 物理直觉: 训练模型不光需要保存模型本身的参数,还要保存一个叫“优化器状态(动量)”的东西。标准的 32 位 AdamW 优化器是个内存巨兽,会吃掉极大量的显存。
- 魔法效果: 这是 bitsandbytes 库的又一神作。它把优化器的状态强行压缩成了 8-bit!这能让你额外省出好几个 G 的显存,是单卡微调必开的终极选项。
4. 其他常规设置
max_steps = 60(总训练步数): 你只打算让模型更新 60 次参数。这通常不是为了训练出一个最终成品,而是一个 “跑通验证 (Sanity Check)” —— 用来测试代码会不会报错、Loss(误差)会不会正常下降。真正训练可能需要几百上千步(通常用num_train_epochs控制)。weight_decay = 0.01(权重衰减): 防止模型死记硬背训练集(过拟合)的惩罚机制。seed = 3407: 固定随机种子,保证每次跑的结果完全一样。(圈内冷知识:3407 是 AI 学术界著名的玄学种子,有一篇著名论文叫《Torch.manual_seed(3407) is all you need》,发现用这个数字通常能跑出好结果。)report_to = "none": 关闭监控面板(比如 WandB),专注于在终端看日志。
六、开始训练
trainer_stats = trainer.train()
# 测测看
messages = [
{"role" : "user", "content" : "Solve (x^2 + 5x + 6 = 0)."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
enable_thinking = False,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(text, return_tensors = "pt").to("cuda"),
max_new_tokens = 2048,
temperature = 0.7, top_p = 0.8, top_k = 20,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# 保存
model.save_pretrained("qwen3_lora_model")
tokenizer.save_pretrained("qwen3_lora_model")
会存下以下文件:
('qwen3_lora_model/tokenizer_config.json',
'qwen3_lora_model/special_tokens_map.json',
'qwen3_lora_model/vocab.json',
'qwen3_lora_model/merges.txt',
'qwen3_lora_model/added_tokens.json',
'qwen3_lora_model/tokenizer.json')
加载训练成果
from unsloth import FastLanguageModel
finetuned_model, finetuned_tokenizer = FastLanguageModel.from_pretrained(
model_name = "qwen3_lora_model",
max_seq_length = 2048,
load_in_4bit = True,
)
# 测一把
messages = [{"role" : "user", "content" : "Solve (x^2 + 5x + 6 = 0)."}]
text = finetuned_tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
enable_thinking = False,
)
from transformers import TextStreamer
_ = finetuned_model.generate(
**finetuned_tokenizer(text, return_tensors = "pt").to("cuda"),
max_new_tokens = 2048,
temperature = 0.7, top_p = 0.8, top_k = 20,
streamer = TextStreamer(finetuned_tokenizer, skip_prompt = True),
)