Transformers框架基础教程及更多实战内容
以商品评价数据集为例,使用bert进行情感分析。
数据集来源:https://github.com/SophonPlus/ChineseNlpCorpus
Step1 导包
1 2
| from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset,load_from_disk
|
Step2 加载数据集
1
| dataset = load_dataset("csv", data_files="./JD.com_comments.csv", split='train')
|
这个数据集共包含720 万条评论,包含用户ID,商品ID,评分,时间,评论标题,评论内容这六个字段。看一下第一条数据。
1 2 3 4 5 6 7 8
| { 'userId': 29792, 'productId': 345003, 'rating': 5, 'timestamp': 1368720000, 'title': '路由器很好', 'comment': '这款路由器用着感觉不错 感觉是好久没卖出去了吧,那么多的灰尘' }
|
Step3 数据集预处理
在数据集中,rating,title,comment分别表示评分,评价标题,评价内容。我们仅使用rating和comment,剔除数据集中rating或comment为空的数据
1
| dataset=dataset.filter( lambda item: item['rating'] is not None and item['comment'] is not None)
|
使用分词器进行编码数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| tokenizer = AutoTokenizer.from_pretrained("hfl/rbt3")
def process_function(examples): ''' 数据处理函数 rating在数据集中共可取1,2,3,4,5。我们最终的结果与rating的对应关系为: 好评(0):4,5 中性(1):3 差评(2):1,2 ''' content = examples['comment'] tokenized_examples = tokenizer(content, max_length=128, truncation=True, padding="max_length")
rates = [int(ra) for ra in examples["rating"]] tokenized_examples["labels"] = [0 if rate > 3 else 2 if rate < 3 else 1 for rate in rates] return tokenized_examples
|
进行数据处理
1
| tokenized_datasets = dataset.map(process_function, batched=True, remove_columns=dataset.column_names)
|
看一下处理完的数据集,并保存到本地,方便下次直接加载
1 2 3
| tokenized_datasets.save_to_disk('./waimai_data')
tokenized_datasets=load_from_disk('./waimai_data')
|
Step4 划分数据集
把整个数据集按比例划分为训练集和测试集
1
| tokenized_datasets=tokenized_datasets.train_test_split(test_size=0.1)
|
最终的数据集如下,训练集360w条,测试集40w:
1 2 3 4 5 6 7 8 9 10
| DatasetDict({ train: Dataset({ features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'], num_rows: 3673050 }) test: Dataset({ features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'], num_rows: 408117 }) })
|
Step5 创建模型
1 2
| model = AutoModelForSequenceClassification.from_pretrained("hfl/rbt3",num_labels=3)
|
Step6 创建评估函数
1 2 3 4 5 6 7 8 9 10 11 12
| import evaluate
acc_metric = evaluate.load("accuracy") f1_metirc = evaluate.load("f1")
def eval_metric(eval_predict): predictions, labels = eval_predict predictions = predictions.argmax(axis=-1) acc = acc_metric.compute(predictions=predictions, references=labels) f1 = f1_metirc.compute(predictions=predictions, references=labels,average='micro') acc.update(f1) return acc
|
Step7 创建TrainingArguments
1 2 3 4 5 6 7 8 9 10 11 12 13
| train_args = TrainingArguments( output_dir="./checkpoints", per_device_train_batch_size=2048, per_device_eval_batch_size=2048, logging_steps=10, evaluation_strategy="epoch", save_strategy="epoch", save_total_limit=3, learning_rate=2e-5, num_train_epochs=10, weight_decay=0.01, metric_for_best_model="f1", load_best_model_at_end=True)
|
Step8 创建Trainer
1 2 3 4 5 6 7
| from transformers import DataCollatorWithPadding trainer = Trainer(model=model, args=train_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], data_collator=DataCollatorWithPadding(tokenizer=tokenizer), compute_metrics=eval_metric)
|
Step9 模型训练
接下来就是漫长的等待时间了,最终训练结果如下:
1 2 3 4 5 6 7 8 9 10 11
| TrainOutput( global_step=250, training_loss=0.43612158012390134, metrics={ 'train_runtime': 711.8635, 'train_samples_per_second': 715.39, 'train_steps_per_second': 0.351, 'total_flos': 8548939048135680.0, 'train_loss': 0.43612158012390134, 'epoch': 10.0} )
|
Step10 模型预测
1 2 3 4 5 6 7 8 9 10 11
| from transformers import pipeline
id2_label = { 1: "中性",2:"差评!",0: "好评!"} model.eval() model.config.id2label = id2_label pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)
print(pipe("还不错,老板送小礼品了")) print(pipe("一般般,勉强能用")) print(pipe("太垃圾了,跟图片上差远了"))
|
预测结果:
1 2 3
| [{'label': '好评!', 'score': 0.9717923998832703}] [{'label': '中性', 'score': 0.5955721139907837}] [{'label': '差评!', 'score': 0.8747568726539612}]
|
代码地址
商品评价情感分析
更多内容