Huggingface(抱抱脸)总部位于纽约,是一家专注于自然语言处理、人工智能和分布式系统的创业公司。他们所提供的聊天机器人技术一直颇受欢迎,但更出名的是他们在NLP开源社区上的贡献。
Huggingface一直致力于自然语言处理NLP技术的平民化(democratize),希望每个人都能用上最先进(SOTA, state-of-the-art)的NLP技术,而非困窘于训练资源的匮乏。
Hugging Face所有模型的地址
你可以在这里下载所需要的模型,也可以上传你微调之后用于特定task的模型。
Hugging Face使用文档的地址
https://huggingface.co/transformers/master/index.html
英汉互译
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
zh2en_model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
zh2en_tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
zh2en_translation = pipeline('translation_zh_to_en',
model=zh2en_model,
tokenizer=zh2en_tokenizer)
zh2en_translation('Python是一门非常强大的编程语言!')
[{'translation_text': 'Python is a very powerful programming language!'}]
en2zh_model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
en2zh_tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
en2zh_translation = pipeline('translation_en_to_zh',
model=en2zh_model,
tokenizer=en2zh_tokenizer)
en2zh_translation('Python is a very powerful programming language!')
[{'translation_text': 'Python是一个非常强大的编程语言!'}]
文本分类
模型 uer/roberta-base-finetuned-chinanews-chinese是使用5个中文文本分类数据集训练得到
- 京东full、京东binary和大众点评数据集包含不同情感极性的用户评论数据。
- 凤凰网 和 China Daily 包含不同主题类的新闻文本数据
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
text_classification = pipeline('sentiment-analysis',
model=model,
tokenizer=tokenizer)
test_text = "上证指数大涨2%"
text_classification(test_text, return_all_scores=True)
[[{'label': 'mainland China politics', 'score': 0.0002807585697155446},
{'label': 'Hong Kong - Macau politics', 'score': 0.00015504546172451228},
{'label': 'International news', 'score': 6.818029214628041e-05},
{'label': 'financial news', 'score': 0.9991051554679871},
{'label': 'culture', 'score': 0.00011297615128569305},
{'label': 'entertainment', 'score': 0.00012184812658233568},
{'label': 'sports', 'score': 0.0001558474759804085}]]
test_text = "Python是一门强大的编程语言"
text_classification(test_text, return_all_scores=True)
[[{'label': 'mainland China politics', 'score': 0.02050291746854782},
{'label': 'Hong Kong - Macau politics', 'score': 0.0030984438490122557},
{'label': 'International news', 'score': 0.005687597207725048},
{'label': 'financial news', 'score': 0.03360358253121376},
{'label': 'culture', 'score': 0.913349986076355},
{'label': 'entertainment', 'score': 0.010810119099915028},
{'label': 'sports', 'score': 0.012947351671755314}]]
代码下载
https://github.com/hidadeng/DaDengAndHisPython/tree/master/20211108HuggingFace学习