Huggingface LLM Course - Chapter1
I've been immersing myself in development literature, especially focused on ML. Here are the books I've read (I'm listing them in Korean since that's my native language :D):
- 스트리트 코더 : 프로그래밍 세계에서 살아남기 위한 개발자 생존 가이드!
- 그림으로 이해하는 서버 구조와 기술
- 혼자 공부하는 컴퓨터 구조 + 운영체제
- 파이썬 텍스트 마이닝 완벽 가이드 : 자연어 처리 기초부터 딥러닝 기반 BERT 모델까지
- 파이토치 딥러닝 모델 AI 앱 개발 입문
These books cover a wide range of topics, including fundamental CS knowldege, development conventions, traditional machine learning methods, and practical applications of deep learning models using PyTorch.
I'm currently working to understand deep learning and its development process, though I recognize this is likely a lifelong journey😂. After going through the materials mentioned above, I've now moved on to the Hugging Face documentation.
I'm genuinely impressed by the documentation's content. It's incredibly user-friendly and helpful for beginners like me who need to start with the basics such as how to use libraries and understanding the numerous methods available. Thanks to the authors' thoughtful approach, I plan to write down the key points in this post.
Huggingface gives us a high-level operation of the model so that we can easily use them. However, There is no an exact details about 'real' implementation at the code level. So, after finishing the huggingface docs I'm going to read pytorch tutorial and '밑바닥부터 시작하는 딥러닝' series.
What is NLP?
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
pipeline
The pipeline() function executes high-level API tasks for NLP.
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
>>> [{'label': 'POSITIVE', 'score': 0.9598047137260437}]
from transformers import pipeline
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
>>>[{'generated_text': 'In this course, we will teach you how to understand and use '
'data flow and data interchange when handling user data. We '
'will be working with one or more of the most commonly used '
'data flows — data flows of various types, as seen by the '
'HTTP'}]
For my learning purposes, the pipeline approach isn't quite suitable. I'm just familiarizing myself with low-level libraries rather than using it for my actual task.
Bias and Limitations
One thing that's particularly impressive about this...
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
>>>['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
['nurse', 'waitress', 'teacher', 'maid', 'prostitute']
Oh... this is a serious problem. as huggingface tells us:
When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won’t make this intrinsic bias disappear.
Brief Summary of Models
Models | Examples | Tasks |
---|---|---|
Encoder | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |
Decoder | CTRL, GPT, GPT-2, Transformer XL | Text generation |
Encoder-Decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |
Chapter 1 is complete! 🎉
Reference
Huggingface LLM Course: https://huggingface.co/learn/llm-course/chapter1/1