「Patterns for Building LLM-based Systems & Products」という記事を読んだメモ

「Patterns for Building LLM-based Systems & Products」という記事を読んだメモです。

Evals: To measure performance

G-Eval is a framework that applies LLMs with Chain-of-Though (CoT) and a form-filling paradigm to evaluate LLM outputs.

GPT-4とCoTなどで評価するフレームワークがあるとのこと。あとで読みたい。

QRoLAの論文でも、GPT-4による評価があるとのこと。

Relative to human judgments which are typically noisy (due to differing biases among annotators), LLM judgments tend to be less noisy (as the bias is more systematic) but more biased.

LLMによる評価には以下のようなバイアスがあるとのこと。

Position bias: LLMs like GPT-4 tend to favor the response in the first position. To mitigate this, we can evaluate the same pair of responses twice while swapping their order. If the same response is preferred in both orders, we mark it as a win. Otherwise, it’s a tie.

位置バイアス。最初の位置にある回答を好む。順番を入れ替えて両方評価したりすべき

Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones, even if the latter is clearer and of higher quality. A possible solution is to ensure that comparison responses are similar in length.

言い回しバイアス。冗長な言い回しを好む。

Self-enhancement bias: LLMs have a slight bias towards their own answers. GPT-4 favors itself with a 10% higher win rate while Claude-v1 favors itself with a 25% higher win rate. To counter this, don’t use the same LLM for evaluation tasks.

自己強化バイアス。自分の回答にバイアスがある。

Retrieval-Augmented Generation: To add knowledge

RAGは定義としてWeb検索とかも含むのか気になった。埋め込みだけを指すのかもしれない。

以下がRAGの論文なので、あとで読みたい。 https://arxiv.org/abs/2005.11401

Web検索の論文もある模様。あとで読みたい。 https://arxiv.org/abs/2203.05115

RAG has also been applied to non-QA tasks such as code generation.

RAGはQA以外にコード生成などにも使えるとのこと。たしかに。

Why not embedding-based search only? While it’s great in many instances, there are situations where it falls short, such as:

Searching for a person or object’s name (e.g., Eugene, Kaptir 2.0)

Searching for an acronym or phrase (e.g., RAG, RLHF)

Searching for an ID (e.g., gpt-3.5-turbo, titan-xlarge-v1.01)

埋め込みベースの検索がうまくいかないケースの例。

人名
頭字語 (RAGやRLHF)
ID (gpt-3.5-turboやtitan-xlarge-v1.01)

たしかにうまくいかなそう。

Fine-tuning: To get better at specific tasks

Similar to prefix tuning, they found that LoRA outperformed several baselines including full fine-tuning. Again, the hypothesis is that LoRA, thanks to its reduced rank, provides implicit regularization. In contrast, full fine-tuning, which updates all weights, could be prone to overfitting.

完全なファインチューンは過学習しやすく、LoRAのほうがよい結果になる可能性があるとのこと。完全なファインチューンのほうがいいかと思っていたので面白い。

Caching: To reduce latency and cost

In the space of serving LLM generations, the popularized approach is to cache the LLM response keyed on the embedding of the input request.

一般的な手法として、入力の埋め込みをキーとしてキャッシュするとのこと。入力そのままではないというのが以外だった。たしかに、チャットボットなどで自由に入力できる場合は完全一致でキャッシュしても役に立たないので、埋め込みを使うのは有用そう。

GPTCacheは聞いたことはあったけどさわっていないので、さわっておきたい。 https://github.com/zilliztech/GPTCache

Guardrails: To ensure output quality

An example is the Guardrails package.

このパッケージは知らなかった。 https://github.com/ShreyaR/guardrails

NeMo Guardrails は知っていたけどさわっていないのでさわっておきたい。

Nvidia’s NeMo-Guardrails follows a similar principle but is designed to guide LLM-based conversational systems. Rather than focusing on syntactic guardrails, it emphasizes semantic ones.

この2つでは、注力している点も違うとのこと。

Guidance enforces the schema by injecting tokens that make up the structure.

Guidanceはトークンを注入することで構造を強制するとのこと。知らなかった。たしかに構造を強制するのによさそう。

Defensive UX: To anticipate & handle errors gracefully

To learn more about defensive UX, we can look at Human-AI guidelines from Microsoft, Google, and Apple.

Microsoft、Google、Appleのガイドラインが紹介されている。

make it easy to dismiss or ignore undesired AI system services

望まないAIサービスを無効にしたりできるようにする話。

it prevents it from becoming a nuisance and potentially reducing customer satisfaction in the long term.

そうすると、AIによって不便になることを防げるとのこと。たしかに。

However, I question whether chat is the right UX for most user experiences—it just takes too much effort relative to the familiar UX of clicking on text and images.

チャットはテキストや画像をクリックするだけと比べて入力が大変というのはそうだと思う。なんでもチャット（自由入力）にしてしまうと、入力がすごく面倒になりそう。

Collect user feedback: To build our data flywheel

Midjourney is another good example. After images are generated, users can generate a new set of images (negative feedback), tweak an image by asking for a variation (positive feedback), or upscale and download the image (strong positive feedback). This enables Midjourney to gather rich comparison data on the outputs generated.

画像生成で、出力に対してバリエーションを要求したかどうかがフィードバックになるのは面白い。この例を意識して考えてみると、色々応用できそう。