How to Prompt Thinking Models like DeepSeek R1 and OpenAI o3
The release of DeepSeek R1 model, a new class of language models designed to reason more effectively than their predecessors, has sparked significant interest in thinking models.
These models effectively internalize the Chain-of-Thought prompting process that was first popularized by Google researchers in a 2022 paper. Unlike traditional LLMs—which require CoT to "reason"—they handle reasoning natively, which often leads to better results.
However, their reasoning ability also means you need to prompt them differently for the best results, and most people get it wrong.
Top Thinking Models in February 2025
- DeepSeek R1
- OpenAI o3 & o1
- Google's Gemini 2.0 Flash Thinking
- Meta's LLaMA 3.1
Key Takeaways
Thinking models require a different approach compared to standard LLMs. Here’s a quick summary before diving into details:
✅ DOs
- Use minimal prompting to let the model think independently.
- Encourage more reasoning for better performance at complex tasks by prompting for additional processing time.
- Use delimiters for clarity to separate distinct parts of input, improving interpretation.
- Use ensembling for highly complex tasks requiring high accuracy.
❌ DON’Ts
- Avoid few-shot and CoT prompting. Unlike traditional LLMs, thinking models perform worse when overloaded with examples or asked to "reason" step-by-step.
- Don’t use thinking models for structured outputs unless absolutely necessary. They perform worse with structured outputs than traditional models.
- Avoid overloading the model with unnecessary details. More context isn’t always better.
Disclaimer: The insights come from research carried out by OpenAI and Microsoft and our own personal experience.
Let's get into it!
1. Use Minimal Prompting
Thinking models work best when given concise, direct, and structured prompts.
Too much information can actually reduce accuracy. Unlike previous models where more context helped, thinking models already structure their reasoning internally.
The best approach is to state the problem clearly and let the model figure out the steps.
✅ What are the main differences between classical and operant conditioning?
---
❌ In psychology, there are different learning theories. Classical conditioning was discovered by Pavlov, while operant conditioning was developed by Skinner. Could you please explain the difference between classical conditioning and operant conditioning? Make sure to include an example for each.
Fewer instructions allow the model to engage its reasoning process naturally.
2. Encourage More Reasoning for Complex Tasks
More complex problems benefit from additional reasoning time.
Thinking models use reasoning tokens, which allow them to internally process a problem before outputting an answer. By prompting the model to take its time, you can improve the quality of the response. However, this also increases token usage, impacting cost.
✅ Analyze the economic impact of renewable energy adoption over the next 20 years. Consider factors such as job creation, energy prices, and carbon reduction. Take your time and think through each aspect carefully.
---
❌ How does renewable energy impact the economy? Answer quickly.
Encouraging longer reasoning helps for multi-step problems, improving accuracy significantly.
3. Avoid Few-Shot and Chain-of-Thought Prompting
Traditional few-shot (where you give a bunch of examples) and CoT prompting strategies reduce performance for thinking models.
According to the paper, openAI's o1-preview model performed worse when given few-shot examples. This contrasts with older models, where few-shot learning improved results. Thinking models are already designed to break down problems internally, so explicit step-by-step guidance can interfere with their reasoning.
✅ What is the capital of Canada?
---
❌ Example 1:
Q: What is the capital of France?
A: Paris
Example 2:
Q: What is the capital of Japan?
A: Tokyo
Now answer this: What is the capital of Canada?
For thinking models, zero-shot prompts worked better than few-shot prompts.
4. Use Thinking Models for Complex Multi-Step Tasks
Thinking models perform best on tasks that require five or more steps.
When solving problems with 3-5 steps, thinking models offered a slight improvement over standard models. For simpler tasks (fewer than 3 steps), performance may actually degrade compared to traditional LLMs, because they "overthink."
If a task is highly structured or simple, a regular LLM like GPT-4 may be a better choice.
✅ Break down the process of solving a complex physics problem involving momentum conservation. Explain each step clearly and logically.
---
❌ What is 2+2?
Tip 💡
To check how many steps a problem requires, you can prompt the web version of a reasoning model to see how many reasoning steps it takes.
5. Use Delimiters to Structure Prompts for Consistent Outputs
For regular LLMs, developers typically use delimiters like triple quotation marks, XML tags, or section titles to clearly define distinct sections of the input. This makes it easier for the model to interpret the information correctly.
Thinking models, however, struggle with structured outputs but can be guided to maintain consistency. If you need a structured response (e.g., JSON, tables, fixed formats), structure your prompt carefully.
✅ [Task: Summarize the following text]
Text: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
---
❌ Summarize this: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
If structured output is critical, you’re better off using a standard LLM instead of a thinking model.
6. Use Ensembling for Highly Complex Tasks
For high-stakes or complex problems, ensembling improves performance.
Ensembling involves running multiple prompts (either the same prompt multiple times or variations of the prompt) and aggregating the results. This approach increases accuracy but raises costs because multiple queries are required.
✅ #Prompt 1:
What are the primary causes of climate change? Provide a well-reasoned answer.
# Prompt 2:
Explain the major contributors to climate change, focusing on human activities and natural factors.
# Prompt 3:
Explain what causes climate change
<Context>
# [Response 1 + Response 2]
</Context>
---
❌ What causes climate change? Answer in a single response.
While ensembling boosts performance, it’s expensive and should only be used when high accuracy is critical.
Observe your thinking models with Helicone 💡
Get full visibility into your thinking models with Helicone. Get started in 5 minutes.
Conclusion
Prompting thinking models like DeepSeek R1 and OpenAI o1 and o3 requires a different mindset and approach. In summary:
- Minimal, clear prompts work best.
- Forcing few-shot examples or step-by-step reasoning reduces effectiveness.
- When you must get a structured output, use standard LLMs instead.
- For even better responses, ensembling is a powerful (though costly) technique.
With these guidelines, you can optimize your interactions with thinking models and get the best possible responses. Enjoy!
You might find these useful:
- DeepSeek R1: New Open-Source MoE Model
- OpenAI o3 Released: Benchmarks and Comparison to o1
- How to Test Your LLM Prompts
- Building a Simple Chatbot with OpenAI Structured Outputs
Questions or feedback?
Are the information out of date? Please raise an issue or contact us, we'd love to hear from you!