Insights·2026-06-11

Why Doesn't ChatGPT Remember Your Conversations?

ChatGPT and other LLMs do not remember conversations. Each turn, the entire previous dialogue is sent again, and the model reads that bundle like text it has never seen before, simply predicting the next token. This is why the context window fills up and costs balloon as conversations grow longer. From the tokenizer's inefficiency with Korean to hallucination, temperature, and system prompts, all of it can be verified on screen with just a few lines of code.

Why Does the Conversation Seem to Continue?

The secret is in the messages array. The core of an API call is a message list made up of three roles: system, user, and assistant. To build a multi-turn conversation, every question and answer so far must be bundled and resent on every turn. It is a structure where you retell the whole conversation from the beginning to someone you just met, every single time. That is why, as conversations grow longer, history management becomes a core design task in practice, in terms of both context window and cost.

What Do You See When You Open the Tokenizer?

You see Korean's structural disadvantage. 'Annyeong' splits into two tokens and 'Annyeonghaseyo, eotteoke jinaeseyo?' into eight, while 'How are you?' takes only six. Measuring the same text as a token-to-character ratio, Korean lands around 0.47-0.75 while English sits around 0.13-0.26. Even with the same context window size, Korean can hold less content. If you are planning a Korean-language AI service, this is a constraint you must build in from the starting line.

Why Does the Model Plausibly Describe a Paper That Doesn't Exist?

Ask about a fictional 2019 journal paper on Korean sentiment analysis, and the model says it cannot name the authors yet plausibly invents the paper's main contributions. For a next-token predictor, continuing with plausible tokens comes more naturally than admitting it does not know. The knowledge cutoff shows up in the same place: ask for today's date and the exchange rate, and the answer reveals knowledge frozen at October 2023. A service that needs real-time information cannot rely on the model alone; it needs complementary structures like RAG or tool use.

Same Model, Same Input, So Why Do the Answers Diverge?

Because temperature reshapes the probability distribution. Run the same sentence three times at 0.1 and you get nearly identical answers; raise it to 1.8 and a completely different sentence begins every time. You can control this directly: keep it low when consistency matters, as in code generation, and higher when you need ideas. System prompts carry the same weight. Swap a single line among 'friendly science teacher,' 'physics PhD,' and 'Socratic educator,' and the same black-hole question returns an analogy, equations, and counter-questions respectively. One line of prompt redefines the model's entire behavior.

Why Do Non-Developers Need Code Demos Too?

Because hearing an explanation and seeing it on screen carry different weight. Someone who has watched hallucination happen attaches verification steps instead of blaming the model, and someone who has seen the full history retransmitted every turn starts treating context management in long conversations as a design task. This is why SH Consulting insists on showing the tokenizer and live API calls even to practitioners with no programming experience in its AX training. A person who has seen how the machine works once handles the tool at a different depth than one who has only heard about it.