Insights·Jun 5, 2026·5 min read

A/B testing newsletter subject lines: the math, and when to skip

Most A/B testing guides skip the question that matters first: do you have the list size to detect anything? The sample-size math, the threshold at which a test pays, and what beats A/B testing under 5,000 subscribers.

Nr
Nashra research team

A/B testing newsletter subject lines is the most-repeated advice in email marketing. For most newsletters, it is also the wrong advice.

The reason is not philosophical. It is statistical. To detect whether one subject line beats another, you need enough opens on each variation to separate signal from noise. Most newsletters do not have that many subscribers. Until they do, A/B tests on subject lines mostly produce confident-looking conclusions about nothing.

This is what the math actually looks like, where the threshold sits, and what works better under it.

What an A/B test actually measures

An email A/B test splits the list into two random groups, sends each group a different subject line, and compares open rates. The version with the higher open rate is declared the winner.

That definition has one problem already. "Open rate" is not a clean measurement anymore. Since Apple Mail Privacy Protection shipped in 2021, Apple Mail pre-loads every email's tracking pixel on its own servers regardless of whether a human ever opened the message. Apple Mail accounts for roughly half of tracked opens across the industry, per Litmus's email client share data.

The practical effect: a chunk of every "open" in your dashboard is a robot loading a pixel. The ratio can be steady enough that A/B tests still work, because the noise hits both variations equally. But raw open rate is no longer the truth it pretends to be. A 2025 send showing a 43% open rate is plausibly closer to 22% real human opens once you back out Apple MPP, per MailerLite's 2025 benchmark report.

That confound matters because it sets the floor for sample size. Smaller real signal, more data needed to detect a difference.

The sample-size math

A/B test sample size is a function of three things: the baseline rate you are measuring against, the smallest difference you want to detect, and how confident you want to be that the difference is real rather than random.

The standard convention is 95% confidence, 80% power, two-tailed test. Plug those into a sample-size calculator and the rough rule that falls out is this:

  • If your real open rate is 25% and you want to detect a 5-point absolute lift (from 25% to 30%), you need roughly 1,200 opens per variation to call it.
  • If you want to detect a 3-point lift (25% to 28%), you need roughly 3,400 opens per variation.
  • If you want to detect a 1-point lift, you need over 30,000 opens per variation.

These numbers come from a standard two-proportion test. You can verify them on Evan Miller's sample-size calculator.

Three points to take from this.

First, opens per variation is what counts, not sends. If you split 4,000 subscribers in half and your real open rate is 25%, you get 500 real opens per variation. That is not enough to detect anything short of a 7-point lift, and a 7-point lift on a subject line is rare.

Second, the threshold for a test is set by the effect you are hoping to find. Most subject-line A/B tests find lifts in the 2 to 5 point range. The list size needed to detect that range starts in the low five figures, not the low four.

Third, the sample-size requirement is per variation. A three-way test costs three times as much per cell.

The threshold for a test that pays

Working backward from those numbers, a useful rule of thumb:

Roughly 5,000 active subscribers with a real open rate above 20% is the floor. That puts you near 1,000 opens per variation on a 50/50 split, enough to reliably detect a 5-point lift. Below that, you are running tests that mostly cannot tell you anything.

A more honest threshold for catching the smaller lifts subject-line tests actually produce (1 to 3 points) is closer to 20,000 active subscribers, which is consistent with the rule-of-thumb HubSpot lands on in its email A/B testing guide.

This is not a counsel of despair. It is the same logic you would apply to product analytics. With fifty visitors on your pricing page, you cannot A/B test the headline. With fifty thousand, you can. Newsletter subject-line tests follow the same curve. Until you have the volume, the test is theatre.

What to do under 5,000 subs

The instinct to test is correct. The mechanism is wrong. Under 5,000 subs, the better mechanism is judgment, not statistics.

Three subject-line moves that consistently outperform their alternatives, well established enough that you can adopt them as defaults without testing:

  • Specifics beat generics. "The three sentences a reader sees first" outperforms "How to write a great opening." Numbers, nouns, and named things win because they signal a concrete payoff.
  • A question that maps to the reader's own beats a statement about you. "Why is your open rate dropping?" outperforms "Our new open-rate report." The first is about the reader. The second is about the sender.
  • Length should match what the inbox preview will do to it. Most opens happen on phones, where the preview cuts off around 35 to 40 characters. A short subject either fits whole or gets clipped at a useful word. A longer subject only earns its length when it is unfinishing a thought the reader wants to finish.

None of these are A/B test results. They are patterns that fall out of how reading attention works on a glance.

The other move that pays at small list sizes: ask new subscribers to reply. A reply tells Gmail this sender matters to this person, which lifts your inbox placement for that recipient on every subsequent send. The downstream effect on your real open rate is larger than anything a subject-line A/B test will tell you, and it is also one of the few wins from our deliverability guide that compounds without any test setup at all.

What to test once you are past the threshold

Past 5,000 active subscribers, A/B testing earns its place. The order to attempt things in, by expected lift:

  1. Send time. A two- or three-hour shift in send time produces larger swings than most subject-line tweaks, because it changes where in the inbox stack your message lands.
  2. Sender name. A real first name from a person outperforms a brand name in almost every reader survey worth reading. Worth one definitive test, then a global change.
  3. Subject-line angle. Test angles, not phrasings. "A question vs a statement" gives you a lesson you can reapply. "Three phrasings vs three phrasings" gives you noise.
  4. Subject-line length. Test once at scale. Apply the answer everywhere.

Once you have a stable answer on each of those, the marginal value of more subject-line tests drops fast. The lever moves to content: cadence, length, what you write about, whether the reader replies. None of those are A/B testable in a useful way at any list size below seven figures.

The point of the test

Underneath the statistics is a simpler argument. The reason a subject line matters is the same reason every other writing decision matters: it is the one chance a reader gives you before deciding whether to read on. That chance compounds across the relationship. A subscriber converts roughly 10× better than a follower not because the subject line was perfect, but because the inbox is one of the last places left where a reader chose to be reached.

The right place to spend the time you would have spent A/B testing under 5,000 subs: writing the second email well enough that the first one matters. The infrastructure that makes that work, one editor, one list, one place where the open rate actually means something, is what we built Nashra's email newsletter for. The subscribers are the spine. The subject line is a knob you can tune properly once the system is in place.

Write your own. Send it to a real audience.

Free forever. Credit card required only for sending emails. Your blog and your newsletter, in one place.

30-day money-back guarantee. Full refund, no questions asked.