Comparing apples with oranges – mistakes with analyses

This blog post is part 4 of a “Beyond the default rate” series, which investigates what the default rate actually means, how it affects you and why you should know more than simply the number you see on some report.

If you haven’t yet read the previous posts in the series, please do that before reading this one. This post will be building upon the foundation we set in previously.

One of the most common mistakes I’ve noticed in the analyses done and highlighted in Bondora forums could be described as comparing apples with oranges. It seems like you have come to some significant result, but in actuality the result simply arrives due to some underlying mistake in sample selection.

This mistake has essentially made all of these analyses on other blogs at least partly invalid and simply using these results as basis for investments, is not the brightest idea.

While it is welcoming to see new people taking a crack at the data for fun and for higher profits, I want to highlight the typical mistakes that are done to avoid the spread of false results and bad investment decisions based on these results.

Starting your analyses with the proper question

Mostly this issue happens when people take the data and start quickly going through it with different analyses without actually starting any of them with a proper question in mind.

For example, let’s say you wanted to know whether loan consolidation is a higher risk loan than others or not. What would your question here be?

If it’s simply “Is loan consolidation a higher risk loan?”, then that’s that and based on the analyses I did a few weeks ago for a person who asked exactly this question, I can tell you that you’ll come up with an answer that yes, it is riskier than most (unless of course you use professional statistical analyses tools which might actually give you the correct result by accounting for other factors too).

default rate for loan consolidation
Loan consolidation is 0.

However, if you actually think through the question and realize that you perhaps only invest into A1000 group, then you’d realize that for this analyses, you should exclude all the loans that are not A1000 from the sample.

And guess what? The results suddenly tell that there’s not that much of a difference for loan consolidation and its default rate is somewhere in the middle with most of the rest of the loan purpose reasons.

The same thing happens with the analysis highlighted in BTL Finder blog about interest rates. As in general a certain payment history tends to have more higher risk borrowers within it, they are assigned a higher interest rate also. Meaning that while it may seem as if a higher interest rate is causing a higher risk, it is often simply the other way around that higher risk groups tend to be more likely to get a higher interest rate assigned and this type of analysis becomes meaningless and unprofitable because you would steer clear of quality borrowers who are overpaying for their loans simply because they pay higher interest rate.

Time aspect in default rate

Since most of the default rate analyses people do are not done based on yearly or monthly default rates, then another important aspect here is to look at the time aspect of your sample. This is the main reason for weird outcomes in the analyses linked in the beginning of the post.

For example, let’s take the first graph about verification type. If you remember the graph on when defaults happen, you’ll probably understand that the longer a certain group of loans has been on the market, the higher the default rate will become because after a year or two even the good quality borrowers might run into troubles every now and then.

For a while there was only one verification method with bank statements and the new ones were added in April of 2014 (if you don’t know when something happened, you can check the approximate timeline of changes at Bondora from this post).

However, it took around a month for loans of different verification to actually start coming in plus the new Income unverified option was added even later. In addition, I’m not entirely sure if the oldest portion of the dataset was included in the analyses or not, but some of the verification methods were present in 2009-2011 and then were not for the next 3 years, meaning that the other aspects that affect the default rate and that were discussed in previous posts, only affected certain portions of the sample while not affecting the other verification methods.

As a result of this, you’ll actually end up with a very weird result that is likely to not be very exact reflection of what’s really going on.

The countries

I know that the previous example can be somewhat confusing because the verification methods existed at one point, were then removed and then readded with a new and different process.

So let’s look at the country part of the analyses, because here it is a bit easier to explain the effect.

Brett at BTL Finder blog came to the conclusion that default rate is lowest in Finland, then Spain, then Slovakia and then Estonia is the highest with default rates. If you’ve ever received any newsletter and opened it, you might be a bit skeptical towards this result and you have reason to be.

The first thing to think about here is to remind yourself when each of the markets were added:

  • Estonia – 2009
  • Finland – July 2013
  • Spain – October 2013
  • Slovakia – February 2014

If you now remind yourself of the fact that the longer a loan group has been issued, the higher the default rate for that group will be compared to another similar group that has been on the market for less time (in other words EST A1000 loans that were issued in 2012 have in total higher default rate than EST A1000 loans issued in 2013 or 2014).

This in general means that if all the markets had equal risk level loans within them, the results based on the data would have to say that with exactly equal default rates for these countries, the order of default rates shown by this calculation would be: Estonia, Finland, Spain and Slovakia.

The result would then indicate as if Slovakia is the lowest risk, Spain second lowest, then comes Finland and Estonia is the worst place to invest into. However, in reality this would simply be the result because you have compared apples with oranges and in reality each would have equal risk in that scenario.

In reality of course you can see that there are actually different risk levels in each of the countries (the loans in some countries have a lot higher proportion of riskier loans than in some other countries), but with this analysis you would never know what the difference is.

Doing it properly

There would be several ways to do this analysis properly and I’ll highlight 2 possibilities:

Wait for more data

One option would be to wait for more data to come in so you can for example compare the numbers without having to do too many extra manipulations on the dataset because if both countries have loans issued for the last 3 years, you’ll likely have enough data for the results to be comparable.

Of course, this is not always an option for such a young and volatile industry as peer-to-peer lending.

Adjust your sample range

Thus the more practical option would be to adjust the range of your sample to match that of the other group.

In the example of analysing Estonian and Finnish loans, you’d want to exclude any loans issued before the period when actually a decent amount of Finnish loans were first issued. Let’s say that Finnish loans started coming in at the end of August 2013.

In this case you would exclude ALL the loans from your analyses that were issued before the end of August 2013 and then run the analyses for default rates for Finland and Estonia.

This would give you the default rates for both countries for loans issued within the same timeframe.

To take this one step further and to avoid any skewed results (like in the loan consolidation example above) you may also exclude certain loan groups that you don’t really invest into anyway because different countries might have different proportions of these groups and this might result in conclusions that aren’t comparable.

Now if you want to compare Finland, Estonia and Slovakia, you would have to again exclude all the loans issued before Slovakian loans started coming in…

This is not always possible to do with all analyses (sometimes there’s not enough data or different groups were issued at different periods, not simultaneously etc), but at minimum, you should at least account for the fact that this can affect your results and sometimes significantly.


While it is possible to do different types of analyses with simple tools like excel and some knowledge about statistical analyses, it is also quite easy to come to incorrect conclusions by making small mistakes in the sample selection.

That’s why you need to:

a) be aware that you need to account for these kind of things in the data and

b) probably have a reference guide on what kind of changes have happened over time (especially if you’re new to the platform) and then you can probably find the best cutoff points from the data itself.

As this is probably kind of a complicated topic to grasp for many people, I decided to leave the EAD1 vs EAD2 part for the next post in the series.

To make sure to receive a notification about the next update, subscribe through the button below.

By Taavi

Taavi has been investing into P2P-lending platforms since 2010.

2 replies on “Comparing apples with oranges – mistakes with analyses”

First thing that comes to mind when looking at Estimated net return / Proportion. If I understand correctly, estimated return should be statistics from the past and not pre-assigned interest-rate without history? And secondly assuming proportion is percentage from all loans then: the lower the proportion (%) the higher mostly the net return is.

net return prop.%
Educ. -0,7 2
Loan cons. 10,8 40
Car 11,9 12
Other 13,4 17
Home rep. 13,6 18
Real estate 14,3 2
Health 15,8 3
Entrepreneur 16,5 3
Travel 21,0 3

As You can see the situation improves with lower proportion with the only exception being Education, and perhaps Car, because of the net cashflow of these people ( they have more funds, coming in and going out). Mostly young undergraduates do not possess funds yet and the people who need this for transportation clearly are not buying a Porsche, instead their taking a gamble by fixing or buying a car and maybe this will help them earn money.
And the people who buy homes, travel or for the company have probably more cash at hand. The health part I do not have an adequate explanation but presumably they are people of older age and perhaps more stable after reflecting back at their life from illness.

Interesting thought. Although I wouldn’t be so sure that there’s such a straightforward connection with less loans in a group = better return.

Could be simply that in certain smaller groups most loans are within the lower risk group for some rather random reason. Although of course, could be a good reason why they tend to be in that group too 🙂

Additionally in some cases it could just as well be that the numbers are so small that the result can be rather random.

However, your train of thought does make sense on an intuitive level, although I wouldn’t think that they take a gamble on buying a car and hoping it’ll increase their income. I’m pretty sure that most such people are simply buying a car because that’s what you do and that’d be comfortable (without realizing that owning a car is owning a liability in itself already).

Share your thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.