Postby Beerhunter on Wed Dec 22, 2010 2:49 am

I've been pondering about this for awhile and not come to a absolute conclusion of how to do this, so I'm asking those people with a statistics background - help!

The limitations are

* the data held centrally is encrypted and hence cannot be queried to get any price information (this is done to avoid the issue of replicating other sites databases). However it could be possible for each user to calculate various stats based on a number of properties they have looked at (say over the last month), and send them to property bee to generate an overall index.

* at the moment, its got to be one overall index for the whole country (or per site) not for areas, we don't have the location information nor the user numbers to drill down yet

* I don't see how replicating what everyone else does for house price indexes helps, lets be radical!

* we can't determine individual property prices centrally (as this would be replicating databases), but I feel a user can derive information from a group of properties and send that to us.

What we can do

* generate stats from 10,000+ individual users each month in real-time

* willingness to try something new and run several theories in parallel based on data collected, eg the toolbar could (it doesn't at the moment) provide mean/median/variance/quantiles for each users data - which could be interpreted in several different ways at the server

* provide stats from Jan 2008 onwards, so you can test your theories using past data and compare it to other indexes

The question (as my stats knowledge is limited to the basics of modes/medians/standard deviates etc) is how could stats from individual users be combined in a meaningful way to generate a real house price index, and is it possible?

I have a few ideas, but will hold back whilst everyone else gives it some thought ;)
Postby rlph on Thu Dec 23, 2010 11:03 am

Possible? Yes, but not exactly trivial:
  • There would need to be a way of not double-counting properties across multiple users (or down-weighting ones watched by many users), and differently weighting different users' stats based on the number/total weight of properties they follow, lest the results get skewed.
  • Strictly speaking median/quintile information can't be computed in a distributed manner. However, assuming prices closely follow a lognormal distribution (i.e. ln(price) is normally distributed) the mean and variance of ln(price) can be computed, so price percentiles can be inferred from that.
Can expand on the hows and whys of the above if you like. (Disclaimer: I took one semester of probability and statistics at uni, and have since forgotten most of it. :D)

Myself, rather than yet another property price index I'd be more interested in "alternative" stats reflecting market sentiment, like:
  • Percentage of monitored properties whose asking price is lowered (or raised - yes it happens) in any given month, and by what percentage
  • Average percentage by which asking price is reduced before selling
  • Average length of time on market
  • Number of "false starts" properties go through before selling - i.e. status goes from Available to Under Offer/Sold STC then back to Available as offer doesn't pan out
Postby Beerhunter on Fri Dec 31, 2010 11:52 am

Hi rlph,

I agree about the more interesting stats, and have started work on adding code to the toolbar to find;

* the date/price the property was discovered
* the date/price the property was sold
* the date/price the property was withdrawn

from which its easy to calculate

* percentage drop in price before being sold/withdrawn
* time on market before being sold/withdrawn

In addition I've added code to calculate

* the number of false starts (ie the number of times the property has gone from sold/stc to available - at the moment I'm ignoring Under Offer as I think some agents are a bit lazy and don't update the status as often as perhaps they should)
* the number of price changes

Even if this data isn't shared, its still useful - either by displaying it in the sidebar and/or enhancing the csv export of data from the current (and horrendous multiple rows per property) format into 1 row per property with more of a summary feel.

I was wondering about how many users were viewing the same properties (after all that's the whole idea of working in a bee!), and plotted a graph;

number_of_users_per_property.png (20.76 KiB) Viewed 4634 times

* there are just under 2 million properties (30.1%) which have only been viewed by one user
* one property has been viewed by 667 users (I guess it was posted on a forum somewhere!)
* on average, a property is viewed by 2.11 users
* 69.9% of properties have been viewed by more than one user
* 12.5% of properties have been viewed by more than 10 users
* 1.0% of properties have been viewed by more than 34 users

