Sunday, December 2, 2012

Postgres random in subquery

There was a use-case in a project, where huge number of test data was required, but it had to meet a constraint for one property. The database schema looked like the following:

item:
item_namepricecolorstore
Item 132rednull
Item 23bluenull
Item 37yellownull

store:
store_idnamecoordinate
1Store 112°34′56″N 12°34′56″E
2Store 212°34′56″N 12°34′56″W
3Store 312°34′56″S 12°34′56″E
4Store 412°34′56″S 12°34′56″W

The requirement was that the tables need to be populated with approximately 1 million items, and these items should be randomly associated with the stores. The stores were previously defined, and the store table has already been populated with the data. The items were partially defined, only the store association was null.

The database was Postgres 8.3.

The first SQL attempt was the following:

update item set store = (select store_id from store order by random() limit 1) where item.store is null;
The result was:
item_namepricecolorstore
Item 132red2
Item 23blue2
Item 37yellow2

All items received the same store id. It seemed as if the evaluation of the random selection only happened once, and was cached for every record. This can happen because the inner selection does not have any dependencies, so it can be treated as a completely separate query.

The solution is that it has to depend on the outer query, even if the dependency is quite silly from the developer point of view:

update item t1 set t1.store  = (select t2.store_id from store where t1=t1 order by random() limit 1) item.store is null;

The dependency here is the where t1=t1 part. Using this query, the result was exactly what we wanted to achieve:

item_namepricecolorstore
Item 132red2
Item 23blue4
Item 37yellow1


No comments:

Post a Comment