Shuffle and Sampling - 随机抽样和洗牌

洗牌算法

Given an array, write a program to generate a random permutation of array elements. This question is also asked as “shuffle a deck of cards” or “randomize a given array”.

题解

这里以 Fisher–Yates shuffle 算法为例,伪代码如下:

  1. To shuffle an array a of n elements (indices 0..n-1):
  2. for i from 0 downto i do
  3. j random integer such that 0 j i
  4. exchange a[j] and a[i]

转化为代码为:

  1. /*
  2. * shuffle cards
  3. */
  4. public static void shuffleCard(int[] cards) {
  5. if (cards == null || cards.length == 0) return;
  6. Random rand = new Random();
  7. for (int i = 0; i < cards.length; i++) {
  8. int k = rand.nextInt(i + 1); // 0~i (inclusive)
  9. int temp = cards[i];
  10. cards[i] = cards[k];
  11. cards[k] = temp;
  12. }
  13. }

看了算法和代码后让我们来使用归纳法简单证明下这个洗牌算法的正确性。我们要证明的问题是:数组中每个元素在每个索引处出现的概率均相等。

对于单个元素来讲,以上算法显然正确,因为交换后仍然只有一个元素。接下来我们不妨假设其遍历到数组索引为i-1时满足随机排列特性,那么当遍历到数组索引为i时,随机数ki的概率为1/i, 为0~i-1的概率为(i-1)/i. 接下来与索引为i的值交换,可以得知card[i]出现在索引i的位置的概率为1/i, 在其他索引位置的概率也为1/i; 而对于card[i]之前的元素,以索引j处的元素card[j]为例进行分析可知其在位置j的概率为1/(i-1) * (i-1)/i = 1/i, 具体含义为遍历到索引i-1card[j]位于索引j的概率(1/(i-1))乘以遍历到索引i时随机数未选择与索引j的数进行交换的概率((i-1)/i).

需要注意的是前面的j <= i-1, 那么card[j]位于索引i的概率又是多少呢?要位于索引i,则随机数k须为i, 这种概率为1/i.

综上,以上算法可以实现完美洗牌(等概率)。

Random sampling - 随机抽样

随机抽样也称为水池抽样,Randomly choosing a sample of k items from a list S containing n items. 大意是从大小为 n 的数组中随机选出 m 个整数,要求每个元素被选中的概率相同。

题解

比较简洁的有算法 Algorithm R, 伪代码如下:

  1. /*
  2. S has items to sample, R will contain the result
  3. */
  4. ReservoirSample(S[1..n], R[1..k])
  5. // fill the reservoir array
  6. for i = 1 to k
  7. R[i] := S[i]
  8. // replace elements with gradually decreasing probability
  9. for i = k+1 to n
  10. j := random(1, i) // important: inclusive range
  11. if j <= k
  12. R[j] := S[i]

转化为代码为:

  1. /*
  2. * random sample
  3. */
  4. public static int[] randomSample(int[] nums, int m) {
  5. if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
  6. int[] sample = new int[m];
  7. for (int i = 0; i < m; i++) {
  8. sample[i] = nums[i];
  9. }
  10. Random random = new Random();
  11. for (int i = m; i < nums.length; i++) {
  12. int k = random.nextInt(i + 1); // 0~i(inclusive)
  13. if (k < m) {
  14. sample[k] = nums[i];
  15. }
  16. }
  17. return sample;
  18. }

和洗牌算法类似,我们要证明的问题是:数组中每个元素在最终采样的数组中出现的概率均相等且为m/n. 洗牌算法中是排列,而对于随机抽样则为组合。

维基百科上的证明相对容易懂一些,这里我稍微复述下。首先将数组前 m 个元素填充进新数组sample, 然后从m开始遍历直至数组最后一个索引。随机数k的范围为0~i, 如果k < m,新数组的索引为 k 的元素则和原数组索引为i的元素交换;如果k >= m, 则不进行交换。i == m时,以原数组中第j个元素为例,它被nums[m]替换的概率为1/(m+1), 也就是说保留在sample数组中的概率为m/(m+1). 对与第m+1个元素nums[m]来说,其位于sample数组中的概率则为m*1/(m+1)(可替换 m 个不同的元素).

接下来仍然使用数学归纳法证明,若i遍历到r时,其之前的元素出现的概率为m/(r-1), 那么其之前的元素中任一元素nums[j]被替换的概率为m/r * 1/m = 1/r, 不被替换的概率则为(r-1)/r. 故元素nums[j]i遍历完r后仍然保留的概率为m/(r-1) * (r-1)/r = m/r. 而对于元素nums[r]来说,其要被替换至sample数组中的概率则为m/r(随机数小于m 的个数为 m).

综上,以上算法在遍历完长度为 n 的数组后每个数出现在最终sample数组中的概率都为m/n.

Implementation and Test case

Talk is cheap, show me the code!

Java

  1. import java.util.*;
  2. import java.util.Random;
  3. public class Probability {
  4. public static void main(String[] args) {
  5. int[] cards = new int[10];
  6. for (int i = 0; i < 10; i++) {
  7. cards[i] = i;
  8. }
  9. // 100000 times test
  10. final int times = 100000;
  11. final int m = 5;
  12. int[][] count = new int[cards.length][cards.length];
  13. int[][] count2 = new int[cards.length][m];
  14. for (int i = 0; i < times; i++) {
  15. shuffleCard(cards);
  16. shuffleTest(cards, count);
  17. int[] sample = randomSample(cards, m);
  18. shuffleTest(sample, count2);
  19. }
  20. System.out.println("Shuffle cards");
  21. shufflePrint(count);
  22. System.out.println();
  23. System.out.println("Random sample");
  24. shufflePrint(count2);
  25. }
  26. /*
  27. * shuffle cards
  28. */
  29. public static void shuffleCard(int[] cards) {
  30. if (cards == null || cards.length == 0) return;
  31. Random rand = new Random();
  32. for (int i = 0; i < cards.length; i++) {
  33. int k = rand.nextInt(i + 1);
  34. int temp = cards[i];
  35. cards[i] = cards[k];
  36. cards[k] = temp;
  37. }
  38. }
  39. /*
  40. * random sample
  41. */
  42. public static int[] randomSample(int[] nums, int m) {
  43. if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
  44. m = Math.min(m, nums.length);
  45. int[] sample = new int[m];
  46. for (int i = 0; i < m; i++) {
  47. sample[i] = nums[i];
  48. }
  49. Random random = new Random();
  50. for (int i = m; i < nums.length; i++) {
  51. int k = random.nextInt(i + 1);
  52. if (k < m) {
  53. sample[k] = nums[i];
  54. }
  55. }
  56. return sample;
  57. }
  58. /*
  59. * nums[i] = j, num j appear in index i ==> count[j][i]
  60. */
  61. public static void shuffleTest(int[] nums, int[][] count) {
  62. if (nums == null || nums.length == 0) return;
  63. for (int i = 0; i < nums.length; i++) {
  64. count[nums[i]][i]++;
  65. }
  66. }
  67. /*
  68. * print shuffle test
  69. */
  70. public static void shufflePrint(int[][] count) {
  71. if (count == null || count.length == 0) return;
  72. // print index
  73. System.out.print(" ");
  74. for (int i = 0; i < count[0].length; i++) {
  75. System.out.printf("%-7d", i);
  76. }
  77. System.out.println();
  78. // print num appear in index i in total
  79. for (int i = 0; i < count.length; i++) {
  80. System.out.print(i + ": ");
  81. for (int j = 0; j < count[i].length; j++) {
  82. System.out.printf("%-7d", count[i][j]);
  83. }
  84. System.out.println();
  85. }
  86. }
  87. }

以十万次试验为例,左侧是元素i, 列代表在相应索引位置出现的次数。可以看出分布还是比较随机的。

  1. Shuffle cards
  2. 0 1 2 3 4 5 6 7 8 9
  3. 0: 10033 9963 10043 9845 9932 10020 9964 10114 10043 10043
  4. 1: 9907 9951 9989 10071 10059 9966 10054 10023 10015 9965
  5. 2: 10042 10046 9893 10080 10050 9994 10024 9852 10098 9921
  6. 3: 10039 10023 10039 10024 9919 10057 10188 9916 9907 9888
  7. 4: 9944 9913 10196 10059 9838 10205 9899 9945 9850 10151
  8. 5: 10094 9971 10054 9958 10022 9922 10047 9978 9965 9989
  9. 6: 9995 10147 9824 10015 10023 9804 10050 10192 9939 10011
  10. 7: 9941 10131 9902 9920 10040 10121 10010 9928 9984 10023
  11. 8: 10010 9926 9883 10098 10083 10028 9801 9936 10200 10035
  12. 9: 9995 9929 10177 9930 10034 9883 9963 10116 9999 9974
  13. Random sample
  14. 0 1 2 3 4
  15. 0: 9966 10026 10078 9966 9891
  16. 1: 9958 9806 10066 10022 10039
  17. 2: 9923 9936 9964 10051 10083
  18. 3: 10165 10088 10184 9928 9916
  19. 4: 9998 9990 9973 9931 9832
  20. 5: 10026 9932 9873 10085 10035
  21. 6: 9942 9972 9990 10030 10026
  22. 7: 9903 10153 9997 10051 10044
  23. 8: 10082 10066 9804 9899 10147
  24. 9: 10037 10031 10071 10037 9987

Reference