Shuffle and Sampling - 隨機抽樣和洗牌

洗牌算法

Given an array, write a program to generate a random permutation of array elements. This question is also asked as “shuffle a deck of cards” or “randomize a given array”.

題解

這裡以 Fisher–Yates shuffle 演算法爲例,僞代碼如下:

  1. To shuffle an array a of n elements (indices 0..n-1):
  2. for i from 0 downto i do
  3. j random integer such that 0 j i
  4. exchange a[j] and a[i]

轉化爲代碼爲:

  1. /*
  2. * shuffle cards
  3. */
  4. public static void shuffleCard(int[] cards) {
  5. if (cards == null || cards.length == 0) return;
  6. Random rand = new Random();
  7. for (int i = 0; i < cards.length; i++) {
  8. int k = rand.nextInt(i + 1); // 0~i (inclusive)
  9. int temp = cards[i];
  10. cards[i] = cards[k];
  11. cards[k] = temp;
  12. }
  13. }

看了算法和代碼後讓我們來使用歸納法簡單證明下這個洗牌算法的正確性。我們要證明的問題是:數組中每個元素在每個索引處出現的機率均相等。

對於單個元素來講,以上算法顯然正確,因爲交換後仍然只有一個元素。接下來我們不妨假設其遍歷到數組索引爲i-1時滿足隨機排列特性,那麼當遍歷到數組索引爲i時,隨機數ki的機率爲1/i, 爲0~i-1的機率爲(i-1)/i. 接下來與索引爲i的值交換,可以得知card[i]出現在索引i的位置的機率爲1/i, 在其他索引位置的機率也爲1/i; 而對於card[i]之前的元素,以索引j處的元素card[j]爲例進行分析可知其在位置j的機率爲1/(i-1) * (i-1)/i = 1/i, 具體含義爲遍歷到索引i-1card[j]位於索引j的機率(1/(i-1))乘以遍歷到索引i時隨機數未選擇與索引j的數進行交換的機率((i-1)/i).

需要注意的是前面的j <= i-1, 那麼card[j]位於索引i的機率又是多少呢?要位於索引i,則隨機數k須爲i, 這種機率爲1/i.

綜上,以上算法可以實現完美洗牌(等機率)。

Random sampling - 隨機抽樣

隨機抽樣也稱爲水池抽樣,Randomly choosing a sample of k items from a list S containing n items. 大意是從大小爲 n 的數組中隨機選出 m 個整數,要求每個元素被選中的機率相同。

題解

比較簡潔的有算法 Algorithm R, 僞代碼如下:

  1. /*
  2. S has items to sample, R will contain the result
  3. */
  4. ReservoirSample(S[1..n], R[1..k])
  5. // fill the reservoir array
  6. for i = 1 to k
  7. R[i] := S[i]
  8. // replace elements with gradually decreasing probability
  9. for i = k+1 to n
  10. j := random(1, i) // important: inclusive range
  11. if j <= k
  12. R[j] := S[i]

轉化爲代碼爲:

  1. /*
  2. * random sample
  3. */
  4. public static int[] randomSample(int[] nums, int m) {
  5. if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
  6. int[] sample = new int[m];
  7. for (int i = 0; i < m; i++) {
  8. sample[i] = nums[i];
  9. }
  10. Random random = new Random();
  11. for (int i = m; i < nums.length; i++) {
  12. int k = random.nextInt(i + 1); // 0~i(inclusive)
  13. if (k < m) {
  14. sample[k] = nums[i];
  15. }
  16. }
  17. return sample;
  18. }

和洗牌算法類似,我們要證明的問題是:數組中每個元素在最終採樣的數組中出現的機率均相等且爲m/n. 洗牌算法中是排列,而對於隨機抽樣則爲組合。

維基百科上的證明相對容易懂一些,這裏我稍微複述下。首先將數組前 m 個元素填充進新數組sample, 然後從m開始遍歷直至數組最後一個索引。隨機數k的範圍爲0~i, 如果k < m,新數組的索引爲 k 的元素則和原數組索引爲i的元素交換;如果k >= m, 則不進行交換。i == m時,以原數組中第j個元素爲例,它被nums[m]替換的機率爲1/(m+1), 也就是說保留在sample數組中的機率爲m/(m+1). 對與第m+1個元素nums[m]來說,其位於sample數組中的機率則爲m*1/(m+1)(可替換 m 個不同的元素).

接下來仍然使用數學歸納法證明,若i遍歷到r時,其之前的元素出現的機率爲m/(r-1), 那麼其之前的元素中任一元素nums[j]被替換的機率爲m/r * 1/m = 1/r, 不被替換的機率則爲(r-1)/r. 故元素nums[j]i遍歷完r後仍然保留的機率爲m/(r-1) * (r-1)/r = m/r. 而對於元素nums[r]來說,其要被替換至sample數組中的機率則爲m/r(隨機數小於m 的個數爲 m).

綜上,以上算法在遍歷完長度爲 n 的數組後每個數出現在最終sample數組中的機率都爲m/n.

Implementation and Test case

Talk is cheap, show me the code!

Java

  1. import java.util.*;
  2. import java.util.Random;
  3. public class Probability {
  4. public static void main(String[] args) {
  5. int[] cards = new int[10];
  6. for (int i = 0; i < 10; i++) {
  7. cards[i] = i;
  8. }
  9. // 100000 times test
  10. final int times = 100000;
  11. final int m = 5;
  12. int[][] count = new int[cards.length][cards.length];
  13. int[][] count2 = new int[cards.length][m];
  14. for (int i = 0; i < times; i++) {
  15. shuffleCard(cards);
  16. shuffleTest(cards, count);
  17. int[] sample = randomSample(cards, m);
  18. shuffleTest(sample, count2);
  19. }
  20. System.out.println("Shuffle cards");
  21. shufflePrint(count);
  22. System.out.println();
  23. System.out.println("Random sample");
  24. shufflePrint(count2);
  25. }
  26. /*
  27. * shuffle cards
  28. */
  29. public static void shuffleCard(int[] cards) {
  30. if (cards == null || cards.length == 0) return;
  31. Random rand = new Random();
  32. for (int i = 0; i < cards.length; i++) {
  33. int k = rand.nextInt(i + 1);
  34. int temp = cards[i];
  35. cards[i] = cards[k];
  36. cards[k] = temp;
  37. }
  38. }
  39. /*
  40. * random sample
  41. */
  42. public static int[] randomSample(int[] nums, int m) {
  43. if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
  44. m = Math.min(m, nums.length);
  45. int[] sample = new int[m];
  46. for (int i = 0; i < m; i++) {
  47. sample[i] = nums[i];
  48. }
  49. Random random = new Random();
  50. for (int i = m; i < nums.length; i++) {
  51. int k = random.nextInt(i + 1);
  52. if (k < m) {
  53. sample[k] = nums[i];
  54. }
  55. }
  56. return sample;
  57. }
  58. /*
  59. * nums[i] = j, num j appear in index i ==> count[j][i]
  60. */
  61. public static void shuffleTest(int[] nums, int[][] count) {
  62. if (nums == null || nums.length == 0) return;
  63. for (int i = 0; i < nums.length; i++) {
  64. count[nums[i]][i]++;
  65. }
  66. }
  67. /*
  68. * print shuffle test
  69. */
  70. public static void shufflePrint(int[][] count) {
  71. if (count == null || count.length == 0) return;
  72. // print index
  73. System.out.print(" ");
  74. for (int i = 0; i < count[0].length; i++) {
  75. System.out.printf("%-7d", i);
  76. }
  77. System.out.println();
  78. // print num appear in index i in total
  79. for (int i = 0; i < count.length; i++) {
  80. System.out.print(i + ": ");
  81. for (int j = 0; j < count[i].length; j++) {
  82. System.out.printf("%-7d", count[i][j]);
  83. }
  84. System.out.println();
  85. }
  86. }
  87. }

以十萬次試驗爲例,左側是元素i, 列代表在相應索引位置出現的次數。可以看出分佈還是比較隨機的。

  1. Shuffle cards
  2. 0 1 2 3 4 5 6 7 8 9
  3. 0: 10033 9963 10043 9845 9932 10020 9964 10114 10043 10043
  4. 1: 9907 9951 9989 10071 10059 9966 10054 10023 10015 9965
  5. 2: 10042 10046 9893 10080 10050 9994 10024 9852 10098 9921
  6. 3: 10039 10023 10039 10024 9919 10057 10188 9916 9907 9888
  7. 4: 9944 9913 10196 10059 9838 10205 9899 9945 9850 10151
  8. 5: 10094 9971 10054 9958 10022 9922 10047 9978 9965 9989
  9. 6: 9995 10147 9824 10015 10023 9804 10050 10192 9939 10011
  10. 7: 9941 10131 9902 9920 10040 10121 10010 9928 9984 10023
  11. 8: 10010 9926 9883 10098 10083 10028 9801 9936 10200 10035
  12. 9: 9995 9929 10177 9930 10034 9883 9963 10116 9999 9974
  13. Random sample
  14. 0 1 2 3 4
  15. 0: 9966 10026 10078 9966 9891
  16. 1: 9958 9806 10066 10022 10039
  17. 2: 9923 9936 9964 10051 10083
  18. 3: 10165 10088 10184 9928 9916
  19. 4: 9998 9990 9973 9931 9832
  20. 5: 10026 9932 9873 10085 10035
  21. 6: 9942 9972 9990 10030 10026
  22. 7: 9903 10153 9997 10051 10044
  23. 8: 10082 10066 9804 9899 10147
  24. 9: 10037 10031 10071 10037 9987

Reference