Methods

Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.

Source code in txtai/embeddings/base.py

  1. 27
  2. 28
  3. 29
  4. 30
  5. 31
  6. 32
  7. 33
  8. 34
  9. 35
  10. 36
  11. 37
  12. 38
  13. 39
  14. 40
  15. 41
  16. 42
  17. 43
  18. 44
  19. 45
  20. 46
  21. 47
  22. 48
  23. 49
  24. 50
  25. 51
  26. 52
  27. 53
  28. 54
  29. 55
  30. 56
  31. 57
  32. 58
  33. 59
  34. 60
  35. 61
  36. 62
  37. 63
  38. 64
  39. 65
  40. 66
  41. 67
  42. 68
  43. 69
  44. 70
  45. 71
  46. 72
  47. 73
  48. 74
  49. 75
  50. 76
  51. 77
  52. 78
  53. 79
  54. 80
  55. 81
  56. 82
  57. 83
  58. 84
  59. 85
  60. 86
  61. 87
  62. 88
  63. 89
  64. 90
  65. 91
  66. 92
  67. 93
  68. 94
  69. 95
  70. 96
  71. 97
  72. 98
  73. 99
  74. 100
  75. 101
  76. 102
  77. 103
  78. 104
  79. 105
  80. 106
  81. 107
  82. 108
  83. 109
  84. 110
  85. 111
  86. 112
  87. 113
  88. 114
  89. 115
  90. 116
  91. 117
  92. 118
  93. 119
  94. 120
  95. 121
  96. 122
  97. 123
  98. 124
  99. 125
  100. 126
  101. 127
  102. 128
  103. 129
  104. 130
  105. 131
  106. 132
  107. 133
  108. 134
  109. 135
  110. 136
  111. 137
  112. 138
  113. 139
  114. 140
  115. 141
  116. 142
  117. 143
  118. 144
  119. 145
  120. 146
  121. 147
  122. 148
  123. 149
  124. 150
  125. 151
  126. 152
  127. 153
  128. 154
  129. 155
  130. 156
  131. 157
  132. 158
  133. 159
  134. 160
  135. 161
  136. 162
  137. 163
  138. 164
  139. 165
  140. 166
  141. 167
  142. 168
  143. 169
  144. 170
  145. 171
  146. 172
  147. 173
  148. 174
  149. 175
  150. 176
  151. 177
  152. 178
  153. 179
  154. 180
  155. 181
  156. 182
  157. 183
  158. 184
  159. 185
  160. 186
  161. 187
  162. 188
  163. 189
  164. 190
  165. 191
  166. 192
  167. 193
  168. 194
  169. 195
  170. 196
  171. 197
  172. 198
  173. 199
  174. 200
  175. 201
  176. 202
  177. 203
  178. 204
  179. 205
  180. 206
  181. 207
  182. 208
  183. 209
  184. 210
  185. 211
  186. 212
  187. 213
  188. 214
  189. 215
  190. 216
  191. 217
  192. 218
  193. 219
  194. 220
  195. 221
  196. 222
  197. 223
  198. 224
  199. 225
  200. 226
  201. 227
  202. 228
  203. 229
  204. 230
  205. 231
  206. 232
  207. 233
  208. 234
  209. 235
  210. 236
  211. 237
  212. 238
  213. 239
  214. 240
  215. 241
  216. 242
  217. 243
  218. 244
  219. 245
  220. 246
  221. 247
  222. 248
  223. 249
  224. 250
  225. 251
  226. 252
  227. 253
  228. 254
  229. 255
  230. 256
  231. 257
  232. 258
  233. 259
  234. 260
  235. 261
  236. 262
  237. 263
  238. 264
  239. 265
  240. 266
  241. 267
  242. 268
  243. 269
  244. 270
  245. 271
  246. 272
  247. 273
  248. 274
  249. 275
  250. 276
  251. 277
  252. 278
  253. 279
  254. 280
  255. 281
  256. 282
  257. 283
  258. 284
  259. 285
  260. 286
  261. 287
  262. 288
  263. 289
  264. 290
  265. 291
  266. 292
  267. 293
  268. 294
  269. 295
  270. 296
  271. 297
  272. 298
  273. 299
  274. 300
  275. 301
  276. 302
  277. 303
  278. 304
  279. 305
  280. 306
  281. 307
  282. 308
  283. 309
  284. 310
  285. 311
  286. 312
  287. 313
  288. 314
  289. 315
  290. 316
  291. 317
  292. 318
  293. 319
  294. 320
  295. 321
  296. 322
  297. 323
  298. 324
  299. 325
  300. 326
  301. 327
  302. 328
  303. 329
  304. 330
  305. 331
  306. 332
  307. 333
  308. 334
  309. 335
  310. 336
  311. 337
  312. 338
  313. 339
  314. 340
  315. 341
  316. 342
  317. 343
  318. 344
  319. 345
  320. 346
  321. 347
  322. 348
  323. 349
  324. 350
  325. 351
  326. 352
  327. 353
  328. 354
  329. 355
  330. 356
  331. 357
  332. 358
  333. 359
  334. 360
  335. 361
  336. 362
  337. 363
  338. 364
  339. 365
  340. 366
  341. 367
  342. 368
  343. 369
  344. 370
  345. 371
  346. 372
  347. 373
  348. 374
  349. 375
  350. 376
  351. 377
  352. 378
  353. 379
  354. 380
  355. 381
  356. 382
  357. 383
  358. 384
  359. 385
  360. 386
  361. 387
  362. 388
  363. 389
  364. 390
  365. 391
  366. 392
  367. 393
  368. 394
  369. 395
  370. 396
  371. 397
  372. 398
  373. 399
  374. 400
  375. 401
  376. 402
  377. 403
  378. 404
  379. 405
  380. 406
  381. 407
  382. 408
  383. 409
  384. 410
  385. 411
  386. 412
  387. 413
  388. 414
  389. 415
  390. 416
  391. 417
  392. 418
  393. 419
  394. 420
  395. 421
  396. 422
  397. 423
  398. 424
  399. 425
  400. 426
  401. 427
  402. 428
  403. 429
  404. 430
  405. 431
  406. 432
  407. 433
  408. 434
  409. 435
  410. 436
  411. 437
  412. 438
  413. 439
  414. 440
  415. 441
  416. 442
  417. 443
  418. 444
  419. 445
  420. 446
  421. 447
  422. 448
  423. 449
  424. 450
  425. 451
  426. 452
  427. 453
  428. 454
  429. 455
  430. 456
  431. 457
  432. 458
  433. 459
  434. 460
  435. 461
  436. 462
  437. 463
  438. 464
  439. 465
  440. 466
  441. 467
  442. 468
  443. 469
  444. 470
  445. 471
  446. 472
  447. 473
  448. 474
  449. 475
  450. 476
  451. 477
  452. 478
  453. 479
  454. 480
  455. 481
  456. 482
  457. 483
  458. 484
  459. 485
  460. 486
  461. 487
  462. 488
  463. 489
  464. 490
  465. 491
  466. 492
  467. 493
  468. 494
  469. 495
  470. 496
  471. 497
  472. 498
  473. 499
  474. 500
  475. 501
  476. 502
  477. 503
  478. 504
  479. 505
  480. 506
  481. 507
  482. 508
  483. 509
  484. 510
  485. 511
  486. 512
  487. 513
  488. 514
  489. 515
  490. 516
  491. 517
  492. 518
  493. 519
  494. 520
  495. 521
  496. 522
  497. 523
  498. 524
  499. 525
  500. 526
  501. 527
  502. 528
  503. 529
  504. 530
  505. 531
  506. 532
  507. 533
  508. 534
  509. 535
  510. 536
  511. 537
  512. 538
  513. 539
  514. 540
  515. 541
  516. 542
  517. 543
  518. 544
  519. 545
  520. 546
  521. 547
  522. 548
  523. 549
  524. 550
  525. 551
  526. 552
  527. 553
  528. 554
  529. 555
  530. 556
  531. 557
  532. 558
  533. 559
  534. 560
  535. 561
  536. 562
  537. 563
  538. 564
  539. 565
  540. 566
  541. 567
  542. 568
  543. 569
  544. 570
  545. 571
  546. 572
  547. 573
  548. 574
  549. 575
  550. 576
  551. 577
  552. 578
  553. 579
  554. 580
  555. 581
  556. 582
  557. 583
  558. 584
  559. 585
  560. 586
  561. 587
  562. 588
  563. 589
  564. 590
  565. 591
  566. 592
  567. 593
  568. 594
  569. 595
  570. 596
  571. 597
  572. 598
  573. 599
  574. 600
  575. 601
  576. 602
  577. 603
  578. 604
  579. 605
  580. 606
  581. 607
  582. 608
  583. 609
  584. 610
  585. 611
  586. 612
  587. 613
  588. 614
  589. 615
  590. 616
  591. 617
  592. 618
  593. 619
  594. 620
  595. 621
  596. 622
  597. 623
  598. 624
  599. 625
  600. 626
  601. 627
  602. 628
  603. 629
  604. 630
  605. 631
  606. 632
  607. 633
  608. 634
  609. 635
  610. 636
  611. 637
  612. 638
  613. 639
  614. 640
  615. 641
  616. 642
  617. 643
  618. 644
  619. 645
  620. 646
  621. 647
  622. 648
  623. 649
  624. 650
  625. 651
  626. 652
  627. 653
  628. 654
  629. 655
  630. 656
  631. 657
  632. 658
  633. 659
  634. 660
  635. 661
  636. 662
  637. 663
  638. 664
  639. 665
  640. 666
  641. 667
  642. 668
  643. 669
  644. 670
  645. 671
  646. 672
  647. 673
  648. 674
  649. 675
  650. 676
  651. 677
  652. 678
  653. 679
  654. 680
  655. 681
  656. 682
  657. 683
  658. 684
  659. 685
  660. 686
  661. 687
  662. 688
  663. 689
  664. 690
  665. 691
  666. 692
  667. 693
  668. 694
  669. 695
  670. 696
  671. 697
  672. 698
  673. 699
  674. 700
  675. 701
  676. 702
  677. 703
  678. 704
  679. 705
  680. 706
  681. 707
  682. 708
  683. 709
  684. 710
  685. 711
  686. 712
  687. 713
  688. 714
  689. 715
  690. 716
  691. 717
  692. 718
  693. 719
  694. 720
  695. 721
  696. 722
  697. 723
  698. 724
  699. 725
  700. 726
  701. 727
  702. 728
  703. 729
  704. 730
  705. 731
  706. 732
  707. 733
  708. 734
  709. 735
  710. 736
  711. 737
  712. 738
  713. 739
  714. 740
  715. 741
  716. 742
  717. 743
  718. 744
  719. 745
  720. 746
  721. 747
  722. 748
  723. 749
  724. 750
  725. 751
  726. 752
  727. 753
  728. 754
  729. 755
  730. 756
  731. 757
  732. 758
  733. 759
  734. 760
  735. 761
  736. 762
  737. 763
  738. 764
  739. 765
  740. 766
  741. 767
  742. 768
  743. 769
  744. 770
  745. 771
  746. 772
  747. 773
  748. 774
  749. 775
  750. 776
  751. 777
  752. 778
  753. 779
  754. 780
  755. 781
  756. 782
  757. 783
  758. 784
  759. 785
  760. 786
  761. 787
  762. 788
  763. 789
  764. 790
  765. 791
  766. 792
  767. 793
  768. 794
  769. 795
  770. 796
  771. 797
  772. 798
  773. 799
  774. 800
  775. 801
  776. 802
  777. 803
  778. 804
  779. 805
  780. 806
  781. 807
  782. 808
  783. 809
  784. 810
  785. 811
  786. 812
  787. 813
  788. 814
  789. 815
  790. 816
  791. 817
  792. 818
  793. 819
  794. 820
  795. 821
  796. 822
  797. 823
  798. 824
  799. 825
  800. 826
  801. 827
  802. 828
  803. 829
  804. 830
  805. 831
  806. 832
  807. 833
  808. 834
  809. 835
  810. 836
  811. 837
  812. 838
  813. 839
  814. 840
  815. 841
  816. 842
  817. 843
  818. 844
  819. 845
  820. 846
  821. 847
  822. 848
  823. 849
  824. 850
  825. 851
  826. 852
  827. 853
  828. 854
  829. 855
  830. 856
  831. 857
  832. 858
  833. 859
  834. 860
  835. 861
  836. 862
  837. 863
  838. 864
  839. 865
  840. 866
  841. 867
  842. 868
  843. 869
  844. 870
  845. 871
  846. 872
  847. 873
  848. 874
  849. 875
  850. 876
  851. 877
  852. 878
  853. 879
  854. 880
  855. 881
  856. 882
  857. 883
  858. 884
  859. 885
  860. 886
  861. 887
  862. 888
  863. 889
  864. 890
  865. 891
  866. 892
  867. 893
  868. 894
  869. 895
  870. 896
  871. 897
  872. 898
  873. 899
  874. 900
  875. 901
  876. 902
  877. 903
  878. 904
  879. 905
  880. 906
  881. 907
  882. 908
  883. 909
  884. 910
  885. 911
  886. 912
  887. 913
  888. 914
  889. 915
  890. 916
  891. 917
  892. 918
  893. 919
  894. 920
  895. 921
  896. 922
  897. 923
  898. 924
  899. 925
  900. 926
  901. 927
  902. 928
  903. 929
  904. 930
  905. 931
  906. 932
  907. 933
  908. 934
  909. 935
  910. 936
  911. 937
  912. 938
  913. 939
  914. 940
  915. 941
  916. 942
  917. 943
  918. 944
  919. 945
  920. 946
  921. 947
  922. 948
  923. 949
  924. 950
  925. 951
  926. 952
  927. 953
  928. 954
  929. 955
  930. 956
  931. 957
  932. 958
  933. 959
  934. 960
  935. 961
  936. 962
  937. 963
  938. 964
  939. 965
  940. 966
  941. 967
  942. 968
  943. 969
  944. 970
  945. 971
  946. 972
  947. 973
  948. 974
  949. 975
  950. 976
  951. 977
  952. 978
  953. 979
  954. 980
  955. 981
  956. 982
  957. 983
  958. 984
  959. 985
  960. 986
  961. 987
  962. 988
  963. 989
  964. 990
  965. 991
  966. 992
  967. 993
  968. 994
  969. 995
  970. 996
  971. 997
  972. 998
  973. 999
  974. 1000
  975. 1001
  976. 1002
  977. 1003
  978. 1004
  979. 1005
  980. 1006
  981. 1007
  982. 1008
  983. 1009
  984. 1010
  985. 1011
  986. 1012
  987. 1013
  988. 1014
  989. 1015
  990. 1016
  991. 1017
  992. 1018
  993. 1019
  994. 1020
  995. 1021
  996. 1022
  997. 1023
  998. 1024
  999. 1025
  1000. 1026
  1001. 1027
  1002. 1028
  1003. 1029
  1004. 1030
  1005. 1031
  1006. 1032
  1007. 1033
  1008. 1034
  1009. 1035
  1010. 1036
  1011. 1037
  1012. 1038
  1013. 1039
  1014. 1040
  1015. 1041
  1016. 1042
  1017. 1043
  1018. 1044
  1019. 1045
  1020. 1046
  1021. 1047
  1022. 1048
  1023. 1049
  1024. 1050
  1025. 1051
  1026. 1052
  1027. 1053
  1028. 1054
  1029. 1055
  1030. 1056
  1031. 1057
  1032. 1058
  1033. 1059
  1034. 1060
  1035. 1061
  1036. 1062
  1037. 1063
  1038. 1064
  1039. 1065
  1040. 1066
  1041. 1067
  1. class Embeddings:
  2. “””
  3. Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts
  4. will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results
  5. that have the same meaning, not necessarily the same keywords.
  6. “””
  7. # pylint: disable = W0231
  8. def init(self, config=None, models=None, kwargs):
  9. “””
  10. Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.
  11. Args:
  12. config: embeddings configuration
  13. models: models cache, used for model sharing between embeddings
  14. kwargs: additional configuration as keyword args
  15. “””
  16. # Index configuration
  17. self.config = None
  18. # Dimensionality reduction - word vectors only
  19. self.reducer = None
  20. # Dense vector model - transforms data into similarity vectors
  21. self.model = None
  22. # Approximate nearest neighbor index
  23. self.ann = None
  24. # Index ids when content is disabled
  25. self.ids = None
  26. # Document database
  27. self.database = None
  28. # Resolvable functions
  29. self.functions = None
  30. # Graph network
  31. self.graph = None
  32. # Sparse vectors
  33. self.scoring = None
  34. # Query model
  35. self.query = None
  36. # Index archive
  37. self.archive = None
  38. # Subindexes for this embeddings instance
  39. self.indexes = None
  40. # Models cache
  41. self.models = models
  42. # Merge configuration into single dictionary
  43. config = {config, kwargs} if config and kwargs else kwargs if kwargs else config
  44. # Set initial configuration
  45. self.configure(config)
  46. def score(self, documents):
  47. “””
  48. Builds a term weighting scoring index. Only used by word vectors models.
  49. Args:
  50. documents: iterable of (id, data, tags), (id, data) or data
  51. “””
  52. # Build scoring index for word vectors term weighting
  53. if self.isweighted():
  54. self.scoring.index(Stream(self)(documents))
  55. def index(self, documents, reindex=False):
  56. “””
  57. Builds an embeddings index. This method overwrites an existing index.
  58. Args:
  59. documents: iterable of (id, data, tags), (id, data) or data
  60. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  61. “””
  62. # Initialize index
  63. self.initindex(reindex)
  64. # Create transform and stream
  65. transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
  66. stream = Stream(self, Action.REINDEX if reindex else Action.INDEX)
  67. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  68. # Load documents into database and transform to vectors
  69. ids, dimensions, embeddings = transform(stream(documents), buffer)
  70. if embeddings is not None:
  71. # Build LSA model (if enabled). Remove principal components from embeddings.
  72. if self.config.get(“pca”):
  73. self.reducer = Reducer(embeddings, self.config[“pca”])
  74. self.reducer(embeddings)
  75. # Save index dimensions
  76. self.config[“dimensions”] = dimensions
  77. # Create approximate nearest neighbor index
  78. self.ann = self.createann()
  79. # Add embeddings to the index
  80. self.ann.index(embeddings)
  81. # Save indexids-ids mapping for indexes with no database, except when this is a reindex
  82. if ids and not reindex and not self.database:
  83. self.ids = self.createids(ids)
  84. # Index scoring, if necessary
  85. # This must occur before graph index in order to be available to the graph
  86. if self.issparse():
  87. self.scoring.index()
  88. # Index subindexes, if necessary
  89. if self.indexes:
  90. self.indexes.index()
  91. # Index graph, if necessary
  92. if self.graph:
  93. self.graph.index(Search(self, True), Ids(self), self.batchsimilarity)
  94. def upsert(self, documents):
  95. “””
  96. Runs an embeddings upsert operation. If the index exists, new data is
  97. appended to the index, existing data is updated. If the index doesnt exist,
  98. this method runs a standard index operation.
  99. Args:
  100. documents: iterable of (id, data, tags), (id, data) or data
  101. “””
  102. # Run standard insert if index doesn’t exist or it has no records
  103. if not self.count():
  104. self.index(documents)
  105. return
  106. # Create transform and stream
  107. transform = Transform(self, Action.UPSERT)
  108. stream = Stream(self, Action.UPSERT)
  109. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  110. # Load documents into database and transform to vectors
  111. ids, , embeddings = transform(stream(documents), buffer)
  112. if embeddings is not None:
  113. # Remove principal components from embeddings, if necessary
  114. if self.reducer:
  115. self.reducer(embeddings)
  116. # Append embeddings to the index
  117. self.ann.append(embeddings)
  118. # Save indexids-ids mapping for indexes with no database
  119. if ids and not self.database:
  120. self.ids = self.createids(self.ids + ids)
  121. # Scoring upsert, if necessary
  122. # This must occur before graph upsert in order to be available to the graph
  123. if self.issparse():
  124. self.scoring.upsert()
  125. # Subindexes upsert, if necessary
  126. if self.indexes:
  127. self.indexes.upsert()
  128. # Graph upsert, if necessary
  129. if self.graph:
  130. self.graph.upsert(Search(self, True), Ids(self), self.batchsimilarity)
  131. def delete(self, ids):
  132. “””
  133. Deletes from an embeddings index. Returns list of ids deleted.
  134. Args:
  135. ids: list of ids to delete
  136. Returns:
  137. list of ids deleted
  138. “””
  139. # List of internal indices for each candidate id to delete
  140. indices = []
  141. # List of deleted ids
  142. deletes = []
  143. if self.database:
  144. # Retrieve indexid-id mappings from database
  145. ids = self.database.ids(ids)
  146. # Parse out indices and ids to delete
  147. indices = [i for i, in ids]
  148. deletes = sorted(set(uid for _, uid in ids))
  149. # Delete ids from database
  150. self.database.delete(deletes)
  151. elif self.ann or self.scoring:
  152. # Find existing ids
  153. for uid in ids:
  154. indices.extend([index for index, value in enumerate(self.ids) if uid == value])
  155. # Clear embeddings ids
  156. for index in indices:
  157. deletes.append(self.ids[index])
  158. self.ids[index] = None
  159. # Delete indices for all indexes and data stores
  160. if indices:
  161. # Delete ids from ann
  162. if self.isdense():
  163. self.ann.delete(indices)
  164. # Delete ids from scoring
  165. if self.issparse():
  166. self.scoring.delete(indices)
  167. # Delete ids from subindexes
  168. if self.indexes:
  169. self.indexes.delete(indices)
  170. # Delete ids from graph
  171. if self.graph:
  172. self.graph.delete(indices)
  173. return deletes
  174. def reindex(self, config=None, function=None, kwargs):
  175. “””
  176. Recreates embeddings index using config. This method only works if document content storage is enabled.
  177. Args:
  178. config: new config
  179. function: optional function to prepare content for indexing
  180. kwargs: additional configuration as keyword args
  181. “””
  182. if self.database:
  183. # Merge configuration into single dictionary
  184. config = {config, kwargs} if config and kwargs else config if config else kwargs
  185. # Keep content and objects parameters to ensure database is preserved
  186. config[“content”] = self.config[“content”]
  187. if objects in self.config:
  188. config[“objects”] = self.config[“objects”]
  189. # Reset configuration
  190. self.configure(config)
  191. # Reset function references
  192. if self.functions:
  193. self.functions.reset()
  194. # Reindex
  195. if function:
  196. self.index(function(self.database.reindex(self.config)), True)
  197. else:
  198. self.index(self.database.reindex(self.config), True)
  199. def transform(self, document):
  200. “””
  201. Transforms document into an embeddings vector.
  202. Args:
  203. documents: iterable of (id, data, tags), (id, data) or data
  204. Returns:
  205. embeddings vector
  206. “””
  207. return self.batchtransform([document])[0]
  208. def batchtransform(self, documents, category=None):
  209. “””
  210. Transforms documents into embeddings vectors.
  211. Args:
  212. documents: iterable of (id, data, tags), (id, data) or data
  213. category: category for instruction-based embeddings
  214. Returns:
  215. embeddings vectors
  216. “””
  217. # Initialize default parameters, if necessary
  218. self.defaults()
  219. # Convert documents into sentence embeddings
  220. embeddings = self.model.batchtransform(Stream(self)(documents), category)
  221. # Reduce the dimensionality of the embeddings. Scale the embeddings using this
  222. # model to reduce the noise of common but less relevant terms.
  223. if self.reducer:
  224. self.reducer(embeddings)
  225. return embeddings
  226. def count(self):
  227. “””
  228. Total number of elements in this embeddings index.
  229. Returns:
  230. number of elements in this embeddings index
  231. “””
  232. if self.ann:
  233. return self.ann.count()
  234. if self.scoring:
  235. return self.scoring.count()
  236. if self.database:
  237. return self.database.count()
  238. if self.ids:
  239. return len([uid for uid in self.ids if uid is not None])
  240. # Default to 0 when no suitable method found
  241. return 0
  242. def search(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
  243. “””
  244. Finds documents most similar to the input query. This method will run either an index search
  245. or an index + database search depending on if a database is available.
  246. Args:
  247. query: input query
  248. limit: maximum results
  249. weights: hybrid score weights, if applicable
  250. index: index name, if applicable
  251. parameters: dict of named parameters to bind to placeholders
  252. graph: return graph results if True
  253. Returns:
  254. list of (id, score) for index search
  255. list of dict for an index + database search
  256. graph when graph is set to True
  257. “””
  258. results = self.batchsearch([query], limit, weights, index, [parameters], graph)
  259. return results[0] if results else results
  260. def batchsearch(self, queries, limit=None, weights=None, index=None, parameters=None, graph=False):
  261. “””
  262. Finds documents most similar to the input queries. This method will run either an index search
  263. or an index + database search depending on if a database is available.
  264. Args:
  265. queries: input queries
  266. limit: maximum results
  267. weights: hybrid score weights, if applicable
  268. index: index name, if applicable
  269. parameters: list of dicts of named parameters to bind to placeholders
  270. graph: return graph results if True
  271. Returns:
  272. list of (id, score) per query for index search
  273. list of dict per query for an index + database search
  274. list of graph per query when graph is set to True
  275. “””
  276. # Determine if graphs should be returned
  277. graph = graph if self.graph else False
  278. # Execute search
  279. results = Search(self, graph)(queries, limit, weights, index, parameters)
  280. # Create subgraphs using results, if necessary
  281. return [self.graph.filter(x) for x in results] if graph else results
  282. def similarity(self, query, data):
  283. “””
  284. Computes the similarity between query and list of data. Returns a list of
  285. (id, score) sorted by highest score, where id is the index in data.
  286. Args:
  287. query: input query
  288. data: list of data
  289. Returns:
  290. list of (id, score)
  291. “””
  292. return self.batchsimilarity([query], data)[0]
  293. def batchsimilarity(self, queries, data):
  294. “””
  295. Computes the similarity between list of queries and list of data. Returns a list
  296. of (id, score) sorted by highest score per query, where id is the index in data.
  297. Args:
  298. queries: input queries
  299. data: list of data
  300. Returns:
  301. list of (id, score) per query
  302. “””
  303. # Convert queries to embedding vectors
  304. queries = self.batchtransform(((None, query, None) for query in queries), query”)
  305. data = self.batchtransform(((None, row, None) for row in data), data”)
  306. # Dot product on normalized vectors is equal to cosine similarity
  307. scores = np.dot(queries, data.T).tolist()
  308. # Add index and sort desc based on score
  309. return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]
  310. def explain(self, query, texts=None, limit=None):
  311. “””
  312. Explains the importance of each input token in text for a query. This method requires either content to be enabled
  313. or texts to be provided.
  314. Args:
  315. query: input query
  316. texts: optional list of (text|list of tokens), otherwise runs search query
  317. limit: optional limit if texts is None
  318. Returns:
  319. list of dict per input text where a higher token scores represents higher importance relative to the query
  320. “””
  321. results = self.batchexplain([query], texts, limit)
  322. return results[0] if results else results
  323. def batchexplain(self, queries, texts=None, limit=None):
  324. “””
  325. Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled
  326. or texts to be provided.
  327. Args:
  328. queries: input queries
  329. texts: optional list of (text|list of tokens), otherwise runs search queries
  330. limit: optional limit if texts is None
  331. Returns:
  332. list of dict per input text per query where a higher token scores represents higher importance relative to the query
  333. “””
  334. return Explain(self)(queries, texts, limit)
  335. def terms(self, query):
  336. “””
  337. Extracts keyword terms from a query.
  338. Args:
  339. query: input query
  340. Returns:
  341. query reduced down to keyword terms
  342. “””
  343. return self.batchterms([query])[0]
  344. def batchterms(self, queries):
  345. “””
  346. Extracts keyword terms from a list of queries.
  347. Args:
  348. queries: list of queries
  349. Returns:
  350. list of queries reduced down to keyword term strings
  351. “””
  352. return Terms(self)(queries)
  353. def exists(self, path=None, cloud=None, kwargs):
  354. “””
  355. Checks if an index exists at path.
  356. Args:
  357. path: input path
  358. cloud: cloud storage configuration
  359. kwargs: additional configuration as keyword args
  360. Returns:
  361. True if index exists, False otherwise
  362. “””
  363. # Check if this exists in a cloud instance
  364. cloud = self.createcloud(cloud=cloud, kwargs)
  365. if cloud:
  366. return cloud.exists(path)
  367. # Check if this is an archive file and exists
  368. path, apath = self.checkarchive(path)
  369. if apath:
  370. return os.path.exists(apath)
  371. # Return true if path has a config.json or config file with an offset set
  372. return path and (os.path.exists(f”{path}/config.json”) or os.path.exists(f”{path}/config”)) and offset in self.loadconfig(path)
  373. def load(self, path=None, cloud=None, config=None, kwargs):
  374. “””
  375. Loads an existing index from path.
  376. Args:
  377. path: input path
  378. cloud: cloud storage configuration
  379. config: configuration overrides
  380. kwargs: additional configuration as keyword args
  381. “””
  382. # Load from cloud, if configured
  383. cloud = self.createcloud(cloud=cloud, kwargs)
  384. if cloud:
  385. path = cloud.load(path)
  386. # Check if this is an archive file and extract
  387. path, apath = self.checkarchive(path)
  388. if apath:
  389. self.archive.load(apath)
  390. # Load index configuration
  391. self.config = self.loadconfig(path)
  392. # Apply config overrides
  393. self.config = {self.config, config} if config else self.config
  394. # Approximate nearest neighbor index - stores dense vectors
  395. self.ann = self.createann()
  396. if self.ann:
  397. self.ann.load(f”{path}/embeddings”)
  398. # Dimensionality reduction model - word vectors only
  399. if self.config.get(“pca”):
  400. self.reducer = Reducer()
  401. self.reducer.load(f”{path}/lsa”)
  402. # Index ids when content is disabled
  403. self.ids = self.createids()
  404. if self.ids:
  405. self.ids.load(f”{path}/ids”)
  406. # Document database - stores document content
  407. self.database = self.createdatabase()
  408. if self.database:
  409. self.database.load(f”{path}/documents”)
  410. # Sparse vectors - stores term sparse arrays
  411. self.scoring = self.createscoring()
  412. if self.scoring:
  413. self.scoring.load(f”{path}/scoring”)
  414. # Subindexes
  415. self.indexes = self.createindexes()
  416. if self.indexes:
  417. self.indexes.load(f”{path}/indexes”)
  418. # Graph network - stores relationships
  419. self.graph = self.creategraph()
  420. if self.graph:
  421. self.graph.load(f”{path}/graph”)
  422. # Dense vectors - transforms data to embeddings vectors
  423. self.model = self.loadvectors()
  424. # Query model
  425. self.query = self.loadquery()
  426. def save(self, path, cloud=None, kwargs):
  427. “””
  428. Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
  429. In those cases, the index is stored as a compressed file.
  430. Args:
  431. path: output path
  432. cloud: cloud storage configuration
  433. kwargs: additional configuration as keyword args
  434. “””
  435. if self.config:
  436. # Check if this is an archive file
  437. path, apath = self.checkarchive(path)
  438. # Create output directory, if necessary
  439. os.makedirs(path, exist_ok=True)
  440. # Copy sentence vectors model
  441. if self.config.get(“storevectors”):
  442. shutil.copyfile(self.config[“path”], os.path.join(path, os.path.basename(self.config[“path”])))
  443. self.config[“path”] = os.path.basename(self.config[“path”])
  444. # Save index configuration
  445. self.saveconfig(path)
  446. # Save approximate nearest neighbor index
  447. if self.ann:
  448. self.ann.save(f”{path}/embeddings”)
  449. # Save dimensionality reduction model (word vectors only)
  450. if self.reducer:
  451. self.reducer.save(f”{path}/lsa”)
  452. # Save index ids
  453. if self.ids:
  454. self.ids.save(f”{path}/ids”)
  455. # Save document database
  456. if self.database:
  457. self.database.save(f”{path}/documents”)
  458. # Save scoring index
  459. if self.scoring:
  460. self.scoring.save(f”{path}/scoring”)
  461. # Save subindexes
  462. if self.indexes:
  463. self.indexes.save(f”{path}/indexes”)
  464. # Save graph
  465. if self.graph:
  466. self.graph.save(f”{path}/graph”)
  467. # If this is an archive, save it
  468. if apath:
  469. self.archive.save(apath)
  470. # Save to cloud, if configured
  471. cloud = self.createcloud(cloud=cloud, kwargs)
  472. if cloud:
  473. cloud.save(apath if apath else path)
  474. def close(self):
  475. “””
  476. Closes this embeddings index and frees all resources.
  477. “””
  478. self.ann, self.config, self.graph, self.archive = None, None, None, None
  479. self.reducer, self.query, self.model, self.models = None, None, None, None
  480. self.ids = None
  481. # Close database connection if open
  482. if self.database:
  483. self.database.close()
  484. self.database, self.functions = None, None
  485. # Close scoring instance if open
  486. if self.scoring:
  487. self.scoring.close()
  488. self.scoring = None
  489. # Close indexes if open
  490. if self.indexes:
  491. self.indexes.close()
  492. self.indexes = None
  493. def info(self):
  494. “””
  495. Prints the current embeddings index configuration.
  496. “””
  497. if self.config:
  498. # Print configuration
  499. print(json.dumps(self.config, sortkeys=True, default=str, indent=2))
  500. def issparse(self):
  501. “””
  502. Checks if this instance has an associated scoring instance with term indexing enabled.
  503. Returns:
  504. True if term index is enabled, False otherwise
  505. “””
  506. return self.scoring and self.scoring.hasterms()
  507. def isdense(self):
  508. “””
  509. Checks if this instance has an associated ANN instance.
  510. Returns:
  511. True if this instance has an associated ANN, False otherwise
  512. “””
  513. return self.ann is not None
  514. def isweighted(self):
  515. “””
  516. Checks if this instance has an associated scoring instance with term weighting enabled.
  517. Returns:
  518. True if term weighting is enabled, False otherwise
  519. “””
  520. return self.scoring and not self.scoring.hasterms()
  521. def configure(self, config):
  522. “””
  523. Sets the configuration for this embeddings index and loads config-driven models.
  524. Args:
  525. config: embeddings configuration
  526. “””
  527. # Configuration
  528. self.config = config
  529. # Dimensionality reduction model
  530. self.reducer = None
  531. # Create scoring instance for word vectors term weighting
  532. scoring = self.config.get(“scoring”) if self.config else None
  533. self.scoring = self.createscoring() if scoring and (not isinstance(scoring, dict) or not scoring.get(“terms”)) else None
  534. # Dense vectors - transforms data to embeddings vectors
  535. self.model = self.loadvectors() if self.config else None
  536. # Query model
  537. self.query = self.loadquery() if self.config else None
  538. def initindex(self, reindex):
  539. “””
  540. Initialize new index.
  541. Args:
  542. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  543. “””
  544. # Initialize default parameters, if necessary
  545. self.defaults()
  546. # Initialize index ids, only created when content is disabled
  547. self.ids = None
  548. # Create document database, if necessary
  549. if not reindex:
  550. self.database = self.createdatabase()
  551. # Reset archive since this is a new index
  552. self.archive = None
  553. # Initialize ANN, will be created after index transformations complete
  554. self.ann = None
  555. # Create scoring only if term indexing is enabled
  556. scoring = self.config.get(“scoring”)
  557. if scoring and isinstance(scoring, dict) and self.config[“scoring”].get(“terms”):
  558. self.scoring = self.createscoring()
  559. # Create subindexes, if necessary
  560. self.indexes = self.createindexes()
  561. # Create graph, if necessary
  562. self.graph = self.creategraph()
  563. def defaults(self):
  564. “””
  565. Apply default parameters to current configuration.
  566. Returns:
  567. configuration with default parameters set
  568. “””
  569. self.config = self.config if self.config else {}
  570. # Expand sparse index shortcuts
  571. if not self.config.get(“scoring”) and any(self.config.get(key) for key in [“keyword”, hybrid”]):
  572. self.config[“scoring”] = {“method”: bm25”, terms”: True, normalize”: True}
  573. # Check if default model should be loaded
  574. if not self.model and self.defaultallowed():
  575. self.config[“path”] = sentence-transformers/all-MiniLM-L6-v2
  576. # Load dense vectors model
  577. self.model = self.loadvectors()
  578. def defaultallowed(self):
  579. “””
  580. Tests if this embeddings instance can use a default model if not otherwise provided.
  581. Returns:
  582. True if a default model is allowed, False otherwise
  583. “””
  584. params = [(“keyword”, False), (“defaults”, True)]
  585. return all(self.config.get(key, default) == default for key, default in params)
  586. def loadconfig(self, path):
  587. “””
  588. Loads index configuration. This method supports both config.json and config pickle files.
  589. Args:
  590. path: path to directory
  591. Returns:
  592. dict
  593. “””
  594. # Configuration
  595. config = None
  596. # Determine if config is json or pickle
  597. jsonconfig = os.path.exists(f”{path}/config.json”)
  598. # Set config file name
  599. name = config.json if jsonconfig else config
  600. # Load configuration
  601. with open(f”{path}/{name}”, r if jsonconfig else rb”, encoding=”utf-8 if jsonconfig else None) as handle:
  602. config = json.load(handle) if jsonconfig else pickle.load(handle)
  603. # Add format parameter
  604. config[“format”] = json if jsonconfig else pickle
  605. # Build full path to embedding vectors file
  606. if config.get(“storevectors”):
  607. config[“path”] = os.path.join(path, config[“path”])
  608. return config
  609. def saveconfig(self, path):
  610. “””
  611. Saves index configuration. This method defaults to JSON and falls back to pickle.
  612. Args:
  613. path: path to directory
  614. Returns:
  615. dict
  616. “””
  617. # Default to JSON config
  618. jsonconfig = self.config.get(“format”, json”) == json
  619. # Set config file name
  620. name = config.json if jsonconfig else config
  621. # Write configuration
  622. with open(f”{path}/{name}”, w if jsonconfig else wb”, encoding=”utf-8 if jsonconfig else None) as handle:
  623. if jsonconfig:
  624. # Write config as JSON
  625. json.dump(self.config, handle, default=str, indent=2)
  626. else:
  627. # Write config as pickle format
  628. pickle.dump(self.config, handle, protocol=_pickle)
  629. def loadvectors(self):
  630. “””
  631. Loads a vector model set in config.
  632. Returns:
  633. vector model
  634. “””
  635. # Create model cache if subindexes are enabled
  636. if indexes in self.config and self.models is None:
  637. self.models = {}
  638. # Model path
  639. path = self.config.get(“path”)
  640. # Check if model is cached
  641. if self.models and path in self.models:
  642. return self.models[path]
  643. # Load and store uncached model
  644. model = VectorsFactory.create(self.config, self.scoring)
  645. if self.models is not None and path:
  646. self.models[path] = model
  647. return model
  648. def loadquery(self):
  649. “””
  650. Loads a query model set in config.
  651. Returns:
  652. query model
  653. “””
  654. if query in self.config:
  655. return Query(self.config[“query”])
  656. return None
  657. def checkarchive(self, path):
  658. “””
  659. Checks if path is an archive file.
  660. Args:
  661. path: path to check
  662. Returns:
  663. (working directory, current path) if this is an archive, original path otherwise
  664. “””
  665. # Create archive instance, if necessary
  666. self.archive = ArchiveFactory.create()
  667. # Check if path is an archive file
  668. if self.archive.isarchive(path):
  669. # Return temporary archive working directory and original path
  670. return self.archive.path(), path
  671. return path, None
  672. def createcloud(self, cloud):
  673. “””
  674. Creates a cloud instance from config.
  675. Args:
  676. cloud: cloud configuration
  677. “””
  678. # Merge keyword args and keys under the cloud parameter
  679. config = cloud
  680. if cloud in config and config[“cloud”]:
  681. config.update(config.pop(“cloud”))
  682. # Create cloud instance from config and return
  683. return CloudFactory.create(config) if config else None
  684. def createann(self):
  685. “””
  686. Creates an ANN from config.
  687. Returns:
  688. new ANN, if enabled in config
  689. “””
  690. return ANNFactory.create(self.config) if self.config.get(“path”) or self.defaultallowed() else None
  691. def createdatabase(self):
  692. “””
  693. Creates a database from config. This method will also close any existing database connection.
  694. Returns:
  695. new database, if enabled in config
  696. “””
  697. # Free existing database resources
  698. if self.database:
  699. self.database.close()
  700. config = self.config.copy()
  701. # Create references to callable functions
  702. self.functions = Functions(self) if functions in config else None
  703. if self.functions:
  704. config[“functions”] = self.functions(config)
  705. # Create database from config and return
  706. return DatabaseFactory.create(config)
  707. def creategraph(self):
  708. “””
  709. Creates a graph from config.
  710. Returns:
  711. new graph, if enabled in config
  712. “””
  713. if graph in self.config:
  714. # Get or create graph configuration
  715. config = self.config[“graph”] if self.config[“graph”] else {}
  716. # Create configuration with custom columns, if necessary
  717. config = self.columns(config)
  718. return GraphFactory.create(config)
  719. return None
  720. def createids(self, ids=None):
  721. “””
  722. Creates indexids when content is disabled.
  723. Args:
  724. ids: optional ids to add
  725. Returns:
  726. new indexids, if content disabled
  727. “””
  728. # Load index ids when content is disabled
  729. return IndexIds(self, ids) if not self.config.get(“content”) else None
  730. def createindexes(self):
  731. “””
  732. Creates subindexes from config.
  733. Returns:
  734. list of subindexes
  735. “””
  736. # Load subindexes
  737. if indexes in self.config:
  738. indexes = {}
  739. for index, config in self.config[“indexes”].items():
  740. # Create index with shared model cache
  741. indexes[index] = Embeddings(config, models=self.models)
  742. # Wrap as Indexes object
  743. return Indexes(self, indexes)
  744. return None
  745. def createscoring(self):
  746. “””
  747. Creates a scoring from config.
  748. Returns:
  749. new scoring, if enabled in config
  750. “””
  751. # Free existing resources
  752. if self.scoring:
  753. self.scoring.close()
  754. if scoring in self.config:
  755. # Expand scoring to a dictionary, if necessary
  756. config = self.config[“scoring”]
  757. config = config if isinstance(config, dict) else {“method”: config}
  758. # Create configuration with custom columns, if necessary
  759. config = self.columns(config)
  760. return ScoringFactory.create(config)
  761. return None
  762. def columns(self, config):
  763. “””
  764. Adds custom text/object column information if its provided.
  765. Args:
  766. config: input configuration
  767. Returns:
  768. config with column information added
  769. “””
  770. # Add text/object columns if custom
  771. if columns in self.config:
  772. # Work on copy of configuration
  773. config = config.copy()
  774. # Copy columns to config
  775. config[“columns”] = self.config[“columns”]
  776. return config

__init__(config=None, models=None, **kwargs)

Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.

Parameters:

NameTypeDescriptionDefault
config

embeddings configuration

None
models

models cache, used for model sharing between embeddings

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 35
  2. 36
  3. 37
  4. 38
  5. 39
  6. 40
  7. 41
  8. 42
  9. 43
  10. 44
  11. 45
  12. 46
  13. 47
  14. 48
  15. 49
  16. 50
  17. 51
  18. 52
  19. 53
  20. 54
  21. 55
  22. 56
  23. 57
  24. 58
  25. 59
  26. 60
  27. 61
  28. 62
  29. 63
  30. 64
  31. 65
  32. 66
  33. 67
  34. 68
  35. 69
  36. 70
  37. 71
  38. 72
  39. 73
  40. 74
  41. 75
  42. 76
  43. 77
  44. 78
  45. 79
  46. 80
  47. 81
  48. 82
  49. 83
  50. 84
  51. 85
  52. 86
  53. 87
  54. 88
  1. def init(self, config=None, models=None, kwargs):
  2. “””
  3. Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.
  4. Args:
  5. config: embeddings configuration
  6. models: models cache, used for model sharing between embeddings
  7. kwargs: additional configuration as keyword args
  8. “””
  9. # Index configuration
  10. self.config = None
  11. # Dimensionality reduction - word vectors only
  12. self.reducer = None
  13. # Dense vector model - transforms data into similarity vectors
  14. self.model = None
  15. # Approximate nearest neighbor index
  16. self.ann = None
  17. # Index ids when content is disabled
  18. self.ids = None
  19. # Document database
  20. self.database = None
  21. # Resolvable functions
  22. self.functions = None
  23. # Graph network
  24. self.graph = None
  25. # Sparse vectors
  26. self.scoring = None
  27. # Query model
  28. self.query = None
  29. # Index archive
  30. self.archive = None
  31. # Subindexes for this embeddings instance
  32. self.indexes = None
  33. # Models cache
  34. self.models = models
  35. # Merge configuration into single dictionary
  36. config = {config, **kwargs} if config and kwargs else kwargs if kwargs else config
  37. # Set initial configuration
  38. self.configure(config)

batchexplain(queries, texts=None, limit=None)

Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled or texts to be provided.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
texts

optional list of (text|list of tokens), otherwise runs search queries

None
limit

optional limit if texts is None

None

Returns:

TypeDescription

list of dict per input text per query where a higher token scores represents higher importance relative to the query

Source code in txtai/embeddings/base.py

  1. 452
  2. 453
  3. 454
  4. 455
  5. 456
  6. 457
  7. 458
  8. 459
  9. 460
  10. 461
  11. 462
  12. 463
  13. 464
  14. 465
  15. 466
  1. def batchexplain(self, queries, texts=None, limit=None):
  2. “””
  3. Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled
  4. or texts to be provided.
  5. Args:
  6. queries: input queries
  7. texts: optional list of (text|list of tokens), otherwise runs search queries
  8. limit: optional limit if texts is None
  9. Returns:
  10. list of dict per input text per query where a higher token scores represents higher importance relative to the query
  11. “””
  12. return Explain(self)(queries, texts, limit)

batchsearch(queries, limit=None, weights=None, index=None, parameters=None, graph=False)

Finds documents most similar to the input queries. This method will run either an index search or an index + database search depending on if a database is available.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
limit

maximum results

None
weights

hybrid score weights, if applicable

None
index

index name, if applicable

None
parameters

list of dicts of named parameters to bind to placeholders

None
graph

return graph results if True

False

Returns:

TypeDescription

list of (id, score) per query for index search

list of dict per query for an index + database search

list of graph per query when graph is set to True

Source code in txtai/embeddings/base.py

  1. 369
  2. 370
  3. 371
  4. 372
  5. 373
  6. 374
  7. 375
  8. 376
  9. 377
  10. 378
  11. 379
  12. 380
  13. 381
  14. 382
  15. 383
  16. 384
  17. 385
  18. 386
  19. 387
  20. 388
  21. 389
  22. 390
  23. 391
  24. 392
  25. 393
  26. 394
  27. 395
  1. def batchsearch(self, queries, limit=None, weights=None, index=None, parameters=None, graph=False):
  2. “””
  3. Finds documents most similar to the input queries. This method will run either an index search
  4. or an index + database search depending on if a database is available.
  5. Args:
  6. queries: input queries
  7. limit: maximum results
  8. weights: hybrid score weights, if applicable
  9. index: index name, if applicable
  10. parameters: list of dicts of named parameters to bind to placeholders
  11. graph: return graph results if True
  12. Returns:
  13. list of (id, score) per query for index search
  14. list of dict per query for an index + database search
  15. list of graph per query when graph is set to True
  16. “””
  17. # Determine if graphs should be returned
  18. graph = graph if self.graph else False
  19. # Execute search
  20. results = Search(self, graph)(queries, limit, weights, index, parameters)
  21. # Create subgraphs using results, if necessary
  22. return [self.graph.filter(x) for x in results] if graph else results

batchsimilarity(queries, data)

Computes the similarity between list of queries and list of data. Returns a list of (id, score) sorted by highest score per query, where id is the index in data.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
data

list of data

required

Returns:

TypeDescription

list of (id, score) per query

Source code in txtai/embeddings/base.py

  1. 412
  2. 413
  3. 414
  4. 415
  5. 416
  6. 417
  7. 418
  8. 419
  9. 420
  10. 421
  11. 422
  12. 423
  13. 424
  14. 425
  15. 426
  16. 427
  17. 428
  18. 429
  19. 430
  20. 431
  21. 432
  22. 433
  1. def batchsimilarity(self, queries, data):
  2. “””
  3. Computes the similarity between list of queries and list of data. Returns a list
  4. of (id, score) sorted by highest score per query, where id is the index in data.
  5. Args:
  6. queries: input queries
  7. data: list of data
  8. Returns:
  9. list of (id, score) per query
  10. “””
  11. # Convert queries to embedding vectors
  12. queries = self.batchtransform(((None, query, None) for query in queries), query”)
  13. data = self.batchtransform(((None, row, None) for row in data), data”)
  14. # Dot product on normalized vectors is equal to cosine similarity
  15. scores = np.dot(queries, data.T).tolist()
  16. # Add index and sort desc based on score
  17. return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]

batchterms(queries)

Extracts keyword terms from a list of queries.

Parameters:

NameTypeDescriptionDefault
queries

list of queries

required

Returns:

TypeDescription

list of queries reduced down to keyword term strings

Source code in txtai/embeddings/base.py

  1. 481
  2. 482
  3. 483
  4. 484
  5. 485
  6. 486
  7. 487
  8. 488
  9. 489
  10. 490
  11. 491
  12. 492
  1. def batchterms(self, queries):
  2. “””
  3. Extracts keyword terms from a list of queries.
  4. Args:
  5. queries: list of queries
  6. Returns:
  7. list of queries reduced down to keyword term strings
  8. “””
  9. return Terms(self)(queries)

batchtransform(documents, category=None)

Transforms documents into embeddings vectors.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required
category

category for instruction-based embeddings

None

Returns:

TypeDescription

embeddings vectors

Source code in txtai/embeddings/base.py

  1. 302
  2. 303
  3. 304
  4. 305
  5. 306
  6. 307
  7. 308
  8. 309
  9. 310
  10. 311
  11. 312
  12. 313
  13. 314
  14. 315
  15. 316
  16. 317
  17. 318
  18. 319
  19. 320
  20. 321
  21. 322
  22. 323
  23. 324
  24. 325
  1. def batchtransform(self, documents, category=None):
  2. “””
  3. Transforms documents into embeddings vectors.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. category: category for instruction-based embeddings
  7. Returns:
  8. embeddings vectors
  9. “””
  10. # Initialize default parameters, if necessary
  11. self.defaults()
  12. # Convert documents into sentence embeddings
  13. embeddings = self.model.batchtransform(Stream(self)(documents), category)
  14. # Reduce the dimensionality of the embeddings. Scale the embeddings using this
  15. # model to reduce the noise of common but less relevant terms.
  16. if self.reducer:
  17. self.reducer(embeddings)
  18. return embeddings

close()

Closes this embeddings index and frees all resources.

Source code in txtai/embeddings/base.py

  1. 652
  2. 653
  3. 654
  4. 655
  5. 656
  6. 657
  7. 658
  8. 659
  9. 660
  10. 661
  11. 662
  12. 663
  13. 664
  14. 665
  15. 666
  16. 667
  17. 668
  18. 669
  19. 670
  20. 671
  21. 672
  22. 673
  23. 674
  1. def close(self):
  2. “””
  3. Closes this embeddings index and frees all resources.
  4. “””
  5. self.ann, self.config, self.graph, self.archive = None, None, None, None
  6. self.reducer, self.query, self.model, self.models = None, None, None, None
  7. self.ids = None
  8. # Close database connection if open
  9. if self.database:
  10. self.database.close()
  11. self.database, self.functions = None, None
  12. # Close scoring instance if open
  13. if self.scoring:
  14. self.scoring.close()
  15. self.scoring = None
  16. # Close indexes if open
  17. if self.indexes:
  18. self.indexes.close()
  19. self.indexes = None

count()

Total number of elements in this embeddings index.

Returns:

TypeDescription

number of elements in this embeddings index

Source code in txtai/embeddings/base.py

  1. 327
  2. 328
  3. 329
  4. 330
  5. 331
  6. 332
  7. 333
  8. 334
  9. 335
  10. 336
  11. 337
  12. 338
  13. 339
  14. 340
  15. 341
  16. 342
  17. 343
  18. 344
  19. 345
  1. def count(self):
  2. “””
  3. Total number of elements in this embeddings index.
  4. Returns:
  5. number of elements in this embeddings index
  6. “””
  7. if self.ann:
  8. return self.ann.count()
  9. if self.scoring:
  10. return self.scoring.count()
  11. if self.database:
  12. return self.database.count()
  13. if self.ids:
  14. return len([uid for uid in self.ids if uid is not None])
  15. # Default to 0 when no suitable method found
  16. return 0

delete(ids)

Deletes from an embeddings index. Returns list of ids deleted.

Parameters:

NameTypeDescriptionDefault
ids

list of ids to delete

required

Returns:

TypeDescription

list of ids deleted

Source code in txtai/embeddings/base.py

  1. 200
  2. 201
  3. 202
  4. 203
  5. 204
  6. 205
  7. 206
  8. 207
  9. 208
  10. 209
  11. 210
  12. 211
  13. 212
  14. 213
  15. 214
  16. 215
  17. 216
  18. 217
  19. 218
  20. 219
  21. 220
  22. 221
  23. 222
  24. 223
  25. 224
  26. 225
  27. 226
  28. 227
  29. 228
  30. 229
  31. 230
  32. 231
  33. 232
  34. 233
  35. 234
  36. 235
  37. 236
  38. 237
  39. 238
  40. 239
  41. 240
  42. 241
  43. 242
  44. 243
  45. 244
  46. 245
  47. 246
  48. 247
  49. 248
  50. 249
  51. 250
  52. 251
  53. 252
  54. 253
  55. 254
  56. 255
  1. def delete(self, ids):
  2. “””
  3. Deletes from an embeddings index. Returns list of ids deleted.
  4. Args:
  5. ids: list of ids to delete
  6. Returns:
  7. list of ids deleted
  8. “””
  9. # List of internal indices for each candidate id to delete
  10. indices = []
  11. # List of deleted ids
  12. deletes = []
  13. if self.database:
  14. # Retrieve indexid-id mappings from database
  15. ids = self.database.ids(ids)
  16. # Parse out indices and ids to delete
  17. indices = [i for i, in ids]
  18. deletes = sorted(set(uid for , uid in ids))
  19. # Delete ids from database
  20. self.database.delete(deletes)
  21. elif self.ann or self.scoring:
  22. # Find existing ids
  23. for uid in ids:
  24. indices.extend([index for index, value in enumerate(self.ids) if uid == value])
  25. # Clear embeddings ids
  26. for index in indices:
  27. deletes.append(self.ids[index])
  28. self.ids[index] = None
  29. # Delete indices for all indexes and data stores
  30. if indices:
  31. # Delete ids from ann
  32. if self.isdense():
  33. self.ann.delete(indices)
  34. # Delete ids from scoring
  35. if self.issparse():
  36. self.scoring.delete(indices)
  37. # Delete ids from subindexes
  38. if self.indexes:
  39. self.indexes.delete(indices)
  40. # Delete ids from graph
  41. if self.graph:
  42. self.graph.delete(indices)
  43. return deletes

exists(path=None, cloud=None, **kwargs)

Checks if an index exists at path.

Parameters:

NameTypeDescriptionDefault
path

input path

None
cloud

cloud storage configuration

None
kwargs

additional configuration as keyword args

{}

Returns:

TypeDescription

True if index exists, False otherwise

Source code in txtai/embeddings/base.py

  1. 494
  2. 495
  3. 496
  4. 497
  5. 498
  6. 499
  7. 500
  8. 501
  9. 502
  10. 503
  11. 504
  12. 505
  13. 506
  14. 507
  15. 508
  16. 509
  17. 510
  18. 511
  19. 512
  20. 513
  21. 514
  22. 515
  23. 516
  24. 517
  25. 518
  1. def exists(self, path=None, cloud=None, kwargs):
  2. “””
  3. Checks if an index exists at path.
  4. Args:
  5. path: input path
  6. cloud: cloud storage configuration
  7. kwargs: additional configuration as keyword args
  8. Returns:
  9. True if index exists, False otherwise
  10. “””
  11. # Check if this exists in a cloud instance
  12. cloud = self.createcloud(cloud=cloud, kwargs)
  13. if cloud:
  14. return cloud.exists(path)
  15. # Check if this is an archive file and exists
  16. path, apath = self.checkarchive(path)
  17. if apath:
  18. return os.path.exists(apath)
  19. # Return true if path has a config.json or config file with an offset set
  20. return path and (os.path.exists(f”{path}/config.json”) or os.path.exists(f”{path}/config”)) and offset in self.loadconfig(path)

explain(query, texts=None, limit=None)

Explains the importance of each input token in text for a query. This method requires either content to be enabled or texts to be provided.

Parameters:

NameTypeDescriptionDefault
query

input query

required
texts

optional list of (text|list of tokens), otherwise runs search query

None
limit

optional limit if texts is None

None

Returns:

TypeDescription

list of dict per input text where a higher token scores represents higher importance relative to the query

Source code in txtai/embeddings/base.py

  1. 435
  2. 436
  3. 437
  4. 438
  5. 439
  6. 440
  7. 441
  8. 442
  9. 443
  10. 444
  11. 445
  12. 446
  13. 447
  14. 448
  15. 449
  16. 450
  1. def explain(self, query, texts=None, limit=None):
  2. “””
  3. Explains the importance of each input token in text for a query. This method requires either content to be enabled
  4. or texts to be provided.
  5. Args:
  6. query: input query
  7. texts: optional list of (text|list of tokens), otherwise runs search query
  8. limit: optional limit if texts is None
  9. Returns:
  10. list of dict per input text where a higher token scores represents higher importance relative to the query
  11. “””
  12. results = self.batchexplain([query], texts, limit)
  13. return results[0] if results else results

index(documents, reindex=False)

Builds an embeddings index. This method overwrites an existing index.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required
reindex

if this is a reindex operation in which case database creation is skipped, defaults to False

False

Source code in txtai/embeddings/base.py

  1. 102
  2. 103
  3. 104
  4. 105
  5. 106
  6. 107
  7. 108
  8. 109
  9. 110
  10. 111
  11. 112
  12. 113
  13. 114
  14. 115
  15. 116
  16. 117
  17. 118
  18. 119
  19. 120
  20. 121
  21. 122
  22. 123
  23. 124
  24. 125
  25. 126
  26. 127
  27. 128
  28. 129
  29. 130
  30. 131
  31. 132
  32. 133
  33. 134
  34. 135
  35. 136
  36. 137
  37. 138
  38. 139
  39. 140
  40. 141
  41. 142
  42. 143
  43. 144
  44. 145
  45. 146
  46. 147
  47. 148
  48. 149
  49. 150
  50. 151
  1. def index(self, documents, reindex=False):
  2. “””
  3. Builds an embeddings index. This method overwrites an existing index.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  7. “””
  8. # Initialize index
  9. self.initindex(reindex)
  10. # Create transform and stream
  11. transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
  12. stream = Stream(self, Action.REINDEX if reindex else Action.INDEX)
  13. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  14. # Load documents into database and transform to vectors
  15. ids, dimensions, embeddings = transform(stream(documents), buffer)
  16. if embeddings is not None:
  17. # Build LSA model (if enabled). Remove principal components from embeddings.
  18. if self.config.get(“pca”):
  19. self.reducer = Reducer(embeddings, self.config[“pca”])
  20. self.reducer(embeddings)
  21. # Save index dimensions
  22. self.config[“dimensions”] = dimensions
  23. # Create approximate nearest neighbor index
  24. self.ann = self.createann()
  25. # Add embeddings to the index
  26. self.ann.index(embeddings)
  27. # Save indexids-ids mapping for indexes with no database, except when this is a reindex
  28. if ids and not reindex and not self.database:
  29. self.ids = self.createids(ids)
  30. # Index scoring, if necessary
  31. # This must occur before graph index in order to be available to the graph
  32. if self.issparse():
  33. self.scoring.index()
  34. # Index subindexes, if necessary
  35. if self.indexes:
  36. self.indexes.index()
  37. # Index graph, if necessary
  38. if self.graph:
  39. self.graph.index(Search(self, True), Ids(self), self.batchsimilarity)

info()

Prints the current embeddings index configuration.

Source code in txtai/embeddings/base.py

  1. 676
  2. 677
  3. 678
  4. 679
  5. 680
  6. 681
  7. 682
  8. 683
  1. def info(self):
  2. “””
  3. Prints the current embeddings index configuration.
  4. “””
  5. if self.config:
  6. # Print configuration
  7. print(json.dumps(self.config, sort_keys=True, default=str, indent=2))

isdense()

Checks if this instance has an associated ANN instance.

Returns:

TypeDescription

True if this instance has an associated ANN, False otherwise

Source code in txtai/embeddings/base.py

  1. 695
  2. 696
  3. 697
  4. 698
  5. 699
  6. 700
  7. 701
  8. 702
  9. 703
  1. def isdense(self):
  2. “””
  3. Checks if this instance has an associated ANN instance.
  4. Returns:
  5. True if this instance has an associated ANN, False otherwise
  6. “””
  7. return self.ann is not None

issparse()

Checks if this instance has an associated scoring instance with term indexing enabled.

Returns:

TypeDescription

True if term index is enabled, False otherwise

Source code in txtai/embeddings/base.py

  1. 685
  2. 686
  3. 687
  4. 688
  5. 689
  6. 690
  7. 691
  8. 692
  9. 693
  1. def issparse(self):
  2. “””
  3. Checks if this instance has an associated scoring instance with term indexing enabled.
  4. Returns:
  5. True if term index is enabled, False otherwise
  6. “””
  7. return self.scoring and self.scoring.hasterms()

isweighted()

Checks if this instance has an associated scoring instance with term weighting enabled.

Returns:

TypeDescription

True if term weighting is enabled, False otherwise

Source code in txtai/embeddings/base.py

  1. 705
  2. 706
  3. 707
  4. 708
  5. 709
  6. 710
  7. 711
  8. 712
  9. 713
  1. def isweighted(self):
  2. “””
  3. Checks if this instance has an associated scoring instance with term weighting enabled.
  4. Returns:
  5. True if term weighting is enabled, False otherwise
  6. “””
  7. return self.scoring and not self.scoring.hasterms()

load(path=None, cloud=None, config=None, **kwargs)

Loads an existing index from path.

Parameters:

NameTypeDescriptionDefault
path

input path

None
cloud

cloud storage configuration

None
config

configuration overrides

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 520
  2. 521
  3. 522
  4. 523
  5. 524
  6. 525
  7. 526
  8. 527
  9. 528
  10. 529
  11. 530
  12. 531
  13. 532
  14. 533
  15. 534
  16. 535
  17. 536
  18. 537
  19. 538
  20. 539
  21. 540
  22. 541
  23. 542
  24. 543
  25. 544
  26. 545
  27. 546
  28. 547
  29. 548
  30. 549
  31. 550
  32. 551
  33. 552
  34. 553
  35. 554
  36. 555
  37. 556
  38. 557
  39. 558
  40. 559
  41. 560
  42. 561
  43. 562
  44. 563
  45. 564
  46. 565
  47. 566
  48. 567
  49. 568
  50. 569
  51. 570
  52. 571
  53. 572
  54. 573
  55. 574
  56. 575
  57. 576
  58. 577
  59. 578
  60. 579
  61. 580
  62. 581
  63. 582
  64. 583
  65. 584
  66. 585
  67. 586
  1. def load(self, path=None, cloud=None, config=None, kwargs):
  2. “””
  3. Loads an existing index from path.
  4. Args:
  5. path: input path
  6. cloud: cloud storage configuration
  7. config: configuration overrides
  8. kwargs: additional configuration as keyword args
  9. “””
  10. # Load from cloud, if configured
  11. cloud = self.createcloud(cloud=cloud, kwargs)
  12. if cloud:
  13. path = cloud.load(path)
  14. # Check if this is an archive file and extract
  15. path, apath = self.checkarchive(path)
  16. if apath:
  17. self.archive.load(apath)
  18. # Load index configuration
  19. self.config = self.loadconfig(path)
  20. # Apply config overrides
  21. self.config = {self.config, config} if config else self.config
  22. # Approximate nearest neighbor index - stores dense vectors
  23. self.ann = self.createann()
  24. if self.ann:
  25. self.ann.load(f”{path}/embeddings”)
  26. # Dimensionality reduction model - word vectors only
  27. if self.config.get(“pca”):
  28. self.reducer = Reducer()
  29. self.reducer.load(f”{path}/lsa”)
  30. # Index ids when content is disabled
  31. self.ids = self.createids()
  32. if self.ids:
  33. self.ids.load(f”{path}/ids”)
  34. # Document database - stores document content
  35. self.database = self.createdatabase()
  36. if self.database:
  37. self.database.load(f”{path}/documents”)
  38. # Sparse vectors - stores term sparse arrays
  39. self.scoring = self.createscoring()
  40. if self.scoring:
  41. self.scoring.load(f”{path}/scoring”)
  42. # Subindexes
  43. self.indexes = self.createindexes()
  44. if self.indexes:
  45. self.indexes.load(f”{path}/indexes”)
  46. # Graph network - stores relationships
  47. self.graph = self.creategraph()
  48. if self.graph:
  49. self.graph.load(f”{path}/graph”)
  50. # Dense vectors - transforms data to embeddings vectors
  51. self.model = self.loadvectors()
  52. # Query model
  53. self.query = self.loadquery()

reindex(config=None, function=None, **kwargs)

Recreates embeddings index using config. This method only works if document content storage is enabled.

Parameters:

NameTypeDescriptionDefault
config

new config

None
function

optional function to prepare content for indexing

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 257
  2. 258
  3. 259
  4. 260
  5. 261
  6. 262
  7. 263
  8. 264
  9. 265
  10. 266
  11. 267
  12. 268
  13. 269
  14. 270
  15. 271
  16. 272
  17. 273
  18. 274
  19. 275
  20. 276
  21. 277
  22. 278
  23. 279
  24. 280
  25. 281
  26. 282
  27. 283
  28. 284
  29. 285
  30. 286
  31. 287
  1. def reindex(self, config=None, function=None, kwargs):
  2. “””
  3. Recreates embeddings index using config. This method only works if document content storage is enabled.
  4. Args:
  5. config: new config
  6. function: optional function to prepare content for indexing
  7. kwargs: additional configuration as keyword args
  8. “””
  9. if self.database:
  10. # Merge configuration into single dictionary
  11. config = {config, **kwargs} if config and kwargs else config if config else kwargs
  12. # Keep content and objects parameters to ensure database is preserved
  13. config[“content”] = self.config[“content”]
  14. if objects in self.config:
  15. config[“objects”] = self.config[“objects”]
  16. # Reset configuration
  17. self.configure(config)
  18. # Reset function references
  19. if self.functions:
  20. self.functions.reset()
  21. # Reindex
  22. if function:
  23. self.index(function(self.database.reindex(self.config)), True)
  24. else:
  25. self.index(self.database.reindex(self.config), True)

save(path, cloud=None, **kwargs)

Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip. In those cases, the index is stored as a compressed file.

Parameters:

NameTypeDescriptionDefault
path

output path

required
cloud

cloud storage configuration

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 588
  2. 589
  3. 590
  4. 591
  5. 592
  6. 593
  7. 594
  8. 595
  9. 596
  10. 597
  11. 598
  12. 599
  13. 600
  14. 601
  15. 602
  16. 603
  17. 604
  18. 605
  19. 606
  20. 607
  21. 608
  22. 609
  23. 610
  24. 611
  25. 612
  26. 613
  27. 614
  28. 615
  29. 616
  30. 617
  31. 618
  32. 619
  33. 620
  34. 621
  35. 622
  36. 623
  37. 624
  38. 625
  39. 626
  40. 627
  41. 628
  42. 629
  43. 630
  44. 631
  45. 632
  46. 633
  47. 634
  48. 635
  49. 636
  50. 637
  51. 638
  52. 639
  53. 640
  54. 641
  55. 642
  56. 643
  57. 644
  58. 645
  59. 646
  60. 647
  61. 648
  62. 649
  63. 650
  1. def save(self, path, cloud=None, kwargs):
  2. “””
  3. Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
  4. In those cases, the index is stored as a compressed file.
  5. Args:
  6. path: output path
  7. cloud: cloud storage configuration
  8. kwargs: additional configuration as keyword args
  9. “””
  10. if self.config:
  11. # Check if this is an archive file
  12. path, apath = self.checkarchive(path)
  13. # Create output directory, if necessary
  14. os.makedirs(path, exist_ok=True)
  15. # Copy sentence vectors model
  16. if self.config.get(“storevectors”):
  17. shutil.copyfile(self.config[“path”], os.path.join(path, os.path.basename(self.config[“path”])))
  18. self.config[“path”] = os.path.basename(self.config[“path”])
  19. # Save index configuration
  20. self.saveconfig(path)
  21. # Save approximate nearest neighbor index
  22. if self.ann:
  23. self.ann.save(f”{path}/embeddings”)
  24. # Save dimensionality reduction model (word vectors only)
  25. if self.reducer:
  26. self.reducer.save(f”{path}/lsa”)
  27. # Save index ids
  28. if self.ids:
  29. self.ids.save(f”{path}/ids”)
  30. # Save document database
  31. if self.database:
  32. self.database.save(f”{path}/documents”)
  33. # Save scoring index
  34. if self.scoring:
  35. self.scoring.save(f”{path}/scoring”)
  36. # Save subindexes
  37. if self.indexes:
  38. self.indexes.save(f”{path}/indexes”)
  39. # Save graph
  40. if self.graph:
  41. self.graph.save(f”{path}/graph”)
  42. # If this is an archive, save it
  43. if apath:
  44. self.archive.save(apath)
  45. # Save to cloud, if configured
  46. cloud = self.createcloud(cloud=cloud, kwargs)
  47. if cloud:
  48. cloud.save(apath if apath else path)

score(documents)

Builds a term weighting scoring index. Only used by word vectors models.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required

Source code in txtai/embeddings/base.py

  1. 90
  2. 91
  3. 92
  4. 93
  5. 94
  6. 95
  7. 96
  8. 97
  9. 98
  10. 99
  11. 100
  1. def score(self, documents):
  2. “””
  3. Builds a term weighting scoring index. Only used by word vectors models.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. “””
  7. # Build scoring index for word vectors term weighting
  8. if self.isweighted():
  9. self.scoring.index(Stream(self)(documents))

search(query, limit=None, weights=None, index=None, parameters=None, graph=False)

Finds documents most similar to the input query. This method will run either an index search or an index + database search depending on if a database is available.

Parameters:

NameTypeDescriptionDefault
query

input query

required
limit

maximum results

None
weights

hybrid score weights, if applicable

None
index

index name, if applicable

None
parameters

dict of named parameters to bind to placeholders

None
graph

return graph results if True

False

Returns:

TypeDescription

list of (id, score) for index search

list of dict for an index + database search

graph when graph is set to True

Source code in txtai/embeddings/base.py

  1. 347
  2. 348
  3. 349
  4. 350
  5. 351
  6. 352
  7. 353
  8. 354
  9. 355
  10. 356
  11. 357
  12. 358
  13. 359
  14. 360
  15. 361
  16. 362
  17. 363
  18. 364
  19. 365
  20. 366
  21. 367
  1. def search(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
  2. “””
  3. Finds documents most similar to the input query. This method will run either an index search
  4. or an index + database search depending on if a database is available.
  5. Args:
  6. query: input query
  7. limit: maximum results
  8. weights: hybrid score weights, if applicable
  9. index: index name, if applicable
  10. parameters: dict of named parameters to bind to placeholders
  11. graph: return graph results if True
  12. Returns:
  13. list of (id, score) for index search
  14. list of dict for an index + database search
  15. graph when graph is set to True
  16. “””
  17. results = self.batchsearch([query], limit, weights, index, [parameters], graph)
  18. return results[0] if results else results

similarity(query, data)

Computes the similarity between query and list of data. Returns a list of (id, score) sorted by highest score, where id is the index in data.

Parameters:

NameTypeDescriptionDefault
query

input query

required
data

list of data

required

Returns:

TypeDescription

list of (id, score)

Source code in txtai/embeddings/base.py

  1. 397
  2. 398
  3. 399
  4. 400
  5. 401
  6. 402
  7. 403
  8. 404
  9. 405
  10. 406
  11. 407
  12. 408
  13. 409
  14. 410
  1. def similarity(self, query, data):
  2. “””
  3. Computes the similarity between query and list of data. Returns a list of
  4. (id, score) sorted by highest score, where id is the index in data.
  5. Args:
  6. query: input query
  7. data: list of data
  8. Returns:
  9. list of (id, score)
  10. “””
  11. return self.batchsimilarity([query], data)[0]

terms(query)

Extracts keyword terms from a query.

Parameters:

NameTypeDescriptionDefault
query

input query

required

Returns:

TypeDescription

query reduced down to keyword terms

Source code in txtai/embeddings/base.py

  1. 468
  2. 469
  3. 470
  4. 471
  5. 472
  6. 473
  7. 474
  8. 475
  9. 476
  10. 477
  11. 478
  12. 479
  1. def terms(self, query):
  2. “””
  3. Extracts keyword terms from a query.
  4. Args:
  5. query: input query
  6. Returns:
  7. query reduced down to keyword terms
  8. “””
  9. return self.batchterms([query])[0]

transform(document)

Transforms document into an embeddings vector.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required

Returns:

TypeDescription

embeddings vector

Source code in txtai/embeddings/base.py

  1. 289
  2. 290
  3. 291
  4. 292
  5. 293
  6. 294
  7. 295
  8. 296
  9. 297
  10. 298
  11. 299
  12. 300
  1. def transform(self, document):
  2. “””
  3. Transforms document into an embeddings vector.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. Returns:
  7. embeddings vector
  8. “””
  9. return self.batchtransform([document])[0]

upsert(documents)

Runs an embeddings upsert operation. If the index exists, new data is appended to the index, existing data is updated. If the index doesn’t exist, this method runs a standard index operation.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required

Source code in txtai/embeddings/base.py

  1. 153
  2. 154
  3. 155
  4. 156
  5. 157
  6. 158
  7. 159
  8. 160
  9. 161
  10. 162
  11. 163
  12. 164
  13. 165
  14. 166
  15. 167
  16. 168
  17. 169
  18. 170
  19. 171
  20. 172
  21. 173
  22. 174
  23. 175
  24. 176
  25. 177
  26. 178
  27. 179
  28. 180
  29. 181
  30. 182
  31. 183
  32. 184
  33. 185
  34. 186
  35. 187
  36. 188
  37. 189
  38. 190
  39. 191
  40. 192
  41. 193
  42. 194
  43. 195
  44. 196
  45. 197
  46. 198
  1. def upsert(self, documents):
  2. “””
  3. Runs an embeddings upsert operation. If the index exists, new data is
  4. appended to the index, existing data is updated. If the index doesnt exist,
  5. this method runs a standard index operation.
  6. Args:
  7. documents: iterable of (id, data, tags), (id, data) or data
  8. “””
  9. # Run standard insert if index doesn’t exist or it has no records
  10. if not self.count():
  11. self.index(documents)
  12. return
  13. # Create transform and stream
  14. transform = Transform(self, Action.UPSERT)
  15. stream = Stream(self, Action.UPSERT)
  16. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  17. # Load documents into database and transform to vectors
  18. ids, _, embeddings = transform(stream(documents), buffer)
  19. if embeddings is not None:
  20. # Remove principal components from embeddings, if necessary
  21. if self.reducer:
  22. self.reducer(embeddings)
  23. # Append embeddings to the index
  24. self.ann.append(embeddings)
  25. # Save indexids-ids mapping for indexes with no database
  26. if ids and not self.database:
  27. self.ids = self.createids(self.ids + ids)
  28. # Scoring upsert, if necessary
  29. # This must occur before graph upsert in order to be available to the graph
  30. if self.issparse():
  31. self.scoring.upsert()
  32. # Subindexes upsert, if necessary
  33. if self.indexes:
  34. self.indexes.upsert()
  35. # Graph upsert, if necessary
  36. if self.graph:
  37. self.graph.upsert(Search(self, True), Ids(self), self.batchsimilarity)