Skip to content

Tags: meta-toolkit/meta

Tags

v3.0.2

Toggle v3.0.2's commit message
release MeTA v3.0.2

Bug fixes
- Fix issues using `MAKE_NUMERIC_IDENTIFIER` instead of
  `MAKE_NUMERIC_IDENTIFIER_UDL` on GCC 7.1.1.
- Work around (what we assume is) a bug on MSYS2 where `cmake` would link
  in additional exception handling libraries that would cause a crash
  during indexing by building the `mman-win32` library as shared.
- Silence fallthrough warnings on Clang from `murmur_hash`.

v3.0.1

Toggle v3.0.1's commit message

Verified

This tag was signed with the committer’s verified signature.
skystrife Chase Geigle
release MeTA v3.0.1

New features
- Add an optional `xz{i,o}fstream` to `meta::io` if compiled with liblzma
  available.
- `util::disk_vector<const T>` can now be used to specify a read-only view
  of a disk-backed vector.

Bug fixes
- `ir_eval::print_stats` now takes a `num_docs` parameter to properly
  display evaluation metrics at a certain cutoff point, which was always 5
  beforehand. This fixes a bug in `query-runner` where the stats were not
  being computed according to the cutoff point specified in the
  configuration.
- `ir_eval::avg_p` now correctly stops computing after `num_docs`. Before,
  if you specified `num_docs` as a smaller value than the size of the
  result list, it would erroneously keep calculating until the end of the
  result list instead of stopping after `num_docs` elements.
- `{inverted,forward}_index` can now be loaded from read-only filesystems.

v3.0.0

Toggle v3.0.0's commit message
release MeTA v3.0.0

New features
- Add an `embedding_analyzer` that represents documents with their averaged word
  vectors.
- Add a `parallel::reduction` algorithm designed for parallelizing complex
  accumulation operations (like an E step in an EM algorithm)
- Parallelize feature counting in feature selector using the new
  `parallel::reduction`
- Add a `parallel::for_each_block` algorithm to run functions on
  (relatively) equal sub-ranges of an iterator range in parallel
- Add a parallel merge sort as `parallel::sort`
- Add a `util/traits.h` header for general useful traits
- Add a Markov model implementation in `sequence::markov_model`
- Add a generic unsupervised HMM implementation. This implementation
  supports HMMs with discrete observations (what is used most often) and
  sequence observations (useful for log mining applications). The
  forward-backward algorithm is implemented using both the scaling method
  and the log-space method. The scaling method is used by default, but the
  log-space method is useful for HMMs with sequence observations to avoid
  underflow issues when the output probabilities themselves are very small.
- Add the KL-divergence retrieval function using pseudo-relevance feedback
  with the two-component mixture-model approach of Zhai and Lafferty,
  called `kl_divergence_prf`. This ranker internally can use any
  `language_model_ranker` subclass like `dirichlet_prior` or
  `jelinek_mercer` to perform the ranking of the feedback set and the
  result documents with respect to the modified query.

  The EM algorithm used for the two-component mixture model is provided as
  the `index::feedback::unigram_mixture` free function and returns the
  feedback model.
- Add the Rocchio algorithm (`rocchio`) for pseudo-relevance feedback in
  the vector space model.
- **Breaking Change.** To facilitate the above to changes, we have also
  broken the `ranker` hierarchy into one more level. At the top we have
  `ranker`, which has a pure virtual function `rank()` that can be
  overridden to provide entirely custom ranking behavior, This is the class
  the KL-divergence and Rocchio methods derive from, as we need to
  re-define what it means to rank documents (first retrieving a feedback
  set, then ranking documents with respect to an updated query).

  Most of the time, however, you will want to derive from the second level
  `ranking_function`, which is what was called `ranker` before. This class
  provides a definition of `rank()` to perform document-at-a-time ranking,
  and expects deriving classes to instead provide `initial_score()` and
  `score_one()` implementations to define the scoring function used for
  each document. **Existing code that derived from `ranker` prior to this
  version of MeTA likely needs to be changed to instead derive from
  `ranking_function`.**
- Add the `util::transform_iterator` class and `util::make_transform_iterator`
  function for providing iterators that transform their output according to
  a unary function.
- **Breaking Change.** `whitespace_tokenizer` now emits *only* word tokens
  by default, suppressing all whitespace tokens. The old default was to
  emit tokens containing whitespace in addition to actual word tokens. The
  old behavior can be obtained by passing `false` to its constructor, or
  setting `suppress-whitespace = false` in its configuration group in
  `config.toml.` (Note that whitespace tokens are still needed if using a
  `sentence_boundary` filter but, in nearly all circumstances,
  `icu_tokenizer` should be preferred.)
- **Breaking Change.** Co-occurrence counting for embeddings now uses
  history that crosses sentence boundaries by default. The old behavior
  (clearing the history when starting a new sentence) can be obtained by
  ensuring that a tokenizer is being used that emits sentence boundary tags
  and by setting `break-on-tags = true` in the `[embeddings]` table of
  `config.toml`.
- **Breaking Change.** All references in the embeddings library to "coocur"
  are have changed to "cooccur". This means that some files and binaries
  have been renamed. Much of the co-occurrence counting part of the
  embeddings library has also been moved to the public API.
- Co-occurrence counting now is performed in parallel. Behavior of its
  merge strategy can be configured with the new `[embeddings]` config
  parameter `merge-fanout = n`, which specifies the maximum number of
  on-disk chunks to allow before kicking off a multi-way merge (default 8).

- Add additional `packed_write` and `packed_read` overloads: for
  `std::pair`, `stats::dirichlet`, `stats::multinomial`,
  `util::dense_matrix`, and `util::sparse_vector`
- Additional functions have been added to `ranker_factory` to allow
  construction/loading of language_model_ranker subclasses (useful for the
  `kl_divergence_prf` implementation)
- Add a `util::make_fixed_heap` helper function to simplify the declaration
  of `util::fixed_heap` classes with lambda function comparators.
- Add regression tests for rankers MAP and NDCG scores. This adds a new
  dataset `cranfield` that contains non-binary relevance judgments to
  facilitate these new tests.
- Bump bundled version of ICU to 58.2.

Bug Fixes
- Fix bug in NDCG calculation (ideal-DCG was computed using the wrong
  sorting order for non-binary judgments)
- Fix bug where the final chunks to be merged in index creation were not
  being deleted when merging completed
- Fix bug where GloVe training would allocate the embedding matrix before
  starting the shuffling process, causing it to exceed the "max-ram"
  config parameter.
- Fix bug with consuming MeTA from a build directory with `cmake` when
  building a static ICU library. `meta-utf` is now forced to be a shared
  library, which (1) should save on binary sizes and (2) ensures that the
  statically build ICU is linked into the `libmeta-utf.so` library to avoid
  undefined references to ICU functions.
- Fix bug with consuming Release-mode MeTA libraries from another project
  being built in Debug mode. Before, `identifiers.h` would change behavior
  based on the `NDEBUG` macro's setting. This behavior has been removed,
  and opaque identifiers are always on.

Deprecation
- `disk_index::doc_name` and `disk_index::doc_path` have been deprecated in
  favor of the more general (and less confusing) `metadata()`. They will be
  removed in a future major release.
- Support for 32-bit architectures is provided on a best-effort basis. MeTA
  makes heavy use of memory mapping, which is best paired with a 64-bit
  address space. Please move to a 64-bit platform for using MeTA if at all
  possible (most consumer machines should support 64-bit if they were made
  in the last 5 years or so).

v2.4.2

Toggle v2.4.2's commit message

Verified

This tag was signed with the committer’s verified signature.
skystrife Chase Geigle
release MeTA v2.4.2

Bug fixes
- Properly shuffle documents when doing an even-split classification test
- Make forward indexer listen to `indexer-num-threads` config option.
- Use correct number of threads when deciding block sizes for
    `parallel_for`
- Add workaround to `filesystem::remove_all` for Windows systems to avoid
    spurious failures caused by virus scanners keeping files open after we
    deleted them
- Fix invalid memory access in `gzstreambuf::underflow`

v2.4.1

Toggle v2.4.1's commit message
release MeTA v2.4.1

Bug fixes
- Eliminate excess warnings on Darwin about double preprocessor definitions
- Fix issue finding `config.h` when used as a sub-project via
    add_subdirectory()

v2.4.0

Toggle v2.4.0's commit message

Verified

This tag was signed with the committer’s verified signature.
smassung Sean Massung
release MeTA v2.4.0

New features
- Add a minimal perfect hashing implementation for `language_model`, and unify
  the querying interface with the existing language model.
- Add a CMake `install()` command to install MeTA as a library (issue #143). For
  example, once the library is installed, users can do:

    find_package(MeTA 2.4 REQUIRED)

    add_executable(my-program src/my_program.cpp)
    target_link_libraries(my-program meta-index) # or whatever other libs you
    need from MeTA
- Feature selection functionality added to `multiclass_dataset` and
  `binary_dataset` and views (issues #111, #149 and PR #150 thanks to @siddshuk).

    auto selector = features::make_selector(*config, training_vw);
    uint64_t total_features_selected = 20;
    selector->select(total_features_selected);
    auto filtered_dset = features::filter_dataset(dset, *selector);

- Users can now, similar to `hash_append`, declare standalone functions in the
  same scope as their type called `packed_read` and `packed_write` which will be
  called by `io::packed::read` and `io::packed::write`, respectively, via
  argument-dependent lookup.

Bug fixes
- Fix edge-case bug in the succinct data structures
- Fix off-by-one error in `lm::diff`

Enhancements
- Added functionality to the `meta::hashing` library: `hash_append` overload for
  `std::vector`, manually-seeded hash function
- Further isolate ICU in MeTA to allow CMake to `install()`
- Updates to EWS (UIUC) build guide
- Add `std::vector` operations to `io::packed`
- Consolidated all variants of chunk iterators into one template
- Add MeTA's citation to the README!

v2.3.0

Toggle v2.3.0's commit message
Release MeTA v2.3.0

New features
- Forward and inverted indexes are now stored in one directory. **To make
    use of your existing indexes, you will need to move their
    directories.** For example, a configuration that used to look like the
    following

    dataset = "20newsgroups"
    corpus = "line.toml"
    forward-index = "20news-fwd"
    inverted-index = "20news-inv"

    will now look like the following

    dataset = "20newsgroups"
    corpus = "line.toml"
    index = "20news-index"

    and your folder structure should now look like

    20news-index
    ├── fwd
    └── inv

    You can do this by simply moving the old folders around like so:

    mkdir 20news-index
    mv 20news-fwd 20news-index/fwd
    mv 20news-inv 20news-index/inv

- `stats::multinomial` now can report the number of unique event types
    counted (`unique_events()`)
- `std::vector` can now be hashed via `hash_append`.

Bug fixes
- Fix rounding bug in language model-based rankers. This bug caused
    severely degraded performance for these rankers with short queries. The
    unit tests have been improved to prevent such a regression in the
    future.

Enhancements
- The bundled ICU version has been bumped to ICU 57.1.
- MeTA will now attempt to build its own version of ICU on Windows if it
    fails to find a suitable ICU installed.
- CI support for GCC 6.x was added for all three major platforms.
- CI support also uses a fixed version of LLVM/libc++ instead of trunk.

v2.2.0

Toggle v2.2.0's commit message
Release MeTA v2.2.0

New features
- Parallelized versions of PageRank and Personalized PageRank have been
  added. A demo is available in `wiki-page-rank`; see the website for
  more information on obtaining the required data.
- Add a disk-based streaming minimal perfect hash function library. A
  sub-component of this is a small memory-mapped succinct data structure
  library for answering rank/select queries on bit vectors.
- Much of our CMake magic has been moved into a separate project included
  as a submodule: https://github.com/meta-toolkit/meta-cmake, which can
  now be used in other projects to simplify initial build system
  configuration.

Bug fixes
- Fix parameter settings in language model rankers not being range checked
  (issue #134).
- Fix incorrect incoming edge insertion in `directed_graph::add_edge()`.
- Fix `find_first_of` and `find_last_of` in `util::string_view`.

Enhancements
- `forward_index` now knows how to tokenize a document down to a
  `feature_vector`, provided it was generated with a non-LIBSVM analyzer.
- Allow loading of an existing index where its corpus is no longer
  available.
- Data is no longer shuffled in `batch_train`. Shuffling the data
  causes horrible access patterns in the postings file, so the data
  should instead shuffled before indexing.
- `util::array_view`s can now be constructed as empty.
- `util::multiway_merge` has been made more generic. You can now specify
  both the comparison function and merging criteria as parameters, which
  default to `operator<` and `operator==`, respectively.
- A simple utility classes `io::mifstream` and `io::mofstream` have been
  added for places where a moveable `ifstream` or `ofstream` is desired
  as a workaround for older standard libraries lacking these move
  constructors.
- The number of indexing threads can be controlled via the configuration
  key `indexer-num-threads` (which defaults to the number of threads on
  the system), and the number of threads allowed to concurrently write to
  disk can be controlled via `indexer-max-writers` (which defaults to 8).

v2.1.0

Toggle v2.1.0's commit message
Release MeTA v2.1.0

New features
- Add the [GloVe algorithm](http://www-nlp.stanford.edu/pubs/glove.pdf) for
  training word embeddings and a library class `word_embeddings` for loading and
  querying trained embeddings. To facilitate returning word embeddings, a simple
  `util::array_view` class was added.
- Add simple vector math library (and move `fastapprox` into the `math`
  namespace).

Bug fixes
- Fix `probe_map::extract()` for `inline_key_value_storage` type; old
  implementation forgot to delete all sentinel values before returning the
  vector.
- Fix incorrect definition of `l1norm()` in `sgd_model`.
- Fix `gmap` calculation where 0 average precision was ignored
- Fix progress output in `multiway_merge`.

Enhancements
- Improve performance of `printing::progress`. Before, `progress::operator()` in
  tight loops could dramatically hurt performance, particularly due to frequent
  calls to `std::chrono::steady_clock::now()`. Now, `progress::operator()`
  simply sets an atomic iteration counter and a background thread periodically
  wakes to update the progress output.
- Allow full text storage in index as metadata field. If `store-full-text =
  true` (default false) in the corpus config, the string metadata field
  "content" will be added. This is to simplify the creation of full text
  metadata: the user doesn't have to duplicate their dataset in `metadata.dat`,
  and `metadata.dat` will still be somewhat human-readable without large strings
  of full text added.
- Allow `make_index` to take a user-supplied corpus object.

Miscellaneous
- ZLIB is now a required dependency.
- Switch to just using the standalone `./unit-test` instead of `ctest`. There
  aren't really many advantages for us to using CTest at this point with the new
  unit test framework, so just use our unit test executable.

v2.0.1

Toggle v2.0.1's commit message
release MeTA v2.0.1

Bug fixes
- Fix issue where `metadata_parser` would not consume spaces in string
    metadata fields. Thanks to @Hopsalot on the forum for the bug report!
- Fix build issue on OS X with Xcode 6.4 and `clang` related to their
    shipped version of `string_view` lacking a const `to_string()` method

Enhancements
- The `./profile` executable ensures that the file exists before operating on
  it. Thanks to @domarps for the PR!
- Add a generic `util::multiway_merge` algorithm for performing the
    merge-step of an external memory merge sort.
- Build with the following Xcode versions on Travis CI:
  * Xcode 6.1 and OS X 10.9 (as before)
  * Xcode 6.4 and OS X 10.10 (new)
  * Xcode 7.1.1 and OS X 10.10 (new)
  * Xcode 7.2 and OS X 10.11 (new)