Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Improve README.
PiperOrigin-RevId: 187375987
  • Loading branch information
Lukasz Kaiser authored and Ryan Sepassi committed Mar 2, 2018
commit 2df7a71e69ac4e43b01e009ea0b814a7619fb06c
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CO

[Tensor2Tensor](https://github.com/tensorflow/tensor2tensor), or
[T2T](https://github.com/tensorflow/tensor2tensor) for short, is a library
of deep learning models and datasets designed to [accelerate deep learning
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html) and make it more accessible.

T2T is actively used and maintained by researchers and engineers within the
of deep learning models and datasets designed to make deep learning more
accessible and [accelerate ML
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).
is actively used and maintained by researchers and engineers within the
[Google Brain team](https://research.google.com/teams/brain/) and a community
of users. We're eager to collaborate with you too, so feel free to
[open an issue on GitHub](https://github.com/tensorflow/tensor2tensor/issues)
Expand Down Expand Up @@ -368,6 +368,7 @@ T2T](https://research.googleblog.com/2017/06/accelerating-deep-learning-research
* [Discrete Autoencoders for Sequence Models](https://arxiv.org/abs/1801.09797)
* [Generating Wikipedia by Summarizing Long
Sequences](https://arxiv.org/abs/1801.10198)
* [Image Transformer](https://openreview.net/forum?id=r16Vyf-0-)
* [Image Transformer](https://arxiv.org/abs/1802.05751)
* [Training Tips for the Transformer Model](http://ufallab.ms.mff.cuni.cz/~popel/training-tips-transformer.pdf)

*Note: This is not an official Google product.*
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CO

[Tensor2Tensor](https://github.com/tensorflow/tensor2tensor), or
[T2T](https://github.com/tensorflow/tensor2tensor) for short, is a library
of deep learning models and datasets designed to [accelerate deep learning
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html) and make it more accessible.
of deep learning models and datasets designed to make deep learning more
accessible and [accelerate ML
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).


## Basics
Expand Down
4 changes: 4 additions & 0 deletions docs/new_problem.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CO
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)

Another good overview of this part together with training is given in
[The Cloud ML Poetry Blog
Post](https://cloud.google.com/blog/big-data/2018/02/cloud-poetry-training-and-hyperparameter-tuning-custom-text-models-on-cloud-ml-engine)

Let's add a new dataset together and train the
[Transformer](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/models/transformer.py)
model on it. We'll give the model a line of poetry, and it will learn to
Expand Down
11 changes: 6 additions & 5 deletions docs/walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CO

[Tensor2Tensor](https://github.com/tensorflow/tensor2tensor), or
[T2T](https://github.com/tensorflow/tensor2tensor) for short, is a library
of deep learning models and datasets designed to [accelerate deep learning
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html) and make it more accessible.

T2T is actively used and maintained by researchers and engineers within the
of deep learning models and datasets designed to make deep learning more
accessible and [accelerate ML
research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).
is actively used and maintained by researchers and engineers within the
[Google Brain team](https://research.google.com/teams/brain/) and a community
of users. We're eager to collaborate with you too, so feel free to
[open an issue on GitHub](https://github.com/tensorflow/tensor2tensor/issues)
Expand Down Expand Up @@ -368,6 +368,7 @@ T2T](https://research.googleblog.com/2017/06/accelerating-deep-learning-research
* [Discrete Autoencoders for Sequence Models](https://arxiv.org/abs/1801.09797)
* [Generating Wikipedia by Summarizing Long
Sequences](https://arxiv.org/abs/1801.10198)
* [Image Transformer](https://openreview.net/forum?id=r16Vyf-0-)
* [Image Transformer](https://arxiv.org/abs/1802.05751)
* [Training Tips for the Transformer Model](http://ufallab.ms.mff.cuni.cz/~popel/training-tips-transformer.pdf)

*Note: This is not an official Google product.*
4 changes: 2 additions & 2 deletions tensor2tensor/data_generators/generator_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,14 +493,14 @@ def __init__(self, first_sequence, spacing=2):
self._spacing = spacing
self._ids = first_sequence[:]
self._segmentation = [1] * len(first_sequence)
self._position = range(len(first_sequence))
self._position = list(range(len(first_sequence)))

def add(self, ids):
padding = [0] * self._spacing
self._ids.extend(padding + ids)
next_segment_num = self._segmentation[-1] + 1 if self._segmentation else 1
self._segmentation.extend(padding + [next_segment_num] * len(ids))
self._position.extend(padding + range(len(ids)))
self._position.extend(padding + list(range(len(ids))))

def can_fit(self, ids, packed_length):
return len(self._ids) + self._spacing + len(ids) <= packed_length
Expand Down
17 changes: 17 additions & 0 deletions tensor2tensor/data_generators/problem.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,15 +421,32 @@ def get_hparams(self, model_hparams=None):
return self._hparams

def maybe_reverse_features(self, feature_map):
"""Reverse features between inputs and targets if the problem is '_rev'."""
if not self._was_reversed:
return
inputs, targets = feature_map["inputs"], feature_map["targets"]
feature_map["inputs"], feature_map["targets"] = targets, inputs
if "inputs_segmentation" in feature_map:
inputs_seg = feature_map["inputs_segmentation"]
targets_seg = feature_map["targets_segmentation"]
feature_map["inputs_segmentation"] = targets_seg
feature_map["targets_segmentation"] = inputs_seg
if "inputs_position" in feature_map:
inputs_pos = feature_map["inputs_position"]
targets_pos = feature_map["targets_position"]
feature_map["inputs_position"] = targets_pos
feature_map["targets_position"] = inputs_pos

def maybe_copy_features(self, feature_map):
if not self._was_copy:
return
feature_map["targets"] = feature_map["inputs"]
if ("inputs_segmentation" in feature_map and
"targets_segmentation" not in feature_map):
feature_map["targets_segmentation"] = feature_map["inputs_segmentation"]
if ("inputs_position" in feature_map and
"targets_position" not in feature_map):
feature_map["targets_position"] = feature_map["inputs_position"]

def dataset(self,
mode,
Expand Down
22 changes: 18 additions & 4 deletions tensor2tensor/data_generators/translate_enmk.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,10 @@

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import text_problems
from tensor2tensor.data_generators import translate
from tensor2tensor.utils import registry

import tensorflow as tf

FLAGS = tf.flags.FLAGS

# End-of-sentence marker.
EOS = text_encoder.EOS_ID

Expand All @@ -49,6 +46,10 @@
]]


# See this PR on github for some results with Transformer on these Problems.
# https://github.com/tensorflow/tensor2tensor/pull/626


@registry.register_problem
class TranslateEnmkSetimes32k(translate.TranslateProblem):
"""Problem spec for SETimes En-Mk translation."""
Expand All @@ -64,3 +65,16 @@ def vocab_filename(self):
def source_data_files(self, dataset_split):
train = dataset_split == problem.DatasetSplit.TRAIN
return _ENMK_TRAIN_DATASETS if train else _ENMK_TEST_DATASETS


@registry.register_problem
class TranslateEnmkSetimesCharacters(translate.TranslateProblem):
"""Problem spec for SETimes En-Mk translation."""

@property
def vocab_type(self):
return text_problems.VocabType.CHARACTER

def source_data_files(self, dataset_split):
train = dataset_split == problem.DatasetSplit.TRAIN
return _ENMK_TRAIN_DATASETS if train else _ENMK_TEST_DATASETS
65 changes: 65 additions & 0 deletions tensor2tensor/data_generators/translate_envi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# coding=utf-8
# Copyright 2018 The Tensor2Tensor Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Data generators for En-Vi translation."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# Dependency imports

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
from tensor2tensor.utils import registry

# End-of-sentence marker.
EOS = text_encoder.EOS_ID

# For English-Vietnamese the IWSLT'15 corpus
# from https://nlp.stanford.edu/projects/nmt/ is used.
# The original dataset has 133K parallel sentences.
_ENVI_TRAIN_DATASETS = [[
"https://github.com/stefan-it/nmt-en-vi/raw/master/data/train-en-vi.tgz", # pylint: disable=line-too-long
("train.en", "train.vi")
]]

# For development 1,553 parallel sentences are used.
_ENVI_TEST_DATASETS = [[
"https://github.com/stefan-it/nmt-en-vi/raw/master/data/dev-2012-en-vi.tgz", # pylint: disable=line-too-long
("tst2012.en", "tst2012.vi")
]]


# See this PR on github for some results with Transformer on this Problem.
# https://github.com/tensorflow/tensor2tensor/pull/611


@registry.register_problem
class TranslateEnviIwslt32k(translate.TranslateProblem):
"""Problem spec for IWSLT'15 En-Vi translation."""

@property
def approx_vocab_size(self):
return 2**15 # 32768

@property
def vocab_filename(self):
return "vocab.envi.%d" % self.approx_vocab_size

def source_data_files(self, dataset_split):
train = dataset_split == problem.DatasetSplit.TRAIN
return _ENVI_TRAIN_DATASETS if train else _ENVI_TEST_DATASETS