PyTorch 是一个 Torch7 团队开源的 Python 优先的深度学习框架
PyTorch 是一个 Torch7 团队开源的 Python 优先的深度学习框架，提供两个高级功能： 强大的 GPU 加速 Tensor 计算（类似 numpy）Table of contents
 Breaking changes: removed
reinforce()
 New features
 Unreduced losses
 A profiler for the autograd engine
 More functions support Higher order gradients
 New features in Optimizers
 New layers and nn functionality
 New Tensor functions and Features
 Other additions
 API changes
 Performance improvements
 Big reduction in framework overhead (helps small models)
 4x to 256x faster Softmax/LogSoftmax
 More...
 Framework Interoperability
 DLPack Interoperability
 Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
 Bug Fixes (a lot of them)
Breaking changes
Stochastic functions, i.e. Variable.reinforce()
were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid bookkeeping of sampled values. In practice, users were still bookkeeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.
We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()
This is the new equivalent code:
probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = m.log_prob(action) * reward
loss.backward()
New features
Unreduced losses
Now, Some loss functions can compute persample losses in a minibatch
 By default PyTorch sums losses over the minibatch and returns a single scalar loss. This was limiting to users.
 Now, a subset of loss functions allow specifying
reduce=False
to return individual losses for each sample in the minibatch  Example:
loss = nn.CrossEntropyLoss(..., reduce=False)
 Currently supported losses:
MSELoss
,NLLLoss
,NLLLoss2d
,KLDivLoss
,CrossEntropyLoss
,SmoothL1Loss
,L1Loss
 More loss functions will be covered in the next release
An inbuilt Profiler in the autograd engine
We built a lowlevel profiler to help you identify bottlenecks in your models
Let us start with an example:
>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
... y = x ** 2
... y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
  
Name CPU time CUDA time
  
PowConstant 142.036us 0.000us
N5torch8autograd9GraphRootE 63.524us 0.000us
PowConstantBackward 184.228us 0.000us
MulConstant 50.288us 0.000us
PowConstant 28.439us 0.000us
Mul 20.154us 0.000us
N5torch8autograd14AccumulateGradE 13.790us 0.000us
N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof
prefix. For example:
nvprof profilefromstart off o trace_name.prof  python <your arguments>
# in python
>>> with torch.cuda.profiler.profile():
... model(x) # Warmup CUDA memory allocator and profiler
... with torch.autograd.profiler.emit_nvtx():
... model(x)
Then, you can load trace_name.prof
in PyTorch and print a summary profile report.
>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)
Read additional documentation here
Higher order gradients
Added higherorder gradients support for the following layers
 ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
 PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
 MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
 DataParallel
Optimizers
 optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
 In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
 Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.
New layers and nn functionality
 Added AdpativeMaxPool3d and AdaptiveAvgPool3d
 Added LPPool1d
 F.pad now has support for:
 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
 constant padding on nd signals
 nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in
nearest
andlinear
modes.  grid_sample now allows padding with the border value via
padding_mode="border"
.grid_sample
expects a grid in the range of[1, 1]
, and if the values are out of these bounds, padding with the value0.0
is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.  Introducing
nn.utils.parameters_to_vector
andnn.utils.vector_to_parameters
parameters_to_vector
takesnet.parameters()
and return a 1D vector that contains all the parametersvector_to_parameters
takes a vector of flattened parameters and copies the values over to a network's parameters Convenient for some reinforcement learning algorithms, such as crossentropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
 Allow user to not specify certain input dimensions for
AdaptivePool*d
and infer them at runtime. For example:
# target output size of 10x7 m = nn.AdaptiveMaxPool2d((None, 7))
 DataParallel container on CPU is now a noop (instead of erroring out)
New Tensor functions and features
 Introduced
torch.erf
andtorch.erfinv
that compute the error function and the inverse error function of each element in the Tensor.  adds broadcasting support to bitwise operators
 Added
Tensor.put_
andtorch.take
similar tonumpy.take
andnumpy.put
. The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices.  The put function copies value into a tensor also using linear indices.
 Differences from
numpy
equivalents:numpy.take
has an optional axis argument, which behaves likeindex_select
. Thisaxis
argument is not yet present.numpy.put
repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
 The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
 add
zeros
andzeros_like
for sparse Tensors.  1element Tensors can now be casted to Python scalars. For example:
int(torch.Tensor([5]))
works now.
Other additions
 Added
torch.cuda.get_device_name
andtorch.cuda.get_device_capability
that do what the names say. Example:>>> torch.cuda.get_device_name(0) 'Quadro GP100' >>> torch.cuda.get_device_capability(0) (6, 0)
 If one sets
torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithms torch.cuda_get_rng_state_all
andtorch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having longrunning ipython notebooks while sharing the GPU with other processes.
API changes
softmax
andlog_softmax
now take adim
argument that specifies the dimension in which slices are taken for the softmax operation.dim
allows negative dimensions as well (dim = 1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined onVariable
 Remove all instances of
device_id
and replace it withdevice
, to make things consistent torch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when usingtorch.autograd.grad
in large graphs with lists of inputs / outputs
For example:x, y = Variable(...), Variable(...) torch.autograd.grad(x * 2, [x, y]) # errors torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
pad_packed_sequence
now allows apadding_value
argument that can be used instead of zeropaddingDataset
now has a+
operator (which usesConcatDataset
). You can do something likeMNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence,src
is optional).recv
returns the rank of the sender. adds
zero_()
toVariable
Variable.shape
returns the size of the Tensor (now made consistent with Tensor)torch.version.cuda
specifies the CUDA version that PyTorch was compiled with Add a missing function
random_
for CUDA.  torch.load and torch.save can now take a
pathlib.Path
object, which is a standard Python3 typed filepath object  If you want to load a model's
state_dict
into another model (for example to finetune a pretrained network),load_state_dict
was strict on matching the key names of the parameters. Now we provide astrict=False
option toload_state_dict
where it only loads in parameters where the keys match, and ignores the other parameter keys.  added
nn.functional.embedding_bag
that is equivalent tonn.EmbeddingBag
Performance Improvements
 The overhead of
torch
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speedsup models that are very small, such as small LSTMs and other common models seen in NLP.  softmax and log_softmax are now 4x to 256x faster on the GPU after rewriting the gpu kernels
 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
 nn.Embedding's renorm option is much faster on the GPU. For embedding dimensions of
100k x 128
and a batch size of 1024, it is 33x faster.  All pointwise ops now use OpenMP and get multicore CPU benefits
 Added dedicated CUDA kernels for group convolutions where
groups == nInputPlane
(depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the benchmark table for more details as well as this table.  Fixed
optim.SGD
's memory usage for sparse gradients (for ex.nn.Embedding(..., sparse=True)
), reducing the usage on a userprovided test script by 10x.  Optional NNPack integration for faster CPU convolutions (not part of binaries)
 Reduce overhead of broadcasting if Tensors aren't broadcastable
torch.nn.utils.weight_norm
over the rightmost dimensions is faster Backward of
torch.norm
is sped up by ~1.5x  Improve the performance of
pack_padded_sequence
 Add a singleargument version of
torch.arange
. For exampletorch.arange(10)
Framework Interoperability
DLPack Interoperability
DLPack Tensors are crossframework Tensor formats. We now have torch.utils.to_dlpack(x)
and torch.utils.from_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.
Model exporter to ONNX
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNetlike and RNNlike (static graphs) can now be shipped to the ONNX format.

There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.

The operations supported in this release are:
 add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
 expand (only when used before a broadcasting ONNX operator; e.g., add)
 prelu (single weight shared among input channels not supported)
 threshold (nonzero threshold/nonzero value not supported)
 Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
 elu, leaky_relu, glu, softmax, log_softmax, avg_pool2d
 unfold (experimental support with ATenCaffe2 integration)
 Embedding (no optional arguments supported)
 RNN
 FeatureDropout (training mode not supported)
 Index (constant integer and tuple indices supported)
Usability Improvements
 More cogent error messages during indexing of Tensors / Variables
Breaking changes  Add proper error message for specifying dimension on a tensor with no dimensions
 better error messages for Conv*d input shape checking
 More userfriendly error messages for LongTensor indexing
 Better error messages and argument checking for Conv*d routines
 Trying to construct a Tensor from a Variable fails more appropriately
 If you are using a PyTorch binary with insufficient CUDA version, then a
warning
is printed to the user.  Fixed incoherent error messages in
load_state_dict
 Fix error message for type mismatches with sparse tensors
Bug fixes
torch
 Fix CUDA lazy initialization to not trigger on calls to
torch.manual_seed
(instead, the calls are queued and run when CUDA is initialized)
Tensor
 if
x
is 2D,x[[0, 3],]
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can dox[[0, 3]]
x.sort(descending=True)
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this. Tensor constructors with numpy input:
torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
 torch will now copy the contents of the array in a storage of appropriate type.
 If types match, it will share the underlying array (nocopy), with equivalent semantics to initializing a tensor with another tensor.
 On CUDA,
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
will now work by making a copy.
ones_like
andzeros_like
now create Tensors on the same device as the original Tensortorch.multinomial
on the CPU would reshape the inputprob_dist
inplace. Fixed this to make sure theprob_dist
input's shape is unchanged after the call tomultinomial
expand
andexpand_as
allow expanding an empty Tensor to another empty Tensor when
[..., None, ...]
was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.  Fix exponential distribution implementation to never sample infinity  cuRAND returns numbers in (0, 1]
 torch.HalfTensor supports
numpy()
andtorch.from_numpy
 Add additional size checking for
torch.scatter
 fix
torch.tril
andtorch.triu
on the GPU for storageoffset Tensors (would return incorrect result).  Fix a memory leak in CUDA qr decomposition
 Fix streamawareness issues in THCUNN kernels
 Fix kwargs parsing in
torch.topk
 Fixed
random_
on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor  Fix
ZeroDivisionError: float division by zero
when printing certain Tensors torch.gels
whenm > n
had a truncation bug on the CPU and returned incorrect results. Fixed. Add a check in tensor.numpy() that checks if no positional arguments are passed
 Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is
contiguous
any
andall
work on empty Tensors on the cpu (previously errored out) Fix
symeig
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.  Improved the numerical stability of
torch.var
andtorch.std
by using Welford's algorithm  The Random Number Generator returned
uniform
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug). Now, all
uniform
sampled numbers will return within the bounds[0, 1)
, across all types and devices
 Now, all
 Fix
torch.svd
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)  Allows empty index Tensor for
index_select
(instead of erroring out)  Previously when
eigenvector=False
,symeig
returns some unknown value for the eigenvectors. Now we zero them out.
sparse
 Fix bug with 'coalesced' calculation in sparse 'cadd'
 Fixes
.type()
not converting indices tensor.  Fixes sparse tensor coalesce on the GPU in corner cases
autograd
 Fixed crashes when calling backwards on leaf variable with requires_grad=False
 fix bug on Variable
type()
around nondefault GPU input.  when
torch.norm
returned0.0
, the gradient wasNaN
. We now use the subgradient at0.0
, so the gradient is0.0
.  Fix an correctness issue with advanced indexing and higherorder gradients
torch.prod
's backward was failing on the GPU due to a type error, fixed. Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
 Variable.cuda() and Tensor.cuda() are consistent in kwargs options
optim
torch.optim.lr_scheduler
is now imported by default.
nn
 Returning a dictionary from a nn.Module's forward function is now supported (used to throw an error)
 When
register_buffer("foo", ...)
is called, and self.foo already exists, then instead of silently failing, now raises aKeyError
 Fixed loading of older checkpoints of RNN/LSTM which were missing
_data_ptrs
attributes. nn.Embedding
had a hard error when using themax_norm
option. This is fixed now. when using the
max_norm
option, the passedin indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel. F.affine_grid
now can take noncontiguous inputs EmbeddingBag can accept both 1D and 2D inputs now.
 Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
 fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
 if BatchNorm has only
1
value per channel in total, raise an error in training mode.  Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
 fix grid_sample backward when gradOutput is a zerostrided Tensor
 Fix a segmentation fault when reflection padding is out of Tensor bounds.
 If LogSoftmax has only 1 element,
inf
was returned. Now this correctly returns0.0
 Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
 Detect pointer aliasing in cuDNN RNN flatten_parameters and avoid that path.
 Fixed ELU higher order gradients when applied inplace
 Workaround a CuDNN RNN bug for halfprecision
 Prevent numerical issues with
poisson_nll_loss
whenlog_input=False
by adding a small epsilon
distributed and multigpu
 Allow kwargsonly inputs to DataParallel. This used to fail:
n = nn.DataParallel(Net()); out = n(input=i)
 DistributedDataParallel calculates num_samples correctly in python2
 Fix the case of DistributedDataParallel when 1GPU per process is used.
 Allow some params to be
requires_grad=False
in DistributedDataParallel  Fixed DataParallel to specify GPUs that don't include GPU0
 DistributedDataParallel's exit doesn't error out anymore, the daemon flag is set.
 Fix a bug in DistributedDataParallel in the case when model has no
buffers
(previously raised incoherent error)  Fix
__get_state__
to be functional inDistributedDataParallel
(was returning nothing)  Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other
Others
model.zoo.load_url
now first attempts to use therequests
library if available, and then falls back tourllib
 Fix error when default_collate is passed a collection of
numpy.str_
Downloads
Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
Package documentation for this release is available at http://pytorch.org/docs/0.2.0/
We're introducing longawaited features such as Broadcasting, Advanced Indexing, Higherorder gradients and finally: Distributed PyTorch.
Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.
Table of contents:
 Tensor Broadcasting (numpystyle)
 Advanced Indexing for Tensors and Variables
 Higherorder gradients
 Distributed PyTorch (multinode training, etc.)
 Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
 New in torch and autograd: matmul, inverse, etc.
 Easier debugging, better error messages
 Bug Fixes
 Important Breakages and Workarounds
Tensor Broadcasting (numpystyle)
In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).
PyTorch Broadcasting semantics closely follow numpystyle broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.
General Semantics
Two tensors are “broadcastable” if the following rules hold:
 Each tensor has at least one dimension.
 When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
For Example:
>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)
# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist
# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:
 If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
 Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.
For Example:
# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])
# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at nonsingleton dimension 1
More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.
Advanced Indexing for Tensors and Variables
PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including nonadjacent indices and duplicate indices, using the same []
style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...]
functions.
Let's look at some examples:
x = torch.Tensor(5, 5, 5)
Pure Integer Array Indexing  specify arbitrary indices at each dimension
x[[1, 2], [3, 2], [1, 0]]
> yields a 2element Tensor (x[1][3][1], x[2][2][0])
also supports broadcasting, duplicates
x[[2, 3, 2], [0], [1]]
> yields a 3element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])
arbitrary indexer shapes allowed
x[[[1, 0], [0, 1]], [0], [1]].shape
> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
[x[0][0][1], x[1][0][1]]]
can use colon, ellipse
x[[0, 3], :, :]
x[[0, 3], ...]
> both yield a 2x5x5 Tensor [x[0], x[3]]
also use Tensors to index!
y = torch.LongTensor([0, 2, 4])
x[y, :, :]
> yields a 3x5x5 Tensor [x[0], x[2], x[4]]
selection with less than ndim, note the use of comma
x[[1, 3], ]
> yields a 2x5x5 Tensor [x[1], x[3]]
Higher order gradients
Now you can evaluate higher order differentials in PyTorch. For example, you can compute HessianVector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.
In the 0.2
release, we've enabled the ability to compute higher order gradients for all of torch.XXX
functions and the most popular nn
layers. The rest will be covered in the next release.
Here's a short example that penalizes the norm of the weight gradients of a Resnet18 model, so that the volume of weights is slowchanging.
import torch
from torchvision.models import resnet18
from torch.autograd import Variable
model = resnet18().cuda()
# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())
# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)
grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.
# now compute the 2norm of the grad_params
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()
# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes
grad_norm.backward()
# do an optimization step
optimizer.step()
We see two new concepts here:
 torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the
.grad
attributes. This is useful if you want to further operate on the gradients.  You can operate on the gradients, and call
backward()
on them.
The list of nn
layers that support higher order gradients are:
AvgPool*d
,BatchNorm*d
,Conv*d
,MaxPool1d,2d
,Linear
,Bilinear
pad
,ConstantPad2d
,ZeroPad2d
,LPPool2d
,PixelShuffle
ReLU6
,LeakyReLU
,PReLU
,Tanh
,Tanhshrink
,Threshold
,Sigmoid
,HardTanh
,ELU
,Softsign
,SeLU
L1Loss
,NLLLoss
,PoissonNLLLoss
,LogSoftmax
,Softmax2d
The rest will be enabled in the next release.
To enable higher order gradients, we've introduced a new style of writing autograd.Function
(the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.
Most of you dont write your own autograd.Function
s, they are lowlevel primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.
Distributed PyTorch
We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger minibatches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
The distributed
package follows an MPIstyle programming model. This means that there are functions provided to you such as send
, recv
, all_reduce
that will exchange Tensors among nodes (machines).
For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:
 shared file system (requires that all processes can access a single file system)
 IP multicast (requires that all processes are in the same network)
 environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)
Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:
import torch.distributed as dist
dist.init_process_group(backend='tcp',
init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
world_size=4)
print('Hello from process {} (out of {})!'.format(
dist.get_rank(), dist.get_world_size()))
This would print Hello from process 2 (out of 4)
on the 3rd machine.
World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size  1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.
Here's a snippet that shows how simple pointtopoint communication can be performed:
# All processes (receiving ones too!) need to have tensors of appropriate
# size preallocated.
x = torch.Tensor(10)
if dist.get_rank() == 0:
x.normal_()
# Send x to process with rank 1
dist.send(x, dst=1)
else: # rank == 1
# Receive data from process with rank 0 and save result in x
dist.recv(x, src=0)
Asynchronous p2p functions (isend
, irecv
) are available too.
However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using send
/recv
. One example is all_reduce
:
x = torch.Tensor([dist.get_rank()])
# Add tensors from all processes such that they all receive the result.
# x is an input and output to this operation.
dist.all_reduce(x)
The distributed package is fairly lowlevel, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but dataparallel training is such a common one that we have created highlevel helpers for it.
Hence, we've introduced DistributedDataParallel
, which is meant to be a nearly dropin replacement for nn.DataParallel.
Here's a code snippet demonstrating changes necessary to add it to existing training code:
# Wrap model in DistributedDataParallel (CUDA only for the moment)
model = torch.nn.parallel.DistributedDataParallel(model.cuda())
# Use a DistributedSampler to restrict each process to a distinct subset
# of the dataset.
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, num_workers=args.workers,
pin_memory=True, sampler=train_sampler)
for epoch in range(args.num_epochs):
# Use .set_epoch() method to reshuffle the dataset partition at every iteration
train_sampler.set_epoch(epoch)
# training loop
...
You can see a fuller Imagenet training example here
New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
New features
 forward_pre_hook is introduced to execute userspecified closures right before a forward function is called.
 Convenient access to nonleaf gradients:
Currently, to access and inspect gradients of intermediate values, we have to usehooks
. This is not convenient for doing simple inspections. Hence, we introduceretain_grad
. It is best explained via an example:
input = Variable(torch.rand(1, 3), requires_grad=True)
h1 = input * 3
out = (h1 * h1).sum()
h1.retain_grad()
out.backward()
print(h1.grad)
# without calling retain_grad(), h1.grad is None
 DataParallel now supports dicts as inputs
New Layers
 Spatial Transformer Networks via
F.grid_sample
andF.affine_grid
nn.SeLU
andnn.AlphaDropout
are introduced, from the paper: SelfNormalizing Neural Networksnn.GLU
(Gated Linear Unit) is introduced from the paper Convolutional Sequence to Sequence Learning Weight Normalization is now implemented via torch.utils.weight_norm.
 You can now ignore specific target indices while computing
cross_entropy_loss
andnll_loss
using theignore_index
argument. This is a cheap and useful way of implementing masking, where you can have amask
index that is ignored in computing the loss. F.normalize
implements dimensionwise renormalizationF.upsample
andnn.Upsample
consolidate multiple Upsampling layers into one function. It implements 2d and 3d bilinear/trilinear/nearest upsampling.nn.EmbeddingBag
: When build bagofwords models, doing anEmbedding
followed bySum
orMean
is common. For variable length sequences, computing bags of embeddings involves masking. We provide a singenn.EmbeddingBag
which is much more efficent and faster to compute bags of embeddings, especially for variable length sequences. Numerically stable Binary CrossEntropy loss via
bce_with_logits
 A negative loglikelihood loss with Poisson distribution of the target via
PoissonNLLLoss
cosine_similarity
: Returns cosine similarity between x1 and x2, computed along dim.
training utilities
Learning Rate Schedulers: torch.optim.lr_scheduler provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.
There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the package docs:
 ReduceLROnPlateau, LambdaLR, StepLR, MultiStepLR, ExponentialLR
ConcatDataset
that is a convenient dataset metaclass that can merge and concatenate two individual datasets.
New in torch and autograd
 All reduce functions such as
sum
andmean
now default to squeezing the reduced dimension. For exampletorch.sum(torch.randn(10, 20))
returns a 1D Tensor. x.shape
, similar to numpy. A convenienceproperty
that is equivalent tox.size()
torch.matmul
, similar to np.matmul bitwise and, or, xor, lshift, rshift
 autograd support for
inverse
,gesv
,cumprod
,atan2
 unbiased
var
andstd
now available via keyword argument option torch.scatter_add
 torch.scatter, except when duplicate indices are encountered, the values are summed. torch.median behaves similar to torch.sum when no arguments are given, i.e. it reduces all the dimensions and returns a single median value of the flattened Tensor.
 masked_copy_ has been renamed to masked_scatter_ (with deprecation on masked_copy_)
 torch.manual_seed now seeds all CUDA devices as well
 You can now specify the random number generator object via keyword arguments
torch.rand(1000, generator=gen)
Bugfixes and small improvements
 Now we emit an error when a Variable is converted to a bool. For example:
b = Variable(torch.zeros(1))
if b[0]: # errors now
 Fix correctness bugs in qr decomposition on CUDA.
 Support for IBM PowerPC64 platform
 Check that the CuDNN version at compiletime is the same version at runtime.
 Improve error message in CUDA forked subprocess
 Faster transposedcopy on CPU
 Improve error messages in InstanceNorm
 Add more argument checking for various routines, especially BatchNorm and Convolution routines.
 Better error messages around shape reporting across the CPU backend.
 Support more than 8 GPUs per machine (workaround a CUDA p2p restriction)
 Improve error message when accessing attributes that don't exist
 t() of Variable consistent with Tensor
 prevent dividebyzero when dropout p=1
 fix sharing of CUDA tensors on noncurrent devices
 when BN epsilon < allowed CuDNN value, fallback to THNN
 Fix threadtrashing when using different number of threads for MKL and OMP
 improve memory usage when using CuDNN RNN
 Fix ZeroPad2d backwards with negative padding
 add dummy tensor.data property, to provide interpretable error message to users
 Fix inplace division for Python3
 Raise error when call from_numpy on 0dim array
 Empty Tensors dont error out when shared across multiprocessing
 fix baddbmm for expanded tensors
 Let parallel_apply accept arbitrary inputs
 keyword arguments in Tensor and Variable are now consistent
 fix torch.inverse when Magma is not available
 Add logical not operator for ByteTensor
 add device asserts in scatter/gather kernels
Important Breakages and Workarounds
As you've read, we've introduced two important changes that are not
backward compatible:
 Numpystyle Broadcasting
 Reduction functions such as
sum(1)
now default tokeepdim=False
We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.
tl;dr
Here is a code snippet that you can add to the top of your scripts.
Adding this code will generate warnings highlighting incompatible code.
Fix your code to no longer generate warnings.
# insert this to the top of your scripts (usually main.py)
import sys, warnings, traceback, torch
def warn_with_traceback(message, category, filename, lineno, file=None, line=None):
sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line))
traceback.print_stack(sys._getframe(2))
warnings.showwarning = warn_with_traceback; warnings.simplefilter('always', UserWarning);
torch.utils.backcompat.broadcast_warning.enabled = True
torch.utils.backcompat.keepdim_warning.enabled = True
Once all warnings disappear, you can remove the code snippet.
More elaborately
Now, let us see the three incompatible changes with examples.
Using the (now deprecated) 1dimensional view pointwise function
Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal. The pointwise operation would then be carried out by viewing each tensor as 1dimensional. PyTorch now supports broadcasting. The “1dimensional” pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.
For example:
>>> torch.add(torch.ones(4), torch.ones(2,2))
__main__:1: UserWarning: self and other not broadcastable, but have the same
number of elements. Falling back to deprecated pointwise behavior.
2
2
2
2
[torch.FloatTensor of size 4]
Broadcasting in code where it didn't happen before
The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape,
but are broadcastable and have the same number of elements.
For example:
>>> torch.add(torch.ones(4,1), torch.randn(4))
would previously produce a Tensor with size: torch.Size([4,1])
,
but now produces a Tensor with size: torch.Size([4,4])
.
In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set torch.utils.backcompat.broadcast_warning.enabled
to True
, which will generate a python warning in such cases.
For Example:
>>> torch.utils.backcompat.broadcast_warning.enabled=True
>>> torch.add(torch.ones(4,1), torch.ones(4))
__main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.
KeepDim=False for Reduction Functions
To get a warning when using a dimensional reduction function with the default keepdim argument, set torch.utils.backcompat.keepdim_warning.enabled
to True
. For example:
>>> torch.sum(torch.ones(2,3), 1)
__main__:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False. Consider passing as kwarg.
3
3
[torch.FloatTensor of size 2]
As with torch.utils.backcompat.broadcast_warning.enabled
, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.
Note also that using keepdim=False
can cause your existing code to "just work" with broadcasting. For example:
# behavior with (old) keepdim=True, causes accidental broadcast
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))
5 5 5 5
5 5 5 5
5 5 5 5
5 5 5 5
[torch.FloatTensor of size 4x4]
# new behavior with keepdim=False is equivalent to nonbroadcasted result
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))
5
5
5
5
[torch.FloatTensor of size 4]
Downloads
API Changes
torch.range
is deprecated in favor oftorch.arange
which is consistent with numpy and python range. On sparse Tensors,
contiguous
is renamed tocoalesce
andcoalesce
is now made outofplace.
(a reminder that Sparse API is still experimental and evolving, so we dont provide backwardcompability).
New Features
New layers and functions
torch.topk
is now supported for all CUDA types, not justtorch.cuda.FloatTensor
. Added a threeway ranking loss: nn.TripletMarginLoss
 Added perinstance normalization layers: nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d
Each channel is treated as an instance to normalize, and meansubtraction and stddivision is done. This is useful when dealing with larger images and smaller minibatches where BatchNorm like effects are desired. nn.ZeroPad2d
andnn.ConstantPad2d
are added.nn.Bilinear
is added, which computesY = X1 * W * X2 + b
Negative dimension support for all functions
Every single function that took a dimension argument will also allow taking negative dimensions.
A negative dimension will index the tensor from the last dimension.
For example:
x = torch.randn(10, 20, 30)
y = torch.mean(x, dim = 1)
Here, since x
has 3 dimensions, and dim = 1
, the last dimension, i.e. dim=3
is picked for taking a mean.
The functions with dimension arguments are:
narrow, transpose, size, cat, chunk, gather, index_select, split, squeeze,
stack, unbind, unsqueeze, cumprod, cumsum, mean, median, mode, norm, prod, std,
sum, var, kthvalue, max, min, sort, topk, renorm,
index_add, index_copy, index_fill, scatter, select, unfold
CUDA support for Sparse Tensors, faster CPU sparse
Now a part of the torch.sparse
API is also supported for torch.cuda.sparse.*Tensor
.
Functions that are supported on CUDA are:
sparse_mask, to_dense, coalesce, transpose, spaddmm
spcadd, mul, div, cadd, csub, cmul
nn.Embedding
now supports sparse even on CUDA (with the sparse=True
flag) leveraging these sparse functions.
A new hybrid matrixmultiply hspmm
operation that multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension).
Several of the CPU sparse functions have more efficient implementations.
In a quickly hacked up Embedding classifier training script by @martinraison we see CUDA sparse performing as well as CUDA dense:
https://gist.github.com/martinraison/1e7c18c6f6eda87f1cb4995b0e6a22a5
Table times of seconds / batch
_  CPU  CUDA 

Dense  10  0.86 
Sparse  0.15  0.13 
named_parameters to filter out specific parameter types
Let's say that you want to add weight decay to all parameters of your model except for the biases. How do you get only the biases of your model?
We introduce nn.Module.named_parameters for this.
It joins named_children
and named_modules
in helping you filter specific attributes of models.
Example of filtering out biases of a model and give them weight_decay of 0:
import torch
import torch.nn as nn
import torch.optim as optim
m = nn.Sequential(
nn.Linear(10, 20),
nn.ReLU(),
nn.Linear(20, 20),
nn.ReLU(),
)
weights, biases = [], []
for name, p in m.named_parameters():
if 'bias' in name:
biases += [p]
else:
weights += [p]
optim.SGD([
{'params': weights},
{'params': biases, weight_decay=0}
], lr=1e2, momentum=0.9, weight_decay=1e5)
Performance Improvements
cumsum
andcumprod
have been significantly made faster on the GPU via using some thrust primitives where appropriate.LSTMCell
andGRUCell
are now significantly faster on the GPU via a fused kernel The default Algorithm for CuDNN has been changed to
PRECOMP_GEMM
which is a
much faster algorithm that takes a tiny bit of workspace. Previously, it used to
beIMPLICIT_GEMM
which took zero workspace, but was significantly slower.  5% to 10% improvement in data loader by collating batches directly into shared memory.
 SVD is now computed on the GPU via divideandconquer (sgesdd) which gives a 2x to 5x speedup.
 The commonly used function
expand
has been moved to C, to have better performance in smaller models.
Bug Fixes
 Added contiguous checks on weight and bias for a large range of THNN functions
 make the range of
random_
correct when both lower and upper bound are specified parallel_apply
now can take arguments that are unhashable Reshape
grad
correctly in the Dot function (inputs don't have to be 1D vectors...)  Added
Variable.type_as
 Unify argument names of
norm
andrenorm
to havep=norm_type, dim=dim
btrisolve
works on CPU doubles ipython autocomplete for torch.nn.Module fixed via implementing
__dir__
device_ids
can now beNone
again inF.data_parallel
and will use all available GPUs workaround cudnn bugs in BatchNorm (<5.1.10) and Dilation (6.0.20)
 Padding bugfix in Conv1d CPU
remainder
andcremainder
are fixed for integer types fix memory leak in
btrisolve
andgetri
 If nn.Module's source cant be retrieved because of any exception,
handle serialization to be nonfatal collate_fn
now retains the type of the numpy arrayis_tensor
andis_storage
are now fixed for oldstyle Python classestorch.cat
now supports keyword arguments CUDA collectives supported coalescing, but the inputs were all assumed
to be of the same Tensor type. This is fixed.  Fix a deadlock bug in autograd because of an underlying glibc bug in specific
linux distros (ArchLinux in particular) abs
is now fixed forchar
andshort
cuda types fix
torch.diag
autograd when giving a dimension argument  fix grouped convolution on CPU when
bias=False
 expose
dilated
convolutions forConvTranspose*d
 Fix a bug in
HingeEmbeddingLoss
wheremargin
can now be specified via kwargs
Improved error messages
 Fix errors and messages when no CUDA devices are available.
Downloads
Watchers：912 
Star：17180 
Fork：3995 
创建时间： 20160813 13:26:41 
最后Commits： 4天前 
Bug fixes and performance improvements
soumith released this
Feb 14, 2018
Assets
Binaries
As always, links to our binaries are on http://pytorch.org
New features
reduce
argument toPoissonNLLLoss
to be able to compute unreduced losses #3770target.requires_grad=True
inl1_loss
andmse_loss
(compute loss wrttarget
) #3876random_split
that randomly splits a dataset into nonoverlapping new datasets of given lengths #4435Allow
map_location
intorch.load
to be a string, such asmap_location='cpu'
ormap_location='cuda:2'
#4203Bug Fixes
Data Loader / Datasets / Multiprocessing
timeout
option to the DataLoader, which will error if sample loading time exceeds the given value. #3474fork
syscall. Now, each worker will have it's RNG seed set tobase_seed + worker_id
wherebase_seed
is a random int64 value generated by the parent process. You may usetorch.initial_seed()
to access this value inworker_init_fn
, which can be used to set other seeds (e.g. NumPy) before data loading.worker_init_fn
is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018ConcatDataset.cumulative_sizes
attribute name #3534CUDA / CuDNN
enabled
argument intorch.autograd.profiler.emit_nvtx
(was being ignored) #4032torch.dot
#3660CPU
torch operators
tensor.repeat
when the underlying storage is not owned bytorch
(for example, coming from numpy) #4084linspace
op, fix 4419. #4470autograd
nn layers
Fix cosine_similarity's output shape #3811
MultiGPU
core
others
Performance improvements
Documentation and UX Improvements