Fastor - C++11/14/17 中的轻量级高性能SIMD优化张量代数框架
A light-weight high performance SIMD optimised tensor algebra framework in C++11/14/17Fastor V0.6.2 is another incremental change over V0.6 release that introduced a significant overhaul in Fastor's internal design and exposed API. This release includes
- SIMD support for complex numbers and complex valued arithmetics starting from SSE2 all the way to AVX512. The SIMD implementation for complex numbers is written with optimisation and specifically
FMA
in mind and it delivers performance similar to Intel's MKL JIT for complex matrix-matrix multiplication and so on. Comprehensive unittests are added for SIMD complex valued arithmetics conj
function introduced for computing the conjugate of a complex valued tensorarg
function introduced for computing the argument or phase angle of a complex valued tensorctranspose
andctrans
functions introduced for computing the argument or phase angle of a complex valued tensor- All boolean tensor methods such as
isequal
,issymmetric
etc are now implemented as free functions working on tensor expressions instead of tensors. There is no longer an underscore in the name of these functions that isis_equal
method of the tensor is now transformed toisequal
working on expressisons - Performance optimisations for creating tensors of tensors (such as
Tensor<Tensor<double,3,3>,2,2>
) or tensors of any non-primitive types (such asTensor<std::vector<double>,2,2>
).matmul
andtmatmul
functions have been specifically tuned to work well with such composite types - Fix an issue in
tmatmul
that was causing compilation error on Windows with MSVC 2019
Assets
2
romeric released this
Fastor V0.6.1 is an incremental change over V0.6 release that introduced a significant overhaul in Fastor internal design and exposed API. This change includes
lu
function introduced for LU decomposition of 2D tensors. Multiple variants of LU decomposition is available including no pivoting, partial pivoting with a permutation vector and partial pivoting with a permutation matrix. This is perhaps the most performant implementation of the LU decomposition available today for small matrices of up to64x64
. If no pivoting is used it the performance is unbeaten for all sizes up to the stack limit however given that the implementation is based on compile time loop recursion for sizes up to32x32
and further it uses block recursion which in turn uses block-triangular-inversion compilation would be quite time consuming for bigger sizesut_inverse
andlut_inverse
for fast triangular inversion of upper and unit lower matrices using block-wise inversiontmatmul
function equivalent to BLAS'sTRMM
function for triangular matrix-matrix (or vector) multiplication which allows either or both operand to be upper/lower triangular. The function can be used to specifiy which matrix is lower/upper at compile time liketmatmul<matrix_type::lower_tri,matrix_type::general>(A,B)
. Proper 2X speed up over matmul for when one operand is triangular and 4X when both are triangular can be achieved for bigger sizesdet/determinant
can now be computed for all sizes using the LU decomposition [default for matrix sizes bigger than4x4
].inv/inverse
andsolve
can be performed with any variant of the LU decomposition- There is now a unified interface for choosing the computation type of linear algebra functions for instance
det<DetCompType::BlockLU>(A)
orinv<InvCompType::SimpleLUPiv>(A)
orsolve<SolveCompType::BlockLUPiv>
etc tril/triu
functions added for getting the lower/upper part of a 2D tensor- Comprehensive unit tests and benchmarks are added and are available for these newly added (and some old) routines
Assets
2
romeric released this
Fastor V0.6 is a major release that brings a lot fundamental internal redesign and performance improvements. This is perhaps the biggest release since the inception of project Fastor. The following lists changes and the new features released in this version
- The whole of Fastor's expression template engine has been reworked to facilitate arbitrary re-structuring of the expressions. Most users will not notice this change as it pertains to internal re-architecturing but the change is quite significant. All linear algebra expressions can now chain with element-wise operations.
- A series of linear algebra expressions are introduced as a result with less verbose names and the other existing linear algebra routines are now moved to linear algebra. This lays out the basic building blocks of Fastor's tensor algebra library
- Multiplication operator
%
introduced that evaluate lazy and takes any expression - Greedy like matmul implemented. Operations like
A%B%C%D%...
will be evaluated in the most efficient order inv
function introduced for lazy inversion. Extremely fast matrix inversion up to stack size256x256
trans
function introduced for lazy transpose. Extremely fast AVX512 8x8 double and 16x16 float transpose using explicit SIMD introduceddet
function introduced for lazy determinantsolve
function evaluated for lazy solve.solve
has the behaviour that if both the inputs areTensor
it evaluates immedidately and if either one of the inputs is an expressions it delays the evaluation.solve
is now also able to solve matrices for up to stack size256x256
qr
function introduced for QR factorisation using modified Gram-Schmidt factorisation that has the potential to be easily SIMD vectorised in the future. The scalar implementation at the moment has good performanceabsdet
andlogdet
functions introduced for lazy computation of absolute and natural logarithm of a determinantdeterminant
,matmul
,transpose
and most verbose linear algebra functions can now take expressions but evaluate immediatelyeinsum
,contraction
,inner
,outer
,permutation
,cross
,sum
andproduct
, now all work on expressions.einsum/contraction
for expressions also dispatches to the same operation minimisation algorithms that the non-expression version does hence the above set of new functions are as fast for expressions as they are for tensor types.cross
function for class cross product of vectors is introduced as well- Most linear algebra operations like
qr
,det
,solve
take optional parameters (class enums) to request the type of computation for instancedet<DetCompType::Simple>
,qr<QRCompType::MGSR>
etc - MKL (JIT) backend introduced which can be used in the same way as libxsmm
- The backend
_matmul
routines are reworked and specifically tuned for AVX512 and_matmul_mk_smalln
is cleaned up and uniformed for up to5::SIMDVector::Size
. Most matmul routines are now available at SSE2 level when it makes sense.matmul
is now as fast as the dedicated MKL JIT API - AVX512
SIMDVector
forint32_t
andint64_t
introduced.SIMDVector
forint32_t
andint64_t
are now activated at SSE2 level - Most intrinsics are now activated at SSE2 level
- All views are now reworked so there is now no need for
FASTOR_USE_VECTORISED_EXPR_ASSIGN
macro unless one wants to vectorises strided views - Multi-dimensional
TensorFixedViews
introduced. This makes it possible to create arbitrary dimensional tensor views with compile time deducible sizes. This together with dynamic views complete the whole view expressions of Fastor diag
function introduced for viewing the diagonal elements of 2D tensors and works just like other views in that it can appear on either side of an equation (can be assigned to)- Major bug fix for in-place division of all expressions by integral numbers
- A lot of new features, traits and internal development tools.
- As a result Fastor now requires a C++14 supporting compiler
The next few releases from here on will be incremental and focus on ironing out corner cases while new features will be continuously rolled out.
Assets
2
Although with a minor tag Fastor V0.5.1 includes some major changes specially in the API design, performance and stability
SIMDVector
has been reworked to fix the long-standing issue with fall-back to non SIMD code for non-64 bit types. The fall-back is now always to the correct scalar type where a scalar specialisation is available i.e.float, double, int32_t, int64_t
and to a fixed array of size 1 holding the type for other cases. The API is now a lot closer toVc
andstd::experimental::simd
.SIMDVector
for floating points is now also activated atSSE2
level allowing any compiler that automatically definesSSE2
without-march=native
vectorise Fastor's code for instance all compiler these days defineSSE2
at-O2/-O3
level- Fix a long-standing bug in network tensor contraction. Rework opmin_meta/cost models to be truly compile-time recursive in terms of depth first search. Strided contractions for networks have completely been removed and for pairs it is deactivated. Tensor contraction of networks now dispatches to by-pair
einsum
which has many specialisation including dispatching to matmul. More than an order of magninute performance gain in certain cases. - Extremely fast
matmul/gemm
routines. Fastor now potentially provides the fastestgemm
routine for small to medium sized tensors of single and double precision. Benchmarks have been added here. Many flavours of matmul implementations are now available, for different sizes and with remainder handling and mask loading/storing. AVX512
support for singles and doubles- Better macro handling through a series of new
FASTOR_...
macros - Accurate
timeit
function based onrdtsc
together with memory clobber and serialisation for further accuracy - Fastor is now Windows compatible. The whole test suite runs and passes on MSVC 2019
- Quite a few bugs and compiler warnings have been fixed along the way
Assets
2
romeric released this
Fastor V0.5 is one hell of a release as it brings a lot of new features, fundamental performance improvements, improved flexibility working with Tensors and many bug fixes:
New Features
- Improved IO formatting. Flexible, configurable formatting for all derived tensor classes
- Generic matmul function for AbstractTensors and expressions
- Introduce a new Tensor type
SingleValueTensor
for tensors of any size and dimension that have all their values the same. It is extremely space efficient as it stores a single value under the hood. It provides a much more optimised route for certain linear algebra functions. For instance matmul of aTensor
andSingleValueTensor
is O(n) and transpose is O(1) - New evaluation methods for all expressions
teval
andteval_s
that provide fast evaluation of higher order tensors cast
method to cast a tensor to a tensor of different data typeget_mem_index
andget_flat_index
to generalise indexing across all tensor classes. Eval methods now use these- Binary comparison operators for expressions that evaluate lazy. Also binary comparison operators for SIMDVectors
- Constructing column major tensors is now supported by using
Tensor(external_data,ColumnMajor)
tocolumnmajor
andtorowmajor
free functionsall_of
,any_of
andnone_of
free function reducers that work boolean expressions- Fixed views now support
noalias
feature FASTOR_IF_CONSTEXPR
macro for C++17
Perforamance and other key improvements
Tensor
class can now be treated as a compile time type as it can be initialised as constexpr by defining the macroFASTOR_ZERO_INITIALISE
- Higher order einsum functions now dispatch to matmul whenever possible which is much faster
- Much faster generic permutation, contraction and einsum algorithms that definitely beat the speed of hand-written C-code now based on recursive templates.
CONTRACT_OPT
is no longer necessary - A much faster loop tiling based transpose function. It is at least 2X faster than implementation in other ET libraries like Eigen and Blaze across all sizes
- Introducing libxsmm backend for matmul. The switch from in-built to libxsmm routines for matmul can be configured by the user using
BLAS_SWITCH_MATRIX_SIZE_S
for square matrices andBLAS_SWITCH_MATRIX_SIZE_NS
for non-square matrices. Default sizes are 16 and 13 respectively. libxsmm brings substantial improvement for bigger size matrices - Condensed unary ops and binary ops into a single more maintainable macro
FASTOR_ASSERT
is now a macro toassert
which optimises better at release- Optimised
determinant
for 4x4 cases. Determinant now works on all types and not just float and double all
is now an alias tofall
which means many tensor view expressions can now be dispatched to tensor fixed views. The implication of this is that expressions likea(all)
andA(all,all)
can just return the underlying tensor as opposed to creating a view with unnecessary sequences and offsets. This is much faster- Specialised constructors for many view types that construct the tensor much faster
- Improved support for
TensorMap
class to behave exactly the same asTensor
class including views, block indexing and so on - Improved unit-testing under many configurations (debug and release)
- Many
Tensor
related methods and functionalities have been separated in to separate files that are now usable by other tensor type classes - Division of an expression by a scalar can now be dispatched to multiplication which creates the opportunity for FMA
- Cofactor and adjoint can now fall back to a scalar version when SIMD types are not available
- Documentation is now available unde Wiki pages
Bug fixes
- Fix a bug in
product
method of Tensor class (99e3ff0) - Fix AVX store bug in backend matmul 3k3 (8f4c6ae)
- Fix bug in tensor matmul for matrix-vector case (899c6c0)
- Fix a bug in SIMDVector under scalar mode with mixed types (f707070)
- Fix bugs with math functions on SIMDVector with size>256 not compiling (ca2c74d)
- Fix bugs with matrix-vector einsum (8241ac8, 70838d2)
- Fix a bug with strided_contraction when the second matrix disappears (4ff2ea0)
- Fix a bug in 4D tensor contructor with initialiser-lists (901d8b1)
- Fixes to fully support SIMDVector fallback to scalar version
- and many more undocumented fixes
Key changes
- Complete re-architecturing the directory hierarchy of Fastor. Fastor should now be included as
#include <Fastor/Fastor.h>
TensorRef
class has now been renamed toTensorMap
- Expressions now evaluate based on the type of their underlying derived classes rather than the tensor that they are getting assigned to
There is a lot more major and minor undocumented changes.
Assets
2
romeric released this
This release brings new features, improvements and bug fixes to Fastor:
- Lots of changes to support MSVC. Thanks to @FabienPean.
- Permutation and einsum functions for generic tensor expressions.
- A
TensorRef
class that wraps over existing data and exposes Fastor's functionality over raw data. - Some more tensor functions can work on tensor expressions.
- Tensor functions for high order tensors operating on the last two indices (NumPy style operations).
- More variants of tensor cross product are now available for high order tensors.
- Bug fixes in backend trace and transpose.
- Bug fix in ordering of tensor networks.
- Bug fix in computing cost models.
and much more!
Assets
2
Watchers：15 |
Star：228 |
Fork：19 |
创建时间： 2016-05-24 09:40:32 |
最后Commits： 15天前 |
许可协议：MIT |
d738369
Compare
romeric released this
Jun 7, 2020
· 1 commit to master since this release
Fastor V0.6.3 is another incremental change over V0.6 release that introduced a significant overhaul in Fastor's internal design and exposed API. This release includes mainly internal changes
New features and improvements
GCC-5
toGCC-latest
and defaultClang
using both scalar and SIMD implementationslut_inverse
andut_inverse
have been renamed totinverse
takingUpLoType
similar to linear algebra computation types #87einsum
for inner and permuted inner product of a single tensor expression #80einsum
by allowing the user to specify the shape of the tensor contraction output.einsum
now can permute and can deal with inner and permuted inner products of tensors and tensor expressions #91permute
function that closely resembles NumPy'spermute
option and implements contiguous writes (instead of contiguous reads) which results in about 15-20% performance improvement. This function is not identical topermutation
cbrt
,exp2/expm1
,log10/log2/log1p
,asinh/acosh/atanh/atan2
,erf/lgamma/tgamma/hypot
,round/floor/ceil
,min/max
etc. Where applicable SIMD versions of these are implemented. The SIMD math layer has been cleaned up and reworked!(Expression)
,isinf(Expression)
,isnan(Expression)
andisfinite(Expression)
are implemented #90min(a,b)/max(a,b)
,pow/hypot/atan2
are now availablealignas
instead of compiler specific macros for memory alignment #98config.h
andmacros.h
underconfig
folder previously namedcommons
#58Bug fixes
cv
-qualifiedTensorMap
toTensor
#94 by @feltechcv
-qualified tensors #99!isfinite(Expression)
or!(a>b)
#93