halide / halide Goto Github PK

a language for fast, portable data-parallel computation

License: Other

CMake 2.03% Makefile 1.53% C++ 87.81% Java 0.63% Objective-C 0.01% C 3.94% Python 2.05% Shell 0.68% LLVM 0.91% Objective-C++ 0.10% Smarty 0.01% Batchfile 0.01% HTML 0.02% JavaScript 0.07% Assembly 0.01% CSS 0.13% Jupyter Notebook 0.08%

halide hexagon compiler dsl gpu image-processing performance

halide's People

Contributors

Stargazers

Watchers

Forkers

tijsmaas cesarnog invinciblejha hal2001 sy-zygy pythons lvzongting josephwinston sherlockxlg autumnm1981 emory55 turboho zloidemon mvl cultofmetatron binarysentient luisibanez nzinfo wishqube damianfral iitaku hshu mokerjoke lenhamey ab2005 phunterlau hksonngan xiaonanzzz nerei yongyi781 purcaro lhc180 xeschen moloned r2jitu jansel 202198 jiawen dsharlet-intel drtpig sanyaade-teachings psuriana jacobke bblum mikeseven victoroliv2 languagefun josephlaurino vboomi marcantoine-arnaud taa4 paranaliyanage uikit0 jrprice zxwglzi pstanczyk weiweichen syoummer tiantian88 blastarindia tompao vbychkovsky danielhauagge atbrox warrenmcquinn michaelbacci alextooter lujaw unimatrixzxero ashkanershadi whuaegeansea mfkiwl zouguangxian fcr zzmjohn trass3r satorukuma chenxilinsidney parvizp sonsongithub sebastianklose vtkingc nushio3 youheixx amos-zq philipoakley josephsieh netaz pauleyy kree-colemcalughlin zmxu bnascimento guanqun pgec imageproc joeyjal luiseduardohdbackup cometdlut emanuelev trgardos

halide's Issues

XOR operator

Halide does not have a XOR operator. In my algorithm I can replace it with a inequality operator, but I've done some tests and (at least for my processor) XOR operator is faster (25%):

It would nice to see a XOR operator in Halide. :)

Vectorize across multiple variables

It correctly fails, but does not issue any kind of useful error message. E.g.:

f.vectorize(x, 4).vectorize(y, 4);

Right now it just fails an assertion inside vectorize.ml

Unify Expr Images & UniformImages lists

These are almost entirely duplicated functionality. They should be refactored into a single list.

C backend does not support vector types

They can be added trivially (with potentially poor performance) as simple loops. The longer-term plan is to rely on syrah.

PTX backend doesn't support most math lib functions

Until it moves to the new NVPTX LLVM backend, the legacy PTX backend is missing most math lib functions (transcendentals, pow, etc.) and some other expected standard library features.

Add llvalue IR node

This would be valuable in a bunch of paths in the LLVM codegen, and potentially later for platform-specific optimization pre-passes.

segfault in vectorization

I'm using the latest Halide release to implement a dilation algorithm in RGBA float, but it segfaults. If I remove the vectorize schedule, it works.

Uniform<int> radius = 20;
RDom dom(-radius, 2*radius+1, -radius, 2*radius+1);
structEl(x,y) = select(x*x + y*y <= radius, 1, 0);
dilation(x, y, c) = select(c < 3,
    maximum( select(structEl(dom.x, dom.y) == 1, input(x + dom.x - radius, y + dom.y - radius, c), 0.0f) ),
    input(x, y, 3));

    //schedule
    structEl.root();
    dilation.vectorize(x, 8);

OpenCL backend plans?

Is there any chance, that Halide will generate OpenCL kernels for use on GPUs? Sometimes in future....

I want to use Halide in my desktop computer graphics (photo processing) app, but many users have AMD cards, not NVidia.

Make LL functional IR layer

The stateful nature of the LLVM OCaml bindings is nasty. It would be much nicer to have a thin ADT layer above this, much like the C types used in the C backend.

This could be started as part of the llvalue IR node, which would need to be constructable without a current "builder" context.

Update LLVM to 3.2svn with NVPTX

We are currently relying on a hacked branch of LLVM 3.1svn from around the SIGGRAPH deadline, which we had to patch to fix a variety of codegen bugs for ARM and add features to the PTX backend. This should be updated to 3.2svn. The limiting factor is that the PTX codegen needs to be updated to work with the new conventions of NVPTX instead of the older independent PTX target.

RDoms don't like zero size

Hi,

If I try to write a reduction of the form:

RDom r(0, input);
f(x) += r;

and supply an input value of 0, the reduction fails to terminate instead of outputting the initial value.

Any thoughts?

Cheers

Update build instructions

Known issues:

GCC 4.7 recipe changed
Required packages are misleading
MacPorts, Ubuntu instructions are largely missing

Build should be able to bootstrap missing OCaml libraries

One of the things which makes the build difficult for first-timers is the need to install OCaml libraries on which we depend. On some platforms this is easy (modern Ubuntu tends to have many up-to-date OCaml packages in apt), while on others it involves a long chain of manual download/configure/make/installs which is an unnecessary distraction.

The build bootstrap process should have functionality which automatically fetches, builds, and installs the relevant libraries in a project-local path, and makes these discoverable to all subsequent build steps. odb is a straightforward option.

The challenge is to do this while also using the system packages when they exist and are sufficient, to avoid too much bloat.

Support GPU device delection

Automatic GPU device selection should be overridable via an HL_GPU_DEVICE environment variable.

ptx backend not working (cuCtxSynchronize could not be resolved)

Halide wroks fine with the CPU backend. However, as soon as I try to use the Cuda backend with
HL_TARGET=ptx ./executable

I get an error like this:

...
%f0.v0_nextvar = add i32 %f0.v0, 1
%55 = icmp ne i32 %f0.v0_nextvar, %38
br i1 %55, label %f0.v0_loop, label %f0.v0_afterloop

f0.v0_afterloop: ; preds = %f0.v0_loop
call void @__free_buffer(%struct.buffer_t* %f0.f5_buf)
call void @fast_free(i8* %f0.f5)
ret void
}
LLVM ERROR: Program used external function 'cuCtxSynchronize' which could not be resolved!

Generated code segfaults on iOS

Add task parallelism

Just add an unordered flag to Block and throw them into the task pool.

src build should make initial modules if dirty

There is logic in src/myocamlbuild.ml to do this, but it seems to have stopped working correctly.

Vectorizing by more than dimension size segfaults

During autotuner debugging we found the following schedule for blur triggers a segfault rather than an error:

blur_y.tile(x, y, xi, yi, 2, 2)
blur_y.vectorize(xi, 8)

GPU reductions potentially cause redundant buffer copies

Identified by Victor Oliveira on the halide-dev list:

RDom r (-5, 11);
Func box_x("box_x");
box_x(x,y,c) += (clamped(c, x + r, y));
Func box_y("box_y");
box_y(x,y,c) += (box_x(x, y + r, c));

box_x.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);
box_y.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);

build should use a configure step to set up desired options

Options:

--use-local-llvm/use-system-llvm
--use-local-clang/use-system-clang
--use-local-ocaml-libs

404 - Not Found

binary distribution of the Halide compiler is 404 - Not Found.

Pyramid interpolation generates unexpectedly slow code

Jim's interpolation algorithm test runs slower (~2x) than gcc-4.6's optimized C result (on x86-64/Mac). We need to look into the generated code to sniff out why.

Add OpenGL ES backend

Would be similar to the ptx backend. Helpful for current-gen cell phones.

Create separate UserAssert for user input errors

Currently, quite a few user errors result in assertions. These have been improved to be reasonably informative, but they should still be pushed off to a separate path from the implementation error asserts, specific to input program warnings and errors.

Build fails on ubuntu 11.10

Hi,

Running bootstrap fails at that point:

--------------------------
Test: building halide.cmxa
--------------------------
Traceback (most recent call last):
  File "util/bootstrap.py", line 80, in <module>
    print ocamlbuild('-use-ocamlfind', 'halide.cmxa')
  File "/home/hamstah/repos/Halide/util/pbs.py", line 352, in __call__
    return RunningCommand(command_ran, process, call_args, actual_stdin)
  File "/home/hamstah/repos/Halide/util/pbs.py", line 136, in __init__
    if rc != 0: raise get_rc_exc(rc)(self.command_ran, self._stdout, self._stderr)
pbs.ErrorReturnCode_10: 

Ran: '/usr/bin/ocamlbuild -use-ocamlfind halide.cmxa'

STDOUT:

  ocamlfind ocamlopt -I /usr/lib/ocaml/ocamlbuild unix.cmxa /usr/lib/ocaml/ocamlbuild/ocamlbuildlib.cmxa myocamlbuild.ml /usr/lib/ocaml/ocamlbuild/ocamlbuild.cmx -o myocamlbuild
+ ocamlfind ocamlopt -I ... (214 more, please see e.stdout)

STDERR:

Running the command manually gives:


$ /usr/bin/ocamlbuild -use-ocamlfind halide.cmxa
Solver failed:
  Ocamlbuild cannot find or build halide.ml.  A file with such a name would usually be a source file.  I suspect you have given a wrong target name to Ocamlbuild.
Backtrace:
  - Failed to build the target halide.cmxa
      - Building halide.cmxa:
          - Failed to build all of these:
              - Building halide.cmx:
                  - Failed to build all of these:
                      - Building halide.ml:
                          - Failed to build all of these:
                              - Building halide.mly
                              - Building halide.mll
                      - Building halide.mlpack
              - Building halide.mllib
Compilation unsuccessful after building 0 targets (0 cached) in 00:00:00.

the cmxa file seems to be missing:


$ find . -name "*.cmxa"
./llvm/Release+Asserts/lib/ocaml/llvm_bitwriter.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_analysis.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_bitreader.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_target.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_ipo.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_scalar_opts.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_executionengine.cmxa
./llvm/bindings/ocaml/target/Release+Asserts/llvm_target.cmxa
./llvm/bindings/ocaml/bitreader/Release+Asserts/llvm_bitreader.cmxa
./llvm/bindings/ocaml/llvm/Release+Asserts/llvm.cmxa
./llvm/bindings/ocaml/transforms/ipo/Release+Asserts/llvm_ipo.cmxa
./llvm/bindings/ocaml/transforms/scalar/Release+Asserts/llvm_scalar_opts.cmxa
./llvm/bindings/ocaml/executionengine/Release+Asserts/llvm_executionengine.cmxa
./llvm/bindings/ocaml/bitwriter/Release+Asserts/llvm_bitwriter.cmxa
./llvm/bindings/ocaml/analysis/Release+Asserts/llvm_analysis.cmxa

bootstrap also got stuck on the git submodule update for some reason, might because of older python/modules, but running the command manually and bootstrap again solved it

Add native struct types

The existing tuples support is awkward, limited, and confusing.

Add LLVM debugging annotations

If we add debugging metadata to the LLVM backend, generated code will be able to show sane state in gdb.

Problem working on VS 2010

I am trying to build Halide on VS 2010. Need some help. I have successfully compiled the code. But, the runtime prob is:

error LNK2019: unresolved external symbol "public: class Halide::DynImage __thiscall Halide::Func::realize(int)" (?realize@Func@Halide@@QAE?AVDynImage@2@H@Z) referenced in function _main

Microsoft Windows support

I know this is early, but I and I'm sure others would be interested in Windows support. I'm taking a guess that VC10 and maybe even VC11 don't have enough C++11 features implemented that would allow Halide to compile, so mingw is probably the way to go. Maybe someone more knowledgeable than I can explain some of the challenges that will be faced in porting to Windows.

bootstrap should print errors correctly

Currently, bootstrap errors tend to be in the form of long log spew from a failed LLVM build or similar. In this case, the script tends to print the start of the log, rather than the end of it, where the actual information is.

Improve buffer_t

Major issues:

size_t strides[MAX_DIMS] - allow inputs with padded scanlines
size_t offset[MAX_DIMS] - allow execution over sub-region of a host buffer
Number of used dimensions should be expressed more clearly. Obvious choices:
- int dims field
- unused dim size = 0 convention
Zero-initialization of the struct should create sane defaults

Also of note: pointer sizes vary between architectures, which means that the size and offsets of the structure itself vary. This probably remains the right answer, but we need to be careful on 32-bit architectures.

Convolution test fails on GPU

This is a bug migrated from the trello list

Casting a FuncRef to an Expr crashes if the Func is not yet defined.

Bug in the C++ layer. There should be a check for this. Possibly non-trivial, because reductions may recursively reference themselves.

Add support for alternative parallel runtimes

TBB and Grand Central Dispatch would be valuable. This should be doable (almost?) entirely as alternative standard libraries.

PatchMatch app fails with RDoms

when c dimension moves from outside to inside, initialization code radically changes (becomes incorrect) - inspect generated code

Problem with guassian pyramid

I'm using Halide (the precompiled libs) for a HDR fusion program and I need to create a gaussian pyramid. There is an example in local_laplacian code, but it doesn't not use JIT (what I want).

I have this code:

Image<int> subsample()
{
    Func downx, downy;
    Var x, y;

    downx(x, y) = ( (*this->image)(2*x-1, y) + 2 * (*this->image)(2*x, y) + (*this->image)(2*x+1 , y) ) / 4;
    downy(x, y) = (downx(x, 2*y-1) + 2 * downx(x, 2*y) + downx(x, 2*y+1)) / 4; 
    int width = this->image->width() / 2, height = this->image->height() / 2 ;
    Image<int> out = downy.realize(width-1, height-1);
    return out;
}

This code obviously fails because it starts on x=0, y=0 and indexes has to be positive numbers.

How can I set the limits for "realize" downy? Any help, please?

PS: I'm just starting with C++ because of Halide, you may think my code is horrible. :)

Generate efficient vector loads from clamped indices

A very common pattern in Halide code loads from an image using a clamped index:

Func clamped(x,y) = input(clamp(x,0,input.width() - 1), clamp(y,0,input.height() - 1);

In the current backend, this generates unnecessarily conservative code when vectorized. A better strategy would be to generate a dynamic branch which detects if the index vector is near the edge of the clamp range, and if not, removes the clamp and generates a simple dense aligned vector load.

Simple tiling fails in blur test

In py_bindings/test_blur.py, this schedule fails:

blur_y.root().tile(y,c,_c0,_c1,64,8)

Emit object code directly from compileToFile

It would be nice for users if we didn't require llc/opt to actually codegen and assemble a statically compiled pipeline.

This will require either plumbing llvm-c's LLVMTargetMachineEmitToFile(...) through to the OCaml standard interface, or making our own shim in src/cllutil.c.

transpose syntax sucks

It's quite confusing. We need some better way to handle loop nesting order.

Bounds checking on input images

We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.

See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug

We should add asserts at the function preamble that check this (conservatively).

run_test should show test names as it runs

Desired format:

<test_name>: {compile|link|run}
....E..

compute_remainder_modulus analysis fails for cast<int>(some float)

This was a bug listed on the trello board. It needs to be tested to see if it's still a bug and fixed if so.

Halide uses all memory

Hi!

I've been trying to implement the SURF descriptor algorithm in Halide. I've run into the problem that the piece of code below (one of the parts of SURF) takes a really long time to compile (compileJIT and compileToFile) and it exits when memory completely fills up (4GB). It's really hard to know the source of the problem since there are no error messages.

I tried to reduce code size (removing some calculations), and sometimes it compiles (though very slow).

FUNC and VAR are just macros to set unique object names.

Here is the function:

Func getResponse(UniformImage resp, UniformImage src)
{
  Func FUNC(f);
  Var VAR(x),VAR(y);

  Expr scale = cast<float>(resp.width()) / cast<float>(src.width());
  Expr xx = cast<int>(scale * cast<float>(x));
  Expr yy = cast<int>(scale * cast<float>(y));
  f(x,y) = resp(clamp(xx, 0, resp.width()), clamp(yy, 0, resp.height()));
  return f;
}

Func nonMaxSuppression
(UniformImage t, Uniform<int> t_step, Uniform<int> t_filter,
 UniformImage m, Uniform<int> m_step, Uniform<int> m_filter,
 UniformImage b, Uniform<int> b_step, Uniform<int> b_filter,
 UniformImage laplacian,
 Uniform<float> threshold)
{
    Var VAR(x), VAR(y);

    /* clamp parameters */
    Func FUNC(clamped_t), FUNC(clamped_m), FUNC(clamped_b);
    clamped_t(x,y) = t(clamp(x, 0, t.width()), clamp(y, 0, t.height()));
    clamped_m(x,y) = m(clamp(x, 0, m.width()), clamp(y, 0, m.height()));
    clamped_b(x,y) = b(clamp(x, 0, b.width()), clamp(y, 0, b.height()));

    Func response_m_t = getResponse(m, t);
    Func response_b_t = getResponse(b, t);

    Expr layerBorder = (t_filter + 1) / (2 * t_step);
    Expr validBounds =  (   y > layerBorder 
                         && y < t.height() - layerBorder
                         && x > layerBorder
                         && x < t.width() - layerBorder);

    Expr candidate = response_m_t(x, y);
    Expr aboveThreshold = candidate >= threshold;

    RDom r(-1, 2, -1, 2);

    Expr max_t = maximum(clamped_t(r.x+x, r.y+y));
    Expr max_m = maximum(clamped_m(r.x+x, r.y+y));
    Expr max_b = maximum(clamped_b(r.x+x, r.y+y));

    Func FUNC(GreaterNeigh);
    GreaterNeigh(x,y) = max(max_t, max(max_m, max_b));

    Expr isExt = validBounds && aboveThreshold && (GreaterNeigh(x,y) <= candidate);

    // ---------------------------------------
    // Step 1: Calculate the 3D derivative
    // ---------------------------------------

    Func FUNC(dx), FUNC(dy), FUNC(ds);

    dx(x,y) = (response_m_t(x+1, y  ) - response_m_t(x-1, y  )) / 2.0f;
    dy(x,y) = (response_m_t(x,   y+1) - response_m_t(x,   y-1)) / 2.0f;
    ds(x,y) = (clamped_t(x, y) - clamped_b(x, y)) / 2.0f;

    // ---------------------------------------
    // Step 2: Calculate the inverse Hessian
    // ---------------------------------------

    Expr v;
    Func FUNC(dxx), FUNC(dyy), FUNC(dss), FUNC(dxy), FUNC(dxs), FUNC(dys);

    v = response_m_t(x, y);

    dxx(x,y) = response_m_t(x + 1, y) + m(x - 1, y) - 2 * v;
    dyy(x,y) = response_m_t(x, y + 1) + m(x, y - 1) - 2 * v;
    dss(x,y) = clamped_t(x, y) + response_b_t(x, y) - 2 * v;

    dxy(x,y) = ( response_m_t(x + 1, y + 1)
               - response_m_t(x - 1, y + 1)
               - response_m_t(x + 1, y - 1)
               + response_m_t(x - 1, y - 1) ) / 4.0;

    dxs(x,y) = ( clamped_t(x + 1, y)
               - clamped_t(x - 1, y)
               - response_b_t(x + 1, y)
               + response_b_t(x - 1, y) ) / 4.0;

    dys(x,y) = ( clamped_t(x, y + 1)
               - clamped_t(x, y - 1)
               - response_b_t(x, y + 1)
               + response_b_t(x, y - 1) ) / 4.0;

    Expr H[3][3] = {{dxx(x,y), dxy(x,y), dxs(x,y)},
                    {dxy(x,y), dyy(x,y), dys(x,y)},
                    {dxs(x,y), dys(x,y), dss(x,y)}};

    Func FUNC(invDet);
    invDet(x,y) = 1.0 /
         (H[0][0]*(H[1][1]*H[2][2]-H[2][1]*H[1][2]) -
          H[0][1]*(H[1][0]*H[2][2]-H[1][2]*H[2][0]) +
          H[0][2]*(H[1][0]*H[2][1]-H[2][2]*H[2][0]));

    Expr invH[3][3] =
      {{ (H[1][1]*H[2][2]-H[2][1]*H[1][2])*invDet(x,y), -(H[1][0]*H[2][2]-H[1][2]*H[2][0])*invDet(x,y),  (H[1][0]*H[2][1]-H[2][0]*H[1][1])*invDet(x,y)},
       {-(H[0][1]*H[2][2]-H[0][2]*H[2][1])*invDet(x,y),  (H[0][0]*H[2][2]-H[0][2]*H[2][0])*invDet(x,y), -(H[0][0]*H[2][1]-H[2][0]*H[0][1])*invDet(x,y)},
       { (H[0][1]*H[1][2]-H[0][2]*H[1][1])*invDet(x,y), -(H[0][0]*H[1][2]-H[1][0]*H[0][2])*invDet(x,y),  (H[0][0]*H[1][1]-H[1][0]*H[0][1])*invDet(x,y)}};

    // ---------------------------------------
    // Step 3: Multiply derivative and Hessian
    // ---------------------------------------

    Expr cx =  (invH[0][0] * dx(x,y) * -1.0) + (invH[0][1] * dy(x,y) * -1.0) + (invH[0][2] * ds(x,y) * -1.0);
    Expr cy =  (invH[1][0] * dx(x,y) * -1.0) + (invH[1][1] * dy(x,y) * -1.0) + (invH[1][2] * ds(x,y) * -1.0);
    Expr ci =  (invH[2][0] * dx(x,y) * -1.0) + (invH[2][1] * dy(x,y) * -1.0) + (invH[2][2] * ds(x,y) * -1.0);


    Expr isClose   = (abs(cx) < 0.5 && abs(cy) < 0.5 && abs(ci) < 0.5);
    Expr posx      = cast<float>((x + cx)*t_step);
    Expr posy      = cast<float>((y + cy)*t_step);
    Expr det_scale = cast<float>((0.1333)*(m_filter + (ci* (m_filter - b_filter))));

    Func FUNC(laplacianF);
    laplacianF = getResponse(laplacian, t);

    Var VAR(c);

    Func FUNC(out);
    out(x,y,c) = select(c==0, isExt && isClose,
                 select(c==1, posx,
                 select(c==2, posy,
                 select(c==3, det_scale,
                 select(c==4, laplacianF(x,y), 0.0)))));

    //schedule
    GreaterNeigh.root();
    invDet.root();

    return out;
}

int main()
{
  UniformImage t(Float(32), 2),
               m(Float(32), 2),
               b(Float(32), 2),
               laplacian(Float(32), 2);

  Uniform<int> t_step, t_filter,
               m_step, m_filter,
               b_step, b_filter;

  Uniform<float> threshold;

  Func nms = nonMaxSuppression(t, t_step, t_filter,
                               m, m_step, m_filter,
                               b, b_step, b_filter,
                               laplacian,
                               threshold);

  nms.compileJIT(); /* error */

  return 0;
}

Run Problem for Negative Values

I am trying to find the cosine of a set of numbers. I used from 0 to 10. My code compiled but, i have this error while running.

Halide::DynImage::Contents::Contents(const Halide::Type&, int): Assertion `a > 0 && "Images must have positive sizes\n"' failed.
Aborted (core dumped)

I understood the second part. How to make it for negative numbers?

Conditionally remove C++11 features for older compilers

It seems all that's needed for the stock Lion compiler is to remove the initializer list support for images.

Compute Capability of Halide

I am working on CUDA and Halide. I have compiled and ran few examples. When i opened my working directory, i found a file "kernel.ptx". I opened it and found this.

.version 2.0
.target sm_20

Does Halide support only devices with compute capability 2.0??

Compile and Run problem for CUDA

I have run the code given in Getting Started and test folder. It ran fine with g++. But, when i include the shell code for CUDA, .i.e.,

"g++-4.6 -std=c++0x hello_halide.cpp -L /usr/local/cuda/lib64 -lcuda halide -lHalide -ldl -lpthread -o hello_halide"

It shows,

"/usr/bin/ld: cannot find halide: File format not recognized
/usr/bin/ld: cannot find -lHalide
collect2: ld returned 1 exit status"

What to do??

Reusing a variable name as the inner dimension of a split breaks

For example, the following code accesses the output image incorrectly.

f(x) = x
f.root().split(x, xo, x, 2)

f.realize(...)

A failing test has been checked in as split_reuse_inner_name_bug