mesa/.gitlab-ci
Samuel Pitoiset 276e6d7bbc gitlab-ci: attach the Fossilize log file as artifact on failure
It might be help.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5627>
2020-06-26 06:45:23 +00:00
..
bare-metal ci: Add a freedreno a630 tracie run. 2020-06-25 17:33:28 +00:00
container ci: Include trace replay support in ARM rootfses. 2020-06-25 17:33:28 +00:00
fossils gitlab-ci: attach the Fossilize log file as artifact on failure 2020-06-26 06:45:23 +00:00
piglit gallivm/conv: enable conversion min code. (v2) 2020-06-11 14:41:19 +10:00
tracie ci/tracie: Print the path if the trace isn't found. 2020-06-25 17:33:28 +00:00
windows CI: Windows: Build LLVM and llvmpipe 2020-05-14 06:40:54 +00:00
README.md ci: Make a simple little bare-metal fastboot mode for db410c. 2020-03-11 21:36:47 +00:00
arm.config ci: Include db410c support in the ARM container. 2020-02-27 09:36:26 -08:00
arm64.config ci: Build a cheza kernel. 2020-05-29 16:46:44 +00:00
build-apitrace.sh ci: Include trace replay support in ARM rootfses. 2020-06-25 17:33:28 +00:00
build-cts-runner.sh gitlab-ci: Update CTS runner 2020-06-23 06:59:27 +00:00
build-deqp-gl.sh gitlab-ci: Test virgl with Khronos' OpenGL CTS 2020-06-23 06:59:27 +00:00
build-deqp-vk.sh ci: Bump vulkan CTS to 1.2.3.0. 2020-06-19 14:50:05 -07:00
build-fossilize.sh ci: Consistently use -j4 across x86 build jobs and -j8 on ARM. 2020-04-01 18:33:58 +00:00
build-gfxreconstruct.sh ci: Consistently use -j4 across x86 build jobs and -j8 on ARM. 2020-04-01 18:33:58 +00:00
build-piglit.sh gitlab-ci: bump piglit checkout commit 2020-06-03 02:22:23 +00:00
build-renderdoc.sh ci: Include trace replay support in ARM rootfses. 2020-06-25 17:33:28 +00:00
build-virglrenderer.sh ci: bump virglrenderer to latest version 2020-06-04 20:05:26 +00:00
build-vulkantools.sh ci: Consistently use -j4 across x86 build jobs and -j8 on ARM. 2020-04-01 18:33:58 +00:00
create-cross-file.sh ci: Make cmake toolchain file for deqp cross build setup. 2020-05-18 19:39:46 +00:00
create-rootfs.sh ci: Include trace replay support in ARM rootfses. 2020-06-25 17:33:28 +00:00
cross-xfail-i386 ci: Run tests on i386 cross builds 2019-09-17 14:53:57 -04:00
cross-xfail-ppc64el gitlab-ci: Add ppc64el and s390x cross-build jobs 2020-02-05 10:52:31 +00:00
cross-xfail-s390x gitlab-ci: remove load_store_vectorizer from expected s390x test failures 2020-02-13 10:53:37 +00:00
deqp-default-skips.txt ci: Make the skip list regexes match the full test name. 2019-11-12 12:54:04 -08:00
deqp-freedreno-a307-fails.txt ci: Don't forget to set NIR_VALIDATE in baremetal runs. 2020-05-22 16:44:46 +00:00
deqp-freedreno-a307-skips.txt ci: Add a disabled-by-default job for GLES3 testing on db410c. 2020-03-02 11:38:46 -08:00
deqp-freedreno-a530-fails.txt freedreno/a5xx: Set MIN_LAYERSZ on 3D textures like we do on a6xx. 2020-05-21 17:09:42 -07:00
deqp-freedreno-a530-skips.txt ci: Enable GLES 3.1 testing on db820c (a530). 2020-04-27 19:06:57 +00:00
deqp-freedreno-a630-bypass-fails.txt turnip: disable early_z for VK_FORMAT_S8_UINT 2020-06-25 03:02:56 +00:00
deqp-freedreno-a630-fails.txt ci: Bump vulkan CTS to 1.2.3.0. 2020-06-19 14:50:05 -07:00
deqp-freedreno-a630-skips.txt ci: remove some freedreno a6xx skips 2020-06-23 10:01:58 +00:00
deqp-lima-fails.txt lima: implement zsbuf reload 2020-03-18 08:36:17 +00:00
deqp-lima-skips.txt lima/gpir: fix crash in schedule_insert_ready_list() 2020-03-16 16:28:33 -07:00
deqp-llvmpipe-fails.txt llvmpipe/setup: add point size clamping 2020-04-27 12:35:24 +10:00
deqp-panfrost-t720-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t720-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t760-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t760-skips.txt gitlab-ci: Run GLES3 tests in dEQP on Panfrost 2020-02-26 14:02:25 +01:00
deqp-panfrost-t820-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t820-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t860-fails.txt pan/mdg: Add new depth store lowering 2020-06-10 13:54:03 +00:00
deqp-panfrost-t860-skips.txt panfrost: Run dEQP-GLES3.functional.shaders.derivate.* on CI 2020-05-12 22:30:42 +00:00
deqp-radv-default-skips.txt gitlab-ci: add a list of excluded tests for RADV 2020-04-22 09:11:53 +02:00
deqp-radv-fiji-aco-fails.txt radv: lower 64-bit drcp/dsqrt/drsq for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-radv-navi10-aco-fails.txt radv: lower 64-bit drcp/dsqrt/drsq for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-radv-navi14-aco-fails.txt gitlab-ci: add a list of expected failures for RADV/ACO on NAVI14 2020-06-25 14:15:49 +00:00
deqp-radv-pitcairn-aco-fails.txt radv: lower 64-bit dfloor on GFX6 for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-radv-polaris10-aco-fails.txt radv: lower 64-bit drcp/dsqrt/drsq for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-radv-polaris10-skips.txt gitlab-ci: add a job that runs Vulkan CTS with RADV conditionally 2019-12-06 10:58:03 +01:00
deqp-radv-raven-aco-fails.txt radv: lower 64-bit drcp/dsqrt/drsq for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-radv-raven-aco-skips.txt ci: add lists of expected failures & skipped tests for RAVEN with ACO 2020-05-01 16:07:47 +02:00
deqp-radv-vega10-aco-fails.txt radv: lower 64-bit drcp/dsqrt/drsq for fixing precision issues 2020-06-25 12:09:08 +00:00
deqp-runner.sh gitlab-ci: Test virgl with Khronos' OpenGL CTS 2020-06-23 06:59:27 +00:00
deqp-softpipe-fails.txt gitlab-ci: Add three more dEQP-GLES31 tests to softpipe skips 2020-02-14 09:55:48 +01:00
deqp-softpipe-skips.txt gitlab-ci: Add three more dEQP-GLES31 tests to softpipe skips 2020-02-14 09:55:48 +01:00
deqp-virgl-gl-fails.txt gitlab-ci: Add manual tests for Virgl using GLES on the host 2020-06-23 06:59:27 +00:00
deqp-virgl-gles-fails.txt gitlab-ci: Add manual tests for Virgl using GLES on the host 2020-06-23 06:59:27 +00:00
fossilize-runner.sh gitlab-ci: attach the Fossilize log file as artifact on failure 2020-06-26 06:45:23 +00:00
fossils.yml gitlab-ci: add parallel-rdp fossils 2020-06-25 08:03:09 +02:00
generate_lava.py gitlab-ci: Run GLES3 tests in dEQP on Panfrost 2020-02-26 14:02:25 +01:00
lava-deqp.yml.jinja2 ci: Enable --compact-display false on all dEQP runs. 2020-04-27 22:10:10 +00:00
lava-gitlab-ci.yml CI: Disable Panfrost Mali-T820, Lima Mali-400 and Lima Mali-450 jobs 2020-06-23 19:26:48 -07:00
meson-build.bat gitlab-ci: Add a job for meson on windows 2019-10-25 22:47:32 +00:00
meson-build.sh ci: Consistently use -j4 across x86 build jobs and -j8 on ARM. 2020-04-01 18:33:58 +00:00
prepare-artifacts.sh ci: Add a freedreno a630 tracie run. 2020-06-25 17:33:28 +00:00
run-shader-db.sh ci: Add intel to shaderdb runs 2020-04-30 11:32:54 +03:00
scons-build.sh scons: Print a deprecation warning about using scons on not windows 2019-10-24 18:33:50 +00:00
test-source-dep.yml gitlab-ci: Use YAML anchor for llvmpipe paths in virgl rules 2020-05-01 07:19:05 +00:00
traces-baremetal.yml freedreno/a6xx: Add support for polygon fill mode (as long as front==back). 2020-06-25 13:46:30 -07:00
traces.yml gitlab-ci: Test Virgl with traces 2020-04-24 05:37:06 +00:00
tracie-runner-gl.sh ci: Add a freedreno a630 tracie run. 2020-06-25 17:33:28 +00:00
tracie-runner-vk.sh ci: Migrate tracie tests done in shell script to pytest 2020-05-18 17:27:42 +00:00
x86_64-w64-mingw32 gitlab-ci: Add a pkg-config for mingw 2019-10-16 23:26:09 +00:00

README.md

Mesa testing

The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge testing of Mesa drivers on various platforms, so that we can ensure no regressions are merged, as long as developers are merging code using marge-bot.

There are currently 4 automated testing systems deployed for Mesa. LAVA and gitlab-runner on the DUTs are used in pre-merge testing and are described in this document. Managing bare metal using gitlab-runner is described under [bare-metal/README.md]. Intel also has a jenkins-based CI system with restricted access that isn't connected to gitlab.

Mesa testing using LAVA

LAVA is a system for functional testing of boards including deploying custom bootloaders and kernels. This is particularly relevant to testing Mesa because we often need to change kernels for UAPI changes (and this lets us do full testing of a new kernel during development), and our workloads can easily take down boards when mistakes are made (kernel oopses, OOMs that take out critical system services).

Mesa-LAVA software architecture

The gitlab-runner will run on some host that has access to the LAVA lab, with tags like "lava-mesa-boardname" to control only taking in jobs for the hardware that the LAVA lab contains. The gitlab-runner spawns a docker container with lava-cli in it, and connects to the LAVA lab using a predefined token to submit jobs under a specific device type.

The LAVA instance manages scheduling those jobs to the boards present. For a job, it will deploy the kernel, device tree, and the ramdisk containing the CTS.

Deploying a new Mesa-LAVA lab

You'll want to start with setting up your LAVA instance and getting some boards booting using test jobs. Start with the stock QEMU examples to make sure your instance works at all. Then, you'll need to define your actual boards.

The device type in lava-gitlab-ci.yml is the device type you create in your LAVA instance, which doesn't have to match the board's name in /etc/lava-dispatcher/device-types. You create your boards under that device type and the Mesa jobs will be scheduled to any of them. Instantiate your boards by creating them in the UI or at the command line attached to that device type, then populate their dictionary (using an "extends" line probably referencing the board's template in /etc/lava-dispatcher/device-types). Now, go find a relevant healthcheck job for your board as a test job definition, or cobble something together from a board that boots using the same boot_method and some public images, and figure out how to get your boards booting.

Once you can boot your board using a custom job definition, it's time to connect Mesa CI to it. Install gitlab-runner and register as a shared runner (you'll need a gitlab admin for help with this). The runner must have a tag (like "mesa-lava-db410c") to restrict the jobs it takes or it will grab random jobs from tasks across fd.o, and your runner isn't ready for that.

The runner will be running an ARM docker image (we haven't done any x86 LAVA yet, so that isn't documented). If your host for the gitlab-runner is x86, then you'll need to install qemu-user-static and the binfmt support.

The docker image will need access to the lava instance. If it's on a public network it should be fine. If you're running the LAVA instance on localhost, you'll need to set network_mode="host" in /etc/gitlab-runner/config.toml so it can access localhost. Create a gitlab-runner user in your LAVA instance, log in under that user on the web interface, and create an API token. Copy that into a lavacli.yaml:

default:
  token: <token contents>
  uri: <url to the instance>
  username: gitlab-runner

Add a volume mount of that lavacli.yaml to /etc/gitlab-runner/config.toml so that the docker container can access it. You probably have a volumes = ["/cache"] already, so now it would be

  volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"]

Note that this token is visible to anybody that can submit MRs to Mesa! It is not an actual secret. We could just bake it into the gitlab CI yml, but this way the current method of connecting to the LAVA instance is separated from the Mesa branches (particularly relevant as we have many stable branches all using CI).

Now it's time to define your test runner in .gitlab-ci/lava-gitlab-ci.yml.

Mesa testing using gitlab-runner on DUTs

Software architecture

For freedreno and llvmpipe CI, we're using gitlab-runner on the test devices (DUTs), cached docker containers with VK-GL-CTS, and the normal shared x86_64 runners to build the Mesa drivers to be run inside of those containers on the DUTs.

The docker containers are rebuilt from the debian-install.sh script when DEBIAN_TAG is changed in .gitlab-ci.yml, and debian-test-install.sh when DEBIAN_ARM64_TAG is changed in .gitlab-ci.yml. The resulting images are around 500MB, and are expected to change approximately weekly (though an individual developer working on them may produce many more images while trying to come up with a working MR!).

gitlab-runner is a client that polls gitlab.freedesktop.org for available jobs, with no inbound networking requirements. Jobs can have tags, so we can have DUT-specific jobs that only run on runners with that tag marked in the gitlab UI.

Since dEQP takes a long time to run, we mark the job as "parallel" at some level, which spawns multiple jobs from one definition, and then deqp-runner.sh takes the corresponding fraction of the test list for that job.

To reduce dEQP runtime (or avoid tests with unreliable results), a deqp-runner.sh invocation can provide a list of tests to skip. If your driver is not yet conformant, you can pass a list of expected failures, and the job will only fail on tests that aren't listed (look at the job's log for which specific tests failed).

DUT requirements

DUTs must have a stable kernel and GPU reset.

If the system goes down during a test run, that job will eventually time out and fail (default 1 hour). However, if the kernel can't reliably reset the GPU on failure, bugs in one MR may leak into spurious failures in another MR. This would be an unacceptable impact on Mesa developers working on other drivers.

DUTs must be able to run docker

The Mesa gitlab-runner based test architecture is built around docker, so that we can cache the debian package installation and CTS build step across multiple test runs. Since the images are large and change approximately weekly, the DUTs also need to be running some script to prune stale docker images periodically in order to not run out of disk space as we rev those containers (perhaps this script).

Note that docker doesn't allow containers to be stored on NFS, and doesn't allow multiple docker daemons to interact with the same network block device, so you will probably need some sort of physical storage on your DUTs.

DUTs must be public

By including your device in .gitlab-ci.yml, you're effectively letting anyone on the internet run code on your device. docker containers may provide some limited protection, but how much you trust that and what you do to mitigate hostile access is up to you.

DUTs must expose the dri device nodes to the containers.

Obviously, to get access to the HW, we need to pass the render node through. This is done by adding devices = ["/dev/dri"] to the runners.docker section of /etc/gitlab-runner/config.toml.

HW CI farm expectations

To make sure that testing of one vendor's drivers doesn't block unrelated work by other vendors, we require that a given driver's test farm produces a spurious failure no more than once a week. If every driver had CI and failed once a week, we would be seeing someone's code getting blocked on a spurious failure daily, which is an unacceptable cost to the project.

Additionally, the test farm needs to be able to provide a short enough turnaround time that people can regularly use the "Merge when pipeline succeeds" button successfully (until we get marge-bot in place on freedesktop.org). As a result, we require that the test farm be able to handle a whole pipeline's worth of jobs in less than 5 minutes (to compare, the build stage is about 10 minutes, if you could get all your jobs scheduled on the shared runners in time.).

If a test farm is short the HW to provide these guarantees, consider dropping tests to reduce runtime. VK-GL-CTS/scripts/log/bottleneck_report.py can help you find what tests were slow in a results.qpa file. Or, you can have a job with no parallel field set and:

  variables:
    CI_NODE_INDEX: 1
    CI_NODE_TOTAL: 10

to just run 1/10th of the test list.

If a HW CI farm goes offline (network dies and all CI pipelines end up stalled) or its runners are consistenly spuriously failing (disk full?), and the maintainer is not immediately available to fix the issue, please push through an MR disabling that farm's jobs by adding '.' to the front of the jobs names until the maintainer can bring things back up. If this happens, the farm maintainer should provide a report to mesa-dev@lists.freedesktop.org after the fact explaining what happened and what the mitigation plan is for that failure next time.