Exploring TouchDesigner functionality and theorising potential.

TouchDesigner

The plugin I will be using allows users to run Google's real time MediaPipe machine learning models in TouchDesigner. This is Dom Scott and Torin Mathis's plugin.

It tracks faces, facial landmarks, hands, poses, objects (including classification). There is also the option for image segmentation which could prove very useful later.

Further Information

What is a Tox?
.bullet A Tox is a TouchDesigner component file which is used to condense complex node strings into one input and output. The purpose of a Tox is to maintain visual clarity and complexity. The MediaPipe plugin has 8 components, each relating to a machine learning model

There are various machine learning models including
.bullet Face_Tracking
.bullet Hand_Tracking
.bullet Pose_Tracking
.bullet Object_Tracking
.bullet Face_Detector
.bullet Image_Classification
.bullet Image_Embeddings
.bullet Image_Segmentation

Limitations

.bullet "Currently the model is limited to 720p input resolution" which could cause problems for high quality graphics. Fortunately, there are image upscalers that could counteract this issue. The issue with realtime generation is that quality is limited to hardware: Even the most powerful computers will struggle with high quality real time generation and most AI image outputs (including cloud-based) hover around the 1024x1024 mark. As mentioned, the fix for this would be post-processing which would include a seperate upscaling function. Whilst the user input resolution is not necessarily equal to the output resolution the free version of TouchDesigner is limited to an output of 1280x1280

Node system 01 Node system 02 Node system 03

Here, the data is extracted and re-separated, the values become usable to modify the properties of other nodes: For example a circle's coordinates can be edited realtime by the movement of the index finger. The current setup locks the X and Y location of the circle to the index finger on the canvas and the scale variable is controlled by the Z position of the index finger.

This proves that compositions could be created using this method, with basic shapes representing each element.

Node system 04

I then experimented with adding a capsule instead of a circle. The X, Y and Z scales were mapped to the relative index finger coordinates.

Exploring StreamDiffusion within TouchDesigner.

TouchDesigner and StreamDiffusion

StableDiffusion

.bullet StableDiffusion is an AI image generation algorithm that I have been using in my professional and personal work since 2022. It is OpenSource and allows users to tweak a wide variety of parameters, allowing for a lot of creative freedom. Using StableDiffusion has played a large part in helping me understand diffusion models and their potential, it is largely responsible for the concept behind this project.

ImageToImage

.bullet Within StableDiffusion there is an ImageToImage function where a user can input an image and the algorithm will be 'inspired,' creating an image that is similar (variability is defined by the user). This opens the door to simple image input where the user has large control over the output. This counteracts one of the problems with existing models which uses a TextToImage software: ImageToImage grants the user more control over the output being that the process is more iterative.

StreamDiffusion

.bullet Made by DotSimulate
.bullet To gain access to the plugin I did have to pay a fee (cost). Whilst I am typically opposed to doing this I found it to be a valuable investment given the time and knowledge I have would not allow me to build a similar system from the ground up.
.bullet StreamDiffusion is a version of StableDiffusion that is optimised for real time visual generation: It is essentially a smaller, more optimised version. It does compromise on complete image generation through resolution and quality of output however I believe it could be a very helpful tool in the iterative process I intend to create. The StreamDiffusion output will not be perfect but the user will be able to define when to take the next step which will involve refining the real time generation using a near real time generation system that requires more computational power.
.bullet The system is well optimised and ready to be tweaked, allowing me to morph it into something new later down the line

Why is this helpful?

.bullet StreamDiffusion will allow me to use gesture controls to control an output in realtime. Whilst StableDiffusion and other more conventional ImageToImage systems are more detailed and accurate, an intuitive system must allow the user to see what they are doing in relation to the output.
.bullet Any delay between input and output could affect engagement which is very important in an exhibition context.

Node system 01 TDSDMountainSunset
TDSDNoiseAdjustment
Node system 01 TDSDGalaxy
Node system 01 TDSDInputDistortion

Further Exploration

Instead of taking a direct approach to using the machine learning models in TouchDesigner I theorised how a silhouette-based system could function. I realised it would require instruction for the user to put their hands in front of the camera, causing an, although not steep, learning curve.

A silhouette-based approach felt like it could be more intuitive and would appeal to the viewer that walks by and is picked up by the camera. I have been thinking about how people would interact with the system and have concluded the majority of 'introductory' experiences will consist of accidentally triggering the machine learning models: For example someone walks behind in the background and is picked up by the model. Initially I thought of this as a distraction for the main user of the system however I now see this as a potential collaborative experience: Design is the profession of communication and having it start with asking for feedback is not dissimilar to a conventional client to designer relationship.

SilhouetteTestRiverAndMountains

Whilst these experiments have been enlightening, they are simply leveraging existing tools rather than solving any new problems. The idea behind this segment of the project is to fix the compositional issue regarding AI image generation. It is imperative my next steps begin to address the issues I set out to solve.

Exploring the capabilities of ComfyUI and generating some test images and compositions.

Learning ComfyUI

I thought it was important to explore the capabilities of ComfyUI as I understand it is much more flexible compared to the Automatic1111 workspace for StableDiffusion.

ComfyUI Nodes ComfyUI Output

.bullet The LoadCheckpoint node is used to load the model.
.bullet The CLIP Text Encoder node is used to encode the text prompts (positive and negative).
.bullet The KSampler node controls the settings for the image generation: How the noise is applied to the image.
.bullet The EmptyLatentImage node is used to create a blank canvas (noise) for the image generation.
.bullet The VAE Decoder node is used to decode the image.
.bullet The SaveImage node is used to display and save the image.

ComfyUI Initial Experiments

.enter

_
_ _
_


Landscapes better than StreamDiffusion
Prompted for a poster: no text rendered
Wanted to see how multi-subject images were handled: better than expected

The installation process for StableDiffusion, MediaPipe, StreamDiffusion and ComfyUI have all come with their individual issues. The main problems being the technical understanding and hardware requirements to get these systems running. Such issues do make me wonder how accessible the system I aim to create would be. In any context outside of an exhibition it would be incredibly difficult for a non-technical, non-designer to install the system, making me wonder how successful the project would be.

I believe the future for AI will remain server-based (like ChatGPT and MidJourney) which, for the best versions, are behind a paywall. Server-side for this project would demand a subscription fee for the cloud hosting. It could be argued that design will always be gatekept, access to industry standard design programs is presently behind a paywall and I can't see this shifting anytime soon.

Flux Setup

FluxNodes03

Further Examples

.enter

FluxNodes01 FluxNodes02

I began thinking how I could separate individual elements from generations as there is no alpha channel so there need to be some more creative workarounds. I thought a bluescreen/ greenscreen setup with some additional nodes could be a viable solution.

BGRem

BGRemNodes01
BGRemSource01 BGRemOutput01

I found a custom background removal node by John-mnz which automatically separates the subject from the background. This is incredibly helpful regarding composition as each individual element having their own background would not allow for any compositing.

BGRemNodes02
BGRemSource02 BGRemOutput02 BGRemMask02

I then started exporting the masks of the image outputs as this may prove useful in reverse engineering the system to mimic some of my previous TouchDesigner experimentation: The masks could be used to isolate areas of the canvas in future generations.

BGRemSource03 BGRemOutput03 BGRemMask03
BGRemSource04 BGRemOutput04 BGRemMask04

I was very impressed with the outputs and how clean the images and image masks were. This is the first time that I can see the system as something that could be very helpful in my own process. Until now I have used image generation as a very rough starting point that is heavily tweaked in Photoshop, this feels like it could replace a substantial amount of the Photoshop process.

Not only is this beneficial to my personal practice but once the system is initialized, all that’s needed is a prompt and button click: Very beginner friendly.

The next step I would like to explore is prompt generation: I often use LLMs to generate detailed prompts for image generation requests. The more detailed the prompt is, the better the output is likely to be. When humans explain images to computers we assume a large amount of detail and context as when talking to a human it is implied. Using an LLM to flesh out the prompt and add context would further automate the process without taking away too much creative control: The user should be able to intervene and add/ remove details if they do not fit their vision.

AutoPrompting

AutoPromptNodes01
AutoPromptNodes01 AutoPromptNodes01 AutoPromptNodes01

I found this tutorial by Sebastian Kamph that uses a custom node to generate more detailed prompts based on simple user input.

For the image above, all I did was type in "Cyberpunk Woman" and pressed generate. The LLM inside ComfyUI then generated this prompt "A gritty cyberpunk scene unfolds with a fierce female protagonist at its center. The image features a striking woman dressed in sleek, futuristic attire that combines cutting-edge technology and high fashion elements. Her dark, metallic eyes are augmented with cybernetic enhancements, adding to her enigmatic allure. She stands tall against the backdrop of neon cityscapes and towering skyscrapers adorned with holographic advertisements, creating a stark contrast between organic life and technological advancement. The image is rendered in a gritty yet vibrant style that encapsulates the essence of cyberpunk culture: an amalgamation of dystopian futurism and neo-noir aesthetics that explores themes such as power, identity, and rebellion."

Whilst it could be argued that this setup offloads some of the creative control of the human I would argue this enhances the process. If a user has a vague idea they will provide a vague prompt but in doing so they would have assumed the context, lighting and details of the element they have imagined. By running the initial prompt through an LLM these contextual details we take for granted are specified. This creates a more complete image but also embeds prompting conventions, leading to better outputs.

Further Examples

.enter

Here are further examples of workflows enhanced with AI prompting.

AutoPromptNodes02
AutoPromptNodes02 AutoPromptNodes02 AutoPromptNodes02
AutoPromptNodes03 AutoPromptNodes03 AutoPromptNodes03

Reflection

Whilst this increased level of automation is exciting it does ask questions of responsibility. If my final outcome allows anyone to create compelling design there are some ethical questions that need to be asked. For example, it is the responsibility of the designer to communicate clearly and ensure the design conveys a message which they are comfortable with. Politically, a tool like this could be weaponized. Arguably, design has always been highly political, however a machine able to create content at such quantities has never been seen before. The point could be made that whoever is in charge of the 'system of unlimited output' will have immense influence. Is this something that should be created? The only way to counteract this threat is to democratise not only design but the design process.

If nothing else, an automated graphic system would increase the quantity of design work but would likely decrease the quality. It is very important that the system I create is not one that is overly focused on output. The goal of my project is to improve the quality of outputs through design principles. Until now I have focused on the automotive aspect of the system however it is becoming increasingly important to address the initial issues I set out to solve.

Compositional Denoising

One of my theories at the start of the project was that users would be able to compose their images and re-render them with a low denoising strength through AI. I experimented with this in ComfyUI however I have not yet found the sweet spot in using basic denoising. I theorise the solution for this is using different LoRas that are designed for multi-subject images, enhanced detailing, lighting and composition. Below is some experimentation of using an image to image system with a low denoising strength.

Compositional Denoising 02 Compositional Denoising 01
Compositional Denoising 03 Compositional Denoising 04 Compositional Denoising 05
Compositional Denoising 06 Compositional Denoising 07 Compositional Denoising 08

Compiled Img2Img

.enter

_
_ _

Another solution I thought of was to use two sets of smaller denoising strengths on the same image. This however yielded even worse results.

The outputs were created using a variety of denoising strengths, ranging from 0.8 to 0.2. A denoising strength of 0 means the image will be unchanged and a denoising strength of 1 will completely change an image. My intention was for the process to improve lighting cohesion and add shadows to the individual elements so they fit together. Unfortunately, the outputs seemed to make each individual element worse alongside the aforementioned intended changes. Perhaps one fix is to use masks to ensure the subjects themselves are less affected by the denoising process.

Exploring how individually generated images could be combined together without distortion.

Multi-Comp

Philosophy

I wanted to find a system that would allow me to create multi-subject compositions. This is one of the primary objectives of the project and without tackling it I would deem this a failure.

I found this tutorial which seemed to be a perfect solution to my project, however, the custom nodes have been depreciated in the most recent versions of ComfyUI.

AreaCompOutput01 AreaCompNodes01

Set Area

I followed this tutorial to assign region-based prompts to the canvas. Whilst this is helpful for existing compositions I struggled to modify it for my use cases.

Further Experimentation

.enter

AreaCompOutput02 AreaCompNodes02
AreaCompOutput06 AreaCompNodes06
AreaCompOutput07 AreaCompNodes07

Here I tried using the same method to create an image composed of multiple subjects from scratch. Unfortunately I was unable to create a system that functioned completely independently.

Furthermore, the system only functions based on two regions of the canvas; the deer has been removed as there was no specific mention of it. Additionally, the rendering of the knight has strayed a lot from its original form which is something I directly would like to avoid.

I tried to use more complex image segmentation however this did not work either.

Mask Generation

GradientGenerationOutput01 GradientGenerationNodes01
GradientGenerationOutput02 GradientGenerationNodes02
GradientGenerationOutput03 GradientGenerationNodes03

Leading on from my experimentation I looked into generating gradients (and therefore masks) within comfyUI. I theorised that this may be an effective way of segmenting the canvas.

Live Masking

One of the objectives for this project was giving the user more control in the 'latent' space. I have found in the past that people struggle to understand the concept of masking and composition, however an understanding of silhouettes is more universal. Following my experimentation in TouchDesigner I explored the in-program capabilities of ComfyUI when it comes to image segmentation.

Caption
Caption Caption Caption
Caption Caption

Nodes

.enter

Caption Caption
Caption Caption
Caption Caption

I experimented using the masking feature as an ImageToImage input and as a native ControlNet but none of my outcomes fit my exact needs. I was hoping to replicate the StreamDiffusion effect where none of the original artefacts are directly visible whilst dictating the composition of the outcome. The test for this system was generating scenery as this requires almost all properties of the original input (human silhouette) to be ignored. The test failed and I theorised how adapting the mask could increase abstraction without forfeiting compositional control.

Caption
Caption Caption
Caption Caption Caption Caption

In my experimentation I accidentally kept in a previous prompt from my testing and found the system was incredibly effective for subjects. The standard I aimed to achieve was near complete abstraction of the input but the true value of this system is in dictating the subject generations. The scenery workflow was designed to generate the background which can afford to differ greatly from the input, the subjects however must be more aligned with the input. It fundamentally gives the user more control, satisfying a key project objective.

ControlNet

ControlNetImagePreProcessing

I theorised there was a better way of creating an image to image system and that is using ControlNet. I followed this tutorial which aided my understanding of ControlNet. I found there are multiple ways of extracting composition-based image data autonomously which seemed to be a much more effective system to the previous methodology. Whilst this is still not an independent system it is highly adaptive and would be more versatile in relation to complex layouts.

ControlNetOutput05 ControlNetNodes05

Further Experimentation

.enter

ControlNetOutput01 ControlNetNodes01 ControlNetOutput02 ControlNetNodes02 ControlNetOutput03 ControlNetNodes03 ControlNetOutput04 ControlNetNodes04 ControlNetOutput06 ControlNetNodes06

I then practically experimented with the limitations of ControlNet. Above is what I deem to be the most successful generation: True to the original composition without straying from the original elements. Furthermore, the elements have been blended well together with universal lighting being applied across the elements and reflections appearing where they should.

Automatic Image Segmentation

Based on the proficiency of ControlNet and depth maps to replicate compositions I explored how masking could further develop the system. Whilst compositionally accurate, ControlNet continually distorts elements: It doesn’t distinguish individual elements and renders everything simultaneously.

Caption

Node Development

.enter

Caption Caption

To avoid this problem I developed an auto-masking system using the depth map output and levels adjustment node. The combination creates a threshold effect, allowing outputs to be passed through a further system as masks to isolate areas of the canvas. One problem with this system is that images have different distances for the foreground, midground and background, meaning complete accuracy could not be guaranteed.

Despite this, a depth map represents the foreground in absolute white and the background in absolute black with greys between. This means the system will be somewhat accurate but would also depend on the number of layers in an image as I have only accounted for three depths for three isolation points.

Exploring how LoRas and ControlNet can be leveraged to increase control in image generation.

LoRas

Philosophy

A LoRa is a Low Rank Adaptation of a model. It is useful in the case of image generation as it allows a system to be more heavily weighted towards a specific style. This is particularly helpful for this project as it introduces a large degree of control back to the project.

In theory, I would be able to create a dataset that represents the compositional and aesthetic style that I want to reproduce: There is the option to train a LoRa on my work.

Recreation01Reference Recreation01Output

Nodes

.enter

Recreation01Nodes

As part of my experimentation I decided to try and recreate specific images using the system I had created. The methodology I used to create this was through more specific prompting. Whilst this was effective in this case I felt it was not a sustainable system. I started looking into LoRas as a way to remedy this. I also thought about how I could potentially train my own LoRa: This would allow me to have a large amount of visual control over the outputs and would be an opportunity to force my style on the generations.

LoRaExperimentation01Output LoRaExperimentation02Output
LoRaExperimentation03Output LoRaExperimentation04Output

Here are side by side comparisons of images generated with and without LoRas: An Art Nouveau LoRa specifically. The prompts remain identical however the styling is completely different. Maintaining the composition of images whilst changing their style opens the door to regenerating images with a higher degree of control without compromising creative freedom.

Nodes

.enter

LoRaExperimentation01Nodes LoRaExperimentation02Nodes LoRaExperimentation03Nodes LoRaExperimentation04Nodes

LoRas and ControlNet

To maintain composition and apply a specific style I chained ControlNet with LoRa models to achieve a poster effect. I used This poster LoRa. My original plan was to use Flux to handle more complex compositions but in my experience StableDiffusion (particularly the earlier models) are more efficient at generating images.

Caption
Caption Caption Caption Caption

I experimented with the strength of the LoRa to explore its effect. Because the LoRa already has a strong foundation from the ControlNet the effect is not as noticable but I still see value in using LoRas to reinforce principles of poster design.

Further Documentation

.enter

Caption
Caption Caption Caption
Caption

Element Refinement System

To ensure quality throughout the process I added an 'element refinement' system that takes the generated images and runs them through an ImageToImage workflow. The higher step count and more robust model generates a higher quality output. The system segments the process and prioritises quality over speed.

Caption Caption

The refinement system functions however the time it takes to run the workflow calls its value into question. This exemplifies the two project objectives clashing: The goal to make a tool that I find valuable conflicts with creating a tool for the non-designer in an exhibition context, skill ceiling vs barrier of entry. It's important to reevaluate the outcome of the project: Am I designing for the designer or democratising design? Given the original project objective I will prioritise the latter.

Composition Refinement System

Similar to the Element Refinement System I theorised how composition could be maintained whilst upscaling and improved cohesion (lighting, shadows and colours) is applied to individual elements. I created a workflow that utilised the Florence2Run custom node which allows images to be automatically captioned by machine learning systems. Manually captioning a composition through combining nodes is possible but would not convey the positional data of the individual elements, disregarding the critical composition phase of the process.

Re-describing a composition benefits the system in that it has a more accurate starting point. Re-generating the image without a caption would almost certainly cause hallucinations which would take value from the previous steps.

Caption Caption
Caption Caption Caption

I tested the system first with simple, single-element compositions to evaluate accuracy. Whilst effective compositionally, the colours were not completely accurate despite the original image's colours being described by Florence2Run. In hindsight I should have used the reference image as the latent image to pass through to the KSampler. This would have increased similarity to the original image but would risk making the image too similar. The goal of the project is not to duplicate existing designs.

Colour Replication

Transferring the colour palette from one image to another would help me autonomously create designs that are visually similar to a reference image. This would allow the user to replicate the style of images without having to define their visual properties. This system would be beneficial to speed of production but would hold most value in allowing users to 'take inspiration' from existing images without needing the specific terminology or knowledge to transfer the properties. This is a vital part of democratising design as it is not necessarily taste that makes the designer (although this is often the first skill needed to become one) but rather the understanding of what is or isn’t successful about a piece. By keeping design decisions like this behind the scenes, the user does not need to specify what they want transferred.

Keeping methodology like this abstract could instill confidence in the user as the complexity of the system is beyond their knowledge. If this is the case is debatable but confidence in an outcome is fundamental to assuring someone they have the capacity (or even the right) to design.

Caption Caption
Caption

I had difficulty applying colour settings to images in ComfyUI: Having used Photoshop for so long I struggled to work in a completely different way which felt constrictive. Although I implement this recolouring system in my practice, the forced simplicity applies well to this context: The goal is not to give the user every option, the goal is to give them a few optimised ones.

More Replication

.enter

Caption Caption
Caption Caption
Caption Caption

Whilst the colour system works well for some references, the effects do not transfer perfectly. For example in the whisky reference the image generation is dark, meaning additional saturation and lightness effects were required to replicate the look of the reference. In the case of the sword and the sunset the colour balance is more accurate at the start meaning the additional effects are over the top.

Caption Caption
Caption Caption Caption Caption

Nodes

.enter

Caption Caption Caption

This system uses the input image as the latent image, making the outputs accurate to their original. For individual asset generation colour accuracy is not of great importance, randomness allows more 'creativity' but in the case of composition refinement (a near final process) it is important to maintain the majority of the design choices the user has made: If the 'refinement' stage completely re-rendered the image the entire system would become redundant. Keeping the properties of the individual assets (including their compositional location and positioning) is essential to the success of the project. Without this the output may as well have been generated in one step.

Creating an agentic prompt extrapolation system.

Prompt Development

Philosophy

I created a system that would autonomously generate a complex prompt, calculate the number of elements, write a prompt for each element and generate lighting, style and contextual prompts all from a single user input. The idea behind this system is that some poster concepts are more complex than others however Flux generates better outputs with a more detailed prompt.

In theory, if the user has a limited idea of what they would like, the LLM will fill in the gaps and create a fully fledged poster concept. If the user has a more specific idea, the system will take less creative freedoms.

To achieve this system I used a combination of Ollama, Qwen3:4b and a custom python script that sends individual introductory prompts to the LLM, creating an agentic approach. I took heavy inspiration from this project that uses a multi-agent system to generate a 20,000 word book. My adapted system is highly effective as the user's input is progressively converted from a simple prompt into a fully fledged poster concept. One important stage of the process is the first stage where the user's input is handled by the Detailer agent whose objective is to fill in gaps and add detail to the initial input. This output is then passed to the next agent which determines the number of elements. The process repeats until all the boxes are filled.

PromptExtrapolator01 PromptExtrapolator02

In the first outcome I realised the system being constrained to 10 elements was overkill. I instead reduced this to three to save time and computational power. This balance between output quality and quantity to time and performance continued to be difficult across all areas of the project. Whilst it is important to generate a high quality outcome it is more important not to lose the viewers attention.

PromptGenTextConcat03

I created a system that would allow me to combine multiple prompts into a single input. This would allow me to apply Lighting, Style, Context and General prompts to every image: Theoretically ensuring visual cohesion. The system is adding text together in simple terms, meaning there is virtually no performance drawback making it an efficient way of promoting coherence.

PromptGenOutput01

I started by running the detailed description of the task through my existing Flux node system. I was pleasantly surprised with the results: The system had created a main character, title, body text and background elements. I was not so pleased with the composition however: I felt the arrangements of the elements were hugely generic which was expected, further highlighting the need for a more complex composition system: The AI cannot be trusted to create dynamic compositions.

Nodes

.enter

PromptGenNodes01 PromptGenOutput02 PromptGenNodes02 PromptGenOutput03 PromptGenNodes03 PromptGenNodes04 PromptGenNodes05

Manual Composition and Rendering

PromptGenOutput04 PromptGenOutput05 RoughComp

I roughly composed these elements as a proof of concept. The resulting composition was fed through the aforementioned ControlNet system. I was very pleased with the outcome as the style of the composition is incredibly consistent yet the composition is identical to the reference image. Unfortunately the text has not been rendered into the composition however the more I think about text, the more I think it may be better included in a code-based process as AI is notoriously bad at rendering typography.

PromptGenOutput06 PromptGenNodes06

Prompt Optimisation

Qwen is a family of open source Large Language Models developed by Alibaba Cloud. They are well known for their efficiency, making them an appealing choice for this part of the project.

I found the time duration to generate a series of prompts (3-10) took between 2-4 minutes on each run. Whilst the quality of the output was satisfactory, in an exhibition setting the generation needs to be faster: The user’s attention is very valuable and if the ‘pre-processing’ stage of the system can take up to 5 minutes I would not feel confident in holding on to attention. This prompted me to explore other LLMs and theorise about how I could improve the system.

Qwen3:4b

BackgroundPrompt.txt
.bullet “A dramatic medieval castle with towering stone spires, weathered stone walls, and intricate arches looms against a vast sky filled with swirling clouds and a crimson sunset, casting long shadows over a misty valley below, with ethereal lighting and a sense of ancient mystery in an epic fantasy style.”

Element1Prompt.txt
.bullet “A majestic dragon, iridescent scales shimmering in hues of gold and crimson, leathery texture with weathered ridges, glowing amber eyes fixed on the horizon, wings spread wide in a dynamic soaring pose, epic fantasy art style, isolated on a transparent background.”

Element2Prompt.txt
.bullet “A brave knight in ornate silver armor, red cloak, holding a glowing sword, dynamic pose, epic fantasy style.”

Qwen2:1.5b

BackgroundPrompt.txt
.bullet “A grand medieval castle perched atop a towering rock formation, bathed in the golden glow of sunset, with an majestic dragon soaring overhead, its scales shimmering in the fading light. The colors are rich and vibrant, featuring deep shades of red, gold, and orange, along with subtle hues of purple and blue. This scene exudes a sense of grandeur and power, evoking a feeling of awe and wonder at the beauty of nature's creation. Epic fantasy art reigns supreme here.”

Element1Prompt.txt
.bullet “A majestic griffin, golden feathers, sharp talons, soaring pose, isolated on a transparent background.”

Element2Prompt.txt
.bullet “A magnificent dragon, soaring over a medieval castle, its scales glittering in the sunset glow. The dragon's wingspan stretches towards the heavens as it soars through the sky, its tail flickering with movement in the wind. This epic fantasy art scene is a testament to grandeur and power, with the castle standing tall atop a rock formation, adding an additional layer of awe to this magnificent dragon's journey.”

Differences:

.bullet Qwen2:1.5b is significantly faster than Qwen3:4b: 17.39s vs 44.20s
.bullet Qwen3:4b performed much better at understanding the task, correctly identifying elements, and generating distinct, relevant prompts for the background and each element.
.bullet Qwen2:1.5b struggled with the nuances of the instructions, leading to incorrect element identification (griffin for dragon) and bleeding descriptions of one element/background into others.

Perhaps the final implementation should be Qwen3:4b for the Detailer prompt that also writes an element list. This would give a more accurate foundation for Qwen2:1.5b to add minor details.

To save time a more pre-defined approach could be more efficient: Specifying style prompts rather than generating them could be more beneficial.

Another solution could be to instruct the LLM to output one cohesive prompt in the form of a table/ spreadsheet.

This testing has been done on my Mac but my PC may be better optimised, reducing the thinking time of the LLM.

Researching existing gesture controlled systems.

Physical Interface Research

Project Soli

Project Soli is a gesture recognition system designed to identify minor motions in the hand, focusing on the thumb and index finger. Soli was designed to create intuitive controls without physical buttons, such as sliding the thumb across the index finger to scroll or tapping the two together to simulate a button press (reminiscent of the new Apple Vision Pro input method). In this video, haptic feedback in gesture-controlled systems is emphasised. Whilst my final outcome will not be precise enough to identify the physical contact between the thumb and index finger, it will be able to measure the proximity between them and respond accordingly, creating a similar effect.

Focusing on the thumb and index finger stems from their communicability and dexterity. When signing letters A-Z only G, S and Z go without direct use of the index finger. The dexterity of the thumb and index finger make them good anchors for intuitive interaction design. Furthermore, using these fingers builds on existing conventions, theoretically enhancing comfort and accessibility.

Although Project Soli seemed promising, it failed to gain mainstream traction. Despite being announced in 2015 and featuring in the Google Pixel 4 (2019) and Nest Hub (2021), it has since faded from consumer devices. The likely reason is that, although the gestures were intuitive to developers, they failed to offer practical advantages over existing tactile or voice-based systems. Intuition for a developer/ designer is not the same for the average user.

This realisation resonates with the PAPA's Park project where systems that felt intuitive to me did not always translate perfectly to non-designers. I plan to develop instructional materials for ease of use. My outcomes should be a streamlined exhibition version for non-designers (to introduce design) and a more complex workflow for designers (to benefit my workflow). The exhibition will hopefully demonstrate how design interaction will change.

Additionally, Project Soli struggled as few visual/ haptic feedback mechanisms grounded the user experience. In contrast, gesture interactions like pinch-to-zoom on smartphones, became mainstream because of consistent feedback and visible outcomes. This emphasises why my project must focus on real-time visual cues to help the user understand the system response

Apple Vision Pro

The mainstream introduction of VR increases the demand for gesture controlled systems again. The Apple Vision Pro for example does away with the traditional keyboard and mouse input, instead opting to monitor the user’s hands and eyes.

The system works by using a variety of cameras to track hand and eye movement. Unlike Project Soli, the system does not use proximity sensors for the hand gestures so the user is free to move their hands wherever they need. I believe this increased freedom has a large appeal as the main selling point of a gesture controlled system is that it feels more fluid but if, like Project Soli, the user must place their hands in a specific position, they may as well use a keyboard and mouse.

Like the hand movement, the eyes of the user are tracked, meaning the system has more contextual awareness. The detriment of conventional human-machine relationships is often the contextual awareness. Humans are able to process body language subconsciously, allowing for more streamlined communication. When we communicate with a machine we often only tell half the story because a computer cannot read tone of voice or body language. This is similar to sarcasm not translating perfectly over text messages.

Increased contextual awareness in machine communication is hugely important. It’s unlikely my system will include any eye tracking but I am very interested in using this technology to improve contextual awareness and analyse designs using data extracted from the subconscious.

Leap Motion Controller

The Leap Motion Controller is a gesture-based input device that enables users to control on-screen content without using a keyboard or mouse. A notable demonstration shows applications including drawing and sculpting and navigating digital interfaces. This aligns closely with my project as I intend for users to control the system entirely through gestures, foregoing normal input devices.

Like Project Soli, Leap Motion emphasises intuitive interaction, but has remained outside mainstream adoption (despite being introduced over a decade ago). This suggests a recurring issue in gestural interface design: the absence of a shared interaction language. Without obvious visual feedback or standardised gestures, the users are unable to understand the system.

Leap Motion has however found continued relevance in VR workflows, particularly in prototyping and immersive experiences. This demonstrates that gestural systems can succeed when paired with visual feedback and spatial awareness. Similarly, my project will provide a visual mapping between input and output, especially in the exhibition, where responsiveness is essential in maintaining engagement.

Microsoft Kinect

The Microsoft Kinect is another gesture-tracking system that, like Leap Motion, aims to change digital content interaction. Designed primarily for gaming, it was released as a competitor to the Nintendo Wii’s motion controls. Although Kinect could detect body silhouettes and basic gestures, it lacked skeletal tracking capabilities which limited accuracy and responsiveness. The system was discontinued in 2015.

Recent advancements in AI detection systems have arguably solved many of Kinect's limitations. Modern tools like MediaPipe can track skeletal real-time data using a webcam. The evolution makes gesture systems more accessible and precise without specialised hardware.

The Kinect was developed with play at its core, which echoes my project's goal to encourage experimentation. The intention of the exhibition is not to transform non-designers into experts but to lower the barrier of entry and demonstrate the possibilities of AI-assisted design. Emphasising engagement through movement shows how interaction design can be playful and meaningful.

One difference is that the Kinect thrived in collaborative or social gameplay, whereas my system is currently designed for solo users. I wonder about the implications of a specific collaborative mode but this would be a future objective. Accidental interactions by bystanders could be embraced (rather than treated as noise) which would not be intentionally collaborative but effective nonetheless.

Golan Levin

Golan Levin is an artist and researcher exploring motion-based interaction as a visual language. Yellowtail is a project where gestures are linked to animated output, forming a live feedback loop to demonstrate control and authorship. His practice emphasises the performative nature of interaction, exploring how design is shaped in real time by motion. Levin believes “The mouse is the narrowest straw you can try to suck all of human expression through”

InterstitialFragmentProcessor

Levin’s Interstitial Fragment Processor explores the impact of their silhouettes in relation to creating art. This is very similar to what I am exploring in my project and so many parallels can be explored. In the work, the subjects make shapes with their silhouettes which drop to the ground using a physics simulator, this is almost exactly what I aim to do except that the shapes created will be used to develop fully fledged graphics. The silhouettes are not exclusively individual as users can create shapes using a combination of their silhouettes, increasing the potential complexity. This opportunity will be replicated in my outcome, I am interested to see if this will happen.

Levin also discusses phonesthesia: The association of shapes and letters/ words. The conclusions are inspired by Wolfgang Köhler’s exploration between what form “takete” takes and what form ”maluma” takes. Arguably this in itself is a design principle.

Levin’s work offers insight for my own system where responsiveness is crucial, showing how creating with movement becomes intuitive when the screen reacts immediately and consistently.

Designing Gestural Interfaces by Dan Saffer

Dan Saffer’s book identifies three foundational principles for gesture-based interaction: Discoverability, feedback and forgiveness.

Dan Saffer argues that discoverability is a weakness of gesture systems because they typically lack any indication of what gestures are possible. Without visible handles or hints users struggle to initiate interaction. Saffer suggests discoverability can be improved with cues, tutorials or conventions to act as entry instructions. In an exhibition setting this is especially important as interaction must be learned instinctively: The invisibility of gestures must be countered with immediate contextual clues or suggestions to make the interaction feel obvious without explicit instruction.

Feedback is described as essential because without it gestures are frustrating. A lack of response causes users to repeat gestures, introducing uncertainty. Every gesture should return a response: Even failed recognition must be acknowledged. Real-time responsiveness keeps users engaged and reassures them that their input is being interpreted. Saffer writes the feedback needs to be proportional (stronger gestures returning stronger reactions) to reinforce the cause-effect relationship.

Gesture recognition should not demand perfection, especially when new users are involved. Forgiveness means tolerating imprecise input and prioritising success over precision. Saffer believes it is better for a system to assume than reject, encouraging user confidence. In practice, this could mean larger gesture thresholds, longer timeouts or smart defaults when input is ambiguous. This approach ensures that early engagement feels rewarding and helps users learn by doing rather than being punished for small mistakes.

Saffer’s believes in a gentle learning curve and forgiving system architecture to mitigate risk of a frustrating gesture interface. This is particularly relevant in an exhibition context where there is no time for onboarding and users must understand the system immediately.

AR Filters

Snapchat Lens Studio and TikTok Effect House provide gesture-responsive effects that show users how their movement is being tracked. On-screen visuals react to face and hand positions with minimal latency, making the systems intuitive. These platforms succeed because of instant response, not because gestures are explained. The direct feedback loop reduces friction, a similar approach will ensure users understand the gesture system without instruction, although I still think these would be helpful.

Exploring a gesture controlled composition system.

P5.js Prototyping

Philosophy

Instead of looking to compose everything inside TouchDesigner from scratch it's important to prototype first. This will bypass some of the technical strain as I develop the system, once complete the principles could be converted into my TouchDesigner project.

The elements that I would like to include are only images and text, the number of which can be defined by the user. These consist of the basic elements needed in a composition which the user can arrange to their satisfaction, the intention is to then run these outcomes through an image generator to realise the user’s vision.

P5JSPrototyping01

I started by creating a simple system that would allow me to organise two element types: Text and image. I added the option to reposition and resize each element. Additionally I added the option to re-organise the layers in the composition.

Whilst this system is very basic it demonstrates how my final application could function. If the blue and red boxes are used as masks or replaced with actual images the user will be able to intuitively create complex compositions.

A potential next step would be to add some compositional guidelines that the elements snap to/ some indication to show the user that there is intelligence in the architecture.

Gesture Controls

I realised I was not directly tied to TouchDesigner as its main value was the MediaPipe plugin. Being that the plugin utilises the MediaPipe library I realised I could incorporate it into a standalone system. I implemented the MediaPipe Hand Tracking library into a P5.js sketch and began prototyping.

P5JSPrototyping01

Initially I added controls to identify when the user was 'tapping'. I found the most intuitive way to do this was to use the index finger and thumb. The system would identify when the two fingers were close enough together and trigger a tap event. The system is similar to the Apple Vision Pro's input method which speaks highly of the intuitiveness of the system.

P5JSPrototyping02

I then added some images to the canvas to test the system. Taking inspiration from my earlier prototype I created a simple drag and drop system that allowed for the repositioning of the images.

P5JSPrototyping03

I then isolated the position of the pinkie finger tip to introduce scaling and rotation. I realised an intuitive way to rescale an image was to open and close the fist, similarly rotating the hand would represent the rotation of the element intuitively.

To track these values I drew a line between the pinkie finger point and the index finger point. This allowed me to identify the initial length of the line and scale the selection based on the new size of the line. Simliarly, I could isolate the rotation of the line to determine the rotation of the element. I believe the system builds on intuition and doesn't require the user to learn a completely new interaction system.

Additional Developments

.enter

P5JSPrototyping04

I then ensured the background image was always behind the other elements to prevent it blocking them. The layering effect is a more complex problem to solve so implemented this simple solution.

P5JSPrototyping05

As the final section materialised I added a native side by side view so the user could visualise how their input was being processed. I also added live instructional text to the hand tracking to encourage movement.

Exploring the relationship between complexity and barrier to entry in design software.

Photoshop vs MS Paint

Being that my outcome is focused on intuition, I researched the development of graphic design software. It is important to understand how the tools of design have changed over time: Investigating intuitive design systems like Microsoft Paint compared to more complex software, like Photoshop, will help me understand how software changes the barrier of entry.

Photoshop began with image correction and manipulation, early uses were technical but not intimidating. As Photoshop’s growth is tied directly to user demand new features were added to serve professionals who needed more precision, more control, more output options therefore shifting the priority away from new users. Over time, the software’s identity changed from a creative aid to an industrial standard

In contrast, Microsoft Paint stayed still and remained built for quick sketches. It has not been updated to serve professionals because its user base is not professional. Photoshop expanded to meet an evolving need, industry ‘forced’ complexity.

The evolution of design software is not separate from the evolution of design itself. Photoshop’s expansion mirrors the question of how design principles can be encoded. Each Adobe program is a visual principle made operational. Adjustment layers are essentially colour theory in a panel and layers are hierarchy. The entire interface is built around design principles.

In relation to my project I have been theorising about the best way for the system to operate. Whilst I never consciously thought about layers conveying hierarchy I still gravitated towards this system. Perhaps a design system made by the designer will have an inherently higher success rate as the design principles that guide the industry are hard wired into anything I create.

Design systems often present as objective, but they encode assumptions. Photoshop’s interface teaches a way of seeing and editing. It becomes a visual language, just as Western design principles reinforce their own logic, appearing universal through repetition. Photoshop, Illustrator and InDesign also reinforce the grid and reward alignment: Visual unpredictability is, in some ways, punished.

In the PAPA’s Park project I tried to make a system that others could use without needing to know the theory behind it which is similar to the early graphic design software appeal. As these systems matured, they became harder to use because they tried to do everything. The more features to support control, the higher the barrier to entry became.

To avoid the ‘trap’ of most modern design software it is important that my outcome remains focused. In the development of the project I have seen the capabilities of the workflows I have created and have had impulses to apply them to different contexts. It is vital that the project remains focused as the smaller the scope of my outcome, the more understandable it is. It’s important to take inspiration from Golan Levin’s Interstitial Fragment Processor in the way that the exhibition premise is simple, leaving the user’s creativity to complexify the outcome.

If complexity is something that requires more exploration, adding an abstraction layer (as I did with the weather presets in PAPA’s Park) allows complexity to be added without the requirement for direct understanding. Photoshop does not include this progressive exposure feature and assumes user understanding, hence the entry barrier.

The history of software and my project revolve around encoding human decisions into procedural steps. In my system, I can decide which principles are programmed and the extent of their flexibility. The question isn’t whether a machine can design. It’s whether it can offer the user enough space to find their own voice.

Arguably simplicity is less valuable but it hides decisions. An output, like in the case of the PAPA’s Park outcome, might seem playful and unrefined, but it’s backed by a value system. The design decisions are not completely different to the choices embedded in Photoshop menus, they’re just more user specific.

Systems are only as good as their assumptions: If the goal is participation accessibility is not a compromise. If the system becomes too sophisticated it’s no longer generative. complex software enables refined aesthetics but do not necessarily produce better ideas.

More isn’t always better, feature-rich systems can lose their purpose. In the case of Photoshop this has led to more creative freedom and industrial value but the more specific a tool is, the more qualified it is to solve a problem. I touched on this idea in my dissertation (link here) but I think the future for design software/ systems is in specific tools. Systems that aim to do everything tend to hallucinate.

The final system.

Final Outcome

_

Despite the difficulties faced chaining commands previously I attempted one last time to create a linked system that generated each stage of the process individually as part of a larger system. To do this I needed to export the ComfyUI workflows as .json files and import them into my project that would call them via a python script.

Workflows

_

This node set is responsible for extrapolating the user's concept and generating a complex description. It allows the next processes to work from a more concrete platform, making the resulting prompts more detailed.

_

These nodes are responsible for generating individual positive and negative prompts. They are fed default prompts which are pre-defined in the system, ensuring quality. The first couple of steps of the system ensure the output information is of high quality regardless of the user's input.

This workflow is looped to generate prompts for the background, subject1, subject2 and subject3.

_

Taking the positive and negative prompts from the previous nodes, the system runs through an image generation cycle. The model chosen is SDXL which was selected for its high speed. The node system processes the image, creates a depth map of the resulting image and re-renders the initial output.

Segmenting the process without exporting each output allows for composition to be better controlled whilst generating a good quality output. The first generation focuses on the composition, using a low step count to maintain similarity to the main image. The second generation builds upon the composition. This system is not dissimilar to the unsuccessful 2x rounds of denoising from earlier in the project. This system however uses ControlNet which gives far more control over the composition.

I have also added an image generation node system without the background removal node for the background.

_

The final ComfyUI related workflow is a simple refinement process. The input image is automatically captioned and sent through further nodes consisting of a LoRa stack and further ControlNet nodes. The system ensures compositional consistency but gives enough 'creative freedom' to generate details within these parameters.

This is by far the longest computational process but it is the last step, meaning it can be afforded extra time as the user will be anticipating the result and therefore it is unlikely they will lose interest. It takes roughly 60 seconds.

Process

Prompting

Initial Input:
.bullet "Please generate a poster of three superheroes, one is a bunny rabbit, one is a dog and the other is a gorilla. They should be in a detailed setting, exploring a new planet."

Detailer Output:
.bullet "A visually stunning sci-fi setting showcases three superheroes exploring an alien planet, each with unique attributes. The background features vibrant, lush vegetation and towering rock formations that create an intriguing backdrop for our dynamic trio."

Elements:
"1. In the foreground stands a bunny rabbit, donned in armor with sharp claws extended, exuding both agility and ferocity as he takes point on their journey through this mysterious world.
2. The dog superhero follows closely behind, boasting sleek, armored fur and powerful paw prints that leave deep impressions in the alien soil. He carries a plasma cannon at his side, ready for action when needed.
3. Finally, the gorilla superhero towers over the other two heroes with massive strength. Clad in a high-tech suit of armor, he wields an enormous energy shield that pulses menacingly as they venture further into this alien landscape.
4. The scene is illuminated by soft, ambient glow emanating from luminescent plants and crystalline formations scattered throughout the environment.
5. Poster style: Combining realistic, photorealistic elements with dynamic comic book aesthetics to create a visually arresting design that captures the action and adventure of this superhero team as they explore an unfamiliar planet together."

Background Positive Prompt:
.bullet "Vast ancient forest at twilight, glowing bioluminescent plants, moss-covered stones, dense fog curling between towering trees, (shafts of light filtering through the canopy), enchanted atmosphere, high detail, cinematic lighting, fantasy concept art, subtle color grading, volumetric shadows, 8K ultra resolution."

Subject1 Positive Prompt:
.bullet "A bunny rabbit donned in armor with sharp claws extended, agile and fierce demeanor, photorealistic elements combined with dynamic comic book aesthetics, (epic pose), superhero genre illustration, stylized fur, strong contrasts between highlights and shadows."

Subject2 Positive Prompt:
.bullet "Dog superhero in armor, sleek fur texture, sci-fi setting, plasma cannon accessory, dynamic composition, vibrant colors, photorealistic style with comic book aesthetics (realism:0.8, detailing:1.3), Stable Diffusion art (not Poster Concept)."

Subject3 Positive Prompt:
.bullet "gorilla superhero (high-tech suit of armor, energy shield, massive strength), sci-fi setting (vibrant vegetation, towering rock formations), photorealistic comic book aesthetics (dynamic composition, soft ambient glow), cinematic lighting (luminescent plants, crystalline formations)., cinematic lighting, realistic, best quality, high quality, realistic, best quality, high quality"

Negative Prompt:
.bullet "dimly lit, cartoon, deformed, illustration, blurry, low quality, worst quality, jpeg artifacts"

Masking

_
_ _ _

I utilised another of Google's MediaPipe image segmentation models to allow the user to generate masks live. This allows further control over the image in an intuitive, gesture-controlled way.

Generation

_ _ _ _

These are the elements generated using the masks and prompts generated previously. They hold a slight resemblance to the original mask without completely constricting the AIs ability to generate images.

The quality of these images is not the best compared to the rest of my testing but optimisations had to be made for efficiency.

Composition

_ _

I then created a basic composition to display all three elements and the background.

Refinement

_

Because each element had different styles and different levels of cropping the refinement process was not effective. Some stylistic principles were applied but the elements were not distinguishable which is unfortunate being that this was a big part of the process.

Reflection

Overall, I am pleased with the final outcome however I was disappointed that the system does not function autonomously. I resorted to manually transferring between stages which does not fulfil the original objective of the project but time restrictions dictated this. I hope to continue development and have a fully autonomous system for the degree show.

The modular structure allowed each stage of the system to remain independent, testable and modifiable. Separating these functions into reusable .json ComfyUI workflows created the foundation that could easily be adapted for other applications. The system could scale to accommodate user-generated datasets, alternative image generation models, or function as a creative assistant within a graphic design studio. Now the scaffolding for the system is in place, different parts of the system can be tweaked to generate vastly different results: Adding further nodes and tweaking settings in the .json can completely change the output with relative ease. Although full automation is not yet achieved, the pipeline’s structure supports further development towards a more seamless. Aside from this change I would have liked to explore custom LoRa’s in more detail, taking even fuller control of the system.

Importantly, this final stage gave further insight into whether a rule-based, automated system could produce outcomes equal in visual success to manually crafted design. Whilst the outputs generally lacked polish or coherence, the system makes design more accessible. As a user’s taste develops they should be able to modify the system to meet their requirements, I hope this would be the first step towards encouraging people to interact with design. My idea has always been that automation can augment creativity rather than replace it.

The quality of the outputs are not to my liking, although the composition in the refinement stage is maintained I hoped the denoising stage would add further details to the elements and make them cohesive. The output seems to have made the sections visually consistent but at the cost of their individual distinction. To remedy this I would add further stages to the original image generation to prioritise quality of output. I also would reduce the impact of the masking process on the initial generation: I ensured the role of the mask was visible in the outcome so the user could identify the result of their actions. This came at the cost of the individual asset generations as the subjects were forced to follow a specified form.

Whilst I wouldn’t consider the outputs to be successful in every sense, the value of the system is not placed on the generations exclusively. Supplying detailed prompts allows the user to mood board without manually researching for images means the system can be used as a research or prototyping tool. As the system now functions as individual parts, these can be tweaked to fit different contexts, providing value to users where they see fit. As is the case with any creative tool, value stems from how people use it.

The system could be used by a junior designer to rapidly create speculative concepts based on a written brief. Reducing the time spent on research would improve professional efficiency, whereas in an educational context it could help students understand how prompts impact compositional decisions through comparing inputs and outputs.

I'm aware that, as the system does not function as intended, I have not created many actual graphics for this unit. I've realised recently that being a conventional graphic designer does not interest me as much as it did although the idea of looking at a canvas rather than code is incredibly appealing at this moment.

Evaluation of the Tate TT&E experience and exhibition approach

Tate Tech Tea & Exchange

I was an active participant in the invite-only residency exploring the potential of generative AI in a collaboration between Tate, Anthropic, UAL & Goldsmiths. I thoroughly enjoyed the experience as I learnt more about the potential of AI in a design specific context whilst asking myself questions about my personal practice.

A question I haven’t considered in huge detail is the wider implications of AI being integrated into every workflow. Conventional skills like knowledge recall and technical proficiency will surely decrease as information becomes even more accessible and most tools run through an LLM of some kind. Whilst this makes our lives ‘easier’ in the short term, it would surely reduce individual intelligence and problem solving skills in the long term. This sort of discussion reminds me of the fictional WALL-E humans who have lost almost all capacity. It could be argued that even pre-AI popularisation we were moving in this direction but the mainstream introduction of this new technology will surely accelerate our journey towards this state of being. Although AI will likely make our lives ‘better’ in the short term it will likely seize control or reduce human function, both seemingly have the same result.

Is it our responsibility to ensure humans have the most cognitive skills or is it instead more important to ensure humans have the most capacity (larger projects etc) even if they are not 100% responsible for them? Surely the answer is somewhere in the middle but this definition is extremely difficult to identify.

If there is intelligence smarter than humans do we even need to worry about our own skills? People should have the opportunity to choose what type of life they live but with freedom of choice some people can do the wrong thing: People can choose to eat healthily and exercise or eat poorly and live sedentary, many people make the wrong decision but enforcing such positions would be highly unpopular. It is inherently human to have the right to make that choice.

This may sound like I believe no good will come from increased intelligence but it is quite the opposite: I see huge benefits from an intelligence revolution and it is the choice of humans whether we will sacrifice our agency.

In its current state I see AI as more of a collaborator than a separate entity: without a well worded prompt there will be a poorly generated response. The LLM’s response should then be seen as a modifiable first draft rather than a final outcome. The role of the human in this process will surely diminish as time passes but for most tasks I can’t see humans completely excluded. AI systems are replacing repetitive tasks like layout generation or asset resizing, the system actively removes labour that justifies the involvement of a designer. When the majority of the process is automated, what role remains for the designer? Theoretically, AI frees the designer to work on higher level decisions, but it also discredits the human touch. While the system improves speed, flexibility and autonomy it threatens freelancing infrastructure: If a client can generate branded content why would they need a designer at all? In building a tool that democratises and streamlines the process I am removing the value of design work but if it’s not me someone else will get there first.

Regardless, there will always be a validation check for anything created by an AI, therefore I think humans will always play a role in a process but it will diminish in all stages but project management. It's our responsibility to use AI responsibly. Even now there are examples of people blindly using AI and resolving all responsibility: People using AI ignorantly in professional settings, for example this author left an LLM prompt in their book.

Another question that was raised often is the environmental impact of AI. Whilst this is a huge concern for server-based AI (like ChatGPT) running everything locally has a near negligible impact on the environment: Because everything is hosted on local machines, rather than a huge server farm, you only need to account for the energy costs rather than the huge amount energy in addition to the robust water-hungry cooling systems.

The experience definitely made me question how I use AI and the potential wider implications of it. For these reasons I have specifically made decisions that use AI in the most ethical ways I can think of. No one knows the true impact AI will have on society in the long run. I think the best course of action is staying up to date with AI developments and not outsourcing all work to it.

Exhibition

Tate Show Setup

As part of the Tate Tech Tea and Exchange project we had to exhibit work that used machine learning. Ambitiously I decided to exhibit my entire system working for the public. Unfortunately, this didn't materialise but I was able to talk to people about my work and show snippets of what I had created. I realised it’s important to have a non-live demonstration of work to show in case something goes wrong, this is something that I will be ensuring for the degree show.

To avoid travelling with my Windows computer across London I set up a remote access system that allowed me to access ComfyUI remotely from my Mac. I set up a Cloudflare Tunnel to allow me to access the system from anywhere. Whilst this did serve its purpose the transfer of data over the internet did add delays to the system which elongated the process. I realised the 'real time' nature of the project had disappeared and the system was in a strange place where it catered to neither the designer nor the non-designer.

The process of setting up a remote server forced me to learn a great deal about servers and remote connections. This project spans a wide range of technical processes and I have learned a huge amount. I certainly feel like a more capable person.

System Rethink

One of the biggest takeaways was that the system was taking too long to respond: The chain of processes was more or less complete but the lack of user interface and many bugs prevented full functionality. The calls made to ComfyUI took too long and the system was generally not a useful enough tool with all the delays. I understand attention is highly valuable in an exhibition context and without a faster system I would lose it.

To get around this I started investigating a system that was more integrated into ComfyUI: By creating a process that made fewer transfers between workflows and uploading and downloading I could reduce time saving and transferring, removing a large portion of the time spent in the system. The main user input is in the initial prompt, masking and composing, meaning only three transfers would be necessary.

Degree Show

Degree Show Setup

For the degree show I would like to display an interactive version of my system. The premise of this project is to give users the opportunity to create complex compositions without advanced design knowledge. I can’t see this happening outside of an exhibition context.

Because my system is geared towards non-designers it is important that my exhibition is not exclusively appealing to technical people. As my work is exclusively digital, input and output, it is important that the setup does not look like a standard setup. It will be difficult to package a screen, camera and microphone in a way that is appealing for a non-designer however a large part of my exhibition philosophy is that a projector will hugely help this.

I think the gesture controlled system will encourage interest and experimentation so I would like to spend as much time in these phases as possible. The backend prompt and image generation should be relatively efficient with a maximum non-interaction time of 30 seconds per phase. During these phases the user should be able to see how the system is processing the data.

As for a non-live demo of the system I would choose to display previous user output and a short video of their process to demonstrate the full capacity of the outcome. Similarly, I would see a lot of value in displaying some printed versions of the outcomes, especially if they relate in some way to instructions on how to interact with the system.

Design Principles

Having been so wrapped up in the process of creation I've had time to think about the design principles underpinning my outcome. Initially my plan was to apply compositional systems like the rule of thirds or scaling systems like the golden ratio however I have realised the more basic principles like form and colour are more important. Whilst I never consciously thought about making every element the same colour, for example, the workflows I have built reflect these principles. The prompt system ensures the lighting is consistent across each element, satisfying cohesion. The repeated stylistic prompts ensure visual properties are repeated. The compositor allows the user to create their own hierarchy without having to think about it. Whilst these are not the most complex ideas, they have rubbed off onto the workflow I have created. A non-designer creating a system may have not considered the importance of these factors but it has been second nature throughout the project.

Whilst I would not credit the success of any design on design principles exclusively I do believe they are a useful guideline more than anything else. For most people they are unaware of these guidelines and so do not consider them. In the case of this system the design principles are automatically applied, allowing the user to follow or violate them without needing to understand them. This is the key to the success of the system: The user can create a composition that is visually appealing without needing to understand why it is visually appealing.

Final conclusion and reflection

Conclusion

Through the final weeks I have contemplated whether creating a design system was worth the time. I’ve found that anything I have attempted to make has been incredibly difficult programmatically and I have been increasingly comparing efficiency of my system compared to the existing pillars of design software.

I appreciate the intention behind creating something larger but I realised towards the end of the project that I am not a software engineer and I don’t have the know-how to solve complex problems within highly detailed architecture. I set out with the goal of creating a relatively simple fix to a problem I had noticed in my personal work but the temptation to keep solving problems as I developed got in the way. As I continued development and added further features the complexity of the project ballooned and I struggled to keep track of every function.

With the knowledge I have now I certainly feel like a more capable designer: My understanding of ComfyUI specifically will hugely expand the projects I am able to take on. Unfortunately, an understanding of how to use a program does not translate into knowledge of how to build on it and create something truly new. Had my initial approach been to create a complex comfyUI system I would have satisfied this but integrating it into different phases of a project came with huge difficulty. I realised late in the project that the fewer transitions between systems the better but even with more condensed architecture the problem I set out to solve requires a huge amount of technical understanding that would take more than a few months to truly develop.

I found that frequently in the project I was looking for tutorials that had very little coverage. Whilst this indicates I am creating something unique, I required further guidance that wasn’t available. Individually each phase of the project has functioned perfectly but chaining them together introduced bigger problems.

Retrospectively thinking I may have approached the problems I was facing in the way a graphic designer would. My plan for the architecture revolved heavily on a hierarchical system with a chain of command. Whilst in principle this sounds sensible, processing the amount of functions and data with my initial limited knowledge of coding (especially Python) led to some big problems. The segmentation of processes often led to problems when I went to combine them but they would not function together if I couldn’t test them in isolation.

Upon reflection I think this may be a problem for a multi-person team although I am proud of the progress I made. I felt often throughout the project that my attention was being dragged in different directions. This gave me a birds eye view which was helpful as it allowed me to understand the bigger picture but not having a specialty never allowed me to truly test any features to my satisfaction. Had this been a team project I would have dedicated a phase of the system and made sure it had a use case. I think after working on this project for so long I lost sight of the original problem I set out to solve.

In hindsight being more realistic with my objective as the deadline approached would have been sensible. I was a victim of blind optimism, unfortunately my vision did not come to fruition in the way I pictured it.