LukeW

Syndicate content LukeW | Digital Product Design + Strategy
Expert articles about user experience, mobile, Web applications, usability, interaction design and visual design.
Updated: 4 hours 20 min ago

Video: Structuring Website Content with AI

Sun, 12/03/2023 - 2:00pm

To create useful conversational interfaces for specific sets of content like this Website, we can use a variety of AI models to add structure to videos, audio files, and text. In this 2.5 minute video from my How AI Ate My Website talk, I discuss how and also illustrate if you can model a behavior, you can probably train a machine to do it at scale.

Transcript

There's more document types than just web pages. Videos, podcasts, PDFs, images, and more. So let's look at some of these object types and see how we can break them down using AI models in a way that can then be reassembled into the Q&A interface we just saw.

For each video file, we first need to turn the audio into written text. For that, we use a speech-to-text AI model. Next, we need to break that transcript down into speakers. For that, we use a diarization model. Finally, a large language model allows us to make a summary, extract keyword topics, and generate a list of questions each video can answer.

We also explored models for identifying objects and faces, but don't use them here. But we did put together a custom model for one thing, keyframe selection. There's also a processing step that I'll get to in a bit, but first let's look at this keyframe selection use case.

We needed to pick out good thumbnails for each video to put into the user interface. Rather than manually viewing each video and selecting a specific keyframe for the thumbnail, we grabbed a bunch automatically, then quickly trained a model by providing examples of good results. Show the speaker, eyes open, no stupid grin.

In this case, you can see it nailed the which Paris girl are you backdrop, but left a little dumb grin, so not perfect. But this is a quick example of how you can really think about having AI models do a lot of things for you.

If you can model the behavior, you can probably train a machine to do it at scale. In this case, we took an existing model and just fine-tuned it with a smaller number of examples to create a useful thumbnail picker.

In addition to video files, we also have a lot of audio, podcasts, interviews, and so on. Lots of similar AI tasks to video files. But here I wanna discuss the processing step on the right.

There's a lot of cleanup work that goes into making sure our AI generated content is reliable enough to be used in citations and key parts of the product experience. We make sure proper nouns align, aka Luke is Luke. We attach metadata that we have about the files, date, type, location, and break it all down into meaningful chunks that can be then used to assemble our responses.

Video: Expanding Conversational Interfaces

Thu, 11/30/2023 - 2:00pm

In this 4 minute video from my How AI Ate My Website talk, I illustrate how focusing on understanding the problem instead of starting with a solution can guide the design of conversational (AI-powered) interfaces. So they don't all have to look like chatbots.

Transcript

But what if instead we could get closer to the way I'd answer your question in real life? That is, I'd go through all the things I've written or said on the topic, pull them together into a coherent reply, and even cite the sources, so you can go deeper, get more context, or just verify what I said.

In this case, part of my response to this question comes from a video of a presentation just like this one, but called Mind the Gap. If you select that presentation, you're taken to the point in the video where this topic comes up. Note the scrubber under the video player.

The summary, transcript, topics, speaker diarization, and more are all AI generated. More on that later, but essentially, this is what happens when a bunch of AI models effectively eat all the pieces of content that make up my site and spit out a very different interaction model.

Now the first question people have about this is how is this put together? But let's first look at what the experience is, and then dig into how it gets put together. When seeing this, some of you may be thinking, I ask a question, you respond with an answer.

Isn't that just a chatbot? Chatbot patterns are very familiar to all of us, because we spend way too much time in our messaging apps. The most common design layout of these apps is a series of alternating messages. I say something, someone replies, and on it goes. If a message is long, space for it grows in the UI, sometimes even taking up a full screen.

Perhaps unsurprisingly, it turns out this design pattern isn't optimal for iterative conversations with sets of documents, like we're dealing with here. In a recent set of usability studies of LLM-based chat experiences, the Nielsen-Norman group found a bunch of issues with this interaction pattern, in particular with people's need to scroll long conversation threads to find and extract relevant information. As they called out, this behavior is a significant point of friction, which we observed with all study participants.

To account for this, and a few additional considerations, we made use of a different interaction model, instead of the chatbot pattern. Through a series of design explorations, we iterated to something that looks a little bit more like this.

In this approach, previous question and answer pairs are collapsed, with a visible question and part of its answer. This enables quick scanning to find relevant content, so no more scrolling massive walls of text. Each question and answer pair can be expanded to see the full response, which as we saw earlier can run long due to the kinds of questions being asked.

Here's how things look on a large screen. The most recent question and answer is expanded by default, but you can quickly scan prior questions, find what you need, and then expand those as well. Net-net, this interaction works a little bit more like a FAQ pattern than a chatbot pattern, which kind of makes sense when you think about it. The Q&A process is pretty similar to a help FAQ. Have a question, get an answer.

It's a nice example of how starting with the problem space, not the solution, is useful. I bring this up because too often designers start the design process with something like a competitive audit, where they look at what other companies are doing and, whether intentionally or not, end up copying it, instead of letting the problem space guide the solution.

In this case, starting with understanding the problem versus looking at solutions got us to a more of a FAQ thing than a chatbot thing. So now we have an expandable conversational interface that collapses and extends to make finding relevant answers much easier.

AI Models Enable New Capabilities

Tue, 11/28/2023 - 2:00pm

In the introduction to my How AI Ate My Website talk, I frame AI capabilities as a set of language and vision operations that allows us to rethink how people experience Web sites. AI tasks like text summarization, speech to text, and more can be used to build new interactions with existing content as outlined in this short 3 minute video.

Transcript

How AI Ate My Website What do most people picture of that title, AI Eating a Website? They might perhaps imagine some scary things, like a giant computer brain eating up web pages on its way to global dominance.

In truth though, most people today probably think of AI as something more like ChatGPT, the popular large language model from OpenAI. These kinds of AI models are trained on huge amounts of data, including sites like mine, which gives them the ability to answer questions such as, Who is Luke? ChatGPT does a pretty good job, so I guess I don't need an intro slide in my presentations anymore.

But it's not just my site that's part of these massive training sets. And since large language models are essentially predicting the next token in a sequence, they can easily predict very likely, but incorrect answers. For instance, it's quite likely a product designer like me went to CMU, but I did not. Even though ChatGPT keeps insisting that I did, in this case, for a master's degree.

No problem though, because of reinforcement learning, many large language models are tuned to please us. So correct them, and they'll comply, or veer off into weird spaces.

Let's zoom out to see this relationship between large language models and websites. A website like mine, including many others, has lots of text. That text gets used as training data for these immense auto-completion machines, like ChatGPT. That's how it gets the ability to create the kinds of responses we just looked at.

This whole idea of training giant machine brains on the totality of published content on the internet can lead people to conjure scary AI narratives.

But thinking in terms of a monolithic AI brain, isn't that helpful to understanding AI capabilities and how they can help us? While ChatGPT is an AI model, it's just one kind, a large language model. There's lots of different AI models that can be used for different tasks, like language operations, vision operations, and more.

Some models do more than one task, others are more specialized. What's very different from a few years ago though, is that general purpose models, things that can do a lot of different tasks, are now widely available and effectively free.

We can use these AI models to rethink what's possible when people interact with our websites, to enable experiences that were impossible before, to go from scary AI thing to awesome new capabilities, and hopefully make the web cool again, because right now, sorry, it's not very cool.

Early Glimpses of Really Personal Assistants

Fri, 11/24/2023 - 2:00pm

Recently I've stumbled into a workflow that's starting to feel like the future of work. More specifically, a future with really personal assistants that accelerate and augment people's singular productivity needs and knowledge.

"The future is already here – it's just not evenly distributed." -William Gibson, 2003

Over the past few months, I've been iterating on a feature of this Website that answers people's digital product design questions in natural language using the over 2,000 text articles, 375 presentations, 100 videos, and more that I've authored over the past 28 years. While the project primarily started as testbed for conversational interface design, it's morphed into quite a bit more.

Increasingly, I've started to use the Ask Luke functionality as an assistant that knows my work almost as well as I do, can share it with others, and regularly expands its usefulness. For example, when asked a question on Twitter (ok, X) I can use Ask Luke to instantly formulate an answer and respond with a link to it.

Ask Luke answers use the most relevant parts of my archive of writings, presentations, and more when responding. In this case, the response includes several citations that were used to create the final answer:

  • a video that begins that the 56:04 timestamp where the topic of name fields came up in a Q&A session after my talk
  • a PDF of a presentation I gave on on Mobile checkout where specific slides outlined the pros and cons of single name fields
  • and several articles I wrote that expanded on name fields in Web forms

It's not hard to see how the process of looking across thousands of files, finding the right slides, timestamps in videos, and links to articles would have taken me a lot longer than the ~10 seconds it takes Ask Luke to generate a response. Already a big personal productivity gain.

I've even found that I can mostly take questions as they come to me and produce responses as this recent email example shows. No need to reformat or adjust the question, just paste it in and get the response.

But what about situations where I may have information in my head but haven't written anything on the topic? Or where I need to update what I wrote in light of new information or experiences I've come across? As these situations emerged, we expanded the admin features for Ask Luke to allow me to edit generated answers or write new answers (often through audio dictation).

Any new or edited answer then becomes part of the index used to answer subsequent questions people ask. I can also control how much an edited or new answer should influence a reply and which citations should be prioritized alongside the answer. This grows the content available in Ask Luke and helps older content remain relevant.

Having an assistant that can accept instructions (questions) in the exact form you get them (no rewriting), quickly find relevant content in your digital exhaust (documents, presentations, recordings, etc.), assemble responses the way you would, cite them in detail, and help you grow your personal knowledge base... well it feels like touching the future.

And it's not hard to imagine how similar really personal assistants could benefit people at work, home, and school.

Further Reading

AI Models in Software UI

Sun, 11/19/2023 - 2:00pm

As more companies work to integrate the capabilities of powerful generative AI language and vision models into new and existing software, high-level interaction patterns are emerging. I've personally found these distinct approaches to AI integration useful for talking with folks about what might work for their specific products and use cases.

In the first approach, the primary interface affordance is an input that directly (for the most part) instructs an AI model(s). In this paradigm, people are authoring prompts that result in text, image, video, etc. generation. These prompts can be sequential, iterative, or un-related. Marquee examples are OpenAI's ChatGPT interface or Midjourney's use of Discord as an input mechanism. Since there are few, if any, UI affordances to guide people these systems need to respond to a very wide range of instructions. Otherwise people get frustrated with their primarily hidden (to the user) limitations.

The second approach doesn't include any UI elements for directly controlling the output of AI models. In other words, there's no input fields for prompt construction. Instead instructions for AI models are created behind the scenes as people go about using application-specific UI elements. People using these systems could be completely unaware an AI model is responsible for the output they see. This approach is similar to YouTube's use of AI models (more machine learning than generative) for video recommendations.

The third approach is application specific UI with AI assistance. Here people can construct prompts through a combination of application-specific UI and direct model instructions. These could be additional controls that generate portions of those instructions in the background. Or the ability to directly guide prompt construction through the inclusion or exclusion of content within the application. Examples of this pattern are Microsoft's Copilot suite of products for GitHub, Office, and Windows.

These entry points for AI assistance don't have to be side panels, they could be overlays, modals, inline menus and more. What they have in common, however, is that they supplement application specific UIs instead of completely replacing them.

Actual implementations of any of these patterns are likely to blur the lines between them. For instance, even when the only UI interface is an input for prompt construction, the system may append or alter people's input behind the scenes to deliver better results. Or an AI assistance layer might primarily serve as an input for controlling the UI of an application instead of working alongside it. Despite that, I've still found these three high-level approaches to be helpful in thinking through where and how AI models are surfaced in software applications.

Until the Right Design Emerges...

Wed, 11/15/2023 - 2:00pm

Too often, the process of design is cut short. When faced with user needs or product requirements, many designers draft a mockup or wireframe informed by what they've seen or experienced before. But that's actually when the design process starts, not ends.

"Art does not begin with imitation, but with discipline."—Sun Ra, 1956

Your first design, while it may seem like a solution, is usually just an early definition of the problem you are trying to solve. This iteration surfaces unanswered questions, puts assumptions to the test, and generally works to establish what you need to learn next.

"Design is the art of gradually applying constraints until only one solution remains."—Unknown

Each subsequent iteration is an attempt to better understand what is actually needed to solve the specific problem you're trying to address with your design. The more deeply you understand the problem, the more likely you are to land on an elegant and effective solution. The process of iteration is a constant learning process that gradually reveals the right path forward.

"True simplicity is, well, you just keep on going and going until you get to the point where you go... Yeah, well, of course." —Jonathan Ive, September, 2013

When the right approach reveals itself, it feels obvious. But only in retrospect. Design is only obvious in retrospect. It takes iteration and discipline to get there. But when you do get there, it's much easier to explain your design decisions to others. You know why the design is the right one and can frame your rationale in the context of the problem you are trying to solve. This makes presenting designs easier and highlights the strategic impact of designers.

Multi-Modal Personal Assistants: Early Explorations

Mon, 11/13/2023 - 2:00pm

With growing belief that we're quickly moving to a world of personalized multi-modal software assistants, many companies are working on early glimpses of this potential future. Here's a few ways you can explore bits of what these kinds of interactions might become.

But first, some context. Today's personal multi-modal assistant explorations are largely powered by AI models that can perform a wide variety of language and vision tasks like summarizing text, recognizing objects in images, synthesizing speech, and lots more. These tasks are coupled with access to tools, information, and memory that makes them directly relevant to people's immediate situational needs.

To simplify that, here's a concrete example: faced with a rat's nest of signs, you want to know if it's ok to park your car. A personal multi-modal assistant could take an image (live camera feed or still photo), a voice command (in natural language), and possibly some additional context (time, location, historical data) as input and assemble a response (or action) that considers all these factors.

So where can you try this out? As mentioned, several companies are tackling different parts of the problem. If you squint a bit at the following list, it's hopefully clear how these explorations could add up to a new computing paradigm.

OpenAI's native iOS app can take image and audio input and respond in both text and speech using their most advanced large language model, GPT4... if you sign up for their $20/month GPT+ subscription. With an iPhone 15 Pro ($1,000+), you can configure the phone's hardware action button to directly open voice control in OpenAI's app. This essentially gives you an instant assistant button for audio commands. Image input, however, still requires tapping around the app and only works with static images not a real-time camera feed.

Humane's upcoming AI Pin (preorder $699) handles multiple inputs with a built in microphone, camera, touch surface, and sensors for light, motion, GPS, and more. It likewise, makes use of a network connection ($24/month) and Large Language Models to respond to natural language requests but instead of making use of your smartphone screen and microphone for output, it makes use of it's own speaker and laser projection display. Definitely on the "different" end of hardware and display spectrum.

Rewind's Pendant (preorder for $59) is a wearable that captures what you say and hear in the real world and then transcribes, encrypts, and stores it on your phone. It's mostly focused on the audio input side of a multi-modal personal assistant but the company's goal is to make use what the device captures to create a "personalized AI powered by truly everything you’ve seen, said, or heard."

New Computer's Dot app (not yet available) has released some compelling videos of a multi-modal personal assistant that runs on iOS. In particular, the ability to add docs and images that become part of a longer term personal memory.

While I'm sure more explorations and developed products are coming, this list let's you touch parts of the future while it's being sorted out... wrinkles and all.

Always Be Learning

Wed, 11/08/2023 - 2:00pm

The mindset to “always be learning” is especially crucial in the field of digital product design where not only is technology continouosly evolving, but so are the people we're designing for.

To quote Bruce Sterling, because people are “time bound entities moving from cradle to grave”, their context, expectations, and problems are always changing. So design solutions need to change along with them.

As a result, designers have to keep learning about how our products are being used, abused, or discarded and we need to feed those lessons back into our designs. Good judgement comes from experiences, and experience comes from bad judgements. Therefore, continuous learning is crucial for refining judgement and improving design outcomes.

"There’s the object, the actual product itself, and then there’s all that you learned. What you learned is as tangible as the product itself, but much more valuable, because that’s your future." -Jony Ive, 2014

So how can we always be learning? Start with the mindset that you have a lot to learn and sometimes unlearn. Spend your time in environments that encourage deeper problem-understanding and cross-disciplinary collaboration. This means not just designing but prototyping as well. Design to build, build to learn.

Recognize the patterns you encounter along the way and make time to explore them. This extends what you've learned into a more broadly useful set of skills and better prepares you for the next set of things you'll need to learn.

Rapid Iterative Testing and Evaluation (RITE)

Thu, 11/02/2023 - 2:00pm

Rapid Iterative Testing and Evaluation or RITE is a process I've used while working at Yahoo! and Google to quickly make progress on new product designs and give teams a deeper shared understanding of the problem space they're working on.

RITE is basically a continuous process of designing and building a prototype, testing it with users, and making changes within a short period, typically a few days. The goal is to quickly identify and address issues, and then iterate on the design based on the what was learned. This gives teams regular face time with end users and collectively grows their knowledge of the needs, environments, and expectations of their customers.

The way I've typically implemented RITE is every Monday, Tuesday, and Wednesday, we design and build a prototype. Then every Thursday, we bring in people to use the prototype through a series of 3-5 usability tests that the whole team attends. On Friday, we discuss the results of that testing together and decide what to change during the following week. This cycle is repeated week after week. In some cases running for months.

This approach puts customers front and center in the design process and allows for quick adaptation to issues and opportunities each week. The RITE method is also useful because it provides insights not just opinions. In other words, if there's a debate about a design decision, we can simply test it with users that week. This squashes a lot of open-ended discussions that don't result in action because the cost of trying something out is incredibly low. "OK we'll try it."

The cadence of weekly user tests also really aligns teams on common goals as everyone participates in observing problems and opportunities, exploring solutions, and seeing the results of their proposals. Over and over again.

Smashing Conf: Journey in Enterprise UX

Mon, 10/09/2023 - 2:00pm

In her A Journey in Enterprise UX talk at Smashing Conf Antwerp, Stephanie Walter outlined her learnings doing UX research and design for internal enterprise users.

  • Enterprise software is design complex due to a wide range of use cases and specific requirements. Most of the time it is ugly and hard to use but it doesn't have to be that way.
  • An internal tool can lots of different user groups. Before you even start research, get familiar with the "as is" what are the processes, jargon, and what is currently in place.
  • Quantitative data analysis lets learn what features get used and how much. Can also analyze the content of these features.
  • Analyzing content allows you to remove duplicated content and rework the information architecture.
  • To get internal users fro research, make friends with different departments, get referrals you'll find people who can help you improve the tools they work with.
  • Most enterprise tools are very task orientated: learn how they do these tools, identify pain points, and content needed.
  • User research questions: tell me about..., walk me through the steps, show me how you..., if you have a magic wand what would you change?
  • Keep track and document everything. Even if it is out of scope, might be useful in the future.
  • People are not used to user centered design processes, might need to dig to find the needs instead hearing solutions.
  • Define priorities: list big pain points and needs, decides with the team on what is fast track vs. big topics
  • Fast track: content and features that are low stakes and don't need extensive feedback therefore can be done quick.
  • For bog topics, you need more data: gather existing information, schedule follow-up sessions, iterate on solutions, and do usability testing.
  • Observational testing allows you to watch how people work and see where the issues are.
  • If users have questions during the session, take notes and save them for the end to not bias testing.
  • User diaries allow you to understand usage over a period of time. This helps find where people fallback to previous tools or processes.
  • Don't oversimplify interfaces for people who need features to do their job. Progressive disclosure and customization options are useful.
  • Content might be there for a reason but you're allowed to question that need.
  • People want to work with the data, let them export or copy data to move it in and out of your tools.
  • Find the small things to make people's live easier. There's lots of these opportunities in enterprise tools.
  • Users don't care what data goes into what tool, but they care about too many clicks, especially for tasks they do regularly.
  • Offer training: some people need and expect it, others won't so make it optional and in multiple formats. Training doesn't mean your UX is bad.
  • Training can be used to collect user feedback, you can hear the questions they ask.
  • Complex internal organizations can slow things down, be patient. Things don't change overnight.
  • Understand what makes people click, and leverage it.
  • Don't bring an opinion to a data fight: measure and bring proof. Have unbiased data.
  • Enterprise users are starting to demand better tools and experience. Make the process of designing internal tools visible to users so they understand the rationale behind designs.
  • Get champions and advocates in your user base.
  • Complexity is scary, break it into pieces and tackle it small parts at a time. User research helps you connect the pieces.

Smashing Conf: UX Writing with a Point of View

Mon, 10/09/2023 - 2:00pm

In his Designing a Product with a Point of View talk at Smashing Conf Antwerp, Nick DiLallo described the role of writers in defining a unique product personality and brand.

  • With placeholder content, it's hard to evaluate the interface. Words help make products simple and clear but also provide a personality.
  • The first step to writing is defining your audience. This helps inform more than words.
  • When creating an audience, don't been too broad: "people who watch movies" vs. "film obsessives" helps you more more decisions.
  • Another way to focus an audience definition is to add "people who.." The point is provide focus for designs.
  • Say something interesting. Start with a sentence to plant a flag or establish a point of view.
  • A lot of companies use words like "fast, simple, or fun". But this sounds like everyone, so it's not interesting.
  • Sometimes we define a feature but instead of "Keep track of your runs." consider "Compete with thousands of runners". The sentences can help guide a lot of design decisions.
  • Write out words to describe features and content in your product. This communicates a perspective on what you are doing.
  • Think really deeply on what words to use in the interface and why. There's many ways to frame the same action.
  • Not all parts of an interface need to be creative, some require conventional labels to be clear like: add to cart.
  • What you include is what you care about. "we think this is important..." What you include communicates a point of view.
  • Bigger means more important. What you emphasize communicates what you care about.
  • Writers look for opportunities to communicate in an interface. Even tiny moments (like the footer) can say a lot about who you are and how you think.
  • You can also overdo it. Be careful about adding brand voice in places that don't need it. Places like maps, calendars, might not need a lot of brand voice.
  • It's not just words but the entire interface that communicates with users.
  • When you work in UX you have to make hard decisions about how to surface potentially offensive issues: gender, race, nationalities, etc.
  • Do what you write. For example "free trail" with a credit card screen. Clear and simple words should not kick complexity down the can.
  • Writing can show what's broken with the UX.

Generative Agents

Sat, 09/30/2023 - 2:00pm

In his AI Speaker Series presentation at Sutter Hill Ventures, Joon Park discussed his work on generated AI agents, their architecture, and what we might learn form them about human behavior. Here's my notes from his talk:

  • Can we create computer generated behavior that simulates human behavior in a compelling way? While this has been very complicated to date, LLMs offer us a new way to tackle the problem.
  • The way we behave and communicate is much too vast and too complex for us to be able to create with existing methods.
  • Large language models (LLMs) are trained on broad data that reflects our lives, like the traces on our social web, Wikipedia, and more. So these models include a tremendous amount about us, how we live, talk, and behave.
  • With the right method, LLMs can be transformed into the core ingredient that has been missing in the past that will enable us to simulate human behavior.
  • Generative agents are a new way to simulate human behavior using LLMs. They are complemented with an agent architecture that remembers, reflects, and plans based on constantly growing memories and cascading social dynamics.
Smallville Simulation
  • Smallville is a custom-built game world which simulates a small village. 25 generative agents are initiated with a paragraph description of their personality and motivations. No other information is provided to them.
  • As individuals, agents set plans, and execute on them. They wake up in the morning, do their routines, and go to work in the sandbox game environment.
  • First, an agent basically generates a natural language statement describing their current action. They then translate this into concrete grounded movements that can affect the sandbox game environment.
  • They actually influence the state of the objects that are in this world. So a refrigerator can be empty when the agent uses a table to make breakfast.
  • They determine whether they want to engage in conversations when they see another agent. And they generate the actual dialogue if they decide to engage.
  • Just how agents can form dialogue with each other, a user can engage in a dialogue with these agents by specifying a persona. For instance, a news reporter.
  • Users can also alter the state of the agent's environment, control an agent, or actually enter as an outside visitor.
  • In this simulation, information diffuses across the community as agents share information with each other and form new relationships.

Agent Architecture
  • In the center of the architecture that powers generative agents is a memory stream that maintains a record of agents' experiences in natural language.
  • From the memory stream, records are retrieved as relevant to the agents' cognitive processes. A retrieval function that takes the agent's current situation as input and returns a subset of the memory stream to pass to a LLM, which then generates the final output behavior of the agents.
  • Retrieval is a linear combination of the recency, importance, and relevance function for each piece of memory.
  • The importance function is a prompt that asks the large-range model for the event status. You're basically asking the agent in natural language, this is who you are. How important is this to you?
  • The relevance function clusters records of agents' memory into higher-level abstract thoughts that are called reflections. Once they are synthesized, these reflections are just a type of memory and are just stored in the memory stream along with other raw observational memories.
  • Over time, this generates trees of reflections and the leaf nodes are basically the observations. As you go higher-level up the tree, you're starting to answer some of the core questions about who agents are, what drives them, what does they like.
  • While we can generate plausible behavior in response to situations, this might sacrifice the quality for long-term actions. Agents need to plan over a longer time horizon than just now.
  • Plans describe a future sequence of actions for the agent and help keep the agent's behavior consistent over time and are generated by a prompt that summarizes the agent and the agent's current status.
  • In order to control for granularity, plans are generated in large chunks to hourly to 1-15 minute increments.
Evaluating Agents
  • How do we evaluate them if agents remember, plan, and reflect in a believable manner?
  • Ask the agents a series of questions and human investigators to rank the answers in order to calculate true skill ratings for each condition
  • Found that the core components of our agent architecture, observation, plan, and reflection, each contribute critically to the believability of these agents.
  • But agents would sometimes fail to retrieve certain memories and sometimes embellish their memory (with hallucinations).
  • And instruction tuning of LLMs also influenced how agents spoke to each other (overly formal or polite).
  • Going forward, the promise of generative agents is that we can actually create accurately simulation human behaviors.
©2003 - Present Akamai Design & Development.