This is quite interesting, but I have to ask, have you experimented much with larger LLMs as a mechanism to basically automate the entire process?
I'm doing something pretty similar right now for internal meetings and I use a process like: transcribe meeting with utterance timestamps, extract keyframes from video along with timestamps, request segmented summary from LLM along with rough timestamps for transitions, add keyframe analysis (mainly for slides).
gpt-4o, claude sonnet 3.5, llama 3.1 405b instruct, llama 3.1 70b instruct all do a pretty stunning job of this IMO. Each department still reviews and edits the final result before sending it out, but I'm so far quite impressed with what we get from the default output even for 1-2hr conversations.
I'd argue the key feature for us is also still providing a simple, intuitive UI for non technical users to manage the final result, edit, polish and send it out.
That is a great point! I can certainly think of cases where you might want to go with an LLM instead and
we have definitely experimented with that approach. Here are some reasons why we think TreeSeg is more
suitable for us:
1. A more algorithmic approach allows us to bake certain contraints into the model. As an example you can add
a regularizer to incentivize TreeSeg to split more eagerly when there are large pauses. You can also strictly
enforce minimum and maximum sizes on segments.
2. If you are interested in reproducing a segmentation with slight variations you might not have good results
with an LLM. Our experience has been that there is significant stochasticity in the answers we get from an LLM.
Even if you try to obtain a more deterministic answer (i.e. set temp to zero), you will need an exact copy of
the model to get the same result in the future. Depending on what LLM you are using this might not be possible
(e.g. OpenAI adjusts models frequently). With TreeSeg you only need your block-utterance embeddings, which you
probably have already stored (presumably in a vector db).
3. TreeSeg outputs a binary tree of segments and their sub-segments and so forth... This structure is important
to us for many reasons, some of which are subjects of future posts. One such reason is access to a continuum
between local (i.e. chapters) and global (i.e. full session) context. Obtaining such a hierarchy via an LLM
might not be that straightforward.
4. There is something attractive about not relying on an LLM for everything!
The recent StackOverflow developer survey noted a prevalence (mislabeled as popularity) over 50% of Microsoft Teams collaboration tool among groups of devs, higher prevalence than Slack.
For devs using Teams, particular remote teams, trial Teams Premium, switch on recording and enable transcripts, then switch on the Microsoft "Meet" app for Teams. (If you are colocated, Teams has a mode where each dev can join with their own device in the same room, and it uses that to enhance speaker detection.)
After a meeting, you may be surprised, stunned even, at the usefulness of the “Meet” app experience for understanding the meeting conversation flow, participant by participant, the quality of the transcript, the quality of the OpenAI backed summary, and the utility of the follow-ups extracted.
This material also becomes searchable, and assuming you leverage Microsoft Stream and retain the meets and recordings, usable as training material as well.
While Augmend takes this idea to the next level, if you are using Teams* and aren't using Meet, you are missing out.
However, this doesn't show the timeline of speakers and more importantly timeline of topics, which is the most valuable part for review. For a double-click on that, see:
Meeting recap in Microsoft Teams > Topics and Chapters:
I'm doing something pretty similar right now for internal meetings and I use a process like: transcribe meeting with utterance timestamps, extract keyframes from video along with timestamps, request segmented summary from LLM along with rough timestamps for transitions, add keyframe analysis (mainly for slides).
gpt-4o, claude sonnet 3.5, llama 3.1 405b instruct, llama 3.1 70b instruct all do a pretty stunning job of this IMO. Each department still reviews and edits the final result before sending it out, but I'm so far quite impressed with what we get from the default output even for 1-2hr conversations.
I'd argue the key feature for us is also still providing a simple, intuitive UI for non technical users to manage the final result, edit, polish and send it out.