Audio-visual collaboration method with delay management for wide area broadcast
阅读说明:本技术 具有用于广域广播的延迟管理的视听协作方法 (Audio-visual collaboration method with delay management for wide area broadcast ) 是由 安东·霍姆伯格 本杰明·赫什 珍妮·杨 佩里·R·库克 杰弗里·C·史密斯 于 2018-04-03 设计创作,主要内容包括:已经开发了促进群体视听表演的现场直播的技术。包括声乐的视听表演被捕获,并且以可以创建引人注目的用户和收听者体验的方式与其他用户的表演相协调。例如,在一些情况或实施例中,可以在以艺术家的风格演唱的视听现场直播中支持具有主机表演者的二重唱,其中,积极的歌手请求或排队针对现场无线电演出娱乐格式的特定歌曲。所开发的技术提供了一种通信延迟容忍机制,用于对在地理上分离的设备处(例如,在全球分布但网络连接的移动电话或平板计算机处,或者在地理上与现场演播室分离的视听捕获设备处)捕获的声音表演进行同步。(Techniques have been developed to facilitate live broadcasting of group audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create compelling user and listener experiences. For example, in some cases or embodiments duel singing with a host performer may be supported in audio-visual live broadcasts that sing in the style of an artist, where an active singer requests or queues a particular song in an entertainment format for a live radio show. The developed technology provides a communication delay tolerant mechanism for synchronizing sound performances captured at geographically separated devices (e.g., at globally distributed but network-connected mobile phones or tablet computers, or at audiovisual capture devices geographically separated from live studios).)
1. An audio collaboration method for broadcasting a combined performance of a first performer and a second performer geographically distributed with non-negligible peer-to-peer communication delay between a host device and a guest device, the method comprising:
receiving, at the host device operating as a local peer, media encoding of a mixed audio performance that (i) includes sound audio captured from a first one of the performers at the guest device communicatively coupled as a remote peer, and (ii) is mixed with an accompaniment audio track;
at the host device, audibly presenting the received mixed audio performances and accordingly capturing audio sounds from a second of the performers; and
mixing the captured second performer voice audio with the received mixed audio performance for transmission to the audience as the broadcast, wherein the broadcast mix comprises the first performer's and second performer's voice audio and the accompaniment audio track with a negligible time lag therebetween.
2. The method of claim 1, further comprising:
transmitting the broadcast mix as a live broadcast over a wide area network to a plurality of recipients, the plurality of recipients comprising the viewer.
3. The method of claim 1, further comprising:
the second actor selectively joins the first actor to the combined performance at the host device.
4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,
wherein a joining first performer is selected from the audience and the joining first performer is decoupled from live transmission of the broadcast to the audience for at least the duration of the combined performance.
5. The method of claim 4, wherein the first and second light sources are selected from the group consisting of,
wherein the live broadcast transmitted to the audience lags in time at least a few seconds with respect to the first performer's voice audio capture.
6. The method of claim 4, further comprising:
returning the first performer to the audience and, at the same time, re-coupling the first performer to the live transmission.
7. The method of claim 6, further comprising:
selectively joining a third performer as a new remote peer, an
After that time, the user can select the desired position,
receiving, at the host device, a second media encoding of a mixed audio performance that (i) includes sound audio captured from the third performer at a new guest device communicatively coupled as the new remote peer, and (ii) is mixed with a second accompaniment audio track;
at the host device, audibly presenting the second media encoding and accordingly capturing additional sound audio from the second performer; and
mixing the captured additional audio with the received second media encoding for transmission to the audience as a continuation of the broadcast, wherein the broadcast mix includes the second and third performer's audio and the second accompaniment audio track with a negligible time lag therebetween.
8. The method of claim 1, further comprising:
providing the second performer captured sound audio to the guest device remote peer for audible presentation at the guest device with at least some guest-side time lag relative to capturing sound audio from the first performer.
9. The method of claim 8, wherein the apparent guest-side time lag is at least about 40-1200 ms.
10. The method of claim 8, wherein the first and second light sources are selected from the group consisting of,
wherein substantially all of the non-negligible peer-to-peer communication delay is apparent in the guest-side time lag.
11. The method of claim 10, wherein the first and second light sources are selected from the group consisting of,
wherein the non-negligible peer-to-peer communication delay is not noticeable at the host device or in the broadcast mix of the first performer and the second performer.
12. The method of claim 1, wherein the non-negligible peer-to-peer communication delay comprises:
the delay of the incoming signal to the transmission,
the delay of the network is set by the network,
jitter buffer delay, and
buffers and output delays.
13. The method as recited in claim 1, wherein the non-negligible peer-to-peer communication delay is at least about 100 and 250 ms.
14. The method as recited in claim 1, wherein the non-negligible peer-to-peer communication delay is approximately 100 and 600 ms.
15. The method of claim 1, wherein the non-negligible peer-to-peer communication delay is at least approximately 30-100 ms.
16. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein receiving the mixed audio performance at the host device and providing the second performer captured sound audio to the guest device is via a peer-to-peer audiovisual connection using a Web real-time communication (WebRTC) type framework.
17. The method of claim 1, further comprising:
providing a broadcast mix of the first performer's and the second performer's audio over a wide area network.
18. The method of claim 17, wherein the first and second light sources are selected from the group consisting of,
wherein the provision of the broadcast mix is via a real-time message transfer protocol (RTMP) type audiovisual streaming protocol.
19. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein at least the guest device constitutes a mobile handset or a media player.
20. The method of claim 1, further comprising:
at the host device, pitch correcting the second performer's voice according to the voice score encoding the sequence of notes of the voice melody.
21. The method of claim 20, further comprising:
pitch correcting, at the host device, the second performer sound according to the sound score encoding at least the first set of harmonic notes of at least some portion of the sound melody.
22. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the first performer sound included in the received mixed performance is a pitch-corrected sound.
23. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein one of the first performer sound and the second performer sound is pitch-corrected according to a sound score that encodes a note sequence of the sound melody; and is
Wherein the other of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding at least a first set of harmonic notes of at least some portion of the voice melody.
24. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein either or both of the first performer sound and the second performer sound are computationally processed to apply one or more audio effects prior to being included in the broadcast.
25. The method of claim 24, wherein the applied audio effect comprises one or more of:
the effect of the reverberation is such that,
the digital filtering is carried out in such a way that,
the frequency spectrum is equalized, and the frequency spectrum is equalized,
the non-linear distortion is generated by the non-linear distortion,
the audio is compressed and then the compressed audio is transmitted,
either the pitch correction or the pitch offset is performed,
channel relative gain and/or phase delay for manipulating apparent placement of the first performer or the second performer within the stereo field.
26. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the received media encoding comprises video that is synchronized in performance with the captured first performer's voice,
wherein the method further comprises capturing, at the host device, video synchronized on the performance with the captured second performer sound, and
wherein the broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.
27. The method of claim 26, further comprising:
dynamically changing at least visual prominence of one or the other of the first performer and the second performer in the broadcast mix based on an evaluation of computationally defined audio features of either or both of the first performer sound and the second performer sound.
28. The method of claim 26, further comprising:
applying one or more video effects to the broadcast mix based at least in part on computationally defined audio or video characteristics for either or both of the first performer audio or video and the second performer audio or video.
29. The method of claim 1, further comprising:
chat messages are received at the host device from members of the audience.
30. The method of claim 1, further comprising:
at least some of the content of the chat message is incorporated as part of the broadcast mixed video.
31. The method of claim 1, further comprising:
receiving, at the host device from a member of the audience, one or more of: chat messages, emoticons, animated GIF, voting instructions.
32. The method of claim 31, further comprising:
incorporating a visual presentation of at least some of the received chat message content, emoticons, animated GIF, or voting instructions as part of the broadcast mix.
33. The method of claim 1, further comprising:
queuing playlist requests from one or more recipients of the broadcast mix.
34. The method of claim 33, further comprising:
in response to selection of a particular playlist request of the queued playlist requests by the second performer at the host device, retrieving from a content repository one or more of: the accompaniment audio track, lyrics, score encode note objectives.
35. The method of claim 33, further comprising:
in response to selection of a particular playlist request of the queued playlist requests at the host device by the second performer, requesting provision of one or more of the following to a communicatively coupled guest device: the accompaniment audio track, lyrics, score encode note objectives.
36. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the broadcast mix is presented as vocal duel.
37. The method of claim 1, further comprising:
receiving, at the host device, media encoding of at least another mixed audio performance that (i) contains sound audio captured from a third performer at another guest device communicatively coupled as another remote peer, and (ii) is temporally aligned or alignable with the accompaniment audio track.
38. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,
wherein the live audio includes both:
a captured conversational audio portion corresponding to an interactive conversation between the first performer and the second performer; and
a captured audio portion of a vocal performance corresponding to a vocal performance of either or both of the first performer and the second performer for the accompaniment audio track.
39. The method of claim 38, further comprising:
selecting a highlight clip set of segments from the live broadcast,
wherein the collection of highlight clips of a segment typically includes a vocal performance portion and typically does not include the conversational audio portion.
40. The method of claim 38, further comprising:
selecting a highlight clip set of segments, whether score coded or computationally determined by audio feature analysis, based on correspondence of a particular audio portion of the live broadcast to lyrics fragments, refrains, or music chapter boundaries.
41. The method of claim 38, further comprising:
selecting a highlight clip set of segments from the live broadcast based on one or more of viewer reaction to the live broadcast, song structure, and audio power.
42. The method of claim 38, further comprising:
in response to a user selection, saving or sharing audiovisual coding of one or more of the highlight clips.
43. The method of claim 1, further comprising:
receive one or more lyric synchronization markers from the guest device, the lyric synchronization markers transmitting to the host device a temporal alignment of lyrics visually presented at the guest device with the voice audio captured by the guest device.
44. The method of claim 43, further comprising:
visually presenting the lyrics at the host device, wherein the visual presentation of the lyrics is temporally aligned with media encoding of the mixed audio performance received from the guest device based on the received one or more lyric synchronization markers.
45. The method of claim 43, in which the first and second regions are separated,
wherein the received one or more lyric synchronization markers coordinate a progress of the lyrics presented on the host device with a pause or other time control at the guest device.
46. A system for disseminating an apparently live broadcast of a combined performance of a first performer and a second performer that are geographically distributed, the system comprising:
a host device and a guest device coupled by a communication network as a local peer and a remote peer with non-negligible peer-to-peer delay therebetween for transmission of audiovisual content, the host device communicatively coupled as the local peer to receive a media encoding of a mixed audio performance containing sound audio captured at the guest device, and the guest device communicatively coupled as the remote peer to provide the media encoding captured from a first one of the performers and mixed with an accompaniment audio track;
the host device is configured to audibly present the received mixed audio performances, to accordingly capture audio sounds from a second one of the performers, and to mix the captured second performer audio sounds with the received mixed audio performance for transmission as the apparently live broadcast.
47. An audio collaboration method for live broadcast of coordinated audiovisual works of a first performer and a second performer captured at a first device and a second device, respectively, geographically distributed, the method comprising:
receiving, at the second device, media encoding of a mixed audio performance that (i) includes sound audio captured at the first device from a first of the performers, and (ii) is mixed with an accompaniment audio track;
at the second device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second of the performers;
mixing the captured second performer voice audio with the received mixed audio performance to provide a broadcast mix comprising the captured voice audio of the first performer and the second performer and the accompaniment audio track without a significant time lag therebetween; and
providing the broadcast mix to a service platform configured to live broadcast the broadcast mix to a plurality of recipient devices that constitute a viewer.
48. The method as set forth in claim 47,
wherein the first device is associated with the second device as a current live visitor, and
wherein the second device operates as a current live host that controls association and disassociation of a particular device from the audience as the current live visitor.
49. The method of claim 48, wherein said step of selecting said target,
wherein the current live host selects from a queue of requests from the viewer to associate as the current live guest.
50. The method of claim 47, wherein the first device operates in a live guest role and the second device operates in a live host role, the method further comprising any one or both of:
the second device releases the live broadcast host role to be borne by another device; and
the second device communicates the live hosting role to a particular device selected from a set including the first device and the viewer.
Technical Field
The present invention relates generally to the capture, processing and/or broadcasting of audiovisual performances by a plurality of performers, and in particular to techniques adapted to manage transmission delays for audiovisual content captured in the context of near real-time audiovisual collaboration by a plurality of geographically distributed performers.
Background
The installed base of mobile phones, personal media players, and portable computing devices, along with media streaming and television set-top boxes, has grown in absolute numbers and computing power each day. The ubiquitous and deeply established life style of people in the world, many of these devices surpass cultural and economic barriers. By way of computing, these computing devices provide speed and storage capabilities comparable to engineering workstation or workgroup computers less than ten years ago, and typically include powerful media processors, making them suitable for real-time sound synthesis and other music applications. As a result, in part, some portable handheld devices (e.g.,
iPodand othersOr android devices), as well as media application platforms and set-top box (STB) -type devices (e.g., Apple @)Devices) have considerable ability to support audio and video processing while providing a platform suitable for advanced user interfaces. In fact, the application (e.g., SmuleOcarina)TM、LeafI Am T-PainTM、Sing!KaraokeTM、Guitar!ByAnd MagicApplications (available from smule corporation)) have shown that advanced digital acoustic technology can be delivered using such devices in a way that provides an attractive music experience.Sing!KaraokeTMImplementations have previously demonstrated the growth of sound performances captured on a non-real-time basis with geographically distributed handheld devices relative to one another, and the same is true of implementations in which a closer coupling of the portable handheld device and a local media application platform (e.g., indoors) is supported (typically with short-range, negligible latency communications over the same local area network or personal area network segment). Improved technical and functional capabilities are desired to extend the sense of intimacy of "now" or "live" to collaborative sound performances, where performers are separated by more significant geographic distances and communication delays between devices are not negligible.
As researchers attempt to transition their innovations to commercial applications deployable to modern handheld devices and media application platforms, significant practical challenges exist within the real-world constraints imposed by processors, memory, and other limited computing resources as previously described, and/or within the typical communication bandwidth and transmission delay constraints of wireless and wide area networks. For example, when an application (e.g., Sing | Karaoke) has demonstrated the desire to audiovisual mix-simulated duet or collaborative sound performances of a large number of performers after a performance, the present sense is created, and live collaboration has proven elusive without physical co-location.
Improved technical and functional capabilities are desired, particularly with respect to managing communication delays and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to recipients, listeners, and/or viewers as a live interactive collaboration of geographically distributed performers. It is also desirable to provide viewer intervention and participation in the construction that provides a sense of closeness of "now" or "live".
Disclosure of Invention
It has been found that while practical limitations are imposed by mobile device platforms and media application execution environments, audiovisual performances including vocal music can be captured and coordinated (in a manner that creates an impressive user and listener experience) with other users' audiovisual performances. In some cases, the vocal performance of the collaborative contributors (along with performance-synchronized video) is captured in the context of a karaoke-type presentation of lyrics and corresponding to an audible presentation of an accompaniment track (backing track). In some cases, the sound (and often synchronized video) is captured as part of a live or non-descriptive performance through acoustic interaction (e.g., duet or dialogue) between the collaborating contributors. In either case, it is envisioned that there will be non-negligible network communication delays between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there is a technical challenge to manage the delay and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to the recipients, listeners, and/or viewers as a live interactive collaboration.
In one technique for accomplishing such faxing of live interactive performance collaborations, the actual and non-negligible network communication delays are (effectively) marked in one direction between guest (guest) and host (host) performers, and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a "live performance" internet broadcast of a host performer may include guest + host duel singing in an explicit (individual) real-time synchronized manner. In some cases, the host computer may be an artist who has popularized a particular musical performance. In some cases, the guest may be an amateur singer who is given the opportunity to sing "live" (by remote means) as a host of (or with) the performance with a popular artist or group. Despite the non-negligible network communication delay from guest to host (which may be 200-.
The result is an apparent live interactive performance (at least from the perspective of the recipient, listener and/or viewer of the broadcast or broadcast performance). Although a non-negligible network communication delay from guest to host is labeled, it will be appreciated that the delay exists and is tolerable in the host to guest direction. However, while the host-to-guest delay is discernable (and may be quite noticeable) to the guest, it need not be noticeable in a conspicuous live broadcast or other propagation. It has been found that a delayed audible presentation of the host sound (or more generally, the host sound of a captured audiovisual performance of the host) does not need to psychoacoustically interfere with the performance of the guest.
Performance-synchronized video may be captured and included in a combined audiovisual performance that constitutes an apparently live broadcast, where the visuals may be based at least in part on time-varying, computationally-defined audio features extracted from (or computed from) the captured sound audio. In some cases or embodiments, these computationally defined audio features are selective to the particular synchronized video that contributes to (or is salient by) one or more of the singers during the coordinated audiovisual mixing.
Optionally, and in some cases or embodiments, the audio may be tone corrected in real-time at the visitor performer's device (or more generally, at a portable computing device (e.g., mobile phone, personal digital assistant, laptop computer, notebook computer, tablet or netbook) or on a content or media application server) according to tone correction settings. In some cases, the pitch correction settings will encode a particular pitch or scale for the sound performance or portion thereof. In some cases, the pitch correction settings include a score-coded melody (score-coded melody) and/or a harmony sequence (harmony sequence) (provided with or associated with the lyrics and accompaniment tracks). Harmonic notes or chords may be coded as well-defined targets, if desired, or melodies coded relative to scores or even actual tones sounded by the singer.
Using the uploaded sound captured at the guest performer devices (e.g., the aforementioned portable computing devices), the content server or service for the host may further mediate the coordinated performance by manipulating and mixing the uploaded audiovisual content of the plurality of contributing singers for further broadcasting or other dissemination. Depending on the goals and implementation of a particular system, the upload may include, in addition to the video content, pitch-corrected sound shows (with or without harmony), dry (i.e., uncorrected) sounds, and/or control trajectories of user's tones and/or pitch correction selections, etc.
Synthesized harmony and/or additional sounds (e.g., sounds captured from another singer at another other location and optionally pitch-altered to harmonize with other sounds) may also be included in the mixture. The geocoding of the captured sound performance (or individual contributions to the combined performance) and/or listener feedback can promote animation or display artifacts by: cueing shows or annotations emanating from a particular geographic location on the user-manipulatable globe. In this way, the implementation of the described functionality can transform a common mobile device and living room or entertainment system into a social instrument that fosters a unique sense of global connectivity, collaboration, and community.
In some embodiments according to the invention(s), an audio collaboration method is provided for broadcasting a joint performance of geographically distributed performers with non-negligible peer-to-peer communication delay between a host device and guest devices. The method includes (1) receiving media encoding of a mixed audio performance at a host device operating as a local peer, the media encoding of the mixed audio performance (i) including sound audio captured from a first one of the performers at a guest device communicatively coupled as a remote peer, and (ii) mixed with an accompaniment audio track; (2) at the host device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second one of the performers; and (3) mixing the captured second performer voice audio with the received mixed audio performance for transmission to the audience as a broadcast, wherein the broadcast mix comprises the voice audio of the first performer and the second performer and the accompaniment audio track with a negligible time lag therebetween.
In some embodiments, the method further comprises transmitting the broadcast mix as a live broadcast over a wide area network to a plurality of recipients, the plurality of recipients comprising the audience. In some embodiments, the method further comprises the second actor selectively joining the first actor to the combined performance at the host device.
In some cases or embodiments, the joining first performer is selected from the audience, and the joining first performer is decoupled from the live transmission of the broadcast to the audience for at least the duration of the combined performance. In some cases or embodiments, the live broadcast transmitted to the audience lags in time by at least a few seconds with respect to the first performer's voice audio capture.
In some embodiments, the method further comprises returning the first performer to the audience, and at the same time, re-coupling the first performer to the live transmission. In some embodiments, the method further comprises selectively joining a third performer as a new remote peer, and thereafter (1) receiving at the host device a second media encoding that (i) includes sound audio captured from the third performer at a new guest device communicatively coupled as the new remote peer, and (ii) mixes with the second accompaniment audio track; (2) at the host device, audibly presenting the second media encoding and accordingly capturing additional sound audio from the second performer; and (3) mixing the captured additional audio with the received second media encoding for transmission to the audience as a continuation of the broadcast, wherein the broadcast mix includes the audio of the second performer and the third performer and the second accompaniment audio track with a negligible time lag therebetween.
In some embodiments, the method further includes providing the second performer captured sound audio to the guest device remote peer for audible presentation at the guest device with at least some guest-side time lag relative to capturing the sound audio from the first performer. In some cases or embodiments, the apparent (guest) guest-side time lag is at least about 40-1200 ms.
In some cases or embodiments, substantially all of the non-negligible peer-to-peer communication delay is apparent in the guest-side time lag. In some cases or embodiments, neither a non-negligible peer-to-peer communication delay is apparent at the host device or in the broadcast mix of the first performer and the second performer. In some cases or embodiments, non-negligible peer-to-peer communication delays include input signal-to-transmission delay, network delay, jitter buffer delay, and buffer and output delay. Non-negligible peer-to-peer communication delays may vary and, in some cases, may be psychoacoustically significant. In some cases or embodiments, the non-negligible peer-to-peer communication delay is at least about 30-100 ms. In some cases or embodiments, the non-negligible peer-to-peer communication delay is at least about 100-. In some cases or embodiments, the non-negligible peer-to-peer communication delay is approximately 100 and 600 ms.
In some cases or embodiments, receiving the mixed audio performance at the host device and providing the second performer captured sound audio to the guest device is done via a peer-to-peer audiovisual connection using a Web real-time communication (WebRTC) type framework. In some embodiments, the method further comprises providing a broadcast mix of the audio of the first performer and the second performer over the wide area network. In some cases or embodiments, the provision of the broadcast mix is via a real-time messaging protocol (RTMP) type audiovisual streaming protocol. In some cases or embodiments, at least the guest device constitutes a mobile handset or a media player.
In some embodiments, the method further comprises pitch correcting, at the host device, the second performer sound according to a sound score encoding the sequence of notes of the sound melody. In some embodiments, the method further comprises pitch correcting, at the host device, the second performer's voice according to the voice scores encoding the at least the first set of harmonic notes of the at least some portion of the voice melody.
In some cases or embodiments, the first performer sound included in the received mixed performance is a pitch-corrected sound. In some cases or embodiments, one of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding a sequence of notes of the voice melody, and the other of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding at least a first set of harmonic notes of at least some portion of the voice melody.
In some cases or embodiments, either or both of the first performer sound and the second performer sound are computationally processed to apply one or more audio effects prior to being included in the broadcast. In some cases or embodiments, the applied audio effect includes one or more of: reverberation effects, digital filtering, spectral equalization, nonlinear distortion, audio compression, pitch correction or shift, channel relative gain and/or phase delay, for manipulating the apparent placement of a first performer or a second performer within a stereo field.
In some cases or embodiments, the received media encoding includes video synchronized in performance with the captured first performer sound, the method further includes capturing, at the host device, video synchronized in performance with the captured second performer sound, and the broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.
In some embodiments, the method further comprises dynamically changing at least the visual prominence of one or the other of the first performer and the second performer in the broadcast mix based on an evaluation of the computationally defined audio characteristics of either or both of the first performer sound and the second performer sound. In some embodiments, the method further comprises applying one or more video effects to the broadcast mix based at least in part on computationally defined audio or video characteristics for either or both of the first performer audio or video and the second performer audio or video.
In some embodiments, the method further comprises receiving, at the host device, a chat message from a member of the audience. In some embodiments, the method further comprises incorporating at least some of the content of the chat message as part of the broadcast mixed video. In some embodiments, the method further comprises receiving, at the host device from a member of the audience, one or more of: chat messages, emoticons, animated GIF, voting instructions. In some embodiments, the method further comprises merging a visual presentation of at least some of the received chat message content, emoticons, animated GIF, or voting instructions as part of the broadcast mix.
In some embodiments, the method further comprises queuing playlist requests from one or more recipients of the broadcast mix. In some embodiments, in response to selection of a particular playlist request of the queued playlist requests by the second performer at the host device, the method further comprises retrieving, from the content repository, one or more of: accompaniment audio tracks, lyrics, scores encode note objectives. In some embodiments, in response to selection of a particular playlist request of the queued playlist requests at the host device by the second performer, the method further comprises requesting provision of one or more of the following to the communicatively coupled guest devices: accompaniment audio tracks, lyrics, scores encode note objectives.
In some cases or embodiments, the broadcast mix is presented as vocal duels. In some embodiments, the method further includes receiving, at the host device, media encoding of at least another mixed audio performance that (i) contains sound audio captured from a third performer at another guest device communicatively coupled as another remote peer, and (ii) is temporally aligned or alignable with the accompaniment audio track. In some cases or embodiments, live audio includes both: a captured conversational audio portion corresponding to an interactive conversation between the first performer and the second performer, and a captured vocal performance audio portion corresponding to a vocal performance of either or both of the first performer and the second performer for the accompaniment audio track.
In some embodiments, the method further comprises selecting a highlight set of segments from the live broadcast, wherein the highlight set of segments generally includes a vocal performance portion and generally does not include a conversational audio portion. In some embodiments, the method further comprises selecting a set of highlight clips of the segment from the live broadcast based on one or more of viewer reaction to the live broadcast, song structure, and audio power. In some embodiments, the method further comprises selecting a highlight clip set of segments, whether score coded or computationally determined by audio feature analysis, based on correspondence of a particular audio portion of the live broadcast to a lyrics fragment, a chorus, or a music chapter boundary. In some embodiments, in response to a user selection, the method further comprises saving or sharing audiovisual encoding of one or more of the highlight clips.
In some embodiments, the method further comprises receiving one or more lyric synchronization marks from the guest device. The lyric sync mark conveys to the host device a temporal alignment of the lyrics visually presented at the guest device with the voice audio captured by the guest device. In some embodiments, the method further comprises visually presenting the lyrics at the host device, wherein the visual presentation of the lyrics is temporally aligned with the media encoding of the mixed audio performance received from the guest device based on the received one or more lyric synchronization markers. In some cases or embodiments, the received one or more lyric synchronization markers coordinate the progress of the lyrics presented on the host device with a pause or other time control at the guest device.
In some embodiments according to the invention(s), a system for disseminating an apparently live broadcast of a combined performance of a first performer and a second performer geographically distributed, the system comprising: a host device and a guest device coupled by a communication network as a local peer and a remote peer with non-negligible peer-to-peer delay between them for transmission of audiovisual content. The host device is communicatively coupled as a local peer to receive media encoding of a mixed audio performance containing sound audio captured at the guest devices, and the guest devices are communicatively coupled as remote peers to provide media encoding captured from a first one of the performers and mixed with the accompaniment audio track. The host device is configured to audibly present the received mixed audio performances, to accordingly capture audio sounds from a second one of the performers, and to mix the captured second performer audio sounds with the received mixed audio performances for transmission as an apparently live broadcast.
In some embodiments according to the invention(s), an audio collaboration method for live broadcast of coordinated audiovisual works of a first performer and a second performer captured at a first device and a second device, respectively, that are geographically distributed, the method comprising: (a) receiving, at the second device, a media encoding of a mixed audio performance that (i) includes sound audio captured at the first device from a first one of the performers, and (ii) is mixed with the accompaniment audio tracks; (b) at the second device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second of the performers; (c) mixing the captured second performer voice audio with the received mixed audio performance to provide a broadcast mix comprising the captured voice audio of the first performer and the second performer and the accompaniment audio track without a significant time lag therebetween; and (d) providing the broadcast mix to a service platform configured to live broadcast the broadcast mix to a plurality of recipient devices that constitute the audience.
In some cases or embodiments, the first device is associated with the second device as a current live guest, and the second device operates as a current live host. The current live host controls the association and disassociation of a particular device from the audience as a current live visitor. In some cases or embodiments, the current live host selects from a queue of requests from the viewer to associate as a current live guest.
In some cases or embodiments, the first device operates in a live guest role and the second device operates in a live host role. The method further comprises any one or both of: the second equipment releases the role of the live broadcast host for another equipment to bear; and the second device communicates the live hosting role to a particular device selected from the set comprising the first device and the viewer.
Drawings
The present invention(s) are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements or features in general.
Fig. 1 depicts information flow between illustrative mobile phone-type portable computing devices in a host and guest configuration for live duet group audiovisual performance in accordance with some embodiment of the invention(s).
Fig. 2 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in a "master-sync" peer-to-peer configuration for generating a live audiovisual performance of a group audiovisual show in accordance with some embodiments of the invention(s).
Fig. 3 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in a "shared delay" peer-to-peer configuration for generating a live audiovisual performance of an audio group visual performance in accordance with some embodiments of the invention(s).
Figure 4 is a flow diagram illustrating an optional real-time continuous pitch correction and choreography signal flow that may be performed based on score-coded pitch correction settings for audiovisual performances captured at a guest or host device in accordance with some embodiments of the invention(s).
Fig. 5 is a functional block diagram of hardware and software components that may be executed at an illustrative mobile-phone-type portable computing device in order to process and transmit captured audiovisual shows for use in a multi-singer live configuration of network-connected devices in accordance with some embodiments of the present invention(s).
Fig. 6 illustrates features of a mobile device that may serve as a platform for performing at least some software implementations of audiovisual performance capture and/or live performance devices in accordance with some embodiments of the invention(s).
Fig. 7 is a network diagram illustrating cooperation of exemplary devices according to some embodiments of the invention(s).
Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or significance of some of the elements or features shown may be exaggerated relative to other elements or features to help to improve understanding of embodiments of the present invention. Also, while illustrated in the accompanying drawings as a single flow for the sake of brevity or to avoid complexity that might otherwise obscure the description of the inventive concepts, it is to be understood that multiple data and control flows (including component signals or code) are consistent with the description.
Detailed Description
Modes for carrying out the invention(s)
Techniques have been developed to facilitate live broadcasting of group audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create compelling user and listener experiences. For example, in some cases or embodiments duel singing with a host performer may be supported in audio-visual live broadcasts that sing in the style of an artist, where an active singer requests or queues a particular song in an entertainment format for a live radio show. The developed technology provides a communication delay tolerant mechanism for synchronizing sound performances captured at geographically separated devices (e.g., at globally distributed but network-connected mobile phones or tablet computers, or at audiovisual capture devices geographically separated from live studios).
Although audio-only embodiments are of course contemplated, it is contemplated that live content will typically include video synchronized with a performance captured in conjunction with sound. Further, while a network-connected mobile phone is shown as an audiovisual capture device, it will be understood based on the description herein that the audiovisual capture and viewing device may include a suitably configured computer, a smart TV and/or a living room type set-top box configuration, and even a smart virtual auxiliary device having audio and/or audiovisual capture devices or capabilities. Finally, although the application to vocal music is described in detail, it will be understood based on the description herein that the audio or audiovisual capture application is not necessarily limited to vocal duals, but may be applicable to other forms of group performances where one or more successive performances are added to previous performances to produce a live broadcast.
In some cases, the vocal performance of the collaborative contributors (along with the performance-synchronized video) is captured in the context of a karaoke-type presentation of the lyrics and corresponding to an audible presentation of the accompaniment track. In some cases, the sound (and often synchronized video) is captured as part of a live or non-descriptive performance through acoustic interaction (e.g., duet or dialogue) between the collaborating contributors. In each case, it is envisioned that there will be non-negligible network communication delays between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there are technical challenges to managing latency and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to the recipients, listeners, and/or viewers as a live interactive collaboration.
In one technique for accomplishing such faxing of live interactive performance collaboration, actual and non-negligible network communication delays are (effectively) flagged in one direction between the guest and the host performer and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a "live performance" internet broadcast of a host performer may include guest + host duel singing in an apparently real-time synchronized manner. In some cases, the host computer may be an artist who has popularized a particular musical performance. In some cases, the guest may be an amateur singer who is given the opportunity to sing "live" (by remote means) as a host of (or with) the performance with a popular artist or group. Although a non-negligible network communication delay from the guest to the host (perhaps 200-.
The result is an apparent live interactive performance (at least from the perspective of the recipient, listener and/or viewer of the broadcast or broadcast performance). Although a non-negligible network communication delay from guest to host is labeled, it will be appreciated that the delay exists and is tolerable in the host to guest direction. However, while the host-to-guest delay is discernable (and may be quite noticeable) to the guest, it need not be noticeable in a conspicuous live broadcast or other propagation. It has been found that a delayed audible presentation of the host sound (or more generally, the host sound of a captured audiovisual performance of the host) does not need to psychoacoustically interfere with the performance of the guest.
Although much of the description herein assumes a stationary host performer on a particular host device for purposes of illustration, it will be understood based on the description herein that some embodiments in accordance with the invention(s) may provide host/guest control logic that allows the host to "pass through microphones" so that a new user (a user who "picks up a microphone" after the current host "discards the microphone" in some cases and in other cases of user selection by the current host) may take over as a host. As such, it will be understood based on the description herein that some embodiments in accordance with the invention(s) may provide host/guest control logic that queues guests (and/or active hosts) and automatically assigns the queued users to appropriate roles.
In some cases or embodiments, in a karaoke-type user interface framework, the vocal audio of individual host and guest character performers is captured along with performance-synchronized video and coordinated with the audiovisual contributions of other users to form a duel or chorus-type group audiovisual performance. For example, in the context of a karaoke-type presentation of lyrics corresponding to an audible presentation of an accompaniment track, an individual user's vocal performance (along with video synchronized with the performance) may be captured on a mobile device, a television-type display, and/or a set-top box device. In some cases or embodiments, score-coded continuous pitch correction and user-selectable audio and/or video effects may be provided. Consistent with the foregoing, but not limited to any particular embodiment claimed, karaoke-type voice performance capture using a portable handheld device provides an illustrative environment.
Karaoke type voice performance capture
Although embodiments of the present invention are not so limited, pitch correction, karaoke type, voice capture using mobile phone type and/or television type audiovisual devices provide a useful descriptive context. For example, in some embodiments such as that shown in FIG. 1, an iPhone available from Apple IncTMThe handheld devices (or more generally,
In the illustration of fig. 1, a current host user of a
In the illustrated configuration, content that is mixed to form a group
Typically, the
In the configuration shown in fig. 1, and despite a non-negligible time lag (typically 100 and 250ms, but possibly greater), the
It should be appreciated that the time lag in the peer-to-peer communication channel between the
User sounds 103A and 103B are captured at the respective
In general, the lyrics, melody, and harmonic track note sets and associated timing and control information may be packaged in an appropriate container or object (e.g., in musical instrument digital interface, MIDI or Java script object symbols, character strings, type formats) to be provided with the accompaniment track(s). Using such information, the
As will be appreciated by those skilled in the art having the benefit of this disclosure, the performances of multiple singers (including performance-synchronized video) may be accumulated and combined, such as to form duel-type performances, chorus, or voice-distracted sessions. In some embodiments of the invention, the social network structure may at least partially replace or inform the host control of the pairing of geographically distributed singers and/or the formation of geographically distributed virtual choruses. For example, with respect to fig. 1, individual singers may execute in a captured fashion (with audio and performance-synchronized video) as current host and guest users, and ultimately streamed to the audience as live broadcasts 122. Such captured audiovisual content may in turn be distributed to the singer's social media contacts, members of the audience, etc. through an open call mediated by a content server. In this manner, the singer himself, members of the audience (and/or content servers or service platforms on their behalf), may invite others to join the coordinated audiovisual performance, or be a queue of members or visitors to the audience.
Where the provision and use of accompaniment tracks is shown and described herein, it should be understood that the captured, pitch corrected (and possibly but not necessarily harmonised) sounds themselves may be mixed (e.g., with the guest 106) to produce "accompaniment tracks" for exciting, guiding or framing subsequent sound captures. Further, additional singers may be invited to sing a particular part (e.g., male treble, part B in duel, etc.) or simply sing, and subsequent sound capture devices (e.g.,
Synchronization method
Based on the description herein, those skilled in the art will appreciate various host-guest synchronization methods that allow for a non-negligible time lag in the peer-to-peer communication channel between
Fig. 2 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in a "master-sync" peer-to-peer configuration for generating a live audiovisual performance of a group audiovisual show in accordance with some embodiments of the invention(s). More specifically, fig. 2 shows an exemplary configuration of the
The key to marking the actual delay is to include the
Fig. 3 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in an optional "shared delay" peer-to-peer configuration for generating a live audiovisual performance of an audio group visual presentation, in accordance with some embodiments of the invention(s). More specifically, fig. 3 shows an exemplary configuration of the
This limited perception of latency is achieved by playing the accompaniment tracks locally on both devices and keeping them synchronized in real time. The
We have experimented with two different approaches to keep the accompaniment track synchronized on two devices (
method 1: we adjust the playback position we receive at the host side by one-way network delay, which is approximated as network RTT/2.
Method 2: we use the Network Time Protocol (NTP) to synchronize the clocks of two devices. In this way we do not need to adjust the timing messages based on one-way network delay, we simply add an NTP timestamp to each song timing message.
For the "shared delay" configuration, method 2 has proven to be more stable than method 1. As an optimization, to avoid excessive timing adjustments, the host only updates the accompaniment track playback position if we are currently more than 50ms away from the accompaniment track playback position of the guest.
Score-coded pitch track
Figure 4 is a flow diagram illustrating pitch correction and harmonic generation equipped with real-time continuous score encoding for captured vocal performances according to some embodiments of the invention(s). In the illustrated configuration, a user/singer (e.g., a guest or host singer at a
Both the pitch correction and the added sum tone are selected to correspond to the
In some embodiments of the techniques described herein, the note (in the current scale or pitch) closest to the note voiced by the user/singer is determined based on the
Audio-visual capture at handheld devices
Although not required to support performance-synchronized video capture in all embodiments, handheld device 101 (e.g.,
Based on the description herein, one of ordinary skill in the art will appreciate signal processing techniques
(sampling, filtering, decimation, etc.) and data representation are appropriately distributed to the appropriate distribution of the functional blocks of the software (e.g., decoder(s) 352, digital-to-analog (D/a)
As will be appreciated by those of ordinary skill in the art, pitch detection and pitch alignment is having a rich technological history in the fields of music and speech coding. Indeed, a wide variety of feature extraction, time domain and even frequency domain techniques have been employed in the art, and may be employed in some embodiments according to the present invention. In view of this, and recognizing that the multi-singer synchronization technique in accordance with the invention(s) is generally independent of any particular pitch detection or pitch correction technique, this description is not intended to be an exhaustive list of the wide variety of signal processing techniques that may be applicable to various designs or implementations in accordance with this description. Instead, we simply note that in some embodiments according to the invention, the pitch detection method computes an average amplitude difference function (AMDF) and executes logic to pick the peak corresponding to the estimate of the pitch period. Based on such an estimation, pitch offset superposition (PSOLA) techniques are used to facilitate resampling of the waveform to produce variations in pitch offset while reducing the aperiodic effect of splicing. A particular implementation based on AMDF/PSOLA technology is described in more detail in commonly owned U.S. patent application No. 8,983,829 entitled "harmonizing AND MIXING sounds captured FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS" AND entitled "cooling AND MIXING sound coordinated hearing FROM terrestrial hearing rejected persons" AND entitled inventors Cook, Lazier, Lieber AND Kirk.
Exemplary Mobile device
FIG. 6 illustrates features of a mobile device that may serve as a platform for executing software implementations in accordance with some embodiments of the present invention. More specifically, FIG. 6 is a diagram of a generic iPhoneTMA block diagram of a
Briefly summarized,
Generally, the
In general, the
The
Other sensors may also be used or provided. A
/g/n/ac communication device and/or bluetoothTM
FIG. 7 shows various examples of computing devices (701, 720A, 720B, and 711) programmed (or programmable) with audio and video capture code, user interface code, pitch correction code, audio presentation pipeline, and playback code according to the functional description herein.
Other embodiments
While the invention(s) have been described with reference to various embodiments, it should be understood that these embodiments are illustrative and that the scope of the invention(s) is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, although a pitch-corrected audio performance captured according to a karaoke-type interface has been described, other variations will be understood. Moreover, although certain illustrative signal processing techniques have been described in the context of certain illustrative applications, those of ordinary skill in the art will recognize that it would be straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
Embodiments according to the present invention may take the form and/or be provided as the following: a computer program product encoded in a computer-readable medium as a sequence of instructions or other functional construct of software that, in turn, is executable in a computing system (e.g., an iPhone handset, a mobile or portable computing device, a media application platform, a set-top box, or a content server platform) to perform the methods described herein. In general, a machine-readable medium may include a tangible article of manufacture that encodes information in a form readable by a machine (e.g., a computing facility of a computer, mobile device, or portable computing device, media device, or streaming media, etc.) and non-transitory storage devices associated with transmission of the information (e.g., as an application program, source or object code, functional descriptive information, etc.). The machine-readable medium may include, but is not limited to, magnetic storage media (e.g., disk and/or tape storage); optical storage media (e.g., CD-ROM, DVD, etc.); a magneto-optical storage medium; read Only Memory (ROM); random Access Memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flashing; or other type of media suitable for storing electronic instructions, sequences of operations, functional descriptive information encodings, and the like.
In general, multiple instances may be provided for a component, operation, or structure described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention(s).