Audio-visual collaboration method with delay management for wide area broadcast

文档序号：1631872 发布日期：2020-01-14 浏览：17次中文

阅读说明：本技术 具有用于广域广播的延迟管理的视听协作方法 (Audio-visual collaboration method with delay management for wide area broadcast ) 是由安东·霍姆伯格本杰明·赫什珍妮·杨佩里·R·库克杰弗里·C·史密斯于 2018-04-03 设计创作，主要内容包括：已经开发了促进群体视听表演的现场直播的技术。包括声乐的视听表演被捕获,并且以可以创建引人注目的用户和收听者体验的方式与其他用户的表演相协调。例如,在一些情况或实施例中,可以在以艺术家的风格演唱的视听现场直播中支持具有主机表演者的二重唱,其中,积极的歌手请求或排队针对现场无线电演出娱乐格式的特定歌曲。所开发的技术提供了一种通信延迟容忍机制,用于对在地理上分离的设备处(例如,在全球分布但网络连接的移动电话或平板计算机处,或者在地理上与现场演播室分离的视听捕获设备处)捕获的声音表演进行同步。(Techniques have been developed to facilitate live broadcasting of group audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create compelling user and listener experiences. For example, in some cases or embodiments duel singing with a host performer may be supported in audio-visual live broadcasts that sing in the style of an artist, where an active singer requests or queues a particular song in an entertainment format for a live radio show. The developed technology provides a communication delay tolerant mechanism for synchronizing sound performances captured at geographically separated devices (e.g., at globally distributed but network-connected mobile phones or tablet computers, or at audiovisual capture devices geographically separated from live studios).)

1. An audio collaboration method for broadcasting a combined performance of a first performer and a second performer geographically distributed with non-negligible peer-to-peer communication delay between a host device and a guest device, the method comprising:

receiving, at the host device operating as a local peer, media encoding of a mixed audio performance that (i) includes sound audio captured from a first one of the performers at the guest device communicatively coupled as a remote peer, and (ii) is mixed with an accompaniment audio track;

at the host device, audibly presenting the received mixed audio performances and accordingly capturing audio sounds from a second of the performers; and

mixing the captured second performer voice audio with the received mixed audio performance for transmission to the audience as the broadcast, wherein the broadcast mix comprises the first performer's and second performer's voice audio and the accompaniment audio track with a negligible time lag therebetween.

2. The method of claim 1, further comprising:

transmitting the broadcast mix as a live broadcast over a wide area network to a plurality of recipients, the plurality of recipients comprising the viewer.

3. The method of claim 1, further comprising:

the second actor selectively joins the first actor to the combined performance at the host device.

4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,

wherein a joining first performer is selected from the audience and the joining first performer is decoupled from live transmission of the broadcast to the audience for at least the duration of the combined performance.

5. The method of claim 4, wherein the first and second light sources are selected from the group consisting of,

wherein the live broadcast transmitted to the audience lags in time at least a few seconds with respect to the first performer's voice audio capture.

6. The method of claim 4, further comprising:

returning the first performer to the audience and, at the same time, re-coupling the first performer to the live transmission.

7. The method of claim 6, further comprising:

selectively joining a third performer as a new remote peer, an

After that time, the user can select the desired position,

receiving, at the host device, a second media encoding of a mixed audio performance that (i) includes sound audio captured from the third performer at a new guest device communicatively coupled as the new remote peer, and (ii) is mixed with a second accompaniment audio track;

at the host device, audibly presenting the second media encoding and accordingly capturing additional sound audio from the second performer; and

mixing the captured additional audio with the received second media encoding for transmission to the audience as a continuation of the broadcast, wherein the broadcast mix includes the second and third performer's audio and the second accompaniment audio track with a negligible time lag therebetween.

8. The method of claim 1, further comprising:

providing the second performer captured sound audio to the guest device remote peer for audible presentation at the guest device with at least some guest-side time lag relative to capturing sound audio from the first performer.

9. The method of claim 8, wherein the apparent guest-side time lag is at least about 40-1200 ms.

10. The method of claim 8, wherein the first and second light sources are selected from the group consisting of,

wherein substantially all of the non-negligible peer-to-peer communication delay is apparent in the guest-side time lag.

11. The method of claim 10, wherein the first and second light sources are selected from the group consisting of,

wherein the non-negligible peer-to-peer communication delay is not noticeable at the host device or in the broadcast mix of the first performer and the second performer.

12. The method of claim 1, wherein the non-negligible peer-to-peer communication delay comprises:

the delay of the incoming signal to the transmission,

the delay of the network is set by the network,

jitter buffer delay, and

buffers and output delays.

13. The method as recited in claim 1, wherein the non-negligible peer-to-peer communication delay is at least about 100 and 250 ms.

14. The method as recited in claim 1, wherein the non-negligible peer-to-peer communication delay is approximately 100 and 600 ms.

15. The method of claim 1, wherein the non-negligible peer-to-peer communication delay is at least approximately 30-100 ms.

16. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein receiving the mixed audio performance at the host device and providing the second performer captured sound audio to the guest device is via a peer-to-peer audiovisual connection using a Web real-time communication (WebRTC) type framework.

17. The method of claim 1, further comprising:

providing a broadcast mix of the first performer's and the second performer's audio over a wide area network.

18. The method of claim 17, wherein the first and second light sources are selected from the group consisting of,

wherein the provision of the broadcast mix is via a real-time message transfer protocol (RTMP) type audiovisual streaming protocol.

19. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein at least the guest device constitutes a mobile handset or a media player.

20. The method of claim 1, further comprising:

at the host device, pitch correcting the second performer's voice according to the voice score encoding the sequence of notes of the voice melody.

21. The method of claim 20, further comprising:

pitch correcting, at the host device, the second performer sound according to the sound score encoding at least the first set of harmonic notes of at least some portion of the sound melody.

22. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the first performer sound included in the received mixed performance is a pitch-corrected sound.

23. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein one of the first performer sound and the second performer sound is pitch-corrected according to a sound score that encodes a note sequence of the sound melody; and is

Wherein the other of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding at least a first set of harmonic notes of at least some portion of the voice melody.

24. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein either or both of the first performer sound and the second performer sound are computationally processed to apply one or more audio effects prior to being included in the broadcast.

25. The method of claim 24, wherein the applied audio effect comprises one or more of:

the effect of the reverberation is such that,

the digital filtering is carried out in such a way that,

the frequency spectrum is equalized, and the frequency spectrum is equalized,

the non-linear distortion is generated by the non-linear distortion,

the audio is compressed and then the compressed audio is transmitted,

either the pitch correction or the pitch offset is performed,

channel relative gain and/or phase delay for manipulating apparent placement of the first performer or the second performer within the stereo field.

26. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the received media encoding comprises video that is synchronized in performance with the captured first performer's voice,

wherein the method further comprises capturing, at the host device, video synchronized on the performance with the captured second performer sound, and

wherein the broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.

27. The method of claim 26, further comprising:

dynamically changing at least visual prominence of one or the other of the first performer and the second performer in the broadcast mix based on an evaluation of computationally defined audio features of either or both of the first performer sound and the second performer sound.

28. The method of claim 26, further comprising:

applying one or more video effects to the broadcast mix based at least in part on computationally defined audio or video characteristics for either or both of the first performer audio or video and the second performer audio or video.

29. The method of claim 1, further comprising:

chat messages are received at the host device from members of the audience.

30. The method of claim 1, further comprising:

at least some of the content of the chat message is incorporated as part of the broadcast mixed video.

31. The method of claim 1, further comprising:

receiving, at the host device from a member of the audience, one or more of: chat messages, emoticons, animated GIF, voting instructions.

32. The method of claim 31, further comprising:

incorporating a visual presentation of at least some of the received chat message content, emoticons, animated GIF, or voting instructions as part of the broadcast mix.

33. The method of claim 1, further comprising:

queuing playlist requests from one or more recipients of the broadcast mix.

34. The method of claim 33, further comprising:

in response to selection of a particular playlist request of the queued playlist requests by the second performer at the host device, retrieving from a content repository one or more of: the accompaniment audio track, lyrics, score encode note objectives.

35. The method of claim 33, further comprising:

in response to selection of a particular playlist request of the queued playlist requests at the host device by the second performer, requesting provision of one or more of the following to a communicatively coupled guest device: the accompaniment audio track, lyrics, score encode note objectives.

36. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the broadcast mix is presented as vocal duel.

37. The method of claim 1, further comprising:

receiving, at the host device, media encoding of at least another mixed audio performance that (i) contains sound audio captured from a third performer at another guest device communicatively coupled as another remote peer, and (ii) is temporally aligned or alignable with the accompaniment audio track.

38. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,

wherein the live audio includes both:

a captured conversational audio portion corresponding to an interactive conversation between the first performer and the second performer; and

a captured audio portion of a vocal performance corresponding to a vocal performance of either or both of the first performer and the second performer for the accompaniment audio track.

39. The method of claim 38, further comprising:

selecting a highlight clip set of segments from the live broadcast,

wherein the collection of highlight clips of a segment typically includes a vocal performance portion and typically does not include the conversational audio portion.

40. The method of claim 38, further comprising:

selecting a highlight clip set of segments, whether score coded or computationally determined by audio feature analysis, based on correspondence of a particular audio portion of the live broadcast to lyrics fragments, refrains, or music chapter boundaries.

41. The method of claim 38, further comprising:

selecting a highlight clip set of segments from the live broadcast based on one or more of viewer reaction to the live broadcast, song structure, and audio power.

42. The method of claim 38, further comprising:

in response to a user selection, saving or sharing audiovisual coding of one or more of the highlight clips.

43. The method of claim 1, further comprising:

receive one or more lyric synchronization markers from the guest device, the lyric synchronization markers transmitting to the host device a temporal alignment of lyrics visually presented at the guest device with the voice audio captured by the guest device.

44. The method of claim 43, further comprising:

visually presenting the lyrics at the host device, wherein the visual presentation of the lyrics is temporally aligned with media encoding of the mixed audio performance received from the guest device based on the received one or more lyric synchronization markers.

45. The method of claim 43, in which the first and second regions are separated,

wherein the received one or more lyric synchronization markers coordinate a progress of the lyrics presented on the host device with a pause or other time control at the guest device.

46. A system for disseminating an apparently live broadcast of a combined performance of a first performer and a second performer that are geographically distributed, the system comprising:

a host device and a guest device coupled by a communication network as a local peer and a remote peer with non-negligible peer-to-peer delay therebetween for transmission of audiovisual content, the host device communicatively coupled as the local peer to receive a media encoding of a mixed audio performance containing sound audio captured at the guest device, and the guest device communicatively coupled as the remote peer to provide the media encoding captured from a first one of the performers and mixed with an accompaniment audio track;

the host device is configured to audibly present the received mixed audio performances, to accordingly capture audio sounds from a second one of the performers, and to mix the captured second performer audio sounds with the received mixed audio performance for transmission as the apparently live broadcast.

47. An audio collaboration method for live broadcast of coordinated audiovisual works of a first performer and a second performer captured at a first device and a second device, respectively, geographically distributed, the method comprising:

receiving, at the second device, media encoding of a mixed audio performance that (i) includes sound audio captured at the first device from a first of the performers, and (ii) is mixed with an accompaniment audio track;

at the second device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second of the performers;

mixing the captured second performer voice audio with the received mixed audio performance to provide a broadcast mix comprising the captured voice audio of the first performer and the second performer and the accompaniment audio track without a significant time lag therebetween; and

providing the broadcast mix to a service platform configured to live broadcast the broadcast mix to a plurality of recipient devices that constitute a viewer.

48. The method as set forth in claim 47,

wherein the first device is associated with the second device as a current live visitor, and

wherein the second device operates as a current live host that controls association and disassociation of a particular device from the audience as the current live visitor.

49. The method of claim 48, wherein said step of selecting said target,

wherein the current live host selects from a queue of requests from the viewer to associate as the current live guest.

50. The method of claim 47, wherein the first device operates in a live guest role and the second device operates in a live host role, the method further comprising any one or both of:

the second device releases the live broadcast host role to be borne by another device; and

the second device communicates the live hosting role to a particular device selected from a set including the first device and the viewer.

Technical Field

The present invention relates generally to the capture, processing and/or broadcasting of audiovisual performances by a plurality of performers, and in particular to techniques adapted to manage transmission delays for audiovisual content captured in the context of near real-time audiovisual collaboration by a plurality of geographically distributed performers.

Background

The installed base of mobile phones, personal media players, and portable computing devices, along with media streaming and television set-top boxes, has grown in absolute numbers and computing power each day. The ubiquitous and deeply established life style of people in the world, many of these devices surpass cultural and economic barriers. By way of computing, these computing devices provide speed and storage capabilities comparable to engineering workstation or workgroup computers less than ten years ago, and typically include powerful media processors, making them suitable for real-time sound synthesis and other music applications. As a result, in part, some portable handheld devices (e.g.,

iPod

and others

Or android devices), as well as media application platforms and set-top box (STB) -type devices (e.g., Apple @)

Devices) have considerable ability to support audio and video processing while providing a platform suitable for advanced user interfaces. In fact, the application (e.g., SmuleOcarina)^TM、Leaf

I Am T-Pain^TM、

Sing！Karaoke^TM、Guitar！ByAnd Magic

Applications (available from smule corporation)) have shown that advanced digital acoustic technology can be delivered using such devices in a way that provides an attractive music experience.

Sing！Karaoke^TMImplementations have previously demonstrated the growth of sound performances captured on a non-real-time basis with geographically distributed handheld devices relative to one another, and the same is true of implementations in which a closer coupling of the portable handheld device and a local media application platform (e.g., indoors) is supported (typically with short-range, negligible latency communications over the same local area network or personal area network segment). Improved technical and functional capabilities are desired to extend the sense of intimacy of "now" or "live" to collaborative sound performances, where performers are separated by more significant geographic distances and communication delays between devices are not negligible.

As researchers attempt to transition their innovations to commercial applications deployable to modern handheld devices and media application platforms, significant practical challenges exist within the real-world constraints imposed by processors, memory, and other limited computing resources as previously described, and/or within the typical communication bandwidth and transmission delay constraints of wireless and wide area networks. For example, when an application (e.g., Sing | Karaoke) has demonstrated the desire to audiovisual mix-simulated duet or collaborative sound performances of a large number of performers after a performance, the present sense is created, and live collaboration has proven elusive without physical co-location.

Improved technical and functional capabilities are desired, particularly with respect to managing communication delays and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to recipients, listeners, and/or viewers as a live interactive collaboration of geographically distributed performers. It is also desirable to provide viewer intervention and participation in the construction that provides a sense of closeness of "now" or "live".

Disclosure of Invention

It has been found that while practical limitations are imposed by mobile device platforms and media application execution environments, audiovisual performances including vocal music can be captured and coordinated (in a manner that creates an impressive user and listener experience) with other users' audiovisual performances. In some cases, the vocal performance of the collaborative contributors (along with performance-synchronized video) is captured in the context of a karaoke-type presentation of lyrics and corresponding to an audible presentation of an accompaniment track (backing track). In some cases, the sound (and often synchronized video) is captured as part of a live or non-descriptive performance through acoustic interaction (e.g., duet or dialogue) between the collaborating contributors. In either case, it is envisioned that there will be non-negligible network communication delays between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there is a technical challenge to manage the delay and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to the recipients, listeners, and/or viewers as a live interactive collaboration.

In one technique for accomplishing such faxing of live interactive performance collaborations, the actual and non-negligible network communication delays are (effectively) marked in one direction between guest (guest) and host (host) performers, and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a "live performance" internet broadcast of a host performer may include guest + host duel singing in an explicit (individual) real-time synchronized manner. In some cases, the host computer may be an artist who has popularized a particular musical performance. In some cases, the guest may be an amateur singer who is given the opportunity to sing "live" (by remote means) as a host of (or with) the performance with a popular artist or group. Despite the non-negligible network communication delay from guest to host (which may be 200-.

The result is an apparent live interactive performance (at least from the perspective of the recipient, listener and/or viewer of the broadcast or broadcast performance). Although a non-negligible network communication delay from guest to host is labeled, it will be appreciated that the delay exists and is tolerable in the host to guest direction. However, while the host-to-guest delay is discernable (and may be quite noticeable) to the guest, it need not be noticeable in a conspicuous live broadcast or other propagation. It has been found that a delayed audible presentation of the host sound (or more generally, the host sound of a captured audiovisual performance of the host) does not need to psychoacoustically interfere with the performance of the guest.

Performance-synchronized video may be captured and included in a combined audiovisual performance that constitutes an apparently live broadcast, where the visuals may be based at least in part on time-varying, computationally-defined audio features extracted from (or computed from) the captured sound audio. In some cases or embodiments, these computationally defined audio features are selective to the particular synchronized video that contributes to (or is salient by) one or more of the singers during the coordinated audiovisual mixing.

Optionally, and in some cases or embodiments, the audio may be tone corrected in real-time at the visitor performer's device (or more generally, at a portable computing device (e.g., mobile phone, personal digital assistant, laptop computer, notebook computer, tablet or netbook) or on a content or media application server) according to tone correction settings. In some cases, the pitch correction settings will encode a particular pitch or scale for the sound performance or portion thereof. In some cases, the pitch correction settings include a score-coded melody (score-coded melody) and/or a harmony sequence (harmony sequence) (provided with or associated with the lyrics and accompaniment tracks). Harmonic notes or chords may be coded as well-defined targets, if desired, or melodies coded relative to scores or even actual tones sounded by the singer.

Using the uploaded sound captured at the guest performer devices (e.g., the aforementioned portable computing devices), the content server or service for the host may further mediate the coordinated performance by manipulating and mixing the uploaded audiovisual content of the plurality of contributing singers for further broadcasting or other dissemination. Depending on the goals and implementation of a particular system, the upload may include, in addition to the video content, pitch-corrected sound shows (with or without harmony), dry (i.e., uncorrected) sounds, and/or control trajectories of user's tones and/or pitch correction selections, etc.

Synthesized harmony and/or additional sounds (e.g., sounds captured from another singer at another other location and optionally pitch-altered to harmonize with other sounds) may also be included in the mixture. The geocoding of the captured sound performance (or individual contributions to the combined performance) and/or listener feedback can promote animation or display artifacts by: cueing shows or annotations emanating from a particular geographic location on the user-manipulatable globe. In this way, the implementation of the described functionality can transform a common mobile device and living room or entertainment system into a social instrument that fosters a unique sense of global connectivity, collaboration, and community.

In some embodiments according to the invention(s), an audio collaboration method is provided for broadcasting a joint performance of geographically distributed performers with non-negligible peer-to-peer communication delay between a host device and guest devices. The method includes (1) receiving media encoding of a mixed audio performance at a host device operating as a local peer, the media encoding of the mixed audio performance (i) including sound audio captured from a first one of the performers at a guest device communicatively coupled as a remote peer, and (ii) mixed with an accompaniment audio track; (2) at the host device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second one of the performers; and (3) mixing the captured second performer voice audio with the received mixed audio performance for transmission to the audience as a broadcast, wherein the broadcast mix comprises the voice audio of the first performer and the second performer and the accompaniment audio track with a negligible time lag therebetween.

In some embodiments, the method further comprises transmitting the broadcast mix as a live broadcast over a wide area network to a plurality of recipients, the plurality of recipients comprising the audience. In some embodiments, the method further comprises the second actor selectively joining the first actor to the combined performance at the host device.

In some cases or embodiments, the joining first performer is selected from the audience, and the joining first performer is decoupled from the live transmission of the broadcast to the audience for at least the duration of the combined performance. In some cases or embodiments, the live broadcast transmitted to the audience lags in time by at least a few seconds with respect to the first performer's voice audio capture.

In some embodiments, the method further comprises returning the first performer to the audience, and at the same time, re-coupling the first performer to the live transmission. In some embodiments, the method further comprises selectively joining a third performer as a new remote peer, and thereafter (1) receiving at the host device a second media encoding that (i) includes sound audio captured from the third performer at a new guest device communicatively coupled as the new remote peer, and (ii) mixes with the second accompaniment audio track; (2) at the host device, audibly presenting the second media encoding and accordingly capturing additional sound audio from the second performer; and (3) mixing the captured additional audio with the received second media encoding for transmission to the audience as a continuation of the broadcast, wherein the broadcast mix includes the audio of the second performer and the third performer and the second accompaniment audio track with a negligible time lag therebetween.

In some embodiments, the method further includes providing the second performer captured sound audio to the guest device remote peer for audible presentation at the guest device with at least some guest-side time lag relative to capturing the sound audio from the first performer. In some cases or embodiments, the apparent (guest) guest-side time lag is at least about 40-1200 ms.

In some cases or embodiments, substantially all of the non-negligible peer-to-peer communication delay is apparent in the guest-side time lag. In some cases or embodiments, neither a non-negligible peer-to-peer communication delay is apparent at the host device or in the broadcast mix of the first performer and the second performer. In some cases or embodiments, non-negligible peer-to-peer communication delays include input signal-to-transmission delay, network delay, jitter buffer delay, and buffer and output delay. Non-negligible peer-to-peer communication delays may vary and, in some cases, may be psychoacoustically significant. In some cases or embodiments, the non-negligible peer-to-peer communication delay is at least about 30-100 ms. In some cases or embodiments, the non-negligible peer-to-peer communication delay is at least about 100-. In some cases or embodiments, the non-negligible peer-to-peer communication delay is approximately 100 and 600 ms.

In some cases or embodiments, receiving the mixed audio performance at the host device and providing the second performer captured sound audio to the guest device is done via a peer-to-peer audiovisual connection using a Web real-time communication (WebRTC) type framework. In some embodiments, the method further comprises providing a broadcast mix of the audio of the first performer and the second performer over the wide area network. In some cases or embodiments, the provision of the broadcast mix is via a real-time messaging protocol (RTMP) type audiovisual streaming protocol. In some cases or embodiments, at least the guest device constitutes a mobile handset or a media player.

In some embodiments, the method further comprises pitch correcting, at the host device, the second performer sound according to a sound score encoding the sequence of notes of the sound melody. In some embodiments, the method further comprises pitch correcting, at the host device, the second performer's voice according to the voice scores encoding the at least the first set of harmonic notes of the at least some portion of the voice melody.

In some cases or embodiments, the first performer sound included in the received mixed performance is a pitch-corrected sound. In some cases or embodiments, one of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding a sequence of notes of the voice melody, and the other of the first performer's voice and the second performer's voice is pitch corrected according to a voice score encoding at least a first set of harmonic notes of at least some portion of the voice melody.

In some cases or embodiments, either or both of the first performer sound and the second performer sound are computationally processed to apply one or more audio effects prior to being included in the broadcast. In some cases or embodiments, the applied audio effect includes one or more of: reverberation effects, digital filtering, spectral equalization, nonlinear distortion, audio compression, pitch correction or shift, channel relative gain and/or phase delay, for manipulating the apparent placement of a first performer or a second performer within a stereo field.

In some cases or embodiments, the received media encoding includes video synchronized in performance with the captured first performer sound, the method further includes capturing, at the host device, video synchronized in performance with the captured second performer sound, and the broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.

In some embodiments, the method further comprises dynamically changing at least the visual prominence of one or the other of the first performer and the second performer in the broadcast mix based on an evaluation of the computationally defined audio characteristics of either or both of the first performer sound and the second performer sound. In some embodiments, the method further comprises applying one or more video effects to the broadcast mix based at least in part on computationally defined audio or video characteristics for either or both of the first performer audio or video and the second performer audio or video.

In some embodiments, the method further comprises receiving, at the host device, a chat message from a member of the audience. In some embodiments, the method further comprises incorporating at least some of the content of the chat message as part of the broadcast mixed video. In some embodiments, the method further comprises receiving, at the host device from a member of the audience, one or more of: chat messages, emoticons, animated GIF, voting instructions. In some embodiments, the method further comprises merging a visual presentation of at least some of the received chat message content, emoticons, animated GIF, or voting instructions as part of the broadcast mix.

In some embodiments, the method further comprises queuing playlist requests from one or more recipients of the broadcast mix. In some embodiments, in response to selection of a particular playlist request of the queued playlist requests by the second performer at the host device, the method further comprises retrieving, from the content repository, one or more of: accompaniment audio tracks, lyrics, scores encode note objectives. In some embodiments, in response to selection of a particular playlist request of the queued playlist requests at the host device by the second performer, the method further comprises requesting provision of one or more of the following to the communicatively coupled guest devices: accompaniment audio tracks, lyrics, scores encode note objectives.

In some cases or embodiments, the broadcast mix is presented as vocal duels. In some embodiments, the method further includes receiving, at the host device, media encoding of at least another mixed audio performance that (i) contains sound audio captured from a third performer at another guest device communicatively coupled as another remote peer, and (ii) is temporally aligned or alignable with the accompaniment audio track. In some cases or embodiments, live audio includes both: a captured conversational audio portion corresponding to an interactive conversation between the first performer and the second performer, and a captured vocal performance audio portion corresponding to a vocal performance of either or both of the first performer and the second performer for the accompaniment audio track.

In some embodiments, the method further comprises selecting a highlight set of segments from the live broadcast, wherein the highlight set of segments generally includes a vocal performance portion and generally does not include a conversational audio portion. In some embodiments, the method further comprises selecting a set of highlight clips of the segment from the live broadcast based on one or more of viewer reaction to the live broadcast, song structure, and audio power. In some embodiments, the method further comprises selecting a highlight clip set of segments, whether score coded or computationally determined by audio feature analysis, based on correspondence of a particular audio portion of the live broadcast to a lyrics fragment, a chorus, or a music chapter boundary. In some embodiments, in response to a user selection, the method further comprises saving or sharing audiovisual encoding of one or more of the highlight clips.

In some embodiments, the method further comprises receiving one or more lyric synchronization marks from the guest device. The lyric sync mark conveys to the host device a temporal alignment of the lyrics visually presented at the guest device with the voice audio captured by the guest device. In some embodiments, the method further comprises visually presenting the lyrics at the host device, wherein the visual presentation of the lyrics is temporally aligned with the media encoding of the mixed audio performance received from the guest device based on the received one or more lyric synchronization markers. In some cases or embodiments, the received one or more lyric synchronization markers coordinate the progress of the lyrics presented on the host device with a pause or other time control at the guest device.

In some embodiments according to the invention(s), a system for disseminating an apparently live broadcast of a combined performance of a first performer and a second performer geographically distributed, the system comprising: a host device and a guest device coupled by a communication network as a local peer and a remote peer with non-negligible peer-to-peer delay between them for transmission of audiovisual content. The host device is communicatively coupled as a local peer to receive media encoding of a mixed audio performance containing sound audio captured at the guest devices, and the guest devices are communicatively coupled as remote peers to provide media encoding captured from a first one of the performers and mixed with the accompaniment audio track. The host device is configured to audibly present the received mixed audio performances, to accordingly capture audio sounds from a second one of the performers, and to mix the captured second performer audio sounds with the received mixed audio performances for transmission as an apparently live broadcast.

In some embodiments according to the invention(s), an audio collaboration method for live broadcast of coordinated audiovisual works of a first performer and a second performer captured at a first device and a second device, respectively, that are geographically distributed, the method comprising: (a) receiving, at the second device, a media encoding of a mixed audio performance that (i) includes sound audio captured at the first device from a first one of the performers, and (ii) is mixed with the accompaniment audio tracks; (b) at the second device, audibly presenting the received mixed audio performances and accordingly capturing sound audio from a second of the performers; (c) mixing the captured second performer voice audio with the received mixed audio performance to provide a broadcast mix comprising the captured voice audio of the first performer and the second performer and the accompaniment audio track without a significant time lag therebetween; and (d) providing the broadcast mix to a service platform configured to live broadcast the broadcast mix to a plurality of recipient devices that constitute the audience.

In some cases or embodiments, the first device is associated with the second device as a current live guest, and the second device operates as a current live host. The current live host controls the association and disassociation of a particular device from the audience as a current live visitor. In some cases or embodiments, the current live host selects from a queue of requests from the viewer to associate as a current live guest.

In some cases or embodiments, the first device operates in a live guest role and the second device operates in a live host role. The method further comprises any one or both of: the second equipment releases the role of the live broadcast host for another equipment to bear; and the second device communicates the live hosting role to a particular device selected from the set comprising the first device and the viewer.

Drawings

The present invention(s) are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements or features in general.

Fig. 1 depicts information flow between illustrative mobile phone-type portable computing devices in a host and guest configuration for live duet group audiovisual performance in accordance with some embodiment of the invention(s).

Fig. 3 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in a "shared delay" peer-to-peer configuration for generating a live audiovisual performance of an audio group visual performance in accordance with some embodiments of the invention(s).

Figure 4 is a flow diagram illustrating an optional real-time continuous pitch correction and choreography signal flow that may be performed based on score-coded pitch correction settings for audiovisual performances captured at a guest or host device in accordance with some embodiments of the invention(s).

Fig. 5 is a functional block diagram of hardware and software components that may be executed at an illustrative mobile-phone-type portable computing device in order to process and transmit captured audiovisual shows for use in a multi-singer live configuration of network-connected devices in accordance with some embodiments of the present invention(s).

Fig. 6 illustrates features of a mobile device that may serve as a platform for performing at least some software implementations of audiovisual performance capture and/or live performance devices in accordance with some embodiments of the invention(s).

Fig. 7 is a network diagram illustrating cooperation of exemplary devices according to some embodiments of the invention(s).

Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or significance of some of the elements or features shown may be exaggerated relative to other elements or features to help to improve understanding of embodiments of the present invention. Also, while illustrated in the accompanying drawings as a single flow for the sake of brevity or to avoid complexity that might otherwise obscure the description of the inventive concepts, it is to be understood that multiple data and control flows (including component signals or code) are consistent with the description.

Detailed Description

Modes for carrying out the invention(s)

Techniques have been developed to facilitate live broadcasting of group audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create compelling user and listener experiences. For example, in some cases or embodiments duel singing with a host performer may be supported in audio-visual live broadcasts that sing in the style of an artist, where an active singer requests or queues a particular song in an entertainment format for a live radio show. The developed technology provides a communication delay tolerant mechanism for synchronizing sound performances captured at geographically separated devices (e.g., at globally distributed but network-connected mobile phones or tablet computers, or at audiovisual capture devices geographically separated from live studios).

Although audio-only embodiments are of course contemplated, it is contemplated that live content will typically include video synchronized with a performance captured in conjunction with sound. Further, while a network-connected mobile phone is shown as an audiovisual capture device, it will be understood based on the description herein that the audiovisual capture and viewing device may include a suitably configured computer, a smart TV and/or a living room type set-top box configuration, and even a smart virtual auxiliary device having audio and/or audiovisual capture devices or capabilities. Finally, although the application to vocal music is described in detail, it will be understood based on the description herein that the audio or audiovisual capture application is not necessarily limited to vocal duals, but may be applicable to other forms of group performances where one or more successive performances are added to previous performances to produce a live broadcast.

In some cases, the vocal performance of the collaborative contributors (along with the performance-synchronized video) is captured in the context of a karaoke-type presentation of the lyrics and corresponding to an audible presentation of the accompaniment track. In some cases, the sound (and often synchronized video) is captured as part of a live or non-descriptive performance through acoustic interaction (e.g., duet or dialogue) between the collaborating contributors. In each case, it is envisioned that there will be non-negligible network communication delays between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there are technical challenges to managing latency and captured audiovisual content in the following manner: the combined audiovisual performance may still be disseminated (e.g., broadcast) in a manner that is presented to the recipients, listeners, and/or viewers as a live interactive collaboration.

In one technique for accomplishing such faxing of live interactive performance collaboration, actual and non-negligible network communication delays are (effectively) flagged in one direction between the guest and the host performer and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a "live performance" internet broadcast of a host performer may include guest + host duel singing in an apparently real-time synchronized manner. In some cases, the host computer may be an artist who has popularized a particular musical performance. In some cases, the guest may be an amateur singer who is given the opportunity to sing "live" (by remote means) as a host of (or with) the performance with a popular artist or group. Although a non-negligible network communication delay from the guest to the host (perhaps 200-.

Although much of the description herein assumes a stationary host performer on a particular host device for purposes of illustration, it will be understood based on the description herein that some embodiments in accordance with the invention(s) may provide host/guest control logic that allows the host to "pass through microphones" so that a new user (a user who "picks up a microphone" after the current host "discards the microphone" in some cases and in other cases of user selection by the current host) may take over as a host. As such, it will be understood based on the description herein that some embodiments in accordance with the invention(s) may provide host/guest control logic that queues guests (and/or active hosts) and automatically assigns the queued users to appropriate roles.

In some cases or embodiments, in a karaoke-type user interface framework, the vocal audio of individual host and guest character performers is captured along with performance-synchronized video and coordinated with the audiovisual contributions of other users to form a duel or chorus-type group audiovisual performance. For example, in the context of a karaoke-type presentation of lyrics corresponding to an audible presentation of an accompaniment track, an individual user's vocal performance (along with video synchronized with the performance) may be captured on a mobile device, a television-type display, and/or a set-top box device. In some cases or embodiments, score-coded continuous pitch correction and user-selectable audio and/or video effects may be provided. Consistent with the foregoing, but not limited to any particular embodiment claimed, karaoke-type voice performance capture using a portable handheld device provides an illustrative environment.

Karaoke type voice performance capture

Although embodiments of the present invention are not so limited, pitch correction, karaoke type, voice capture using mobile phone type and/or television type audiovisual devices provide a useful descriptive context. For example, in some embodiments such as that shown in FIG. 1, an iPhone available from Apple Inc^TMThe handheld devices (or more generally, handheld devices 101A, 101B operating as guest and host devices, respectively) execute software that operates in conjunction with the content server 110 to provide sound capture. This configuration optionally provides continuous real-time, fractionally encoded pitch correction and coordination of the captured sound. Set-top box devices (e.g., Apple TV) connected by or to a computer, television or other audiovisual device (not specifically shown) may also be used^TMDevice) to capture video synchronized to the performance. In some embodiments, performance-synchronized video may be captured using an onboard camera provided by a handheld device paired with a connected set-top box device. Suitable techniques are described in detail in commonly owned, co-pending U.S. patent application No. 15/337,866, filed on 28/10/2016, entitled "audio media application Platform with Wireless Handheld device Audiovisual Input," entitled push Audio media, and filed in 2016, Hersh, Shimmin, Yang, and Cook, the entire contents of which are incorporated herein by reference in their entirety.

In the illustration of fig. 1, a current host user of a current host device 101B at least partially controls the content of a live broadcast 122, which live broadcast 122 is buffered for viewers on devices 120A, 120B, a. In the illustrated configuration, a current guest user of a current guest device 101A contributes to a group audiovisual performance mix 111 provided as a live broadcast 122 by a current host device 101B (ultimately via the content server 110). Although for simplicity the devices 120A, 120B, 120N, and indeed the current guest device 101A and host device 101B, are shown as handheld devices (e.g., mobile phones), persons of ordinary skill in the art having benefit of the present disclosure will appreciate that any given member of the audience may receive the live broadcast 122 on any suitable computer, smart television, tablet via a set-top box or other client having streaming media capabilities.

In the illustrated configuration, content that is mixed to form a group audiovisual performance mix 111 is captured in the context of a karaoke-type performance capture, where lyrics 102, optional tone cues 105, and a general accompaniment track 107 are provided from the content server 110 to either or both of the current guest device 101A and the current host device 101B. For example, the current host (on the current host device 101B) typically exercises final control over the live broadcast by: select a particular user (or users) from the audience to act as the current visitor(s), by selecting a particular song (and/or portion of its sound for a particular user) from a request queue, and/or by starting, stopping, or pausing a group AV show. Once the guest and/or song is selected or approved by the current host, the guest user may (in some embodiments) start/stop/pause the scrolling of the accompaniment track 107A for local audible presentation and otherwise control the content of the guest mix 106 provided to the current host device 101B (the scrolling of the accompaniment track mixes with the captured guest audiovisual content). The scrolling of the lyrics 102A and the optional tone cue 105A at the current guest device 101A coincides in time with the accompaniment track 107A and is also controlled by the start/stop/pause of the current guest. In some cases or scenarios, the media may be retrieved from a media store (e.g., iTunes residing in a handheld device, set-top box, etc.)^TMLibraries or iTunes accessible from handheld devices, set-top boxes, etc^TMLibrary) presents accompanying audio and/or video.

Typically, the song request 132 is viewer initiated and transmitted to the content selection and guest queue control logic 112 of the content server 110 over a signaling path. Host control 131 and guest control 133 are shown as bidirectional signaling paths. Other queue and control logic configurations consistent with the described operations, including host or guest controlled queues and/or song selections, will be understood based on this disclosure.

In the configuration shown in fig. 1, and despite a non-negligible time lag (typically 100 and 250ms, but possibly greater), the current host device 101B receives and audibly presents the guest mix 106 as an accompaniment track for which the current host's audiovisual performance is captured at the current host device 101B. The lyrics scroll 102B and the optional tone cue 105B at the current host device 101B correspond in time to the accompaniment track (here the guest mix 106). To facilitate synchronization with the guest mix 106 based on a time lag in the peer-to-peer communication channel between the current guest device 101A and the current host device 101B and guest side start/stop/pause control, a marker beacon may be encoded in the guest mix to provide appropriate phase control of the lyrics 102B and the optional tone cues 105B on the screen. Alternatively, phase analysis of any accompaniment tracks 107A included in the guest mix 106 (or any infiltration, in case the accompaniment tracks are encoded or transmitted separately) may be used to provide appropriate phase control of the onscreen lyrics 102B and optional tone cues 105B at the current host device 101B.

It should be appreciated that the time lag in the peer-to-peer communication channel between the current guest device 101A and the current host device 101B affects the guest hybrid 106 and the communication in the opposite direction (e.g., host microphone 103C signal encoding). Any of a variety of communication channels may be used to communicate audiovisual signals and control between the current guest device 101A and the current host device 101B, and between the guest device 101A and the host device 101B and the content server 110, and between the audience devices 120A, 120B. For example, respective telecommunications carrier wireless facilities and/or wireless local area networks and respective wide area network gateways (not specifically shown) may provide communications and services to devices 101A, 101B, 120A, 120BCommunications from devices 101A, 101B, 120A, 120B, · 120N are provided. Based on the description herein, those skilled in the art will recognize that various data communication facilities (including 802.11Wi-Fi, Bluetooth, etc.) may be employed, alone or in combination^TM4G-LTE wireless, wired data networks, wired or wireless audiovisual interconnections such as according to HDMI, AVI, Wi-Di standards or facilities) to facilitate the communication and/or audiovisual presentation described herein.

User sounds 103A and 103B are captured at the respective handheld devices 101A, 101B and may optionally be pitch corrected continuously and in real time and presented in audio mix with locally appropriate accompaniment tracks (e.g., accompaniment track 107A at the current guest device 101A and guest mix 106 at the current host device 101B) to provide the user with an improved timbre reproduction of his/her own voice performance. The pitch correction is typically based on score-coded sets of notes or cues (e.g., pitch and vocal cues 105A, 105B visually displayed at the current guest device 101A and at the current host device 101B, respectively) that provide a continuous pitch correction algorithm (with a performance synchronization sequence of the target note in the current pitch or scale) executing on the respective devices. In addition to performing synchronized melody objectives, score-coded harmonic note sequences (or sets) provide a pitch-shifting algorithm with additional objectives (typically coded as offsets from the guide melody note trajectory, and typically scoring only selected portions thereof) for pitch-shifting to coordinate with the user's own captured version of the sound. In some cases, the key correction setting may be characteristic of a particular artist (e.g., an artist performing sounds associated with a particular accompaniment track).

In general, the lyrics, melody, and harmonic track note sets and associated timing and control information may be packaged in an appropriate container or object (e.g., in musical instrument digital interface, MIDI or Java script object symbols, character strings, type formats) to be provided with the accompaniment track(s). Using such information, the devices 101A and 101B (and associated audiovisual display and/or set-top box device, not specifically shown) may display lyrics, or even visual cues associated with the target notes, harmony and currently detected vocal tones corresponding to the auditory performance of the accompaniment track(s), in order to facilitate karaoke-type vocal performance by the user. Thus, if an active singer selects "old lovers" (When I Was people Man) as promoted by brunaomas (Bruno Mars), "you _ man.json and you _ man.m4a can be downloaded from the content server (in cases where they are not already available or cached based on previous downloads) and used again to provide background music, synchronized lyrics, and in some cases or embodiments, score encoded note tracks for continuous, real-time pitch correction as the user sings. Optionally, at least for some embodiments or genres, the harmonic note trajectories may be score coded for harmonic panning to the captured sounds. Typically, the captured tonal corrected (possibly harmonically equipped) sound performance is saved locally as one or more audiovisual files on a handheld device or set-top box along with the performance synchronization video, and then compressed and encoded for transmission (e.g., as a guest mix 106 or a group audiovisual performance mix 111 or constituent encoding of the foregoing) to the content server 110 as an MPEG-4 container file. MPEG-4 is one suitable standard for the encoded presentation and transmission of digital multimedia content for internet, mobile network and advanced broadcast applications. Other suitable codecs, compression techniques, encoding formats, and/or containers may be employed if desired.

As will be appreciated by those skilled in the art having the benefit of this disclosure, the performances of multiple singers (including performance-synchronized video) may be accumulated and combined, such as to form duel-type performances, chorus, or voice-distracted sessions. In some embodiments of the invention, the social network structure may at least partially replace or inform the host control of the pairing of geographically distributed singers and/or the formation of geographically distributed virtual choruses. For example, with respect to fig. 1, individual singers may execute in a captured fashion (with audio and performance-synchronized video) as current host and guest users, and ultimately streamed to the audience as live broadcasts 122. Such captured audiovisual content may in turn be distributed to the singer's social media contacts, members of the audience, etc. through an open call mediated by a content server. In this manner, the singer himself, members of the audience (and/or content servers or service platforms on their behalf), may invite others to join the coordinated audiovisual performance, or be a queue of members or visitors to the audience.

Where the provision and use of accompaniment tracks is shown and described herein, it should be understood that the captured, pitch corrected (and possibly but not necessarily harmonised) sounds themselves may be mixed (e.g., with the guest 106) to produce "accompaniment tracks" for exciting, guiding or framing subsequent sound captures. Further, additional singers may be invited to sing a particular part (e.g., male treble, part B in duel, etc.) or simply sing, and subsequent sound capture devices (e.g., current host device 101B in the configuration of fig. 1) may shift in pitch and place their captured sounds in one or more locations within a duel or virtual chorus. These AND other aspects of performance growth (for embodiments mediated by a content server) are described in commonly owned U.S. patent No. 8,983,829 entitled "harmonizing AND MIXING sounds CAPTURED from geographically DISTRIBUTED PERFORMERS," Coordinating AND MIXING sounds CAPTURED by geographically DISTRIBUTED PERFORMERS, AND to Cook, Lazier, Lieber, AND Kirk, which is incorporated herein by reference.

Synchronization method

Based on the description herein, those skilled in the art will appreciate various host-guest synchronization methods that allow for a non-negligible time lag in the peer-to-peer communication channel between guest device 101A and host device 101B. As shown in the context of fig. 1, an accompaniment track (e.g., accompaniment track 107A) may provide a synchronized timeline for time-phased sound capture performed at the respective peer devices (guest device 101A and host device 101B) and minimize (or eliminate) perceived delays for its users.

Fig. 2 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in a "master-sync" peer-to-peer configuration for generating a live audiovisual performance of a group audiovisual show in accordance with some embodiments of the invention(s). More specifically, fig. 2 shows an exemplary configuration of the guest device 101A and the host device 101B (see back fig. 1) during a peer-to-peer session and how the audiovisual signals flowing therebetween (e.g., the guest mix 106 and the host microphone audio 103C) provide a user experience in which the host device singer (at the host device 101B) always listens to guest sound (captured from the guest microphone local input 103A) and accompaniment tracks 107A in a perfectly synchronized manner. While the guest will perceive the host's accumulated sound as delayed (in the mix provided at the guest speaker or headset 240A) by the full audio round trip (RTT) delay, the audio stream of the multi-sound performance (including the remote guest microphone mixed with the accompaniment track) provided to the host device 101B and mixed as live broadcast (122) exhibits zero (or negligible) delay for the host singer or viewer.

The key to marking the actual delay is to include the track 107A in the audio mix supplied from the guest device 101A to the broadcaster's device (host device 101B). This audio stream ensures that the guest's voice and accompaniment tracks are always synchronized from the broadcaster's perspective (based on the audible presentation at the host speaker or earpiece 240B). In case the network delay is significant, the guest may still feel that the broadcaster is singing slightly out of sync. However, as long as the guest is interested in singing that is time-synchronized with the accompaniment track, rather than the slightly delayed speech of the host, the host voice is mixed with the multiple tones of the guest voice and the accompaniment track is synchronized when streamed to the viewer.

Fig. 3 is a flow diagram depicting a stream of audio signals captured and processed at respective guest and host devices coupled in an optional "shared delay" peer-to-peer configuration for generating a live audiovisual performance of an audio group visual presentation, in accordance with some embodiments of the invention(s). More specifically, fig. 3 shows an exemplary configuration of the guest device 101A and the host device 101B (see back fig. 1) during a peer-to-peer session and how the audiovisual signals flowing therebetween (e.g., the guest mix 106 and the host microphone audio 103C) combine to limit the perception of audio delay of other singers to a one-way lag (nominally half of the full audio round-trip delay) only after the accompaniment track.

This limited perception of latency is achieved by playing the accompaniment tracks locally on both devices and keeping them synchronized in real time. The guest device 101A sends periodic timing messages to the host containing the current location in the song, and the host device 101B adjusts the playback position of the song accordingly.

We have experimented with two different approaches to keep the accompaniment track synchronized on two devices (guest device 101A and host device 101B):

method 1: we adjust the playback position we receive at the host side by one-way network delay, which is approximated as network RTT/2.

Method 2: we use the Network Time Protocol (NTP) to synchronize the clocks of two devices. In this way we do not need to adjust the timing messages based on one-way network delay, we simply add an NTP timestamp to each song timing message.

For the "shared delay" configuration, method 2 has proven to be more stable than method 1. As an optimization, to avoid excessive timing adjustments, the host only updates the accompaniment track playback position if we are currently more than 50ms away from the accompaniment track playback position of the guest.

Score-coded pitch track

Figure 4 is a flow diagram illustrating pitch correction and harmonic generation equipped with real-time continuous score encoding for captured vocal performances according to some embodiments of the invention(s). In the illustrated configuration, a user/singer (e.g., a guest or host singer at a guest device 101A or a host device 101B, see back in fig. 1) sings along with an accompaniment track karaoke type. In the case of a guest singer at the current guest device 101A, the operational accompaniment track is the accompaniment track 107A, whereas for a host singer at the current host device 101B, the operational accompaniment track is the customer mix 106, which delivers the original accompaniment track mixed with the guest singer at least in embodiments employing the "host synchronization" method. In either case, the sound captured (251) from the microphone input 201 may be continuously pitch corrected (252) and blended (255), optionally in real time, to mix (253) with the operational accompaniment tracks audibly presented at the one or more acoustic transducers 202.

Both the pitch correction and the added sum tone are selected to correspond to the score 207, which in the illustrated configuration is wirelessly transmitted 261, along with audio encoding of the lyrics 208 and the accompaniment track 209 of the operation (e.g., the accompaniment track 107A or the guest mix 106), to the device(s) on which the sound capture and pitch correction is to be performed (e.g., from the content server 110 to the guest device 101A, or via the guest device 101A to the host device 101B, see back fig. 1). In some cases or embodiments, the content selection and guest queue control logic 112 is selected for melodies or/and harmonic notes selected at the respective guest device 101A and host device 101B.

In some embodiments of the techniques described herein, the note (in the current scale or pitch) closest to the note voiced by the user/singer is determined based on the score 207. Although the closest note may typically be the key pitch corresponding to the score encoded sound melody, it need not be. Indeed, in some cases, the user/singer may be intended to sing harmony, and the sounding notes may be closer to approximating the harmony trajectory.

Audio-visual capture at handheld devices

Although not required to support performance-synchronized video capture in all embodiments, handheld device 101 (e.g., current guest device 101A or current host device 101B, see back fig. 1) may itself capture both audio and performance-synchronized video. Thus, fig. 5 illustrates a basic signal processing flow (350) according to some implementations for a mobile phone-type handheld device 101 to capture audio and performance synchronized video, to generate tone-corrected and optionally harmonized sound for audible presentation (locally and/or at a remote target device), and to communicate with a content server or service platform 110.

Based on the description herein, one of ordinary skill in the art will appreciate signal processing techniques

(sampling, filtering, decimation, etc.) and data representation are appropriately distributed to the appropriate distribution of the functional blocks of the software (e.g., decoder(s) 352, digital-to-analog (D/a) converter 351, capture 353, 353A, and encoder 355) that are executable to provide the signal processing flow 350 shown in fig. 5. Similarly, with respect to fig. 4, the signal processing flow 250 and illustrative score encoded note targets (including chorus note targets), those of ordinary skill in the art will appreciate that the signal processing techniques and data representations are suitably distributed to functional blocks and signal processing constructs (which may be implemented at least in part as software executable on a handheld or other portable computing device) (e.g., decoder(s) 258, capture 251, digital-to-analog (D/a) converter 256, mixers 253, 254 and encoder 257).

As will be appreciated by those of ordinary skill in the art, pitch detection and pitch alignment is having a rich technological history in the fields of music and speech coding. Indeed, a wide variety of feature extraction, time domain and even frequency domain techniques have been employed in the art, and may be employed in some embodiments according to the present invention. In view of this, and recognizing that the multi-singer synchronization technique in accordance with the invention(s) is generally independent of any particular pitch detection or pitch correction technique, this description is not intended to be an exhaustive list of the wide variety of signal processing techniques that may be applicable to various designs or implementations in accordance with this description. Instead, we simply note that in some embodiments according to the invention, the pitch detection method computes an average amplitude difference function (AMDF) and executes logic to pick the peak corresponding to the estimate of the pitch period. Based on such an estimation, pitch offset superposition (PSOLA) techniques are used to facilitate resampling of the waveform to produce variations in pitch offset while reducing the aperiodic effect of splicing. A particular implementation based on AMDF/PSOLA technology is described in more detail in commonly owned U.S. patent application No. 8,983,829 entitled "harmonizing AND MIXING sounds captured FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS" AND entitled "cooling AND MIXING sound coordinated hearing FROM terrestrial hearing rejected persons" AND entitled inventors Cook, Lazier, Lieber AND Kirk.

Exemplary Mobile device

FIG. 6 illustrates features of a mobile device that may serve as a platform for executing software implementations in accordance with some embodiments of the present invention. More specifically, FIG. 6 is a diagram of a generic iPhone^TMA block diagram of a mobile device 400 consistent with a commercially available version of a mobile digital device. While embodiments of the present invention are certainly not limited to iPhone deployments or applications (and even to iPhone-type devices), the iPhone device platform, in addition to its rich sensors, multimedia infrastructure, application program interface, and wireless application delivery model, provides a powerful platform on which certain implementations may be deployed. Based on the description herein, one of ordinary skill in the art will appreciate a wide variety of additional mobile device platforms that may be suitable (now or hereafter) for a given implementation or deployment of the inventive techniques described herein.

Briefly summarized, mobile device 400 includes a display 402, which may be sensitive to tactile and/or haptic contact with a user. Touch sensitive display 402 may support multiple touch features, processing multiple simultaneous touch points, including processing data associated with the pressure, extent, and/or location of each touch point. Such processing facilitates gestures and interactions with multiple fingers and other interactions. Of course, other touch-sensitive display technologies may also be used, such as a display in which contact is made using a stylus or other pointing device.

Generally, the mobile device 400 presents a graphical user interface on the touch-sensitive display 402 that provides the user with access to various system objects and for communicating information. In some implementations, the graphical user interface can include one or more display objects 404, 406. In the example shown, the display objects 404, 406 are graphical representations of system objects. Examples of system objects include device functions, applications, windows, files, alarms, events, or other identifiable system objects. In some embodiments of the invention, at least some of the digital acoustic functionality described herein is provided when the application program is executed.

In general, the mobile device 400 supports network connectivity, including functions such as both mobile radio and wireless network interconnection functions to enable a user to carry the mobile device 400 and its associated network-enabled functions with him. In some cases, mobile device 400 may interact with other devices in the vicinity (e.g., via Wi-Fi, bluetooth, etc.). For example, mobile device 400 can be configured to interact with a peer or base station for one or more devices. In this way, the mobile device 400 can grant or deny network access to other wireless devices.

The mobile device 400 includes various input/output (I/O) devices, sensors, and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the capture of sound shows and the audible presentation of accompaniment tracks and mixed pitch corrected sound shows as described elsewhere herein. In some embodiments of the invention, speaker 460 and microphone 662 may provide appropriate transducers for the techniques described herein. An external speaker port 464 may be included to facilitate hands-free voice functionality, such as speakerphone functionality. An audio jack 466 may also be included for a headset and/or a microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.

Other sensors may also be used or provided. A proximity sensor 468 can be included to facilitate detection of a user location of the mobile device 400. In some implementations, the ambient light sensor 470 can be utilized to facilitate adjustment of the brightness of the touch-sensitive display 402. The accelerometer 472 may be used to detect movement of the mobile device 400, as indicated by directional arrow 474. Accordingly, display objects and/or media may be rendered according to a detected orientation (e.g., portrait or landscape). In some implementations, the mobile device 400 can include circuitry and sensors to support location determination capabilities, such as those provided by a Global Positioning System (GPS) or other positioning system (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocoding as described herein. The mobile device 400 also includes a camera lens and imaging sensor 480. In some implementations, instances of the camera lens and sensor 480 are located on the front and back surfaces of the mobile device 400. The camera allows still images and/or video to be captured for association with the captured pitch corrected sound.

Mobile device 400 can also include one or more wireless communication subsystems, e.g., 802.11b

/g/n/ac communication device and/or bluetooth^TM A communication device 488. Other communication protocols may also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulation (4G-LTE) and beyond (e.g., 5G), Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), and so forth. A port device 490 (e.g., a Universal Serial Bus (USB) port or a docking port or some other wired port connection) may be included and used to establish a wired connection to other computing devices (e.g., other communication devices 400, network access devices, personal computers, printers, or other processing devices capable of receiving and/or transmitting data). For example, the port device 490 may also allow the mobile device 400 to synchronize with a host device using one or more protocols (e.g., TCP/IP, HTTP, UDP, and any other known protocols).

FIG. 7 shows various examples of computing devices (701, 720A, 720B, and 711) programmed (or programmable) with audio and video capture code, user interface code, pitch correction code, audio presentation pipeline, and playback code according to the functional description herein. Device instance 701 is depicted as operating in a video capture mode with audio and performance synchronized, while device instances 720A and 720B are depicted as operating in a mode to receive live mixed audio-visual performances. Although the television-type display and/or set-top box device 720B is described as operating in a live reception mode, such devices and computer 711 may operate as part of a video capture facility (as either guest device 101A or host device 101B, review fig. 1) with synchronized audio and performance. Each of the aforementioned devices communicates with a server 712 or service platform (which hosts the storage and/or functionality explained herein with respect to content server 110) via wireless data transmission and/or an intermediate network 704. The captured pitch-corrected audio performance is blended with performance-synchronized video to define a multi-singer audio-visual performance as described herein, which can (optionally) be live-broadcast and audio-visual rendered at the laptop 711.

Other embodiments

While the invention(s) have been described with reference to various embodiments, it should be understood that these embodiments are illustrative and that the scope of the invention(s) is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, although a pitch-corrected audio performance captured according to a karaoke-type interface has been described, other variations will be understood. Moreover, although certain illustrative signal processing techniques have been described in the context of certain illustrative applications, those of ordinary skill in the art will recognize that it would be straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.

Embodiments according to the present invention may take the form and/or be provided as the following: a computer program product encoded in a computer-readable medium as a sequence of instructions or other functional construct of software that, in turn, is executable in a computing system (e.g., an iPhone handset, a mobile or portable computing device, a media application platform, a set-top box, or a content server platform) to perform the methods described herein. In general, a machine-readable medium may include a tangible article of manufacture that encodes information in a form readable by a machine (e.g., a computing facility of a computer, mobile device, or portable computing device, media device, or streaming media, etc.) and non-transitory storage devices associated with transmission of the information (e.g., as an application program, source or object code, functional descriptive information, etc.). The machine-readable medium may include, but is not limited to, magnetic storage media (e.g., disk and/or tape storage); optical storage media (e.g., CD-ROM, DVD, etc.); a magneto-optical storage medium; read Only Memory (ROM); random Access Memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flashing; or other type of media suitable for storing electronic instructions, sequences of operations, functional descriptive information encodings, and the like.

In general, multiple instances may be provided for a component, operation, or structure described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention(s).

28页详细技术资料下载

Audio-visual collaboration method with delay management for wide area broadcast

相关技术

网友询问留言